Buffalo buffalo buffalo buffalo buffalo buffalo buffalo.
No, I have gone into a paroxysm of perseveration. Believe it or not, the seven buffaloes above is a real sentence. It only becomes meaningful when you can properly parse it. Although there are different meanings embedded in the sentence, the one that I like best is: Buffalo buffalo buffalo buffalo; buffalo buffalo buffalo. That semicolon helps to pull some sense from the seven Buffaloes. However it still seems like a nonsense sentence until you start to really understand that the word buffalo has three distinct meanings:
- The city of Buffalo
- The bison
- To confuse or bully.
If we take the different meanings and replace the term buffalo with the sense of the word, we get the following: Bison from the city of Buffalo confuse bison; bison (in general) confuse bison. (Perhaps buffalo think that other buffalo are strange looking cows, or perhaps maybe bears that took too many drugs in their youth.)
Ultimately what we have done here is pull apart the sentence so it makes sense – we have parsed it.
In this month’s Wizard
, we will look at parsing a little more closely. We will take the last line of a person’s address and use tools from Access to pull it apart, or parse it.
Let’s say you received a series of addresses from Excel, and you'd like to break the last address line into its component parts: city, state, and zip.
For instance, let’s use the following as examples of what you might have to deal with:
- Westford, MA 01886
- North Grafton, Mass
- North Andover MA 02645
- Winchendon Massachusetts, 7895-01475
- Lewisberg, West Virgina
What we have here are several different problems:
In the first case, things are straightforward. We have the city followed by a comma, and then a five digit ZIP Code.
In the second case, we have a city with two words in it and then a four letter abbreviation for state and no ZIP Code.
In the third case, we have a two-word city, no comma, a two-letter state, followed by a five digit ZIP Code.
In fourth case, we have a city followed by a full state name followed by a comma followed by a 4+5 ZIP Code.
In the final case, we have a city, a comma, a two-word state, and no ZIP Code.
If we had only wanted to do a few of these, it would be no problem; we would just do it by hand. If we had 50 of them, we might use the tools available in Excel to do some generic parsing, and then manually edit those that need fixing. However, if we have 2,000 addresses, or 5,000 or 50,000, we would need much stronger tools -- and that is our task here.
The Solution -- Overview
The best way to approach this type of problem is to take the big problem and break it down into a series of smaller problems.
If we keep in mind that our goal is to get a city, state, and ZIP Code out of our address, we really have three smaller problems to solve rather than one large one.
- City – it appears first and may be more than one word
- State – it can take various forms, but at the end of the day it is a defined list.
- ZIP Code – either 5 digits or nine digits with a hyphen after the fifth digit. Occasionally it may also contain a space rather than a hyphen if the person who entered the data was not playing by the rules.
Our approach will be to isolate the pieces and get them into the preferred form.
We know that the ZIP Code must come at the end of the address line. If we start at the end of the address line and work backward, inspecting every character, we can keep on going until we hit something other than a digit, hyphen, or space.
Let’s look at Westford, MA 01886
. If we start at the very end, and read backwards, we get 68810 followed by a space and then the letter A. Once we hit the letter A, we know that everything to the right is our ZIP Code.
Will the same logic work for North Grafton, Mass
which has no ZIP code? Surprisingly, yes. We begin at the very end and start working backward. The first character we come to is the letter s. Since that’s not a digit. We know that everything to the right of it is our ZIP Code. In this case, there is nothing, so our logic works.
The only really tricky problem we have left is a hyphen in our ZIP Code.
Let’s work with Winchendon Massachusetts, 7895-01475
The same logic works here as well; we start at the end and work backward until we hit the comma.
At this point we have identified our ZIP Code, and we know exactly where it starts. Everything to the left of the ZIP Code contains our city and state.
With most states, like Massachusetts, we have one word, or a set of letters that indicate an abbreviation such as “MA”
or “MASS.” So perhaps you can just take everything from the beginning of the ZIP Code and again working backward get to the state.
Unfortunately, we have to deal with states such as North Carolina, South Dakota, and West Virginia as well as states like New Mexico and Puerto Rico. Well, because United States Postal Service has declared that there is a very rigorous two letter abbreviation for all 50 states and territories, we can declare that, if we see two characters with a space on either side, then we know that that’s our state. But, we also have to deal with the fact that some people like put periods between the letters. So in order to solve that problem, we can just take out any periods that we encounter.
With this approach, we now have a two-letter abbreviation for the state; but what about illegal abbreviations? In this case, we will need a table of common abbreviations for states. If we include, the legal abbreviations in that list, then, when we get what we believe is a state, we can check it against our table. If it is valid, we use it as is; if not, then we see if it is a common substitute; in that case we use the valid entry from our table.
We have to be careful, however, not to end prematurely because, if the state is West Virginia and working backward we come to Virginia, we can easily make a mistake. So we test for two words rather than one. If the two-word candidate is not in our list. Then we test for the single word as a member in our list. If we find it that it is there, we have our state. If neither the last word, or last and next to last words are on our list, then our logic has failed and we flag this record as requiring human intervention.
At this point, we stop. We could code for all sorts of exceptional and unusual combinations of illegal state names; for instance, somebody might have typed Alsaka when they meant Alaska and we could use techniques such as Soundex see if we get there. We can also build a heavily populated table, including mistakes people commonly make when putting a state into an address line. This makes sense if you're doing hundreds of thousands records, or if you know that the data that you are dealing with is particularly sloppy. In most cases, the records won’t be sufficiently numerous to warrant the extra work associated with trying to capture 100% of the possibilities.
Capturing the City
After having found the State and ZIP Code, everything remaining on the left must be the city. At the outset, you might have thought that the city would have been the hardest to detect, since city names are so variable. In fact, since it is anything left over after having found the State and ZIP Code, it’s easy.
Conclusions and Next Steps
This month, we developed the logic to pull apart a city and state line from an address. When we take it bit by bit, we find that just as Buffalo buffalo buffalo buffalo buffalo buffalo buffalo
can be parsed so that it makes sense, we can do the same thing to turn a single address line into three separate address fields: City, State, and ZIP.
This month was the more challenging of our two tasks: developing sound logic to parse an address line. Next month, I will show you the code that we use to implement that logic. We will do this in VBA (Visual Basic for applications). I will structure things so you can just take the code and pop it into your own application.