I've been tasked with identifying records that have a physical address differing from their mailing address.
The problem is that some physical addresses may be entered as 123 N. Main St.
The mailing address in the example is 123 North Main Street.
I need to account, not only for abbreviations, but also spacing.
This, to me, means lots of Case statements to convert it, or I could parse the addresses using substring and trim.
Is there any easier way? I can't seem to put the pieces together with Google.
Posts
Alternatively, you could try using SQL Server's SOUNDEX function, and seeing if that will be sufficient. I'm skeptical that it will be, because of things such as "N" being compared with "North" - I doubt SOUNDEX will consider that a very close match.
You're going to need to get a list of abbreviations and try to match it, and normalize it. You're also going to want to use an actually programming language in this case. SQL is all well and good, and I'm sure you can do a great many things in it, but it is really not designed for this shit.
Generally, for street, your normalization process is as follows:
pre-step) create a list of abbreviations, like st rd ste and the rest
1) split string by token, in this case spaces
2) identify the house or street number, usually first or last set of entries and almost always contains a number. In this case, verify that if it's a number, it matches a number of one of the tokens in the non-normalized strings.
3) do a lookup on the abbreviation list, replace as needed
This is about as simple as you can get, but it's still not going to be 100%.
* for some definitions of "fairly"
So I realize that SQL isn't built for it and I'm not endeared to try it. I was hoping there were some functions built I could brutalize into functioning for me, but I don't seem to be able to find things that work for addresses (just special characters).
I like the idea of breaking it into parsed fields based on Space as a divider, that should work in 99.9% of all addresses in file. I will work out a way to strip the directional identifiers from each field (physical AND mailing). Thanks for the suggestions, this at least got me thinking.
If you wind up having to do the standardization yourself, I'm sorry. But here is what I would do:
- Load the addresses into a work table with a foreign key back to the source so that you can massage these addresses without updating the source.
- Get rid of anything that isn't a number or a letter.
- Convert all directionals to abbreviations. This does mean that if you had a street name like "Star South Ln", you would be changing it to "Star S Ln", but that's OK.
- Convert all suffixes to post office standard abbreviations: "Street" to "St", "Lane" to "Ln", "Avenue" to "Ave", etc. You also need to look for non-standard abbreviations: "Av" instead of "Ave", "Blv" instead of "Blvd", etc.
- Convert Unit Types ("Apt", "Unit", "Suite") to standards. Actually your best bet would probably be to strip them out, which solves the problem of having "Apt" in the Physical Address and "#" in the Mailing Address.
- Once you've standardized everything you can, strip out all spaces from the addresses. This will solve the issue of "123 Stone Hill Rd" on one side, and "123 Stonehill Rd" on the other.
Good luck!
Assuming you're using SQLServer:
where left(ltrim(physicaladdress), patindex('% %', ltrim(physicaladdress)) - 1) <> left(ltrim(mailaddress), patindex('% %', ltrim(mailaddress)) - 1)
Any solution which treats two apartments in the same building as the same address is no solution at all.
I like the design of this solution. Here's what I get when I run it (I see results for a split second returned to the grid, then the error:
and patindex('% %', ltrim(physicaladdress)) > 0
and patindex('% %', ltrim(mailaddress)) > 0
You'll probably want to run a separate query to look at these addresses that don't have a space. Hopefully they're just blank addresses. Otherwise they are most likely garbage.
The left( -1 gets me though, how/why does that work?