The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.
The Guiding Principles and New Rules document is now in effect.
I once had to do a very similar task. I ended up creating a probability analysis when comparing strings, then, analyzing the results to decide upon a score threshold that seemed to hit nearly all matches without any false positives. Score was a modified percentage based on how closely the strings matched, and I'd pick something like 92.4% as the threshold. Anyway, if I were to do this for your situation, I'd probably add an extra column to hold the match probability value, and do the string analysis (lots of string splitting and other string operations will be needed).
Alternatively, you could try using SQL Server's SOUNDEX function, and seeing if that will be sufficient. I'm skeptical that it will be, because of things such as "N" being compared with "North" - I doubt SOUNDEX will consider that a very close match.
There isn't a good or easy way to do this, unfortunately.
You're going to need to get a list of abbreviations and try to match it, and normalize it. You're also going to want to use an actually programming language in this case. SQL is all well and good, and I'm sure you can do a great many things in it, but it is really not designed for this shit.
Generally, for street, your normalization process is as follows:
pre-step) create a list of abbreviations, like st rd ste and the rest
1) split string by token, in this case spaces
2) identify the house or street number, usually first or last set of entries and almost always contains a number. In this case, verify that if it's a number, it matches a number of one of the tokens in the non-normalized strings.
3) do a lookup on the abbreviation list, replace as needed
This is about as simple as you can get, but it's still not going to be 100%.
bowen on
not a doctor, not a lawyer, examples I use may not be fully researched so don't take out of context plz, don't @ me
Depending on how many records you have, you can whip up a fairly* simple script that uses Google's geocoding API. Convert all the addresses to latitude/longitude pairs, and then compare those. They say they only allow 2,500 lookups per day, but those limits are squishy, and if you really try you can get around them.
I'd be comparing 402,000 records (unique... I know).
So I realize that SQL isn't built for it and I'm not endeared to try it. I was hoping there were some functions built I could brutalize into functioning for me, but I don't seem to be able to find things that work for addresses (just special characters).
I like the idea of breaking it into parsed fields based on Space as a divider, that should work in 99.9% of all addresses in file. I will work out a way to strip the directional identifiers from each field (physical AND mailing). Thanks for the suggestions, this at least got me thinking.
Your best bet will be to use an address standardization tool. At my place of work we use ZP4 from Semaphore (http://www.semaphorecorp.com). It would be $99 well spent. I'm sure there are similar tools out there.
If you wind up having to do the standardization yourself, I'm sorry. But here is what I would do:
- Load the addresses into a work table with a foreign key back to the source so that you can massage these addresses without updating the source.
- Get rid of anything that isn't a number or a letter.
- Convert all directionals to abbreviations. This does mean that if you had a street name like "Star South Ln", you would be changing it to "Star S Ln", but that's OK.
- Convert all suffixes to post office standard abbreviations: "Street" to "St", "Lane" to "Ln", "Avenue" to "Ave", etc. You also need to look for non-standard abbreviations: "Av" instead of "Ave", "Blv" instead of "Blvd", etc.
- Convert Unit Types ("Apt", "Unit", "Suite") to standards. Actually your best bet would probably be to strip them out, which solves the problem of having "Apt" in the Physical Address and "#" in the Mailing Address.
- Once you've standardized everything you can, strip out all spaces from the addresses. This will solve the issue of "123 Stone Hill Rd" on one side, and "123 Stonehill Rd" on the other.
Had a thought this morning. Just compare the house numbers. It's going to be pretty rare to have truly different physical and mail addresses with the same house number.
Had a thought this morning. Just compare the house numbers. It's going to be pretty rare to have truly different physical and mail addresses with the same house number.
Had a thought this morning. Just compare the house numbers. It's going to be pretty rare to have truly different physical and mail addresses with the same house number.
Had a thought this morning. Just compare the house numbers. It's going to be pretty rare to have truly different physical and mail addresses with the same house number.
That means you have addresses that don't have a single space in them. Add this to the where clause to exclude them:
and patindex('% %', ltrim(physicaladdress)) > 0
and patindex('% %', ltrim(mailaddress)) > 0
You'll probably want to run a separate query to look at these addresses that don't have a space. Hopefully they're just blank addresses. Otherwise they are most likely garbage.
patindex() returns the position of the first occurrence of the pattern being searched for. In the case of my example, a single space. The "-1" is to subtract one from the position returned by patindex() so the left() function returns "145" rather than "145 ". Since you're searching for a space, I should have just used a rtrim() on the return of left() to get rid of the space. That way the left() function wouldn't ever be called with a negative 1, which causes it to error.
Posts
Alternatively, you could try using SQL Server's SOUNDEX function, and seeing if that will be sufficient. I'm skeptical that it will be, because of things such as "N" being compared with "North" - I doubt SOUNDEX will consider that a very close match.
You're going to need to get a list of abbreviations and try to match it, and normalize it. You're also going to want to use an actually programming language in this case. SQL is all well and good, and I'm sure you can do a great many things in it, but it is really not designed for this shit.
Generally, for street, your normalization process is as follows:
pre-step) create a list of abbreviations, like st rd ste and the rest
1) split string by token, in this case spaces
2) identify the house or street number, usually first or last set of entries and almost always contains a number. In this case, verify that if it's a number, it matches a number of one of the tokens in the non-normalized strings.
3) do a lookup on the abbreviation list, replace as needed
This is about as simple as you can get, but it's still not going to be 100%.
* for some definitions of "fairly"
So I realize that SQL isn't built for it and I'm not endeared to try it. I was hoping there were some functions built I could brutalize into functioning for me, but I don't seem to be able to find things that work for addresses (just special characters).
I like the idea of breaking it into parsed fields based on Space as a divider, that should work in 99.9% of all addresses in file. I will work out a way to strip the directional identifiers from each field (physical AND mailing). Thanks for the suggestions, this at least got me thinking.
If you wind up having to do the standardization yourself, I'm sorry. But here is what I would do:
- Load the addresses into a work table with a foreign key back to the source so that you can massage these addresses without updating the source.
- Get rid of anything that isn't a number or a letter.
- Convert all directionals to abbreviations. This does mean that if you had a street name like "Star South Ln", you would be changing it to "Star S Ln", but that's OK.
- Convert all suffixes to post office standard abbreviations: "Street" to "St", "Lane" to "Ln", "Avenue" to "Ave", etc. You also need to look for non-standard abbreviations: "Av" instead of "Ave", "Blv" instead of "Blvd", etc.
- Convert Unit Types ("Apt", "Unit", "Suite") to standards. Actually your best bet would probably be to strip them out, which solves the problem of having "Apt" in the Physical Address and "#" in the Mailing Address.
- Once you've standardized everything you can, strip out all spaces from the addresses. This will solve the issue of "123 Stone Hill Rd" on one side, and "123 Stonehill Rd" on the other.
Good luck!
Assuming you're using SQLServer:
where left(ltrim(physicaladdress), patindex('% %', ltrim(physicaladdress)) - 1) <> left(ltrim(mailaddress), patindex('% %', ltrim(mailaddress)) - 1)
Any solution which treats two apartments in the same building as the same address is no solution at all.
I like the design of this solution. Here's what I get when I run it (I see results for a split second returned to the grid, then the error:
and patindex('% %', ltrim(physicaladdress)) > 0
and patindex('% %', ltrim(mailaddress)) > 0
You'll probably want to run a separate query to look at these addresses that don't have a space. Hopefully they're just blank addresses. Otherwise they are most likely garbage.
The left( -1 gets me though, how/why does that work?