SQL: Comparing Addresses

1ddqd · May 2011

I've been tasked with identifying records that have a physical address differing from their mailing address.

The problem is that some physical addresses may be entered as 123 N. Main St.

The mailing address in the example is 123 North Main Street.

I need to account, not only for abbreviations, but also spacing.

This, to me, means lots of Case statements to convert it, or I could parse the addresses using substring and trim.

Is there any easier way? I can't seem to put the pieces together with Google.

schuss · May 2011

This is why they make normalization tools, because strings blow ass to convert/compare. How many records?

Steve Bennett · May 2011

I once had to do a very similar task. I ended up creating a probability analysis when comparing strings, then, analyzing the results to decide upon a score threshold that seemed to hit nearly all matches without any false positives. Score was a modified percentage based on how closely the strings matched, and I'd pick something like 92.4% as the threshold. Anyway, if I were to do this for your situation, I'd probably add an extra column to hold the match probability value, and do the string analysis (lots of string splitting and other string operations will be needed).

Alternatively, you could try using SQL Server's SOUNDEX function, and seeing if that will be sufficient. I'm skeptical that it will be, because of things such as "N" being compared with "North" - I doubt SOUNDEX will consider that a very close match.

bowen · May 2011

There isn't a good or easy way to do this, unfortunately.

You're going to need to get a list of abbreviations and try to match it, and normalize it. You're also going to want to use an actually programming language in this case. SQL is all well and good, and I'm sure you can do a great many things in it, but it is really not designed for this shit.

Generally, for street, your normalization process is as follows:

pre-step) create a list of abbreviations, like st rd ste and the rest

1) split string by token, in this case spaces
2) identify the house or street number, usually first or last set of entries and almost always contains a number. In this case, verify that if it's a number, it matches a number of one of the tokens in the non-normalized strings.
3) do a lookup on the abbreviation list, replace as needed

This is about as simple as you can get, but it's still not going to be 100%.

aperlscript · May 2011

Depending on how many records you have, you can whip up a fairly* simple script that uses Google's geocoding API. Convert all the addresses to latitude/longitude pairs, and then compare those. They say they only allow 2,500 lookups per day, but those limits are squishy, and if you really try you can get around them.

curl 'http://maps.googleapis.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false'

* for some definitions of "fairly"

1ddqd · May 2011

I'd be comparing 402,000 records (unique... I know).

So I realize that SQL isn't built for it and I'm not endeared to try it. I was hoping there were some functions built I could brutalize into functioning for me, but I don't seem to be able to find things that work for addresses (just special characters).

I like the idea of breaking it into parsed fields based on Space as a divider, that should work in 99.9% of all addresses in file. I will work out a way to strip the directional identifiers from each field (physical AND mailing). Thanks for the suggestions, this at least got me thinking.

Chickeen · May 2011

Your best bet will be to use an address standardization tool. At my place of work we use ZP4 from Semaphore (http://www.semaphorecorp.com). It would be $99 well spent. I'm sure there are similar tools out there.

If you wind up having to do the standardization yourself, I'm sorry. But here is what I would do:

- Load the addresses into a work table with a foreign key back to the source so that you can massage these addresses without updating the source.

- Get rid of anything that isn't a number or a letter.

- Convert all directionals to abbreviations. This does mean that if you had a street name like "Star South Ln", you would be changing it to "Star S Ln", but that's OK.

- Convert all suffixes to post office standard abbreviations: "Street" to "St", "Lane" to "Ln", "Avenue" to "Ave", etc. You also need to look for non-standard abbreviations: "Av" instead of "Ave", "Blv" instead of "Blvd", etc.

- Convert Unit Types ("Apt", "Unit", "Suite") to standards. Actually your best bet would probably be to strip them out, which solves the problem of having "Apt" in the Physical Address and "#" in the Mailing Address.

- Once you've standardized everything you can, strip out all spaces from the addresses. This will solve the issue of "123 Stone Hill Rd" on one side, and "123 Stonehill Rd" on the other.

Good luck!

Chickeen · May 2011

Had a thought this morning. Just compare the house numbers. It's going to be pretty rare to have truly different physical and mail addresses with the same house number.

Assuming you're using SQLServer:

where left(ltrim(physicaladdress), patindex('% %', ltrim(physicaladdress)) - 1) <> left(ltrim(mailaddress), patindex('% %', ltrim(mailaddress)) - 1)

JHunz · May 2011

Chickeen wrote: »

Had a thought this morning. Just compare the house numbers. It's going to be pretty rare to have truly different physical and mail addresses with the same house number.

Assuming you're using SQLServer:

where left(ltrim(physicaladdress), patindex('% %', ltrim(physicaladdress)) - 1) <> left(ltrim(mailaddress), patindex('% %', ltrim(mailaddress)) - 1)

Any solution which treats two apartments in the same building as the same address is no solution at all.

Chickeen · May 2011

JHunz wrote: »

Chickeen wrote: »

Had a thought this morning. Just compare the house numbers. It's going to be pretty rare to have truly different physical and mail addresses with the same house number.

Assuming you're using SQLServer:

where left(ltrim(physicaladdress), patindex('% %', ltrim(physicaladdress)) - 1) <> left(ltrim(mailaddress), patindex('% %', ltrim(mailaddress)) - 1)

Any solution which treats two apartments in the same building as the same address is no solution at all.

I said it would be pretty rare. But yeah, I suppose you could live at Apt A your mail delivered at Apt C.

1ddqd · May 2011

Chickeen wrote: »

Had a thought this morning. Just compare the house numbers. It's going to be pretty rare to have truly different physical and mail addresses with the same house number.

Assuming you're using SQLServer:

where left(ltrim(physicaladdress), patindex('% %', ltrim(physicaladdress)) - 1) <> left(ltrim(mailaddress), patindex('% %', ltrim(mailaddress)) - 1)

I like the design of this solution. Here's what I get when I run it (I see results for a split second returned to the grid, then the error:

Invalid length parameter passed to the LEFT or SUBSTRING function.

Chickeen · May 2011

That means you have addresses that don't have a single space in them. Add this to the where clause to exclude them:

and patindex('% %', ltrim(physicaladdress)) > 0
and patindex('% %', ltrim(mailaddress)) > 0

You'll probably want to run a separate query to look at these addresses that don't have a space. Hopefully they're just blank addresses. Otherwise they are most likely garbage.

1ddqd · May 2011

So patindex( looks for multiple spaces (in this syntax) within a given field.

The left( -1 gets me though, how/why does that work?

Chickeen · May 2011

patindex() returns the position of the first occurrence of the pattern being searched for. In the case of my example, a single space. The "-1" is to subtract one from the position returned by patindex() so the left() function returns "145" rather than "145 ". Since you're searching for a space, I should have just used a rtrim() on the return of left() to get rid of the space. That way the left() function wouldn't ever be called with a negative 1, which causes it to error.

Penny Arcade

Quick Links

SQL: Comparing Addresses

Posts