[SQL] extracting location info from string
A column contains location information, which may contain any of the following: 1) null 2) country name (e.g. "France") 3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") 4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") 5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") Using the tables below, I would like to derive COUNTRY.ID, REGION.CODE, city name. = CREATE TABLE person ( id integer NOT NULL, "name" character varying(256) NOT NULL, "location" character varying(256), CONSTRAINT person_pkey PRIMARY KEY (id) ); CREATE TABLE country ( id character varying(3) NOT NULL, -- alpha-3 code "name" character varying(50) NOT NULL, CONSTRAINT country_pkey PRIMARY KEY (id) ); CREATE TABLE region ( country character varying(3) NOT NULL, code character varying(3) NOT NULL, "name" character varying(50) NOT NULL, CONSTRAINT region_pkey PRIMARY KEY (country, code), CONSTRAINT country_region_fk FOREIGN KEY (country) REFERENCES country (id) MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT ) = System: PostgreSQL 8.4 -- Best Regards, Tarlika Elisabeth Schmitz -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] extracting location info from string
On Sun, 22 May 2011 21:05:26 +0100 Tarlika Elisabeth Schmitz wrote: >A column contains location information, which may contain any of the >following: > >1) null >2) country name (e.g. "France") >3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") >4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") >5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") I also need to cope with variations of COUNTRY.NAME and REGION.NAME. -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] extracting location info from string
On 23 May 2011 10:00, Tarlika Elisabeth Schmitz wrote: > On Sun, 22 May 2011 21:05:26 +0100 > Tarlika Elisabeth Schmitz wrote: > >>A column contains location information, which may contain any of the >>following: >> >>1) null >>2) country name (e.g. "France") >>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") >>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") >>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") > > > I also need to cope with variations of COUNTRY.NAME and REGION.NAME. I'm not sure I fully understand your request, but sanitising that data will be tedious, whichever way you dice it - particularly with the variations on region. Another thing of great import is whether the city can occur in the data column all by itself; if yes, it's next to impossible to distinguish it from a country. Specially if we assume that typos/misspelling are feasible on top of punctuation ... of course, you could create a list of valid cities and countries, with homophones thrown in for good measure, and compare & replace things appropriately, w/ name w/o a clean match being reported for human interaction =o) If I had a task like that to perform I'd dump the data out to file and have a good go at it w/ sed & awk, or perl, depending on how complex & voluminous the data is. Cheers, Andrej -- Please don't top post, and don't use HTML e-Mail :} Make your quotes concise. http://www.georgedillon.com/web/html_email_is_evil.shtml -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] extracting location info from string
On 23/05/2011 9:11 AM, Andrej wrote: On 23 May 2011 10:00, Tarlika Elisabeth Schmitz wrote: On Sun, 22 May 2011 21:05:26 +0100 Tarlika Elisabeth Schmitz wrote: A column contains location information, which may contain any of the following: 1) null 2) country name (e.g. "France") 3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") 4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") 5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") I also need to cope with variations of COUNTRY.NAME and REGION.NAME. This is a hard problem. You're dealing with free-form data that might be easily understood by humans, but relies on various contextual information and knowledge that makes it really hard for computers to understand. If you want to do a good job of this, your best bet is to plug in 3rd party address analysis software that is dedicated to this task. Most (all?) such packages are commercial, proprietary affairs. They exist because it's really, really hard to do this right. Another thing of great import is whether the city can occur in the data column all by itself; if yes, it's next to impossible to distinguish it from a country. Not least because some places are both, eg: Luxembourg The Vatican Singapore (The Grand Duchy of Luxembourg has other cities, but still serves as an example). -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/ -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql