[SQL] extracting location info from string

2011-05-22 Thread Tarlika Elisabeth Schmitz
A column contains location information, which may contain any of the
following:

1) null
2) country name (e.g. "France")
3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")

Using the tables below, I would like to derive COUNTRY.ID, REGION.CODE,
city name.




=
CREATE TABLE person
(
  id integer NOT NULL,
  "name" character varying(256) NOT NULL,
  "location" character varying(256),
  CONSTRAINT person_pkey PRIMARY KEY (id)
);

CREATE TABLE country
(
  id character varying(3) NOT NULL, -- alpha-3 code
  "name" character varying(50) NOT NULL,
  CONSTRAINT country_pkey PRIMARY KEY (id)
);

CREATE TABLE region
(
  country character varying(3) NOT NULL,
  code character varying(3) NOT NULL,
 "name" character varying(50) NOT NULL,
  CONSTRAINT region_pkey PRIMARY KEY (country, code),
  CONSTRAINT country_region_fk FOREIGN KEY (country)
  REFERENCES country (id) MATCH SIMPLE
  ON UPDATE CASCADE ON DELETE RESTRICT
)


=
System: PostgreSQL 8.4


-- 

Best Regards,
Tarlika Elisabeth Schmitz

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] extracting location info from string

2011-05-22 Thread Tarlika Elisabeth Schmitz
On Sun, 22 May 2011 21:05:26 +0100
Tarlika Elisabeth Schmitz  wrote:

>A column contains location information, which may contain any of the
>following:
>
>1) null
>2) country name (e.g. "France")
>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")


I also need to cope with variations of COUNTRY.NAME and REGION.NAME. 


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] extracting location info from string

2011-05-22 Thread Andrej
On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
 wrote:
> On Sun, 22 May 2011 21:05:26 +0100
> Tarlika Elisabeth Schmitz  wrote:
>
>>A column contains location information, which may contain any of the
>>following:
>>
>>1) null
>>2) country name (e.g. "France")
>>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")
>
>
> I also need to cope with variations of COUNTRY.NAME and REGION.NAME.

I'm not sure I fully understand your request, but sanitising that data will be
tedious, whichever way you dice it - particularly with the variations
on region.

Another thing of great import is whether the city can occur in the
data column all by itself; if yes, it's next to impossible to distinguish
it from a country.  Specially if we assume that typos/misspelling are
feasible on top of punctuation ... of course, you could create a list
of valid cities and countries, with homophones thrown in for good
measure, and compare & replace things appropriately, w/ name
w/o a clean match being reported for human interaction =o)

If I had a task like that to perform I'd dump the data out to file
and have a good go at it w/ sed & awk, or perl, depending on
how complex & voluminous the data is.



Cheers,
Andrej


-- 
Please don't top post, and don't use HTML e-Mail :}  Make your quotes concise.

http://www.georgedillon.com/web/html_email_is_evil.shtml

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] extracting location info from string

2011-05-22 Thread Craig Ringer

On 23/05/2011 9:11 AM, Andrej wrote:

On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
  wrote:

On Sun, 22 May 2011 21:05:26 +0100
Tarlika Elisabeth Schmitz  wrote:


A column contains location information, which may contain any of the
following:

1) null
2) country name (e.g. "France")
3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")



I also need to cope with variations of COUNTRY.NAME and REGION.NAME.


This is a hard problem. You're dealing with free-form data that might be 
easily understood by humans, but relies on various contextual 
information and knowledge that makes it really hard for computers to 
understand.


If you want to do a good job of this, your best bet is to plug in 3rd 
party address analysis software that is dedicated to this task. Most 
(all?) such packages are commercial, proprietary affairs. They exist 
because it's really, really hard to do this right.



Another thing of great import is whether the city can occur in the
data column all by itself; if yes, it's next to impossible to distinguish
it from a country.


Not least because some places are both, eg:

  Luxembourg
  The Vatican
  Singapore

(The Grand Duchy of Luxembourg has other cities, but still serves as an 
example).


--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql