Re: [SQL] extracting location info from string
Tarlika Elisabeth Schmitz wrote: Lew wrote: That isn't a table structure, that's a freeform text structure. You didn't state your question, Tarlika, but your database structure is terrible. For example, "region" and "country" should be different columns. I presume you are referring to my original post: CREATE TABLE person ( id integer NOT NULL, "name" character varying(256) NOT NULL, "location" character varying(256), CONSTRAINT person_pkey PRIMARY KEY (id) ); Sorry, this was just a TEMPORARY table I created for quick analysis of my CSV data (now renamed to temp_person). The target table is: CREATE TABLE person ( id integer NOT NULL, "name" character varying(100) NOT NULL, country character varying(3), county character varying(3), town character varying(50), CONSTRAINT trainer_pkey PRIMARY KEY (id), CONSTRAINT country_person_fk FOREIGN KEY (country) REFERENCES country (id) MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT, CONSTRAINT county_person_fk FOREIGN KEY (country, county) REFERENCES county (country, code) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, ); Ah, yes, that makes much more sense. Temporary tables such as you describe can be very convenient and effective. Thanks for the clarification. I think this problem is very widespread, namely how to get structured information out of freeform data. I've encountered it many times over the years, as have so many I know. I believe that human intervention will always be needed for this type of work, e.g., distinguishing place names that seem the same or correlating ones that seem distinct. I also don't know of any perfect approach. Perhaps the best one can find is a probabilistic promise that error will be less than some epsilon. That said, if you have a robust process to correct errors as the user population discovers them, then you can approach perfection asymptotically. Sometimes the best solution to a technical problem is a good human process. From an engineering standpoint, user feedback is a vital element of homeostatic control. Edward W. Rouse's suggestion of a reference table to resolve different forms of address or region identification would fit well with such a process. -- Lew Honi soit qui mal y pense. http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Performance of NOT IN and <> with PG 9.0.4
On Tue, May 24, 2011 at 7:45 AM, Jasmin Dizdarevic wrote: > Hi, > found the problem. > 238 sec. with set enable_material = 'on' > 4(!) sec. with set enable_material = 'off' > > @Robert Haas: I thought it would be interesting to you, because > you've committed a patch regarding materialization for 9.0. If you like to > investigate this further, I can provide you more details. Well, it makes me glad I insisted we add enable_material. But I can't really tell from this output what is happening. Can we see the EXPLAIN ANALYZE output on 9.0, with and without enable_material? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Performance of NOT IN and <> with PG 9.0.4
On Tue, May 24, 2011 at 3:32 PM, Jasmin Dizdarevic wrote: > enable_material = off Is there any chance you can reproduce this with a simpler test case that doesn't involve quite so many joins? It looks to me like shutting off enable_material is saving you mostly by accident here. There's only one materialize node in the whole plan. And just incidentally, do you have any of the other enable_* settings turned off? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Performance of NOT IN and <> with PG 9.0.4
Robert Haas writes: > On Tue, May 24, 2011 at 3:32 PM, Jasmin Dizdarevic > wrote: >> enable_material = off > Is there any chance you can reproduce this with a simpler test case > that doesn't involve quite so many joins? I didn't stop to count, but are there enough that join_collapse_limit or from_collapse_limit could be in play? regards, tom lane -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] extracting location info from string
On 05/24/2011 10:57 AM, Lew wrote: Tarlika Elisabeth Schmitz wrote: Lew wrote: That isn't a table structure, that's a freeform text structure. You didn't state your question, Tarlika, but your database structure is terrible. For example, "region" and "country" should be different columns. I presume you are referring to my original post: CREATE TABLE person ( id integer NOT NULL, "name" character varying(256) NOT NULL, "location" character varying(256), CONSTRAINT person_pkey PRIMARY KEY (id) ); Sorry, this was just a TEMPORARY table I created for quick analysis of my CSV data (now renamed to temp_person). The target table is: CREATE TABLE person ( id integer NOT NULL, "name" character varying(100) NOT NULL, country character varying(3), county character varying(3), town character varying(50), CONSTRAINT trainer_pkey PRIMARY KEY (id), CONSTRAINT country_person_fk FOREIGN KEY (country) REFERENCES country (id) MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT, CONSTRAINT county_person_fk FOREIGN KEY (country, county) REFERENCES county (country, code) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, ); Ah, yes, that makes much more sense. Temporary tables such as you describe can be very convenient and effective. Thanks for the clarification. I think this problem is very widespread, namely how to get structured information out of freeform data. I've encountered it many times over the years, as have so many I know. I believe that human intervention will always be needed for this type of work, e.g., distinguishing place names that seem the same or correlating ones that seem distinct. I also don't know of any perfect approach. Perhaps the best one can find is a probabilistic promise that error will be less than some epsilon. That said, if you have a robust process to correct errors as the user population discovers them, then you can approach perfection asymptotically. Sometimes the best solution to a technical problem is a good human process. From an engineering standpoint, user feedback is a vital element of homeostatic control. Edward W. Rouse's suggestion of a reference table to resolve different forms of address or region identification would fit well with such a process. To minimize the ultimately quite necessary human adjudication, one might make good use of what is often termed "crowd sourcing": Keep all the distinct "hand entered" values and a map to the final human assessment. At least repeated oddities won't repeatedly involve hand editing. -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Performance of NOT IN and <> with PG 9.0.4
As I've understood the docs those 2 limits should not take effect, because the performance is going down when adding two aggregated columns, but only when enable_material is on. 2011/5/25 Tom Lane > Robert Haas writes: > > On Tue, May 24, 2011 at 3:32 PM, Jasmin Dizdarevic > > wrote: > >> enable_material = off > > > Is there any chance you can reproduce this with a simpler test case > > that doesn't involve quite so many joins? > > I didn't stop to count, but are there enough that join_collapse_limit > or from_collapse_limit could be in play? > >regards, tom lane >
Re: [SQL] extracting location info from string
On Tue, 24 May 2011 12:57:57 -0400 Lew wrote: >Tarlika Elisabeth Schmitz wrote: >>this was just a TEMPORARY table I created for quick analysis >> of my CSV data (now renamed to temp_person). >Ah, yes, that makes much more sense. Temporary tables such as you >describe can be very convenient and effective. The ER model is pretty simple. It has only 30 tables, of which 12 are "active" tables; the rest are pretty static (e.g. COUNTRY). There are 7-8 CVS formats and my idea is to import, clean up, normalize and distribute the data via temp tables and associated triggers. The whole process has to be pretty much automated as volume is too big to babysit import. As I said in my other post, with the daily import, I can't reject data just because part of the data cannot [yet] be tidied because other data will relate to these. Therefore my intention is to partially import (using the above example import a PERSON without resolving location) and flag up cases for later manual intervention in a log table. Some data just can't be cleaned up and I have to take a pragmatic approach - for instance, non-existing countries: I see there is an ISO standard for split countries such as Czechoslovakia (http://en.wikipedia.org/wiki/ISO_3166-3). That gets 12 PERSONs off my list. But what on earth do I do with "North America". Create a new country because some info is better than none? I can hardly make them US citizens because I know how upset people get when mistaken for their neighbours. >I think this problem is very widespread, namely how to get structured >information out of freeform data. The PERSON.location is peanuts compared to other data of 20x the volume and the really important information encoded in a freeform string with all sorts of abbreviations used. >That said, if you have a robust process to correct errors as the user >population discovers them, then you can approach perfection >asymptotically.[...] 1 user - moi > From an engineering standpoint, user feedback is a vital element of >homeostatic control. I'll be having a lot of conversations with myself. ;-) -- Best Regards, Tarlika Elisabeth Schmitz -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] extracting location info from string
On Wed, 25 May 2011 09:25:48 -0600 Rob Sargent wrote: > > >On 05/24/2011 10:57 AM, Lew wrote: >> Tarlika Elisabeth Schmitz wrote: >> >>> CREATE TABLE person >>> ( >>> id integer NOT NULL, >>> "name" character varying(256) NOT NULL, >>> "location" character varying(256), >>> CONSTRAINT person_pkey PRIMARY KEY (id) >>> ); >>> >>> this was just a TEMPORARY table I created for quick analysis >>> of my CSV data (now renamed to temp_person). CREATE TABLE country ( id character varying(3) NOT NULL, -- alpha-3 code "name" character varying(50) NOT NULL, CONSTRAINT country_pkey PRIMARY KEY (id) ); >To minimize the ultimately quite necessary human adjudication, one >might make good use of what is often termed "crowd sourcing": Keep >all the distinct "hand entered" values and a map to the final human >assessment. I was wondering how to do just that. I don't think it would be a good idea to hard code this into the clean-up script. Take, for instance, variations of COUNTRY.NAME spelling. Where would I store these? I could do with a concept for this problem, which applies to a lot of string-type info. -- Best Regards, Tarlika Elisabeth Schmitz -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] extracting location info from string
On 05/25/2011 03:13 PM, Tarlika Elisabeth Schmitz wrote: On Wed, 25 May 2011 09:25:48 -0600 Rob Sargent wrote: On 05/24/2011 10:57 AM, Lew wrote: Tarlika Elisabeth Schmitz wrote: CREATE TABLE person ( id integer NOT NULL, "name" character varying(256) NOT NULL, "location" character varying(256), CONSTRAINT person_pkey PRIMARY KEY (id) ); this was just a TEMPORARY table I created for quick analysis of my CSV data (now renamed to temp_person). CREATE TABLE country ( id character varying(3) NOT NULL, -- alpha-3 code "name" character varying(50) NOT NULL, CONSTRAINT country_pkey PRIMARY KEY (id) ); To minimize the ultimately quite necessary human adjudication, one might make good use of what is often termed "crowd sourcing": Keep all the distinct "hand entered" values and a map to the final human assessment. I was wondering how to do just that. I don't think it would be a good idea to hard code this into the clean-up script. Take, for instance, variations of COUNTRY.NAME spelling. Where would I store these? I could do with a concept for this problem, which applies to a lot of string-type info. I think you keep your current structures used for deducing the canonical forms, but with each unique input encounter you add it to you seen-thus-far list which becomes just one more check (possibly the first such check). create table address_input ( id unique/sequence, human_input character varying(256), resolution character varying(256) ) You may have to add a column for the type of input (if you know for instance the input is for street address v. country) or you may want the resolution to be portioned in to county, city and so on. -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
[SQL] Re: [SQL] extracting location info from string
Have you looked at http://en.m.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance - Reply message - From: "Tarlika Elisabeth Schmitz" Date: Wed, May 25, 2011 6:13 pm Subject: [SQL] extracting location info from string To: On Wed, 25 May 2011 09:25:48 -0600 Rob Sargent wrote: > > >On 05/24/2011 10:57 AM, Lew wrote: >> Tarlika Elisabeth Schmitz wrote: >> >>> CREATE TABLE person >>> ( >>> id integer NOT NULL, >>> "name" character varying(256) NOT NULL, >>> "location" character varying(256), >>> CONSTRAINT person_pkey PRIMARY KEY (id) >>> ); >>> >>> this was just a TEMPORARY table I created for quick analysis >>> of my CSV data (now renamed to temp_person). CREATE TABLE country ( id character varying(3) NOT NULL, -- alpha-3 code "name" character varying(50) NOT NULL, CONSTRAINT country_pkey PRIMARY KEY (id) ); >To minimize the ultimately quite necessary human adjudication, one >might make good use of what is often termed "crowd sourcing": Keep >all the distinct "hand entered" values and a map to the final human >assessment. I was wondering how to do just that. I don't think it would be a good idea to hard code this into the clean-up script. Take, for instance, variations of COUNTRY.NAME spelling. Where would I store these? I could do with a concept for this problem, which applies to a lot of string-type info. -- Best Regards, Tarlika Elisabeth Schmitz -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] extracting location info from string
On 26 May 2011 09:13, Tarlika Elisabeth Schmitz wrote: > On Wed, 25 May 2011 09:25:48 -0600 > Rob Sargent wrote: > >> >> >>On 05/24/2011 10:57 AM, Lew wrote: >>> Tarlika Elisabeth Schmitz wrote: >>> CREATE TABLE person ( id integer NOT NULL, "name" character varying(256) NOT NULL, "location" character varying(256), CONSTRAINT person_pkey PRIMARY KEY (id) ); this was just a TEMPORARY table I created for quick analysis of my CSV data (now renamed to temp_person). > > CREATE TABLE country > ( > id character varying(3) NOT NULL, -- alpha-3 code > "name" character varying(50) NOT NULL, > CONSTRAINT country_pkey PRIMARY KEY (id) > ); > > >>To minimize the ultimately quite necessary human adjudication, one >>might make good use of what is often termed "crowd sourcing": Keep >>all the distinct "hand entered" values and a map to the final human >>assessment. > > I was wondering how to do just that. I don't think it would be a good > idea to hard code this into the clean-up script. Take, for instance, > variations of COUNTRY.NAME spelling. Where would I store these? > > I could do with a concept for this problem, which applies to a lot of > string-type info. I'd start w/ downloading a list as mentioned here: http://answers.google.com/answers/threadview?id=596822 And run it through a wee perl script using http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm to make phonetic matches ... Then I'd run your own data through DoubleMetaphone, and clean up matches if not too many false positives show up. Cheers, Andrej -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] enum data type vs table
On Tue, May 17, 2011 at 11:23 PM, Seb wrote: > Are there any guidelines for deciding whether to 1) create an enum data > type or 2) create a table with the set of values and then have foreign > keys referencing this table? Some fields in a database take a small > number of values, and I'm not sure which of these routes to take. The > enum data type seems like a clean way to handle this without creating a > constellation of tables for all these values, but if one wants to add a > new label to the enum or make changes to it at some point, then the > tables using it have to be recreated, so it's quite rigid. Have I got > this right? Thanks. I think your choice depends on a few things: 1 - How do you want to interact with the tables? What I mean is, are you planning on querying, inserting, or updating data to those tables via text or will you need to join to your reference table? If you don't want to join, you'll either need to use enum types, use views (which can be a pain if you want to update a view), or duplicate/reference the text directly (which is slow and a bad idea for several reasons). 2 - How much can you tolerate downtime or a busy database? Changing types is a single transaction and requires an exclusive lock. On small tables this is negligible, but on big tables it can require downtime. 3 - How often do you really expect changes to the enum type? If adding a new value to an enum type is truly a rare event, it's . If it's frequent or regular, you should probably have a table. I've used both of these approaches and I've found enum types to be well worth any trouble to drop/recreate types. The changes I've made have been rare, and I've been able to schedule downtime pretty easily, so it made the most sense for me. Also, Postgres 9.1 allows adding values to enum types, so you could always use that when it is finally released. Hope this helps, Cheers, Peter -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Sorting Issue
Since no one has responded does that mean there is no list anywhere? Or does anyone know of a collation that will allow for case insensitive sorting as well as not ignoring spaces? -Original Message- From: [email protected] [mailto:[email protected]] On Behalf Of Ozer, Pam Sent: Wednesday, May 18, 2011 3:22 PM To: Tom Lane Cc: Samuel Gendler; [email protected]; [email protected] Subject: Re: [SQL] Sorting Issue Is there anywhere that gives you all the available collations and their definitions? I found with the C collation it now sorts the spaces correct but it is also case sensitive which messes with some of our other sorts. -Original Message- From: Tom Lane [mailto:[email protected]] Sent: Tuesday, May 10, 2011 9:47 AM To: Ozer, Pam Cc: Samuel Gendler; [email protected]; [email protected] Subject: Re: [SQL] Sorting Issue "Ozer, Pam" writes: > Isn't this the English standard for collation? Or is this a non-c > locale as mentioned below? Is there anyway around this? >LC_COLLATE = 'en_US.utf8' en_US is probably using somebody's idea of "dictionary order", which I believe includes ignoring spaces in the first pass. You might be happier using "C" collation. Unfortunately that requires re-initdb'ing your database (as of existing PG releases). regards, tom lane -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] enum data type vs table
On Wed, 25 May 2011 17:23:26 -0500, Peter Koczan wrote: > On Tue, May 17, 2011 at 11:23 PM, Seb wrote: >> Are there any guidelines for deciding whether to 1) create an enum >> data type or 2) create a table with the set of values and then have >> foreign keys referencing this table? Some fields in a database take >> a small number of values, and I'm not sure which of these routes to >> take. The enum data type seems like a clean way to handle this >> without creating a constellation of tables for all these values, but >> if one wants to add a new label to the enum or make changes to it at >> some point, then the tables using it have to be recreated, so it's >> quite rigid. Have I got this right? Thanks. > I think your choice depends on a few things: > 1 - How do you want to interact with the tables? What I mean is, are > you planning on querying, inserting, or updating data to those tables > via text or will you need to join to your reference table? If you > don't want to join, you'll either need to use enum types, use views > (which can be a pain if you want to update a view), or > duplicate/reference the text directly (which is slow and a bad idea > for several reasons). > 2 - How much can you tolerate downtime or a busy database? Changing > types is a single transaction and requires an exclusive lock. On small > tables this is negligible, but on big tables it can require downtime. > 3 - How often do you really expect changes to the enum type? If adding > a new value to an enum type is truly a rare event, it's . If it's > frequent or regular, you should probably have a table. > I've used both of these approaches and I've found enum types to be > well worth any trouble to drop/recreate types. The changes I've made > have been rare, and I've been able to schedule downtime pretty easily, > so it made the most sense for me. > Also, Postgres 9.1 allows adding values to enum types, so you could > always use that when it is finally released. These are great guidelines, thanks. -- Seb -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
