Re: [SQL] How to count from a second table in an aggregate query?
Date: Wed, 15 Apr 2009 21:23:04 -0700 From: Steve Midgley scie...@misuse.org To: Erik Jones ejo...@engineyard.com Subject: Re: How to count from a second table in an aggregate query? Message-ID: 49e6b2a8.5040...@misuse.org Erik Jones wrote: On Apr 15, 2009, at 1:15 PM, Steve Midgley wrote: I want to generate an analysis report that counts the values in two separate tables. I've been able to accomplish what I want with two separate queries that I then merge together in Excel. Essentially what I need is a horizontal UNION statement (or something like that). get a FK id and count of a certain column in one table, based on some criteria - for each FK id, get the count of a different column in a different table Display the counts from both queries side-by-side along with the FK id's in a single result set Joining against a subquery for the second count does the trick: select src_contact_id, count(log_type), cp.count from contact_log , (select contact_id, count(property_id) from contact_property group by contact_id) as cp where src_contact_id = cp.contact_id and log_type in ('web', 'detail') group by src_contact_id, cp.count order by src_contact_id src_contact_id | count | count +---+--- 1 | 5 | 4 2 | 3 | 2 A friend of mine off-list provided an alternative SQL version which I thought the list might have interest in: select src_contact_id, count(distinct contact_log.id), count(distinct contact_property.id) from contact_log, contact_property where contact_log.src_contact_id = contact_property.contact_id and contact_log.log_type in ('web', 'detail') group by src_contact_id; Credit to Matt Gainsborough for that one. Makes perfect sense as I look at it. It's nice to see two alternate paths to the same solution. ANSI-92 joins work just as well as his ANSI-89 join syntax for this (i.e. using the JOIN keyword to set the relation). Steve
[SQL] How to count from a second table in an aggregate query?
Hi, I'm trying to figure out how to do something which I'd guess is easy for a sql whiz but has me stumped. I would greatly appreciate any help on this - it's a form of SQL query that I've never figured out, but have wanted to use many times over the years.. I want to generate an analysis report that counts the values in two separate tables. I've been able to accomplish what I want with two separate queries that I then merge together in Excel. Essentially what I need is a horizontal UNION statement (or something like that). I've included some DDL and sample SQL queries that explain what I want better than I can in English, but the general idea is: get a FK id and count of a certain column in one table, based on some criteria - for each FK id, get the count of a different column in a different table Display the counts from both queries side-by-side along with the FK id's in a single result set Thanks for any assistance on this! Steve /*SQL STARTS*/ drop table if exists contact_log; drop table if exists contact_property; create table contact_log(id serial NOT null, src_contact_id integer, log_type character varying(63), CONSTRAINT contact_log_pkey PRIMARY KEY (id)); create table contact_property(id serial NOT null, contact_id integer, property_id integer, CONSTRAINT contact_property_pkey PRIMARY KEY (id), CONSTRAINT contact_property_cid_pid UNIQUE (contact_id, property_id)); insert into contact_log (src_contact_id, log_type) values(1, 'web'); insert into contact_log (src_contact_id, log_type) values(1, 'web'); insert into contact_log (src_contact_id, log_type) values(1, 'web'); insert into contact_log (src_contact_id, log_type) values(1, 'detail'); insert into contact_log (src_contact_id, log_type) values(1, 'detail'); insert into contact_log (src_contact_id, log_type) values(2, 'detail'); insert into contact_log (src_contact_id, log_type) values(2, 'detail'); insert into contact_log (src_contact_id, log_type) values(2, 'web'); insert into contact_log (src_contact_id, log_type) values(2, 'foobar'); insert into contact_log (src_contact_id, log_type) values(3, 'foobar'); insert into contact_log (src_contact_id, log_type) values(4, 'web'); insert into contact_property (contact_id, property_id) values(1, 20); insert into contact_property (contact_id, property_id) values(1, 21); insert into contact_property (contact_id, property_id) values(1, 22); insert into contact_property (contact_id, property_id) values(2, 23); insert into contact_property (contact_id, property_id) values(2, 24); insert into contact_property (contact_id, property_id) values(1, 50); insert into contact_property (contact_id, property_id) values(3, 51); insert into contact_property (contact_id, property_id) values(5, 52); -- This gets what I want from contact_log select src_contact_id, count(log_type) from contact_log where contact_log.src_contact_id in (select contact_id from contact_property) and log_type in ('web', 'detail') and src_contact_id in (select contact_id from contact_property) group by src_contact_id order by src_contact_id; -- correct output is : 1|5, 2|3 -- This gets what I want from contact_property select contact_id, count(property_id) from contact_property where contact_id in (select src_contact_id from contact_log where log_type in ('web', 'detail')) group by contact_id order by contact_id; -- correct output is: 1|4, 2|2 -- THIS DOESN'T WORK (of course - but what would?) select src_contact_id, count(log_type), count(property_id) from contact_log join contact_property cp on cp.contact_id = contact_log.src_contact_id where contact_log.src_contact_id in (select contact_id from contact_property) and log_type in ('web', 'detail') group by src_contact_id order by src_contact_id -- correct output *should be* : 1|5|4, 2|3|2 /*SQL ENDS*/ -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] How to count from a second table in an aggregate query?
Erik Jones wrote: On Apr 15, 2009, at 1:15 PM, Steve Midgley wrote: I want to generate an analysis report that counts the values in two separate tables. I've been able to accomplish what I want with two separate queries that I then merge together in Excel. Essentially what I need is a horizontal UNION statement (or something like that). get a FK id and count of a certain column in one table, based on some criteria - for each FK id, get the count of a different column in a different table Display the counts from both queries side-by-side along with the FK id's in a single result set Joining against a subquery for the second count does the trick: select src_contact_id, count(log_type), cp.count from contact_log , (select contact_id, count(property_id) from contact_property group by contact_id) as cp where src_contact_id = cp.contact_id and log_type in ('web', 'detail') group by src_contact_id, cp.count order by src_contact_id src_contact_id | count | count +---+--- 1 | 5 | 4 2 | 3 | 2 Thanks Erik! This is perfect. Oliveiros showed me another neat solution a while back that involved a select statement in the from clause, but I kind filed that solution mentally as a cool parlor trick. Now I see that I'm going to have to learn and study this form of SQL more closely, as it's extremely flexible and powerful. Thanks for the very complete and patiently instructive response - it makes perfect sense. I'll work to share this along as I go. Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] changing multiple pk's in one update
Date: Mon, 13 Apr 2009 17:09:49 -0400 From: Glenn Maynard glennfmayn...@gmail.com To: pgsql-sql@postgresql.org Subject: Re: changing multiple pk's in one update Message-ID: d18085b50904131409g10d43d6cs35dd14ede13b...@mail.gmail.com (JMdict? I was playing with importing that into a DB a while back, but the attributes in that XML are such a pain--and then my email died while I was trying to get those changed, and I never picked it up again.) On Mon, Apr 13, 2009 at 1:20 PM, Stuart McGraw smcg2...@frii.com wrote: 1 to the number of sentences in the entry) and the sentence text. Â The pk is of course the entry id and the sense number. There are other tables that have fk's to the senses. Your PK is a composite of (entry, order)? Won't your foreign keys elsewhere all break when you shift the order around? I guess I could add an order[1] column and use the sense number as a surrogate partial key to avoid the need for key renumbering, but all the api's (and the normal human way of thinking) are based on sense number 1 of entry x, sense number 2 of entry y, so one would need to maintain order as a gapless sequence (or add a new mapping layer to map from/to a arbitrary monotonic sequence to a 1,2,3,... sequence) -- the gain doesn't seem that big. Why not do this in the straightforward way: three separate fields: a regular, sequential PK; an FK to the entry; and an order number. Add an index on (entry_key, order_number). It's a little more expensive since you have a new column and index (the PK), but in a table with a lot of plain text that's probably insignificant. Now you can use the plain PK for your FK's. I'd agree with this approach. I have a number of tables which are sensitive to arbitrary ordering and they sound roughly similar to your use-case (though my tables are probably smaller). My approach is to create a string column in the table which permits defining arbitrary ordering. I use a string field b/c it's easier for me to stuff (by hand) new ordered records in between other existing records. But an integer would work just as well, so long as you make sure you keep enough space between the integers (e.g. 1000, 2000, 3000). Also, if your ordered list is truly ordinal (each record is either 1st, 2nd, 3rd, etc in a single list) you could just use 1,2,3,4 for the ordering, but then you have to mess with two records in order to swap the positions of (say) item 2 and 3. Of course you can do this pretty easily inside a transaction, and you don't have to worry about the mess of moving PK's. Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Nested selects
pgsql-sql-ow...@postgresql.org wrote: Date: Tue, 7 Apr 2009 22:34:38 -0400 From: Glenn Maynard glennfmayn...@gmail.com To: pgsql-sql@postgresql.org Subject: Nested selects Message-ID: d18085b50904071934g7ad206f1i14ac05f7bd29f...@mail.gmail.com I'm deriving high scores from two tables: one containing data for each time a user played (rounds), and one containing a list of stages: CREATE TABLE stage (id SERIAL NOT NULL PRIMARY KEY, name VARCHAR); CREATE TABLE round (id SERIAL NOT NULL PRIMARY KEY, score REAL, stage_id INTEGER REFERENCES stage (id)); INSERT INTO stage (name) VALUES ('stage 1'), ('stage 2'), ('stage 3'); INSERT INTO round (stage_id, score) VALUES (1, 100), (1, 150), (1, 175), (2, 250), (2, 275), (2, 220), (3, 350), (3, 380), (3, 322); SELECT r.* FROM round r WHERE r.id IN ( -- Get the high scoring round ID for each stage: SELECT ( -- Get the high score for stage s: SELECT r.id FROM round r WHERE r.stage_id = s.id ORDER BY r.score DESC LIMIT 1 ) FROM stage s ); This works fine, and with a (stage_id, score DESC) index, is reasonably fast with around 1000 stages. round may expand to millions of rows. Unfortunately, it doesn't generalize to getting the top N scores for each stage; LIMIT 2 isn't valid (more than one row returned by a subquery used as an expression). I fiddled with putting the inner results in an array, without much luck, and I'm not sure how well that'd optimize. Having the results in any particular order isn't important. (In practice, the inner select will often be more specific--high scores on the west coast, high scores this month, and so on.) This seems embarrassingly simple: return the top rounds for each stage--but I'm banging my head on it for some reason. How about this: select round.*, stage.name from round left join stage on stage.id = round.stage_id ORDER BY round.score DESC; Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Can we load all database objects in memory?
At 09:20 AM 3/26/2009, pgsql-sql-ow...@postgresql.org wrote: Message-Id: 587e5df3-5859-48de-93f9-f7b05c37e...@rvt.dds.nl From: ries van Twisk p...@rvt.dds.nl To: DM dm.a...@gmail.com In-Reply-To: eae6a62a0903251220p2edd379en50d17541edef0...@mail.gmail.com Subject: Re: Can we load all database objects in memory? Date: Wed, 25 Mar 2009 15:07:21 -0500 References: eae6a62a0903251220p2edd379en50d17541edef0...@mail.gmail.com X-Archive-Number: 200903/89 X-Sequence-Number: 32332 The short answer is no, you cannot force PostgreSQL to load all objects into memory. However when you proper configure PostgreSQL most, if not all of your data will be cached by the OS and/or PostgreSQL shared memory system. On Mar 25, 2009, at 2:20 PM, DM wrote: I have a database of 10GB. My Database Server has a RAM of 16GB Is there a way that I can load all the database objects to memory? Since you have such an abundance of RAM, I'd think you could simulate this feature by setting up some kind of psql script to pull down all 10gb of data through Postgres when the server first boots - possibly just a bunch of select * from [table]; commands for each table in your catalog. All/most of the data should be cached into RAM by the OS at that point. Also, I don't know what the state of the art is regarding RAM disks these days, but for a read-only database, that seems like an option (10gb of ram disk for your read-only data and 6 gb of ram for OS and Pg). Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Alter Table/Indexing
At 02:20 AM 3/25/2009, pgsql-sql-ow...@postgresql.org wrote: To: Zdravko Balorda zdravko.balo...@siix.com cc: pgsql-sql@postgresql.org Subject: Re: Alter Table/Indexing In-reply-to: 49c89fea.8060...@siix.com References: 49c89fea.8060...@siix.com Comments: In-reply-to Zdravko Balorda zdravko.balo...@siix.com message dated Tue, 24 Mar 2009 09:55:06 +0100 Date: Tue, 24 Mar 2009 10:35:31 -0400 Message-ID: 27189.1237905...@sss.pgh.pa.us From: Tom Lane t...@sss.pgh.pa.us X-Archive-Number: 200903/84 X-Sequence-Number: 32327 Zdravko Balorda zdravko.balo...@siix.com writes: I wonder does ATER TABLE TYPE, SET, depends on indexes, like INSERT does in a sense it may be faster to drop and recreate index than sorting after every row inserted. ALTER TABLE TYPE already rebuilds the indexes; you won't make the overall process any faster by doing that by hand. regards, tom lane I had a case (a long time ago) where I was on MS SQL in a production environment. We had a number of indices which were system related - meaning they were used infrequently to speed up certain administrative functions. When doing a bulk load we found that if we dropped these indices (but kept the ones that were crucial for production) we could significantly speed up the effective downtime of the system b/c any DDL statement was executed faster. We would then schedule these indices to be re-created at later dates, spreading out the load (b/c the system was in production at that point). I wonder if Postgres functions similarly for such a use case? As Tom says, the total processing time is fixed: you have to upload the data and rebuild all the indices, but if there are non-critical indices, you can go from zero to data loaded faster by dropping them and rebuilding them manually later? Thanks for any insight on that (and I hope my question helps the OP as well - if this seems off topic let me know), Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Exclude fields from SELECT command
At 05:20 PM 3/16/2009, pgsql-sql-ow...@postgresql.org wrote: In-Reply-To: 1992170861895942...@unknownmsgid References: 1992170861895942...@unknownmsgid Date: Mon, 16 Mar 2009 22:45:54 +0100 Message-ID: 162867790903161445i78127316s1c0deb3bec0e1...@mail.gmail.com Subject: Re: Exclude fields from SELECT command From: Pavel Stehule pavel.steh...@gmail.com To: Charles Tam c@osm.net.au Cc: pgsql-sql@postgresql.org 2009/3/16 Charles Tam c@osm.net.au: Hi Everybody Iâve a table with 35 fields and would like to perform a SELECT command without specifying every single field. As such, Iâve use the SELECT * command. Is there an approach to exclude 5 fields from being returned? hello no, there are no way regards Pavel Stehule I think Pavel is right for 99% of the cases. But there is a cure that is worse than the disease. You could select all columns from a bunch of tables without knowing what the column names were, excepting N columns, by iterating through the info schema data and building a SQL select appropriately (sql meta-programming I guess you would call it). But it's a real chore to do manually. If you have this need for some programmatic purpose (where some initial investment in effort will pay future dividends), then check out the info schema options: http://developer.postgresql.org/pgdocs/postgres/infoschema-columns.html In your case, I think you'd be looking for five values of table_name and then looking at all the column_name fields, building your column list, excepting the column_names you wish to exclude.. Best, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
[SQL] Re: select count of all overlapping geometries and return 0 if none.
At 06:20 AM 3/13/2009, pgsql-sql-ow...@postgresql.org wrote: Message-ID: 457532.70947...@web45913.mail.sp1.yahoo.com Date: Thu, 12 Mar 2009 10:28:19 -0700 (PDT) From: Duffer Do dufferd...@yahoo.com Subject: select count of all overlapping geometries and return 0 if none. To: pgsql-sql@postgresql.org X-Archive-Number: 200903/24 X-Sequence-Number: 32267 Hello all, I have 2 tables locations and user_tracker: locations has 2 columns location_name location_geometry user_tracker has 3 columns user_name user_geometry user_timestamp locations table is coordinates and names of areas of interest. user_tracker basically is an archive of a user's movements as he pans his map. I have a need to assign a ranking of locations based on how many times users have intersected this location. The problem I am having is that my query only returns locations that have been intersected by a user. I need it to return ALL locations and a zero if this location has not been intersected. As an example: LOCATIONS 1: Talahassee, FL | talahassee's bounding box 2: Manhattan, NY | Manhattan's bounding box 3: Frankfurt, GE| Frankfurt's bounding box USER_TRACKER john doe | geometry that overlaps Frankfurt | today john doe | geometry that overlaps Frankfurt | today john doe | geometry that overlaps Frankfurt | today john doe | geometry that overlaps Frankfurt | yesterday john doe | geometry that overlaps Frankfurt | Monday john doe | geometry that overlaps Frankfurt | Sunday Mary Jane | geometry that overlaps Manhattan | today Rob Roy| geometry that overlaps Manhattan | today Rob Roy| geometry that overlaps Manhattan | today I want to return the following: locations| number_visits Frankfurt| 6 Manhattan | 3 Talahassee | 0 My query only returns: Frankfurt| 6 Manhattan | 3 Now I have really simplified this example for readability, my actual tables are more complex. How can I accomplish this? My query: SELECT count(user_name) as number_visits, location_name from locations, user_tracker WHERE user_geometry location_geometry Thanks in advance Hi, I am stuck at an airport right now, and had the time to hack out your solution. I hope it helps. In the future, it would be helpful to the list (and more likely to yield responses) if you include the create/insert statements such as what I've put together below. It makes it much easier for people to quickly get you an answer - in this case I had a spare 20 minutes so I did it for you. Also, I put a gist index on your locations table but that assumes your areas are flat, which isn't right for spherical coordinates on the earth. PostGis should be able to get you closer, if that kind of accuracy matters. Anyway, here are the table setups and what I believe is the solution query: -- START SQL SCRIPT -- drop table if exists locations; create table locations (id serial, name varchar(255), area circle); insert into locations (name, area) values ('Talahassee, FL','((0,0),1)'), ('Manhattan, NY','((2,0),1)'), ('Frankfurt, GE','((4,0),1)'); CREATE INDEX idx_locations_area ON locations USING gist (area); drop table if exists user_tracker; create table user_tracker (id serial primary key, name varchar(255), location point, time varchar(255)); insert into user_tracker (name,location,time) values ('john doe', '(4,0)', 'today'), ('john doe', '(4,0)', 'today'), ('john doe', '(4,0)', 'today'), ('john doe', '(4,0)', 'yesterday'), ('john doe', '(4,0)', 'Monday'), ('john doe', '(4,0)', 'Sunday'), ('Mary Jane', '(2,0)', 'today'), ('Rob Roy', '(2,0)', 'today'), ('Rob Roy', '(2,0)', 'today'); -- SOLUTION QUERY HERE -- select count(user_tracker.id), locations.name from user_tracker right outer join locations on user_tracker.location @ locations.area group by locations.name; -- END SQL SCRIPT -- OUTPUT: 3;Manhattan, NY 0;Talahassee, FL 6;Frankfurt, GE Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Best practices for geo-spatial city name searches?
At 08:20 AM 2/25/2009, pgsql-sql-ow...@postgresql.org wrote: To: pgsql-sql@postgresql.org From: Mark Stosberg m...@summersault.com Subject: Best practices for geo-spatial city name searches? Date: Tue, 24 Feb 2009 11:19:56 -0500 Message-ID: 20090224111956.5b7a4...@summersault.com X-Archive-Number: 200902/94 X-Sequence-Number: 32231 Hello, I use PostgreSQL and the cube type to perform geo-spatial zipcode proximity searches. I'm wondering about the best practices also supporting a geo-spatial distance search based on a city name rather than zipcode. In our original data model, we used a 'zipcodes' table, with the zipcode as the primary key. This can of course contain a City Name column, but there is a problem with this, illustrated a Nome, Alaska case. Nome's zipcode is 99762. It maps to multiple cities including Diomede, Alaska and Nome, Alaska. In the data model described, only the Diomede row is imported, and the other rows, including the Nome, Alaska row are dropped. So if you try to search for Nome, Alaska, you won't find anything. One solution would be to have a cities table, with the city/state as the primary key, and a zipcode as an additional column. Then, by joining on the zipcodes table, the coordinates for a city could be found. Is there any other way I should be considering data modelling to support searches on zipcodes and cities? Hi Mark, I built a very similar system for www.hutz.com. It uses a complete postcode database, without dropping nearby/overlapping cities. It also includes the postcode alias values, which are names that the post office uses as equivalent to the official names. Within the city table, I created a series of self-joining id's: id|alias_city_id|post_code_city_id|muni_city_id So a city record can be an alias, postcode or muni record. A muni record is the definitive record for a city (and is defined by the postcode record closest to the city center as defined by the USGS). A postcode record, represents a postcode (zipcode) region within a city. An alias represents an alternate name that either refers to a muni record or a postcode record (and is defined as alias_city_id IS NOT NULL) So if I want to search the table for only muni city records, the query looks like select * from city where id = muni_city_id I also included lat/long coordinates for every record, making it easy to calculate distances and find all city records within a certain range, etc. (I used the point and circle operators for this along with a GiST index - it's not perfect for long distances - it assumes the earth is flat, but it works great for small distances and is very fast). I hope this helps. Feel free to contact me on-list or off, if you want to discuss more. Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] problem using twice custom comparision operator
At 01:20 PM 1/24/2009, pgsql-sql-ow...@postgresql.org wrote: From: Marek Florianczyk fra...@adm.tp.pl Organization: TP SA To: pgsql-sql@postgresql.org Subject: problem using twice custom comparision operator Date: Fri, 23 Jan 2009 21:42:44 +0100 Message-Id: 200901232142.44102.fra...@adm.tp.pl Hi all, I wanted to make custom operator to sort data like this: 1,2,10,1a,1b,10a in to order: 1,1a,1b,2,10,10a Hi Marek, The following idea may be too different to fit your needs, but I got some help from this list a while back on how to force sorts for a specific query. It sounds like you want to override searching for all queries, so this may not be appropriate. Anyway here's an example of a solution that sorts things in arbitrary order for any given query: SELECT * FROM foobar ORDER BY CASE field WHEN 555 then 1 WHEN 342 then 2 WHEN 111 then 3 ELSE 4 This sorts 555 then 342 then 111 then everything else. Obviously this is oversimplified for your case, but you could write some comparisons in place of the static numbers (e.g. 555) that follow the same rules as the function you're writing. I don't know if performance would be anything comparable either (I'd guess that using the regex operators (like ~* would be the way to go). I thought I'd mention this other approach in case it was of interest and you haven't run across it before. Sincerely, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
[SQL] Re: some howto/theory book/tutorial on practical problem solving in SQL
At 09:20 AM 1/14/2009, pgsql-sql-ow...@postgresql.org wrote: Date: Wed, 14 Jan 2009 12:05:29 +0100 From: Ivan Sergio Borgonovo m...@webthatworks.it To: pgsql-sql@postgresql.org Subject: Re: some howto/theory book/tutorial on practical problem solving in SQL Message-ID: 20090114120529.0ab11...@dawn.webthatworks.it In-Reply-To: 375b6e92-443f-4bd3-bd03-908925639...@engineyard.com References: 2009073227.159ab...@dawn.webthatworks.it 375b6e92-443f-4bd3-bd03-908925639...@engineyard.com O'Reilly's SQL Hacks is a good one that fits the bill you describe. I think it is complementary to Celko's SQL puzzles. O'Reilly's book seems techniques on the field. Celko's book seems design on the field. Meanwhile I downloaded all the ecommerce, billing and ERP Open Source programs I know (over 60) and started to examine the code to see how they manage discounts in SQL. The one that seems more promising to learn from seems: ofbiz webERP promogest I'll come back asking comments on a possible design to solve my problem later. Hi Ivan, You're studying something very interesting, and I hope you will post your progress and solutions, questions and ideas back to this list as you go. I've forwarded this thread already to several people who work with related issues, and they're very interested in some solutions as well. So stay in touch as you work on this, please. Sincerely, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] How to excute dynamically a generated SQL command?
At 10:20 AM 1/4/2009, pgsql-sql-ow...@postgresql.org wrote: Message-ID: 618950b80901031757l15109658kdae1cdb0814d3...@mail.gmail.com Date: Sat, 3 Jan 2009 17:57:32 -0800 From: John Zhang johnzhan...@gmail.com To: postgis-us...@postgis.refractions.net Subject: How to excute dynamically a generated SQL command? X-Archive-Number: 200901/2 X-Sequence-Number: 32084 Hi the list, Referring to the PostgreSQL 8.3 documentation 38.5.4. Executing Dynamic Commands , the command for executing a dynamic command is: EXECUTE command-string [ INTO [STRICT] target ]; I am to execute an sql statement created dynamically, which is represented in a variable sSql. Here is an example: sSql='INSERT INTO hm_raster.hm_airphotos( file_ext, airphoto) VALUES ('.tif', lo_import( E''C:\\HM\\Data\\Flightmap.tif'');'; EXECUTE sSQL; It raises the error as: ERROR: syntax error at end of input LINE 1: ...E'C:\\HM\\Data\\Flightmap.tif') ^ I would appreciate a lot if you offer your input. Thanks a lot. John John: You're not escaping all your strings. That error message is a tip-off, I think. Try this line: sSql='INSERT INTO hm_raster.hm_airphotos( file_ext, airphoto) VALUES (''.tif'', lo_import( E''C:\\HM\\Data\\Flightmap.tif'');'; The part I changed was: ''.tif'' I'm not sure what language you're working in, but it's remotely possibly (depending on the execution stack) that you have to doubly escape your backslashes also, in which case: sSql='INSERT INTO hm_raster.hm_airphotos( file_ext, airphoto) VALUES (''.tif'', lo_import( E''C:HMDataFlightmap.tif'');'; I suffer on Windows wishing we could have / path separators by default. Note that these days Windows generally does support / instead of \ for paths if you're careful. If you put them in quotes, it works even on the command line, which is helpful. You can type this directly into the CMD prompt now: dir c:/temp All new programs I write on Windows (in Ruby) use forward slashes for paths, and it works just fine. Not sure about VB or C#, but I'd guess you can make it work. Might be simpler than all the escaping work.. Best, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Question on Escape-string
At 05:20 AM 1/1/2009, pgsql-sql-ow...@postgresql.org wrote: To: pgsql-sql@postgresql.org Subject: Question on Escape-string X-Archive-Number: 200812/132 X-Sequence-Number: 32082 Dear all, I am using pl/pgsql to develop a function to implement some logic to load BLOB data, like .tif file, to postgres DB. The issue I am facing is the file name MUST be with double back-slash \\ in order for pgsql to process the string properly. However, when the string is Escaped in my function, how can I pass it in to lo_import() function? Is there any function to double back-slash a string? Or how can we preserve a string as RAW? ISSUE : -- use E'C:\\tmp\\tst.tif' for the full file name for IN parameter of load_blob function. -- however, when the string is escaped it becomes 'C:\tmp\tst.tif' as expected -- the file name need be passed in to lo_import() function again without double \\ -- when it is passed in and escaped , the \ is gone and the filename becomes meaningless Any input would be much appreciated! Thanks a lot John Hi John, If I understand you, you want to put double backslashes back into a string that has been stored in a Postgres field with single backslashes? Here's some SQL I cooked up to demonstrate what I think is a solution. Note the use of \\ and doubly-escaped backslashes in the regex replace parameters - that's the key. DROP TABLE IF EXISTS test; CREATE TABLE test ( filename character varying(255) NOT NULL, data bytea ); insert into test (filename, data) values (E'c:\\tmp\\tst.tif', '1234'); select replace(filename, E'\\', E''), data from test Does this do it? Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Best way to and from a one-to-many joined table?
At 11:20 AM 12/6/2008, [EMAIL PROTECTED] wrote: Message-ID: [EMAIL PROTECTED] From: Oliveiros Cristina [EMAIL PROTECTED] To: Bryce Nesbitt [EMAIL PROTECTED], sql pgsql pgsql-sql@postgresql.org References: [EMAIL PROTECTED] Subject: Re: Best way to and from a one-to-many joined table? Date: Fri, 5 Dec 2008 19:23:25 - Howdy, Bryce Could you please try this out and tell me if it gave what you want. Best, Oliveiros SELECT person_name FROM test_people p JOIN test_attributes a ON ((a.people_id = p.people_id) AND (a.attribute = @firstAttr)) JOIN test_attributes b ON ((b.people_id = p.people_id) AND (b.attribute = @secondAttr)); Hi, I saw a few people post answers to this question and it raised another related question for me. What are the differences between the above query and this one. Are they semantically/functionally identical but might differ in performance? Or would they be optimized down to an identical query? Or am I misreading them and they are actually different? SELECT person_name FROM test_people p JOIN test_attributes a ON ((a.people_id = p.people_id) JOIN test_attributes b ON ((b.people_id = p.people_id) WHERE (a.attribute = @firstAttr)) AND (b.attribute = @secondAttr)); Also, any suggestions about how to figure out this on my own without bugging the list in the future would be great. Thanks for any insight! Steve p.s. I posting in the same thread, but if you think I should have started a new thread let me know for the future. -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Sequence and nextval problem
At 11:20 PM 11/24/2008, [EMAIL PROTECTED] wrote: Message-Id: [EMAIL PROTECTED] From: ries van Twisk [EMAIL PROTECTED] To: Tk421 [EMAIL PROTECTED] In-Reply-To: [EMAIL PROTECTED] Subject: Re: Sequence and nextval problem Date: Mon, 24 Nov 2008 16:21:40 -0500 References: [EMAIL PROTECTED] X-Archive-Number: 200811/144 X-Sequence-Number: 31928 On Nov 24, 2008, at 2:12 PM, Tk421 wrote: The conversion from access database to postgres worked fine. Everithing it's ok. But now, when i use my database i've found a problem with sequences. In the conversion, the autonumeric fields from access have been converted to sequences, everithing ok in a first view. The problem comes because the autonumeric fields in access always return the last value of the table +1, but postgres no. Postgres returns lost (i don't know how to call them) values. An example. [snip] In access if i execute INSERT INTO table (description) VALUES ('desc 8'), the result row is 8 | desc 8 But in postgres the same query te result row is 3 | desc 8 My question is, can i do something to make ANY sequence to take the last value from his associated table, and not a lost value? This sounds like if the start of the sequence is set incorrectly: Try this : SELECT setval('NAME OF SEQUENCE', SOME_INTEGER, true); btw, you should also not expect a specific value from the sequence except that you will always get the next value from the sequence. it's also generally a bad idea to do select max(someid)+1 from table. The whole concept of a sequence is thus much better. I think this is sound general advice for a production database. However if you control the database such that you can prevent access to it while you are updating it, you can run something like: SELECT setval('NAME OF SEQUENCE', (select max(id)+1 from table_of_sequence), true); Where table_of_sequence is the name of the table which the sequence is attached to. The reason you don't use that syntax is that it's not multi-user safe. But if you know there are no other users running changes to that sequence when you run your updates, then you're good to go. It's a very fast way to update all your tables to make sure the sequence #'s are all valid, without having to look up the max value on each one (which would also require that you shut off access to the table and for a much longer time). Hope that helps, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] grouping/clustering query
At 10:20 PM 10/22/2008, you wrote: Message-ID: [EMAIL PROTECTED] Date: Wed, 22 Oct 2008 12:14:49 +0700 From: David Garamond [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: grouping/clustering query X-Archive-Number: 200810/89 X-Sequence-Number: 31731 Dear all, I have an invoices (inv) table and bank transaction (tx) table. There's also the payment table which is a many-to-many relation between the former two tables, because each invoice can be paid by one or more bank transactions, and each bank transaction can pay for one or more invoices. Example: # (invoiceid, txid) (A, 1) (A, 3) (B, 1) (B, 2) (C, 5) (D, 6) (D, 7) (E, 8) (F, 8) For journalling, I need to group/cluster this together. Is there a SQL query that can generate this output: # (journal: invoiceids, txids) [A,B] , [1,2,3] [C], [5] [D], [6,7] [E,F], [8] Hi Dave, I'm not following the logic here. A has 1,3 and B has 1,2. So why does the first line print: [A,B] , [1,2,3] What's the rule that tells the query to output this way? Is it that all of B's values are between A's values? Also in your output, you've indicated [A,B] - does this mean you want two columns of output, each column being a pg array? I may not be the best person to answer the actual SQL question, but I thought I'd clarify your requirements so the list members can have the best chance of answering. Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] many-to-many relationship
At 05:20 PM 10/8/2008, [EMAIL PROTECTED] wrote: Date: Wed, 8 Oct 2008 11:25:10 +0200 From: Louis-David Mitterrand [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: Re: many-to-many relationship Message-ID: [EMAIL PROTECTED] Mail-Followup-To: pgsql-sql@postgresql.org References: [EMAIL PROTECTED] [EMAIL PROTECTED] In-Reply-To: [EMAIL PROTECTED] X-Archive-Number: 200810/23 X-Sequence-Number: 31665 |id|image_url|f_table|f_key |1 |url..|person |1234 |2 |url2.|event |5678 I think this is called a polymorphic join but I could be wrong about that. I'd guess you could construct a rule or trigger to validate the foreign key data on insert/update but that's out of my skill area. Hi Steve, So in your solution the f_table column is just text which needs to be validated by a custom trigger? Hi, Yup - that's exactly what I'm suggesting. Storing the text value of the related tables right in the table in question. It might seem insane, but in my experience it works out reasonably well. Ruby on Rails has popularized the approach, using it both in the data backend, as well as in the OO frontend (so Rugy object class to be instantiated is chosen by the text value of f_table for a given row - hence the polymorphism). http://wiki.rubyonrails.org/rails/pages/UnderstandingPolymorphicAssociations There are some situations where this approach could create problems but if in general all you're doing is select statements along these lines: select * from images where f_table = 'person' and f_id = '1234' There's not much to go wrong. (Famous last words). And regarding the custom validation by trigger, I'd think that would work just fine. I'm not an expert on triggers, rules and constraints in Pg though. (I do all my validation in the middleware, which might give some people here high blood pressure). :) Keep us posted on which solution you choose and how it works out for you! Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] many-to-many relationship
At 06:20 AM 10/7/2008, [EMAIL PROTECTED] wrote: Date: Mon, 6 Oct 2008 15:08:02 +0200 From: Louis-David Mitterrand [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: many-to-many relationship Message-ID: [EMAIL PROTECTED] X-Archive-Number: 200810/13 X-Sequence-Number: 31655 Hi, Say you have several objects (tables): person, location, event, etc. all of which can have several images attached. What is the best way to manage relations between a single 'image' table and these different objects? For now each 'image' row has pointers to id_person, id_location, id_event, etc. (only one of which is used for any given row). Is there a better way, more elegant way to do it, without using redundant id_* pointers on each row and yet still enforce foreign keys? Thanks, Hi, I think the relationship tables method works pretty well but I have another suggestion. You could store the Foreign table name within image table as well as the Foreign key. |id|image_url|f_table|f_key |1 |url..|person |1234 |2 |url2.|event |5678 I think this is called a polymorphic join but I could be wrong about that. I'd guess you could construct a rule or trigger to validate the foreign key data on insert/update but that's out of my skill area. Hope that helps a little, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Finding sequential records
At 09:50 PM 9/29/2008, Richard Broersma wrote: On Mon, Sep 29, 2008 at 7:48 PM, Steve Midgley [EMAIL PROTECTED] wrote: In my specific case it turns out I only had duplicates, but there could have been n-plicates, so your code is still correct for my use-case (though I didn't say that in my OP). Ya there are a lot of neat queries that you can construct. If you have a good background in math and set theory (which I don't have) you can develop all sorts of powerful analysis queries. On a side note, I thought that I should mention that unwanted duplicates are an example where some ~have gotten bitten~ with a purely surrogate key approach. To make matter worse, is when some users update part of one duplicate and another updates a different duplicated on a another field(s). Then once the designer discovers the duplicate problem, she/he has to figure out some way of merging these non-exact duplicates. So even if the designer has no intention of implementing natural primary/foreign keys, he/she will still benefit from a natural key consideration in that a strategy can be designed to prevent getting bitten by duplicated data. I only mention this because db designers get bitten by this all the time. Well at least the ones that subscribe to www.utteraccess.com get bitten. From what I've seen not one day has gone by without someone posting a question to this site about how to both find and remove all but one of the duplicates. Truly. I have worked with some school districts around the US and this duplicate record problem is more than theoretical. Some of the gnarliest, dirtiest, n-plicate data I've ever seen comes out of the US public education system. More generally where I have seen a need for natural keys, I've always taken the best of both worlds approach. So I always stick an integer/serial PK into any table - why not - they're cheap and sometimes are handy. And then for tables along the lines of your description, I add a compound unique index which serves the business rule of no dupes along these lines. Am I following your point? Any reason why using serial PK's with compound natural unique indices is better/worse than just using natural PK's? Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Finding sequential records
At 05:38 PM 9/26/2008, Oliveiros Cristina wrote: In-Reply-To: [EMAIL PROTECTED] References: [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] Howdy, Steve. SELECT id FROM dummy a NATURAL JOIN ( SELECT fkey_id,name FROM dummy GROUP BY fkey_id,name HAVING COUNT(*) 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - MIN(id) + 1) / 2 ) b ORDER BY id; In your table you just have duplicates? Or you may have triplicates? And quadruplicates? And in general n-uplicates? At the time, I thought you might have n-uplicates, so I designed the query to be as general as possible to handle all that cases, from which duplicates are a particular case, but now i am wondering if you don't have more than duplicates. In my specific case it turns out I only had duplicates, but there could have been n-plicates, so your code is still correct for my use-case (though I didn't say that in my OP). Well, anyway the idea is as follows The sum of a sequence is given by first + last / 2 * n, with n = last - first + 1, OK ? I *love* your application of that formula. It's rare for me to be able to use real math in SQL, so this was a pleasure to read (and understand!) Thanks again to Richard and Oliveiros for a truly educating experience! I hope some others were similarly enlightened. With gratitude, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
[SQL] Finding sequential records
Hi, I've been kicking this around today and I can't think of a way to solve my problem in pure SQL (i.e. I can only do it with a looping/cursor-type solution and some variables). Given a table with this DDL/data script: drop table if exists dummy; create table dummy ( id integer primary key, name varchar(255), fkey_id integer ) ; insert into dummy (id, name, fkey_id) values (502163,'3Br/3Ba Pool Villa in Westin St. John, USVI- Summer 2008',500100); insert into dummy (id, name, fkey_id) values (502164,'3Br/3Ba Pool Villa in Westin St. John, USVI- Summer 2008',500100); insert into dummy (id, name, fkey_id) values (502169,'Lazy Bear Lodge',105); -- not sequential id to previous insert into dummy (id, name, fkey_id) values (502199,'Lazy Bear Lodge',105); insert into dummy (id, name, fkey_id) values (502170,'3 Bed, 1 Bath Cottage Less Than a Mile from West Dennis Beach',500089); insert into dummy (id, name, fkey_id) values (502171,'3 Bed, 1 Bath Cottage Less Than a Mile from West Dennis Beach',500089); -- not sequential id nor duplicate fkey_id to previous insert into dummy (id, name, fkey_id) values (502175,'3 Bed, 1 Bath Cottage Less Than a Mile from West Dennis Beach',500102); insert into dummy (id, name, fkey_id) values (502213,'Sea Watch',500128); -- not duplicate fkey_id to previous insert into dummy (id, name, fkey_id) values (502214,'Sea Watch',500130); Find all instances where * name is duplicated * fkey_id is the same (for the any set of duplicated name fields) * id is sequential (for any set of duplicated name fields) The system should return 502163 502164 502170 502171 Here's as far as I got: select id from dummy where name in ( select name from dummy group by name having count(name)1 ) order by id I can't figure out how to test for duplicate fkey_id when name is the same, nor to test for sequential id's when name is the same. Having a method for either would be great, and both would be a bonus! It seems like there's a clever way to do this without cursors but I can't figure it out! Thanks for any help! Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Finding sequential records
Wow. Thanks to both Richard and Oliveiros. Out of the box Oliveiros' solution does what I want but I don't understand why! SELECT id FROM dummy a NATURAL JOIN ( SELECT fkey_id,name FROM dummy GROUP BY fkey_id,name HAVING COUNT(*) 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - MIN(id) + 1) / 2 ) b ORDER BY id; What's going on here with the sum(id) equaling the average product of the min and max? I gather that's to match id's with id's that are one bigger than itself? Can anyone clarify how that is working? Richard's sql is very interesting to me in concept - but it's not getting me the results correctly: SELECT A.* FROM ( SELECT ID FROM Dummy GROUP BY name, fkey_id ) AS A INNER JOIN Dummy AS D ON A.id - 1 = D.id OR A.id + 1 = D.id; This returns an error: ERROR: column dummy.id must appear in the GROUP BY clause or be used in an aggregate function SQL state: 42803 I'm not sure how to setup that from select to produce id's without adding id to the group by (which would cause the query to return too many rows). Perhaps a natural join like in Oliveiros' sql would do the job? Thanks for any advice on either of these solutions. I'm going to learn a lot here if someone can pound it into my head. Thanks, Steve It seems to be returning any records that have sequential id's regardless At 11:02 AM 9/26/2008, Richard Broersma wrote: On Fri, Sep 26, 2008 at 10:39 AM, Steve Midgley [EMAIL PROTECTED] wrote: drop table if exists dummy; create table dummy ( id integer primary key, name varchar(255), fkey_id integer ) ; The system should return 502163 502164 502170 502171 --first get all of the duplicated ids SELECT id FROM Dummy GROUP BY name, fkey_id --Next from this list find check to see if there are any sibling immediate above or below it. SELECT A.* FROM ( SELECT ID FROM Dummy GROUP BY name, fkey_id ) AS A INNER JOIN Dummy AS D ON A.id - 1 = D.id OR A.id + 1 = D.id; -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] surrogate vs natural primary keys
At 08:20 AM 9/18/2008, [EMAIL PROTECTED] wrote: Message-ID: [EMAIL PROTECTED] Date: Wed, 17 Sep 2008 09:20:44 -0700 From: Richard Broersma [EMAIL PROTECTED] To: Scott Marlowe [EMAIL PROTECTED] Subject: Re: surrogate vs natural primary keys In-Reply-To: [EMAIL PROTECTED] References: [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] X-Archive-Number: 200809/124 X-Sequence-Number: 31576 My opinion is that the database constraints are the last line of defense to ensure business rules and data integrity are not violated. Since I highly value the ability to enforce business rules using ordinary table DDL, I try to use natural keys as often as I can. Hi Richard, I often find your comments insightful and right on the money. This is another one of those cases. Your comments above are a great example of when natural keys make sense: I hadn't looked at it from this perspective! I'm a middleware developer (the bane of DBA's!) -- and so I generally solve these sorts business rules constraints in the middleware code, which of course is prone to all kinds of different problems (like lazy developers who code around the OO validation checkers!). Thanks for giving such a great explanation as to the value of natural keys! You haven't won me over, but you did teach me something - which I appreciate. Best, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] surrogate vs natural primary keys
To: pgsql-sql@postgresql.org From: Seb [EMAIL PROTECTED] Subject: Re: surrogate vs natural primary keys Date: Mon, 15 Sep 2008 17:56:31 -0500 Organization: Church of Emacs Lines: 20 Message-ID: [EMAIL PROTECTED] References: [EMAIL PROTECTED] [EMAIL PROTECTED] X-Archive-Number: 200809/101 X-Sequence-Number: 31553 On Mon, 15 Sep 2008 16:45:08 -0600, Scott Marlowe [EMAIL PROTECTED] wrote: [...] I think this question is a lot like how large should I set shared_buffers? There's lots of different answers based on how you are using your data. Yes, this is precisely what I'm after: *criteria* to help me decide which approach to take for different scenarios. Such guidance is what seems to be lacking from most of the discussions I've seen on the subject. It's hard to distill this information when most of the discussion is centered on advocating one or the other approach. I think Scott and others have laid out the main ideas in a very cool-headed way already, but here's my follow-on input: I agree with Andrew Sullivan that using industry standard id's as your primary key can be problematic. But I do sometimes apply unique indices to such industry standard columns to ensure they are in fact unique and can be a surrogate for the real integer/serial primary key. As a rule, I have decided to stay away from meaningful (natural) primary keys for these reasons: 1) They sometimes change b/c of business rule changes, forcing technical changes to the relationship model, when only internal table schema changes should be required to support the new business requirements. 2) Generating arbitrary/surrogate keys is easier b/c you can use sequence generators. (When creating a new record, I have to figure out the value of a meaningful column before saving the record which sometimes I don't want to do!) 3) Surrogate keys are guaranteed unique regardless of semantic content of the table. 4) All tables can all join to each other in the same ways: property.id holds the same data type as contact.id. All id fields are the same in type/format. I think there's even a reasonable argument for globally unique surrogate keys: all keys for any table use the same sequence of id's. I implemented a system in the 90's that used globally unique id's and it opened up some interesting solutions that I wouldn't have thought of when I started the project (self joins were the same as foreign joins since the id's in both entities were guaranteed unique). I've heard some people argue the use of GUID's for id's but I've been too scared to try that in a real system. Sequential, arbitrary primary keys (as surrogate keys) are predictable though. So if you share those keys with the public (via URL's for example), then competitors can learn information about your business (how fast keys are generated for a certain table for example). That's an argument for random, arbitrary primary keys though, not for compound/meaningful keys. I think natural or compound keys make more sense to DBA's and let you implement some kinds of database solutions more quickly. All in all, I don't really understand the merits of natural keys outside of data warehouse applications. In data warehouses, in my experience, compound natural keys just end up turning into fact tables! :) In summary: I've never heard someone say they've been bitten by using an arbitrary surrogate key system, but I myself have been bitten and have heard lots of stories of problems when using natural keys. I hope this helps some, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Join question
At 12:20 PM 8/15/2008, [EMAIL PROTECTED] wrote: Date: Fri, 15 Aug 2008 13:46:14 -0400 From: Edward W. Rouse [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: Re: Join question Message-ID: [EMAIL PROTECTED] I did try that, but I can't get both the values from table a with no entries in table b and the values from table b with null entries to show up. It's either one or the other. Edward W. Rouse Might have luck with applying some additional WHERE clause criteria to your full outer join. So if you don't want certain types NULL's in table b, restrict against that in WHERE clause? I could be misunderstanding the whole thing though.. Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] DELETE with JOIN
At 10:05 AM 8/7/2008, [EMAIL PROTECTED] wrote: Date: Thu, 7 Aug 2008 09:14:49 -0700 From: [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: DELETE with JOIN Message-ID: [EMAIL PROTECTED] I want to delete with a join condition. Google shows this is a common problem, but the only solutions are either for MySQL or they don't work in my situation because there are too many rows selected. I also have to make this work on several databases, includeing, grrr, Oracle, so non-standard MySQL solutions are doubly aggravating. DELETE FROM a WHERE a.b_id = b.id AND b.second_id = ? I have tried to do this before and always found a way, usually DELETE FROM a WHERE a.b_id IN (SELECT id FROM b WHERE second_id = ?) but I have too many rows, millions, in the IN crowd, ha ha, and it barfs. EXISTS is no better. At least Oracle barfs, and I haven't got to the others yet. I figured I would go with the worst offender first, and let me tell you, it is offensive. Dang I wish it were postgresql only! I could write a Dumb Little Test Program (tm) to read in all those IN ids and execute a zillion individual DELETE statements, but it would be slow as puke and this little delete is going to come up quite often now that I have a test program which needs to generate the junky data and play with it for several days before deleting it and starting over again. Hi, Have you tried something where you read in all those IN id's and then group them into blocks (of say 1,000 or 10,000 or whatever number works best)? Then execute: DELETE FROM a WHERE a.b_id in ([static_list_of_ids]) Replacing in a loop [static_list_of_ids] with each block of 1000 id's in a comma delimited string? I use this technique sometimes in middleware and it works pretty well. There's probably a pure-sql solution in Pg as well but this method should work across any SQL platform, which seems like one of your requirements. Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Problem with ORDER BY and DISTINCT ON
At 03:51 PM 7/31/2008, Tom Lane wrote: Steve Midgley [EMAIL PROTECTED] writes: At 07:29 AM 7/16/2008, Tom Lane wrote: I think what is happening is that ORDER BY knows that and gets rid of the duplicate entries while DISTINCT ON fails to do so. Of course removing the duplicate from both areas is the correct solution and I broke down and hacked that into the auto-sql-writing code and so my immediate problem is solved. I'm happy to file this as a ticket for Pg (please point me to your ticket tool as I've never used it). This is not a very big deal but Pg has such a high compliance with wacky-but-valid SQL it does seem like it should be fixed just because. I've applied a patch for this to CVS HEAD. I doubt we'll try to fix it in the back branches, though --- it's too much of a corner case to be worth taking any risk of breaking other stuff. regards, tom lane Hey Tom, That's really great - thanks. I'm impressed how quickly you are fixing this obscure issue. I came from MS SQL and it would be hard for me to put into words how much of a better job you all are doing on Pg. Best, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] index for group by
At 09:20 AM 7/22/2008, [EMAIL PROTECTED] wrote: Date: Tue, 22 Jul 2008 13:27:24 +0200 From: A. Kretschmer [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: Re: index for group by Message-ID: [EMAIL PROTECTED] am Tue, dem 22.07.2008, um 13:18:30 +0200 mailte Patrick Scharrenberg folgendes: Hi, is there a way to speedup group by queries with an index? In particular if I have a table like this: CREATE TABLE data ( id1 integer, id2 integer, somedata character varying, ts timestamp with time zone ); where continously data is logged about id1 and id2 into somedata, together with the timestamp when it was logged. So I have multiple rows with the same id1 and id2 but different timestamp (and data maybe). At the moment I have ~40.000.000 rows in that table so doing a SELECT id1, id2 FROM data GROUP BY id1, id2; without a where-clause every select forces a seq-scan. First, why are you doing a group by when you aren't doing an aggregation (like COUNT, SUM, etc)? It seems like you can get way better performance by doing this: SELECT DISTINCT ON (id1, id2) id1, id2 FROM data ORDER BY id1, id2 (Assuming your compound index is in id1,id2 order). Am I missing something? A different more cumbersome idea I have for you (if you really do need a GROUP BY) is to build a warehouse table that precalculates the data you want. You can build some recurring process that runs every NN minutes or hours and fires off a stored procedure which grabs all the data from this data table, aggregates it and saves it to warehouse table. You could aggregate against your datetime stamp by N hours or days as well. If this idea is of interest you can write back to the list or off-list to me for more info. Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
[SQL] Problem with ORDER BY and DISTINCT ON
Hi, I'm a little baffled. I'm trying to generate a SQL statement that issues a DISTINCT ON using the same values as my ORDER BY statement. I'm using a somewhat complex CASE statement in my ORDER BY clause. I'm on Pg 8.2. Here is some SQL to get you started at seeing my problem: -- drop table if exists property; create table property ( id serial, state varchar(255), search_rate_max decimal(8,2), data_priority_code varchar(255) ); SELECT DISTINCT ON (property.state, CASE WHEN (search_rate_max IS NOT NULL) THEN 1 ELSE 2 END,search_rate_max, CASE WHEN (search_rate_max IS NOT NULL) THEN 1 ELSE 2 END,search_rate_max, property.id) property.id FROM property WHERE ((property.data_priority_code IS NOT NULL)) ORDER BY property.state, CASE WHEN (search_rate_max IS NOT NULL) THEN 1 ELSE 2 END,search_rate_max, CASE WHEN (search_rate_max IS NOT NULL) THEN 1 ELSE 2 END,search_rate_max, property.id LIMIT 10 OFFSET 0 RESULTS: ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions SQL state: 42P10 Now if you run this statement it works -- SELECT DISTINCT ON (property.state, property.id) property.id FROM property WHERE ((property.data_priority_code IS NOT NULL)) ORDER BY property.state, property.id LIMIT 10 OFFSET 0 -- However if you run this statement it ALSO works, which tells me it's not just my CASE statements that are messing things up (note in this example, I just removed the primary key property.id from the ORDER BY and DISTINCT ON clauses: --- SELECT DISTINCT ON (property.state, CASE WHEN (search_rate_max IS NOT NULL) THEN 1 ELSE 2 END,search_rate_max, CASE WHEN (search_rate_max IS NOT NULL) THEN 1 ELSE 2 END,search_rate_max ) property.id FROM property WHERE ((property.data_priority_code IS NOT NULL)) ORDER BY property.state, CASE WHEN (search_rate_max IS NOT NULL) THEN 1 ELSE 2 END,search_rate_max, CASE WHEN (search_rate_max IS NOT NULL) THEN 1 ELSE 2 END,search_rate_max LIMIT 10 OFFSET 0 RESULTS: ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions SQL state: 42P10 Finally, if you run this statement it works fine (removing one of the duplicate search_rate_max statements): SELECT DISTINCT ON (property.state, CASE WHEN (search_rate_max IS NOT NULL) THEN 1 ELSE 2 END,search_rate_max, property.id) property.id FROM property WHERE ((property.data_priority_code IS NOT NULL)) ORDER BY property.state, CASE WHEN (search_rate_max IS NOT NULL) THEN 1 ELSE 2 END,search_rate_max, property.id LIMIT 10 OFFSET 0 What's going on here? Am I doing something that isn't legitimate SQL? I can't see why having a duplicate CASE statement should foul things up like this? It's pretty clear (from additional testing not included in this email) that the duplicate search_rate_max CASE is causing the problem. Thanks for any advice or suggestions on how to get this to run correctly. Is this a bug? Basically I'm doing this as an optimization - I can get much better performance running the DISTINCT ON in some circumstances than using DISTINCT, but the edge case above is breaking my tests and preventing me from implementing the idea. The code is generated by an application layer which is not really paying attention to whether or not the two CASE statements apply to the same field or not (sometimes they do sometimes they don't).. Thanks! Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Copyright and Paper walls (was: Rollback in Postgres)
At 11:59 AM 7/12/2008, [EMAIL PROTECTED] wrote: Date: Sat, 12 Jul 2008 10:20:37 +0100 From: Simon Riggs [EMAIL PROTECTED] To: Dave Page [EMAIL PROTECTED] Cc: Lewis Cunningham [EMAIL PROTECTED], Scott Marlowe [EMAIL PROTECTED], samantha mahindrakar [EMAIL PROTECTED], pgsql-sql@postgresql.org Subject: Re: Rollback in Postgres Message-ID: [EMAIL PROTECTED] On Sat, 2008-07-12 at 09:40 +0100, Dave Page wrote: On Sat, Jul 12, 2008 at 8:56 AM, Simon Riggs [EMAIL PROTECTED] wrote: Please don't put links to copyrighted material on our lists. That's an odd thing to say, given that virtually every link on our lists probably points to material copyrighted in some way. Prudence is all I ask for. We don't need to provide additional advertising for others, nor do we wish to embroil ourselves in accusations over copyright violations. I don't want to pile more wood on the fire, but I think I can see both sides to this. I believe this is not so much copyright violation concern, but if the Pg team releases some cool feature relating to rollbacks down-the-road that is vaguely similar to Oracle's system, reducing the amount of discussion about Oracle's features on this list would reduce Oracle's ability to claim that the feature was a direct appropriation. That said (and IANAL), I think posting links to for-profit and/or copyrighted websites is really important in general for the list. There's a lot of good information out there and I think it's not so great if this list were to limit itself only to public domain and open copyright documentation for consideration. Just two more cents from the peanut gallery on a Saturday afternoon, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] PERSISTANT PREPARE (another point of view)
At 04:31 PM 7/11/2008, [EMAIL PROTECTED] wrote: Date: Fri, 11 Jul 2008 23:31:03 + From: Milan Oparnica [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: PERSISTANT PREPARE (another point of view) Message-ID: [EMAIL PROTECTED] [snip] What could we gain by introducing a kind of global prepared statement area, is SIMPLICITY of DB DEVELOPMENT AND MAINTENANCE. Here is our point of view: [snip] Now, instead of preparing statements on each connection request (and we use around 900 prepared statements), why couldn't we simply prepare these statements ONCE and keep them in some global storage for future everyday usage. Hi, What's wrong with using complex views, stored procedures, functions and maybe even custom data types to accomplish what you want here? It seems like you could build a lot of prepared statements using these tools, providing your application layer developers with a consistent set of interfaces to obtain data that are not tied to the data tables themselves. And allowing them to insert/update/manage tables via structured interfaces as well. Am I missing something? Best, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] ANSI Standard
At 02:20 AM 6/25/2008, [EMAIL PROTECTED] wrote: Date: Tue, 24 Jun 2008 17:33:11 +0300 From: Pascal Tufenkji [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: ANSI Standard Message-ID: [EMAIL PROTECTED] Hi, How do I know if a function (or a certain sql syntax) in Postgres is a SQL ANSI Standard, hence it works on all databases such as MySQL, SQL Server, Oracle. In general, I find that the Pg docs pretty clear state what is ANSI standard and what isn't within Pg. You can also view the ANSI-92 standard here: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt In terms of making sure you're cross platform compatible, I'd say you have to designate a series of platforms (e.g. MySQL 5, Pg 8.3, Oracle X, MS SQL X, ext) which you will test against and explicitly support. You will find that no matter how tightly you attempt to build your platform against ANSI-92 (or any other std) if you do not regularly test against a set of platforms, your solution will converge on supporting only the platforms you do regular test against. I hope that helps, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Tsearch
At 12:20 PM 6/12/2008, [EMAIL PROTECTED] wrote: Date: Thu, 12 Jun 2008 08:47:44 -0400 From: PostgreSQL Admin [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: Tsearch Message-ID: [EMAIL PROTECTED] this is a small sample of the data: short_desc | long_desc --+ CHICKEN,BROILERS OR FRYERS,LEG,MEATSKN,CKD,FRIED,BATTER | Chicken, broilers or fryers, leg, meat and skin, cooked, fried, batter Is the best method of search through this data full text search via tsearch or some other method. I'm running version 8.3 say I want to search for chicken skin? Thanks for the advice, J Having previously established myself as a non-expert on TSearch2, I hesitate to post, but I think your issue is data semantics more than search tool. How do you want your system to handle this psuedo data: Chicken broilers, pork skins Should that return a match for chicken skins or not? If your data are semantically stored such that any record that matches chicken and also matches skin refers to the skins of chickens (the actual meaning you are looking for), then you're going to have a relatively easy time of it. If not, you'll have to consider ways to code your data so that searches behave correctly. You could develop search / coding schemes that say things like find all records with chicken and skin, where the following words do not appear between the words chicken and skin: beef, pork, cow, pig, etc.. Just some thoughts for you there. Best, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Conceptual Design Question
At 10:52 AM 6/10/2008, [EMAIL PROTECTED] wrote: Date: Tue, 10 Jun 2008 05:05:24 -0700 From: Bryan Emrys [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: Conceptual Design Question Message-ID: [EMAIL PROTECTED] Hello Everyone, In a text-heavy database, I'm trying to make an initial design decision in the following context. There is a lot of long text that I could break down into three different categories: [snip] The conceptual question is what are the trade-offs between having one textual table compared with multiple text tables? Any help on pointing out practical considerations would be appreciated. Thanks. Bryan Hi Bryan, Firstly, I might investigate the GiST index and TSearch2 in this regard. I'm not an expert on them, and it maybe is cart before the horse, but if those tools are applicable and are easier to implement/maintain with one design approach or the other, I might use their design preferences as my guide for picking the right relationships. Beyond that advice, it does seem to me that a polymorphic relationship (where one table holds multiple entities) *could* describe laws and treaties, though they are kind of different in their relations. Commentaries seem pretty distinct from these two things. My overall opinion would also depend on the architecture. Will you have a unified middleware/ORM layer that can manage the business rules for the polymorphic data retrieval? Or will developers be going directly into the database to pull items directly? If you have a unified ORM that stores the business rules, you can be more aggressive about using polymorphism, b/c the complexity can be hidden from most developers. All in all, I think your model is really describing three distinct data entities, and should be stored in three separate tables, but that's a very high level and uninformed opinion! I'd let TSearch2 drive your design if that's a relevant consideration. Of course TSearch2 is very flexible so it might not really care much about this. :) In general, I find that a data model that looks like the real data is the one that I'm happiest with - the systems I've seen with too much UML optimization and collapsing of sets of data into single tables tend to be harder to maintain, etc. Just some random opinions for you there. I'm sure others have different perspectives which are equally or more valid! Best, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] design resource
At 11:20 PM 6/5/2008, [EMAIL PROTECTED] wrote: Date: Thu, 5 Jun 2008 10:14:04 -0400 From: Edward W. Rouse [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: design resource Message-ID: [EMAIL PROTECTED] I was wondering if there were any resources that have some table designs for common problems. Since that isn't very clear I will give an example. We have an internal app from years back that needs to be updated. One of the problems is that when it was originally created, the company only had US customers. We now have international customers and need to support international addresses and phone numbers. For the phone numbers that means adding a new column for international code or expanding the data field so that it's big enough to hold the international prefix (still not sure which approach is best). But I haven't a clue as to how to set up for international addresses. So I was hoping there would be a resource that I could check where these kinds of data sets have been 'solved' to ease the effort. I have several books on design patterns for programming but I've not seen a design patterns book for common database problems. Thanks. Hi, In addition to Craig's excellent answer, I'll give an additional nuance. I think that free-form and flexible/re-usable fields are the way to for handling addresses. However, normalizing country is generally pretty smart (as is normalizing state/admin region within countries where you do a lot of business). This can be generally handled on the front-end with a pull-down menu of choices, but you would probably be happiest enforcing this on the back-end as well - possibly by having a country look up table: country_id|iso2|iso3|full_name|short_name|full_accents|short_accents...etc I keep the country names with and without accents to make searching easier across keyboards/locales. I hope this helps too -- I think Craig has given you the lion's share of good advice for sure - and I definitely follow the practices more or less as he laid them out as well. Sincerely, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Extremely Low performance with ODBC
At 09:20 PM 5/27/2008, [EMAIL PROTECTED] wrote: Date: Tue, 27 May 2008 09:29:56 -0700 From: Richard Broersma [EMAIL PROTECTED] To: Sebastian Rychter [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Subject: Re: Extremely Low performance with ODBC Message-ID: [EMAIL PROTECTED] On Mon, May 26, 2008 at 9:34 PM, Sebastian Rychter [EMAIL PROTECTED] wrote: I'll keep on looking for any other differences on different logs.. I think I already searched all the Postgresql forums I know and no one ever reported something like that before. I've been following this discussion with interest - years ago I was responsible for a high performance web system that used ODBC in the data layer (not quite an oxymoron there). But I haven't touched ODBC since 2001! (Be warned).. With performance times that slow it seems like there are a couple of possibilities that I can think of. To narrow down the choices: How many rows does your query return? If it's returning a lot try running the exact same query but with a limit 10 on the end - does that take just as long to run or does it come back quickly? Here are some other ideas: 1) The Pg ODBC driver is broken and is looping internally in some wrong way, when constructing your result set. This seems a little unlikely but totally possible. The groups that Richard recommends can probably help on this end. 2) Your ODBC DSN is misconfigured. This seems more likely to me. You may already be an expert with ODBC and have it correctly configured, in which disregard this. But the Pg ODBC interface has a lot of unique options and any one or combination could be spiking your results speed. Have you played with the various config options in the advanced window of odbcad32.exe when creating a data source with a Pg ODBC driver? For example, try toggling Keyset query optimization or Use Declare/Fetch - Also, does the cache size setting impact your query speed at all? There are a number of checkboxes here that seem like they might affect your situation. Be sure all the logging is turned off when testing performance. Also, updatable cursors is turned on by default (on my copy of Pg ODBC) - that seems wrong and you might try your search with it turned off. Also server side prepare is turned off on my copy, and this also seems wrong - any diff with it turned on? Another idea: when you run the query and it takes a long time, is the CPU spiked on your client machine or on the SQL Server or neither (if they're on the same machine, look in Task Manager to see which process is eating CPU). If it's not spiked anywhere, then it's possible that your problem is with a network socket timeout or something similar. Maybe use wire shark or something to watch your network traffic to dig deeper if this seems possible. Others may have a more clear idea as to what these various ODBC options are for, but in my distant memories, fiddling with ODBC settings when doing performance tuning can make a big, big difference in how an application performs. My whole goal in this regard was to find the settings in ODBC that caused ODBC to do as absolutely little as possible. Just receive raw sql, pipe it to my server and hand it back to me in a memory data structure. All the data structure parsing was done by my middleware connection wrapper to ODBC and none was done by ODBC itself. Once I got to that point, I was able to get some decent performance out of ODBC. I hope some of these ideas helps! Feel free to write back on or off list. Sincerely, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Extremely Low performance with ODBC
At 09:20 PM 5/27/2008, [EMAIL PROTECTED] wrote: Date: Tue, 27 May 2008 09:29:56 -0700 From: Richard Broersma [EMAIL PROTECTED] To: Sebastian Rychter [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Subject: Re: Extremely Low performance with ODBC Message-ID: [EMAIL PROTECTED] On Mon, May 26, 2008 at 9:34 PM, Sebastian Rychter [EMAIL PROTECTED] wrote: I'll keep on looking for any other differences on different logs.. I think I already searched all the Postgresql forums I know and no one ever reported something like that before. Apologies for the double post on this - I wonder also if you have tried alternative Pg ODBC drivers? I found this one while googling around: http://uda.openlinksw.com/odbc/st/odbc-progress-st/ But I'd guess there are other alternatives. Also, you might check out their ADO driver and see if it gives you the same problems.. Just some more grist for the solution mill, hopefully! Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] export CSV file through Java JDBC
At 07:20 AM 4/15/2008, you wrote: Date: Mon, 14 Apr 2008 09:41:41 -0400 From: Emi Lu [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: export CSV file through Java JDBC Message-ID: [EMAIL PROTECTED] Good morning, Running the following command from command line is ok, but cannot export a table into a csv file through java JDBC code. Please help! JAVA code: === public static void exec(String command) { try{ Process p = Runtime.getRuntime().exec(command); p.waitFor(); p.destroy(); }catch(Exception e) { System.err.println(exec command Error: + e.getMessage()); } } SQL Command: = psql -U username -d dbName -c \copy tableName to 'result.csv' with CSV When call exec(commands); Nothing happens, result.csv was not created at all? Thanks a lot! A couple of thoughts. First, you aren't passing the password in, so that seems like a problem. Of course, psql won't let you specify a password on the command line but last I looked you can set an ENV var before running psql: PGPASSWORD=[your password here] Second, you don't specify a server/port, which means your Pg server is localhost:5432? Third, you are not specifying a path to pgsql, so you have to be sure that it can be found in the path. Now this can be tricky: your Java application may be running in a context DIFFERENT from your command prompt. The user/env your Java app is running in will determine what path vars are available to it - it may not be able to find psql. Try running which psql /tmp/which.txt in your code above and see what happens (assuming you're on a box with which installed). Fourth (minor), you don't specify column names in your export which could result in variable results depending on the create statement - it's better to specify to guarantee the same results every time. Fifth, try capturing STDERR and STDOUT, so that if psql or command shell generate errors you'll know what they are. Maybe Java gives you that in e.getMessage or maybe you need to put it in your psql command line. I'm doing exactly the same thing you are doing but in Ruby/ActiveRecord so I know this works. It works for me on Windows and Linux, fwiw. I don't know enough Java to know if the command you are running is the standard shell execute command in Java. If it's not, that's what you want so change your code that way. You just want java to shell out to the OS command processor. Be sure when you set your command shell env var, that this env var persists long enough so that when you run your psql command it's still in effect. For example this psuedo code might not work b/c two different child shells are run: system.exec(export PGPASSWORD=pass1234); system.exec(psql my command here); I think you want something more like this psuedo code: system.set_environment(PGPASSWORD)=pass1234; system.exec(psql my command here); I hope this helps, Steve
Re: [SQL] Create on insert a unique random number
At 06:47 AM 3/19/2008, D'Arcy J.M. Cain wrote: But your suggestion was to base this key on the serial primary key so where is your index collision protection? You are going to get collisions on both the serial key and, to a lesser extent, your generated one. Besides, has anyone ever demonstrated a real issue with lookups using serial primary keys? I think you are trying to second guess the database engine with this and I don't think that that is a great idea. Hi D'Arcy, I'm not following this line. Maybe we're talking about two different things here.. I don't know if Lance is using CRUD methodology per se, but that's a well accepted web approach and uses (generally) serial primary keys in the URL structure as (where numbers are serial pk's): [website]/contact/12345 [website]/property/45678 [and the client sends GET, POST, PUT, DELETE http requests, or mimics, to activate various functions] Whether CRUD of otherwise, in the model I was promoting, there would be two index columns in the table along with other data, a public index and a serial primary key. The public index is based on the primary key: pk | public_pk 1 | md5(1 + fixed salt) 2 | md5(2 + fixed salt) ... AFAIK, an MD5 hash is guaranteed to generate a unique output for any unique input, so the serial key and fixed salt would guarantee no hash index collisions on the MD5 output. Of course if a competitor knows you're using MD5 and they know your salt, they could calculate all the md5 integer hashes and see which ones exist.. But I could care less if he uses md5 or sha-1 or Guids! (I just picked MD5 because another poster recommended it and it's very easy to implement in Pg). The point I care about is that there would be a public_pk that associates to one-and-only-one serial pk. Also that public_pk should be 1) not easily guessable, 2) non-clustering (and therefore non-serial). Then his url's would look like something like: [website]/contact/c4ca4238a0b923820dcc509a6f75849b [website]/property/c81e728d9d4c2f636f067f89cc14862c The issue is about creating an index into a sparse hash so that each record is somewhat randomly located in a sparse hash index space. (One valid reason to do this would be if you wanted to hide the total number of records in your table from competitors or customers). (Just If that is your goal then start your serial at something other than 1. Start at 1,000,000 for example and your first user will think that you already have one million clients. Actually, he will think that you started elsewhere than 1 but he won't know where. The original post did not want users to be able to type in random integers like: /contact/343 And find out if that record #343 exists or not (regardless of whether they can get access to the record - the error generated on no-authorization may be different from record-not-found). So starting at a million does not fix the OP's issue. From my perspective, wherever you start your serial index, competitors can watch it grow over time, if it's a numeric serial. That could be more valuable in many businesses than knowing the initial size of the table. Anyway, I hope that clears up what I was recommending! I didn't anticipate it would stir up this much analysis and I hope the OP finds your input and mine useful in coming up with a final answer to his issue. Thanks for taking the time to consider the issue and I'll look forward to any additional ideas or comments you have on this too! Sincerely, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Create on insert a unique random number
At 11:58 AM 3/18/2008, [EMAIL PROTECTED] wrote: Date: Tue, 18 Mar 2008 13:40:42 -0500 From: Campbell, Lance [EMAIL PROTECTED] To: Vivek Khera [EMAIL PROTECTED], pgsql-sql@postgresql.org Subject: Re: Create on insert a unique random number Message-ID: [EMAIL PROTECTED] Thanks for all of your input. It appears that the best way to do this is to create a default random number in the primary id field in the table definition and then return that value after insert. If an exception occurs because of duplicates I will simple perform the same insert statement again. I doubt there would be many duplicate hits if I use a really large number. [snip] I built a web application called the Form Builder. It allows individuals to create web forms. After a user is done building their web form the tool provides a URL for the user to access the form. Obviously the URL has the random ID of the form in it. Most of the forms created with this tool can be accessed and filled out by the general public. [snip] Hi Lance, I think I get you as a fellow web systems (aka middleware) guy. My opinion is that the use of a sparse index is totally reasonable for the purpose you describe. But I would argue that you could take it a little further in implementation that might keep your db design sane while still giving you the sparse index function on the front-end. 1) Create a second field (as someone recommend on this list) that is an MD5 of your primary key. Use that as your accessor index from the web application. But keep the primary key as an integer serial, so that it works as expected, and you can build relations normally. I think in the end you'll be happier with this method than messing around with a custom primary key system.. You can build a trigger that generates the MD5 hash every time a record is created (or you can do it in your ORM layer in the web app). 2) Also, (but OT) put a monitor on your weblogs to look for 404 errors (page not found for the sql-only people here). This will supplement your sparse index by detecting people who are scanning your sparse index space and generating lots of misses. Hope that helps, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Create on insert a unique random number
At 12:36 PM 3/18/2008, D'Arcy J.M. Cain wrote: On Tue, 18 Mar 2008 12:23:35 -0700 Steve Midgley [EMAIL PROTECTED] wrote: 1) Create a second field (as someone recommend on this list) that is an MD5 of your primary key. Use that as your accessor index from the web I strongly disagree for three reasons. First, if you are going to generate a key then don't store it. Just generate it every time. Second, don't generate it based on a known field. You may think that it is secure but what if you private key is compromised? Do you then change everyone's security code? Third, what if one person's code is compromised? If it is based on a calculation then you can't change that one person's security code. Generate a random number and store that. You will be much happier when something goes wrong and something always goes wrong. -- D'Arcy J.M. Cain [EMAIL PROTECTED] | Democracy is three wolves Hi D'Arcy, I'm not clear on your concern here - an MD5 hash doesn't have a private key that can be compromised, afaik. It's a one way hash. I don't see much difference between making an MD5 of the primary key and generating a random number for the public primary key, except that you shouldn't get index collisions with the MD5 method (whereas eventually you will with a random number, though of course using a GUID would eliminate that concern for practical purposes). The issue raised by the OP, I believe, is not about security of the primary key # itself or its ability to provide unauthorized access to the underlying records. The system in question protects its records from unauthorized access already. The issue is about creating an index into a sparse hash so that each record is somewhat randomly located in a sparse hash index space. (One valid reason to do this would be if you wanted to hide the total number of records in your table from competitors or customers). (Just for reference of my view on the problem: http://en.wikipedia.org/wiki/Hash_table) Whether SHA-1 or MD5, I think the point is that if you don't care about speed in generating the hash index (which the OP doesn't apparently), hash indexing via an encryption algorithm will ensure that the hash index is relatively free of clustering - which as I understand it, is the point of this exercise. Encryption as a hash index generator is imperfect for sure, as the Wikipedia article goes at length to discuss, but from my perspective it does the job - at least as far as the OP describes it (or I understood it!). [smile] I may be way off here of course, and I appreciate the input - any thoughts? Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Insert problem
At 09:20 AM 3/10/2008, [EMAIL PROTECTED] wrote: Date: Mon, 10 Mar 2008 00:14:12 + From: Jamie Tufnell [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: Re: Insert problem Message-ID: [EMAIL PROTECTED] [snip] table defination create sequence schItem_item_seq create table schItem (scid int NOT NULL references schedule ON DELETE CASCADE, item int NOT NULL default nextval('schItem_item_seq'), [snip] It looks like there's already a row where scid=2072 and item=nextval('schItem_item_seq'). Try: SELECT setval('schItem_item_seq', (SELECT max(item)+1 FROM schItem)); And then run your query again. Cheers, J. A friendly amendment to Jamie's (correct) advice. Be sure that no other tools are obtaining id's from the sequence or inserting rows into schItem when you run this.. (Safest way is to have the db offline when running this). I got some good info from the experts here about this while back and wrote up a little blog article detailing the issue (in short it's a big pain to do it while the db is online, with no obvious sure-fire solution): http://www.misuse.org/science/2007/08/07/obtaining-a-block-of-ids-from-a-sequence-in-postgresql/ I hope this helps a little! Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] works but ...
At 02:49 PM 3/10/2008, A. R. Van Hook wrote: The following code seems to work but it leads to the following question(s): Is there a sequence for each scid,item or is there one sequence that must be reset when changing scid? $cmd = select setval('schItem_item_seq', (select max(item)+1 from schItem where scid=$newScid)); $result = $conn-exec($cmd); $OK += cmp_eq($conn,PGRES_TUPLES_OK, $result-resultStatus); $cmd = insert into schItem (scid, value, iflag, outalts, sidate, istid) select $newScid, i.value, i.iflag, i.outalts, i.sidate, i.istid from schItem i, schItem s where (i.scid=$wrongScid and i.item = $searchItem) and (s.scid=$newScid) group by i.value, i.iflag, i.outalts, i.sidate, i.istid, i.scid; $result = $conn-exec($cmd); $OK += cmp_eq($conn,PGRES_COMMAND_OK, $result-resultStatus); $cmd = delete from schItem where scid = $wrongScid and item = $searchItem; ps the full perl script is attached thanks again art -- Arthur R. Van Hook Hi Arthur, I'm not totally clear on your question but generally speaking there is one sequence per primary key, by default on each table. So if you reset that key, then your table will start issuing keys at that new number. Another way to be more safe is to +5 your sequence, so that even if a few inserts slip in, you're still ahead of the game.. Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] using copy from in function
At 03:20 AM 3/5/2008, [EMAIL PROTECTED] wrote: Date: Wed, 5 Mar 2008 01:51:19 +0300 From: Yura Gal [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: using copy from in function Message-ID: [EMAIL PROTECTED] I'm trying to automate import data using CORY FROM. For this purpose I wrote plpgsql function. As my postgres works on windows vista I need to use E'' syntax for path-to-file. This syntax works fine in SQL queries like: COPY table FROM E'path_to_file_with_double_backslashes'; [snip] _file := $$c:\folder1\folder2\$$ || _chrom || '.txt'; Hi, I'm not sure if this is related, but I have had terrible trouble using \ marks for paths in WinXP.. I have found surprisingly that / work and don't cause any parsing problems. Also, I believe that if you use syntax like: '/folder1/folder2/' || _chrom || '.txt' (i.e. leaving off the c: part too), you may find that everything just works a little cleaner / fewer unexpected surprises. Like I said, I don't know if this is your issue (and Vista), but it's been my experience with WinXP and file paths in Postgresql. Best, Steve -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.orgextra=pgsql-sql
Re: [SQL] Usage of UUID with 8.3 (Windows)
At 09:20 AM 2/12/2008, [EMAIL PROTECTED] wrote: Date: Mon, 11 Feb 2008 11:56:33 -0500 From: Tom Lane [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Subject: Re: Usage of UUID with 8.3 (Windows) Message-ID: [EMAIL PROTECTED] [EMAIL PROTECTED] writes: Now, what do I have to do in order to generate a valid UUID (or the 5 different versions as implemented by the RFC) under Windows? Figure out how to build ossp-uuid on Windows ... regards, tom lane You can review this thread too. MS provides native WinAPI tools and language wrappers for this job: http://answers.google.com/answers/threadview?id=553194 Here's the specific API call, which could possibly be wrapped inside PG - however that raises the problem Tom notes about boatloads of platform specific code -- I'd guess that a custom written stored procedure is the way to go? Not sure how to make WinAPI calls from PG - perhaps someone on list has experience or references for that: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/rpc/rpc/uuidcreate.asp HTH, Steve ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[SQL] Columns view? (Finding column names for a table)
Hi, I see this documentation item but can't figure out how to use it: http://www.postgresql.org/docs/8.2/interactive/infoschema-columns.html The view columns contains information about all table columns (or view columns) in the database. However, if I execute select columns; I get a not found error. I'm sure there's some simple explanation - I'm interested specifically in listing the column names of a specific table (in Pg 8.2). The manual SQL I've developed is (where [table_name] is the table I want columns for): select pg_attribute.attname, * from pg_attribute join pg_class on pg_class.oid = pg_attribute.attrelid where pg_class.relname = '[table_name]' and attnum 0 and atttypid 0 Not pretty but seems to work. Of course if there were a view that encapsulated this and future-proofed it, that'd be much nicer. Any assistance is appreciated! Thanks, Steve ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [SQL] accounting schema
At 05:09 PM 2/6/2008, [EMAIL PROTECTED] wrote: Date: Wed, 6 Feb 2008 17:08:54 -0800 From: Medi Montaseri [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: accounting schema Message-ID: [EMAIL PROTECTED] Hi, I am learning my way into Accounting and was wondering how Accounting applications are designed. perhaps you could point the way On one hand, accountants talk about a sacret equation A = L + OE (Asset = Libility + Owner Equity) and then under each categories there are one or many account. On the other hand a DBA thinks in terms of tables and relations. Instead of getting theoritical, allow me to setup an example Hi Medi, You might read some source code and docs for open source accounting software and see how it's done. Here's one example that might be interesting and simple enough to follow: http://www.gnucash.org/ In general, I think there are many different accounting methods, so you have to get clear about which one you're using. Double-entry accounting is common. Cash vs. accrual posting methods matter (i.e. when does an expense or receivable charge against the assets balance?) My most basic understanding is that in general you track assets as they come in, to an Accounts Receivable ledger (i.e. table) and liabilities to an Accounts Payable ledger. Then you reconcile these two books into a General Ledger table which gives you something like an audit trail of all activity (and a running balance). I'm sure Wikipedia will define these three terms and lots more with greater clarity. But my (limited) experience with accounting schema is that they often involve these three tables (AR/AP/GL) at their core. As you add bank accounts, complex investment instruments, depreciation etc, things get considerably more complex of course. I'll readily admit my limited experience, and I'm sure others on this list have far better information. I hope this gets you started anyway. Sincerely, Steve ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [SQL] accounting schema
At 06:54 PM 2/6/2008, Medi Montaseri wrote: Thanks Steve... And finally you mentioned that bank accounts are tricky...can you expand on this please. After all I am under the impression that bank accounts are a corner stone of this whole book keeping...I mean...bank accounts have debits and credits just like any account...eg interest earned is a credit and bank fees are debits...what worries you about bank accounts... Thanks guys...this is very very nice Medi Hi Medi, We may be well off-topic for Pgsql but it is a modeling issue so I'll reply on-list.. Consider this scenario: You want to track your AP/AR accounts by business unit - so you have a Marketing account, Sales account, Engineering account, etc. But let's say you want to keep all the money for all the units in only two bank accounts depending on which region the expenses will be paid out from (to reduce processing fees). So you might pay for things accrued by Engineering and Marketing on the West Coast from one bank account and Engineering and Marketing expenses on the East Coast from the other (a US-centric example). So the bank accounts where cash is actually deposited/withdrawn is different from where the money is received or spent from a logical perspective in the AR/AP ledgers. This is a simple example and real-world issues become truly horrendous. (For example, try to account for investment instruments like long term property holdings with lines of credit against the real estate). I've always thought that if DBA's existed when Accounting was invented, things would look very, very different. :) Good luck, Steve ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[SQL] Negative numbers for PK/ID's?
Hi, A while ago on a different SQL platform, I had the idea to use negative numbers as id's for certain system records that I didn't prefer to have interspersed with other records in a table. (For example, we had template records which we used to spawn new records, and rather than store them in a different table we just had records id=-1,-2 etc and did an insert/select to create new live rows based on the templates). Preventing them from displaying in the results was as simple as ensuring that all select statements had a id 0 statement in them.. I'm wondering if there are any Bad Things that happen if I use negative integers for primary key values in Postgres (v8.2)? My primary keys are all bigserial type. Any comments or advice? Thanks! Steve ---(end of broadcast)--- TIP 6: explain analyze is your friend
[SQL] Re: Proposed archival read only trigger on rows - prevent history modification
At 07:50 PM 1/29/2008, [EMAIL PROTECTED] wrote: Date: Mon, 28 Jan 2008 20:16:35 -0800 From: Bryce Nesbitt [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: Proposed archival read only trigger on rows - prevent history modification [snip] I'm considering building a protective mechanism, and am seeking feedback on the idea. The approach would be to add a new column named ro to each table at invoice level and below. Then have a trigger on 'ro'==true deny the write, and probably raise a huge stink. As invoice are mailed each month, all the supporting data would be set to ro true. [snip] Hi Bryce, I have a similar situation but a little in reverse. I have many sets of *incoming* records, which I want to preserve, though at any one time there is only one live version of the incoming records. Sometimes I have to read and compare versions of the records, live or otherwise. The logical records I'm talking about occupy a number of tables that are joined together in the database itself. My solution, which required a little middleware engineering, was to create two tables for each table that had multiple versions of records. So, I have a property table and a property_versions table that have identical table structures. I have a column common to every such versioned set of tables called import_group_id. For live tables, this just tells me which version from the *_versions table is currently being used for that row. The live tables have primary keys just like normal (id as a serial int field). The versions tables' primary keys are different, compound keyed off id and import_group_id. This permits normalization but also allows multiple versions of the same records. In your case, I'd say you could archive your data table to a data_versions table. You might archive periodically and leave the records on the live table (but knowing that audit versions are safely tucked away and easily accessible/comparable), or you might migrate the records off the live table onto the versions table (insert followed by a delete in a transaction). If you adopted the latter method, you could union the two tables to get a complete set of rows. (Via a view even? Not sure if you can create a view on a union but it seems likely you can..) You could also choose (like I did) to store multiple versions of the records, if your data are slowly changing rather than completely unchanging. However it sounds like your situation is such that you want to ensure old records are not modified ever. If you set up triggers and/or permission on the data_versions table, no one can ever delete or modify anything there. This might work perfectly for your purposes. Data in the live table can be edited as you like but changes to the data_versions table is simply not permitted. Inserts are allowed to data_versions but nothing else.. I hope this idea is useful. If I haven't explained it well, drop me a line and I'll try to clarify. Good luck with the project! Steve ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [SQL] improvements to query with hierarchical elements
At 07:24 PM 1/22/2008, you wrote: Hi all, I have created a little test database to help illustrate my situation. CREATE TABLE categories ( id integer NOT NULL, name character varying(255) NOT NULL, description character varying(255), vocabulary_id integer, derived boolean ); CREATE TABLE category_descendants ( id integer NOT NULL, ancestor_id integer, descendant_id integer, distance integer, derived boolean ); CREATE TABLE category_links ( id integer NOT NULL, parent_id integer, child_id integer, derived boolean ); [snip..] As stated in my last post, any help you can give on how to improve queries of this type would be very much appreciated. Thanks! Ryan Hi Ryan, I've been toying with your sample data for a bit and I apologize but your query has me baffled. Not that it's wrong - it actually looks very sophisticated, but it seems super complex to me - kind of like how I usually feel reading perl.. :) I'm sure real sql-heads would get it right away but I'm not able to. If you're looking to optimize the use-case you provided in your first email, the best thing I can suggest from what I understand would make an assumption: Are the data in your tables are slowly changing? So could you build some analytic/pre-calculated data into these tables or related supporting ones to guide your searches/queries? For example, if you want to find only records which are immediate children of other records, why not make a table which stores just that information? Your current tables are fully hierarchical which is great, but you want to look things up quickly based on a specific relationship: records who are direct children of a particular record.. So if you made a calculated table that stores this information, you could keep it up to date either by running the calculation script periodically or by attaching updates to relevant triggers / rules. I'm sorry I'm not able to get into the SQL / example you sent further. I got lost in the code, which I'm a little embarrassed to admit but there you are. If you're interested in this idea of precalculating values to optimize your search, I'd be happy to discuss further. Also, Ralph Kimball's Data Warehousing books are excellent on this subject (one of the few authors who truly changed the way I think about data). Steve ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] improvements to query with hierarchical elements
Date: Sun, 20 Jan 2008 20:01:08 -0800 From: Ryan Wallace [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: improvements to query with hierarchical elements Message-ID: [EMAIL PROTECTED] Greetings, I have a complex query which I am trying to figure out the most efficient way of performing. My database is laid out as follows: items -have_many- events -have_many- event_locations -have_many- locations also rows in the location_links table link two locations together in a parent-child relationship and rows in the location_descendants table provide a full list of the descendants of a particular location. I am trying to find all locations which both are direct children of a given parent location, and are associated with at least one item in a constrained subset of items. (eg. Find all states of the USA in which at least one wooden axe was made. Also find the number of wooden axes made in each state.) I have developed the following query: SELECT locations.*, location_ids.item_count AS item_count FROMlocations JOIN (SELECT immediate_descendants.ancestor_id AS id, COUNT(DISTINCT creation_events.item_id) AS item_count FROMevent_locations JOIN (SELECT * FROMlocation_descendants WHERE ancestor_id IN (SELECT child_id FROMlocation_links WHERE parent_id = *note 1* ) ) AS immediate_descendants ON event_locations.location_id = immediate_descendants.descendant_id JOIN (SELECT * FROMevents WHERE item_id IN (*note 2*) AND association = 'creation' ) AS creation_events ON event_locations.event_id = creation_events.id GROUP BY immediate_descendants.ancestor_id ) AS location_ids ON locations.id = location_ids.id *note 1* - the id of the parent location. *note 2* - the query which returns a list of constrained item ids This works but I am looking for any way to improve the performance of the query (including changing the layout of the tables). Any ideas, suggestions or general pointers would be greatly appreciated. Thanks very much, Ryan Hi Ryan, I have built some similar queries so I might be able to help you. But it's a little hard (for me) to dig into your query without a test set. Could you please post some create table and insert statements to give us a little test bed to run your query in? I realize that may be a fair bit of work for you but it would help me to give you some ideas. Without seeing a more formal schema and being able to toy with it, I'm not sure I can give good advice. Others may have different opinions which I would welcome. Sincerely, Steve ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] UTF8 encoding and non-text data types
On Jan 13, 2008 8:51 PM, Steve Midgley mailto:[EMAIL PROTECTED][EMAIL PROTECTED] wrote: At 02:22 PM 1/13/2008, mailto:[EMAIL PROTECTED][EMAIL PROTECTED] wrote: Date: Sat, 12 Jan 2008 14:21:00 -0800 From: Medi Montaseri mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] To: mailto:pgsql-sql@postgresql.orgpgsql-sql@postgresql.org Subject: UTF8 encoding and non-text data types Message-ID: mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] I understand PG supports UTF-8 encoding and I have sucessfully inserted Unicode text into columns. I was wondering about other data types such as numbers, decimal, dates That is, say I have a table t1 with create table t1 { name text, cost decimal } I can insert UTF8 text datatype into this table with no problem But if my application attempts to insert numbers encloded in UTF8, then I get wrong datatype error Is the solution for the application layer (not database) to convert the non-text UTF8 numbers to ASCII and then insert it into database ? Thanks Medi Hi Medi, I have only limited experience in this area, but it sounds like you sending your numbers as strings? In your example: create table t1 { name text, cost decimal }; insert into t1 (name, cost) values ('name1', '1'); I can't think of how else you're sending numeric values as UTF8? I know that Pg will accept numbers as strings and convert internally (that has worked for me in some object relational environments where I don't choose to cope with data types), but I think it would be better if you simply didn't send your numeric data in quotations, whether as UTF8 or ASCII. If you don't have control over this layer (that quotes your values), then I'd say converting to ASCII would solve the problem. But better to convert to numeric and not ship quoted strings at all. I may be totally off-base and missing something fundamental and I'm very open to correction (by anyone), but that's what I can see here. Best regards, Steve At 11:01 AM 1/14/2008, Medi Montaseri wrote: Thanks Steve, Actually I do not insert text data into my numeric field. As I mentioned given create table t1 { name text, cost decimal } then I would like to insert numeric data into column cost because then I can later benefit from numerical operators like SUM, AVG, etc More specifically, I am using HTML, Perl and PG. So from the HTML point of view a textfield is just some strings. So my user would enter 12345 but expressed in UTF8. Perl would get this and use DBI to insert it into PG What I am experiencing now is that DB errors that I am trying to insert an incorrect data into column cost which is numeric and the data is coming in from HTML in UTF8 Mybe I have to convert it to ASCII numbers in Perl before inserting them into PG Thanks Medi Hi Medi, I agree that you should convert your values in Perl before handing to DBI. I'm not familiar with DBI but presumably if you're sending it UTF8 values it's attempting to quote them or do something with them, that a numeric field in Pg can't handle. Can you trap/monitor the exact sql statement that is generated by DBI and sent to Pg? That would help a lot in knowing what it is doing, but I suspect if you just convert your numbers from the HTML/UTF8 source values into actual Perl numeric values and then ship to DBI you'll be better off. And you'll get some input validation for free. I hope this helps, Steve
Re: [SQL] UTF8 encoding and non-text data types
At 12:43 PM 1/14/2008, Medi Montaseri wrote: Here is my traces from perl CGI code, I'll include two samples one in ASCII and one UTF so we know what to expect Here is actual SQL statement being executed in Perl and DBI. I do not quote the numerical value, just provided to DBI raw. insert into t1 (c1, cost) values ('tewt', 1234) this works find insert into t1 (c1, cost) values ('#1588;#1583;', #1777;#1778;#1779;#1780;) DBD::Pg::db do failed: ERROR: syntax error at or near ; at character 59, And the PG log itself is very similar and says ERROR: syntax error at or near ; at character 59 Char 59 by the way is the first accurance of semi-colon as in #1; which is being caught by PG parser. Medi On Jan 14, 2008 12:18 PM, Steve Midgley mailto:[EMAIL PROTECTED][EMAIL PROTECTED] wrote: On Jan 13, 2008 8:51 PM, Steve Midgley mailto:[EMAIL PROTECTED][EMAIL PROTECTED] wrote: At 02:22 PM 1/13/2008, mailto:[EMAIL PROTECTED][EMAIL PROTECTED] wrote: Date: Sat, 12 Jan 2008 14:21:00 -0800 From: Medi Montaseri mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] To: mailto:pgsql-sql@postgresql.orgpgsql-sql@postgresql.org Subject: UTF8 encoding and non-text data types Message-ID: mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] I understand PG supports UTF-8 encoding and I have sucessfully inserted Unicode text into columns. I was wondering about other data types such as numbers, decimal, dates That is, say I have a table t1 with create table t1 { name text, cost decimal } I can insert UTF8 text datatype into this table with no problem But if my application attempts to insert numbers encloded in UTF8, then I get wrong datatype error Is the solution for the application layer (not database) to convert the non-text UTF8 numbers to ASCII and then insert it into database ? Thanks Medi Hi Medi, I have only limited experience in this area, but it sounds like you sending your numbers as strings? In your example: create table t1 { name text, cost decimal }; insert into t1 (name, cost) values ('name1', '1'); I can't think of how else you're sending numeric values as UTF8? I know that Pg will accept numbers as strings and convert internally (that has worked for me in some object relational environments where I don't choose to cope with data types), but I think it would be better if you simply didn't send your numeric data in quotations, whether as UTF8 or ASCII. If you don't have control over this layer (that quotes your values), then I'd say converting to ASCII would solve the problem. But better to convert to numeric and not ship quoted strings at all. I may be totally off-base and missing something fundamental and I'm very open to correction (by anyone), but that's what I can see here. Best regards, Steve At 11:01 AM 1/14/2008, Medi Montaseri wrote: Thanks Steve, Actually I do not insert text data into my numeric field. As I mentioned given create table t1 { name text, cost decimal } then I would like to insert numeric data into column cost because then I can later benefit from numerical operators like SUM, AVG, etc More specifically, I am using HTML, Perl and PG. So from the HTML point of view a textfield is just some strings. So my user would enter 12345 but expressed in UTF8. Perl would get this and use DBI to insert it into PG What I am experiencing now is that DB errors that I am trying to insert an incorrect data into column cost which is numeric and the data is coming in from HTML in UTF8 Mybe I have to convert it to ASCII numbers in Perl before inserting them into PG Thanks Medi Hi Medi, I agree that you should convert your values in Perl before handing to DBI. I'm not familiar with DBI but presumably if you're sending it UTF8 values it's attempting to quote them or do something with them, that a numeric field in Pg can't handle. Can you trap/monitor the exact sql statement that is generated by DBI and sent to Pg? That would help a lot in knowing what it is doing, but I suspect if you just convert your numbers from the HTML/UTF8 source values into actual Perl numeric values and then ship to DBI you'll be better off. And you'll get some input validation for free. I hope this helps, Steve Hi Medi, That structure for numeric values is never going to work, as best as I understand Postgres (and other sql pipes). You have to convert those UTF chars to straight numeric format. Hopefully that solves your problem? I hope it's not too hard for you to get at the code which is sending the numbers as UTF? Steve
Re: [SQL] UTF8 encoding and non-text data types
At 02:22 PM 1/13/2008, [EMAIL PROTECTED] wrote: Date: Sat, 12 Jan 2008 14:21:00 -0800 From: Medi Montaseri [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: UTF8 encoding and non-text data types Message-ID: [EMAIL PROTECTED] I understand PG supports UTF-8 encoding and I have sucessfully inserted Unicode text into columns. I was wondering about other data types such as numbers, decimal, dates That is, say I have a table t1 with create table t1 { name text, cost decimal } I can insert UTF8 text datatype into this table with no problem But if my application attempts to insert numbers encloded in UTF8, then I get wrong datatype error Is the solution for the application layer (not database) to convert the non-text UTF8 numbers to ASCII and then insert it into database ? Thanks Medi Hi Medi, I have only limited experience in this area, but it sounds like you sending your numbers as strings? In your example: create table t1 { name text, cost decimal }; insert into t1 (name, cost) values ('name1', '1'); I can't think of how else you're sending numeric values as UTF8? I know that Pg will accept numbers as strings and convert internally (that has worked for me in some object relational environments where I don't choose to cope with data types), but I think it would be better if you simply didn't send your numeric data in quotations, whether as UTF8 or ASCII. If you don't have control over this layer (that quotes your values), then I'd say converting to ASCII would solve the problem. But better to convert to numeric and not ship quoted strings at all. I may be totally off-base and missing something fundamental and I'm very open to correction (by anyone), but that's what I can see here. Best regards, Steve ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[SQL] Re: How to keep at-most N rows per group? periodic DELETEs or constraints or..?
At 07:20 AM 1/9/2008, [EMAIL PROTECTED] wrote: Date: Tue, 8 Jan 2008 17:41:18 + From: Jamie Tufnell [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: Re: How to keep at-most N rows per group? periodic DELETEs or constraints or..? Message-ID: [EMAIL PROTECTED] On 1/8/08, codeWarrior [EMAIL PROTECTED] wrote: Jamie: I think you are probably having slowdown issues in your DELETE FROM WHERE NOT IN SELECT ORDER BY DESCENDING construct -- that seems a bit convoluted to me Hmm so rather than NOT IN ( .. LIMIT 50) would you suggest IN ( ... OFFSET 50) like in Erik's example? Or something else entirely? ALSO: It looks to me like you have a column named timestamp' ??? This is bad practice since timestamp is a reserved word... You really ought NOT to use reserved words for column names... different debate. I do realize it would be better to use something else and thanks for the tip This is an established database and timestamp has been used in other tables which is why I stuck to it here.. one day when time permits maybe I'll rename them all! Why bother deleting records anyway ? Why not alter your query that tracks the 50 records to LIMIT 50 ??? The read query does LIMIT 50 and the reason for deleting the rest of the records is because they're not needed by the application and there's loads of them being created all the time (currently several million unnecessary rows) -- I imagine eventually this will slow things down? Do you think a regular batch process to delete rows might be more appropriate than a trigger in this scenario? Thanks, Jamie This is kludgy but you would have some kind of random number test at the start of the trigger - if it evals true once per every ten calls to the trigger (say), you'd cut your delete statements execs by about 10x and still periodically truncate every set of user rows fairly often. On average you'd have ~55 rows per user, never less than 50 and a few outliers with 60 or 70 rows before they get trimmed back down to 50.. Seems more reliable than a cron job, and solves your problem of an ever growing table? You could adjust the random number test easily if you change your mind of the balance of size of table vs. # of delete statements down the road. Steve ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[SQL] Re: How to keep at-most N rows per group? periodic DELETEs or constraints or..?
I think what you want is related to this post on how to create a FIFO queue in Postgres: http://people.planetpostgresql.org/greg/index.php?/archives/89-Implementing-a-queue-in-SQL-Postgres-version.html The major difference is that you want a FIFO queue per user_id, so the triggering code would want to bump old records aggregating on user_id to calculate the limit subquery. His original code is this: DELETE FROM q WHERE id NOT IN (SELECT id FROM q ORDER BY id DESC LIMIT maxrows); And that subquery is where (I think!) you'd want to add where user_id = [user_id] - I'm not sure how you'll pass user_id into this function, maybe someone else can help with that? Hopefully this is useful? Steve At 06:24 AM 1/8/2008, [EMAIL PROTECTED] wrote: Date: Tue, 8 Jan 2008 14:24:22 + From: Jamie Tufnell [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: How to keep at-most N rows per group? periodic DELETEs or constraints or..? Message-ID: [EMAIL PROTECTED] Hi, I have a table that stores per-user histories of recently viewed items and I'd like to limit the amount of history items to = 50 per user. I'm considering doing this with a query run from cron every so often but I'm not happy with what I've come up with so far, and since it's a quite active table I thought I'd ask here to see if there's a more efficient way. Right now the table structure is as follows... user_item_history: id (PK), user_id (FK), item_id (FK), timestamp For user_ids that have more than 50 rows, I want to keep the most recent 50 and delete the rest. The most obvious way of doing this for me is: -- -- Get the user_ids with 50 or more history entries like this -- SELECT user_id, count(*) FROM user_scene_history GROUP BY user_id HAVING count(*) 50; -- -- Then iterate the ids above (_user_id) -- DELETE FROM user_scene_history WHERE user_id = _user_id AND id NOT IN ( SELECT id FROM user_scene_history WHERE user_id = _user_id ORDER BY timestamp DESC LIMIT 50); I've left out the simple logic tying the above two queries together for clarity.. I haven't actually tested this but while I assume it would work I imagine there is a neater and possibly more efficient way of attacking this. I'm also open to different approaches of limiting the user's history too ... perhaps with table constraints so they can simply never exceed 50 entries? But I'm not sure how to do this.. Any help would be greatly appreciated.. Thanks, Jamie
[SQL] System catalog future changes
Hello again, Reading a previous recent post and answers called Describe Table got me thinking about a little piece of SQL I use in an application to get a list of all the tables for a specific namespace: select pg_class.relname as table_name from pg_class join pg_namespace on pg_namespace.oid = relnamespace where pg_class.relkind = 'r' and pg_namespace.nspname = 'public' order by pg_class.relname I have assumed that this sort of query would be relatively durable - that future system catalog changes are unlikely to make this query stop working? Does anyone have any knowledge of how fixed the columns and values of this query are (i.e. are there a lot of internal and external dependencies that make future Pg versiosn unlikely to break the above code)? Any other input on the above SQL - should I be doing this in another way? Thanks for any thoughts or advice, Steve ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [SQL] System catalog future changes
Thanks Erik - that cleans things up significantly for me. For the record, the SQL for finding all the table names (alpha ordered) in the public schema using pg_tables view is: select tablename from pg_tables where schemaname='public' order by tablename Steve At 09:38 AM 12/18/2007, Erik Jones wrote: On Dec 18, 2007, at 10:53 AM, Steve Midgley wrote: Hello again, Reading a previous recent post and answers called Describe Table got me thinking about a little piece of SQL I use in an application to get a list of all the tables for a specific namespace: select pg_class.relname as table_name from pg_class join pg_namespace on pg_namespace.oid = relnamespace where pg_class.relkind = 'r' and pg_namespace.nspname = 'public' order by pg_class.relname I have assumed that this sort of query would be relatively durable - that future system catalog changes are unlikely to make this query stop working? Does anyone have any knowledge of how fixed the columns and values of this query are (i.e. are there a lot of internal and external dependencies that make future Pg versiosn unlikely to break the above code)? Any other input on the above SQL - should I be doing this in another way? Thanks for any thoughts or advice, If all you're looking for is regular tables, the I'd use the pg_tables view. Erik Jones Software Developer | Emma® [EMAIL PROTECTED] 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate market in style. Visit us online at http://www.myemma.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [SQL] Query design assistance - getting daily totals
Hi, Rodrigo is exactly right in my opinion. To provide a little more info on this calendar or day dimension idea.. You can create, for example, a time table dimension which stores every day of every year as a unique record (for as far into the future as you need). You can then associate various attributes to each day, depending on your business needs like so: id|datetime|is_business_day|is_weekday|is_fed_holiday Of course it's not normalized but that's the point. You then just store the id in various places and it's easy to join back to this table and figure out if a particular day has an attribute you're interested in (or you can find the id's for all the days which have a particular attribute for a given date range - to go the other direction, for example). You can get more on this type of thinking from the most excellent resource by Ralph Kimball The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition) - this book did more to open my eyes to alternative to traditional normalized modeling than anything else. It also made me feel less guilty about building certain non-normal structures. :) I hope that's helpful.. Steve At 12:21 PM 12/13/2007, [EMAIL PROTECTED] wrote: Date: Wed, 12 Dec 2007 14:53:08 -0500 From: =?UTF-8?Q?Rodrigo_De_Le=C3=B3n?= [EMAIL PROTECTED] To: Paul Lambert [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Subject: Re: Query design assistance - getting daily totals Message-ID: [EMAIL PROTECTED] On Dec 12, 2007 1:39 AM, Paul Lambert [EMAIL PROTECTED] wrote: It's a financial application which needs to work using a concept of 'financial periods' which may not necessarily correspond to calendar months and it's much easier to manage in this way than it is to merge it all together using a date field. Eg, 1st January may actually be the 15th 'working day' of the 9th 'financial period' - however looking at just a date of jan-1 there is no way of knowing this and it's the periods that matter more so than the actual date. I think what you need is a Calendar Table to map actual dates to buckets e.g. 'financial periods', etc. See: http://codeinet.blogspot.com/2006/08/auxiliary-calendar-table-for-sql.html ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [SQL] INSERT INTO relational tables
At 09:23 AM 12/7/2007, [EMAIL PROTECTED] wrote: Date: Fri, 07 Dec 2007 14:22:26 +0100 From: Stefan Scheidegger [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: INSERT INTO relational tables Message-ID: [EMAIL PROTECTED] Hi all I'm new to SQL and I'm facing a problem I can't find any information about (google, forums). By now I wonder if I understood something wrong about relational DBs. An example to explain my Problem: Lets say I have a table containing information about the customer (name, address, ...) and about his order (pieces, product-name, price). Because one customer can order several products I split the table into two relational tables to prevent redundancy: tbl_customer (cust_id, cust_name, cust_address) and tbl_order (ord_pieces, ord_productname, ord_price, ord_customer REFERENCES tbl_customer(cust_id)) Now I want to insert several new customers with several orders each at once. If I had all information in one table, this would be easy with something like: INSERT INTO tbl_customerorders (name, address, pieces, porductname, price) VALUES ('MR. X', '1st street', 3, 't-shirts', 30), ('MR. X', '1st street', 5, 'books', 50), ('MRS. Y', '2nd street', 1, 't-shirt', 10),... But how can I do this in one query if I split the table? I can add one new customer, get his ID with curval() and then add his orders. But this wonât work if I want to add several customers at once. To read this information I can do a query with the argument WHERE cust_id=ord_customer. I can create a VIEW doing this so I can read the data as if it was stored in only one table. But is there in posgres/sql an abstraction-layer that allows me to insert as if the information was stored in one table? (Something like a VIEW that provides INSERT, UPDATE, ⦠and automatically inserts the referenced ID.) Thanks for any help! Stefan Andrea has given you some deep answers (i.e. smarter than what I'm going to say) but since you say you're a newbie to SQL maybe this will help. Perhaps all you need is to wrap your entire set of statements into a transaction (http://www.postgresql.org/docs/8.2/interactive/tutorial-transactions.html) to ensure that your entires are acid (http://databases.about.com/od/specificproducts/a/acid.htm). BEGIN insert tbl_customer...; insert tbl_order...; insert tbl_order...; END It's not clear whether you're looking for syntax simplicity or referential integrity (or both). If syntax is your need, then Andreas has given some good ideas. As another thought about syntax enhancement: there are lots of object-relational mappers out there that let you model your relations in a higher level language (I use Ruby and ActiveRecord which are easy to write and learn, but Java, .NET, etc all have great stuff). Then you just build your entities in the domain specific language and it handles all the id inserts and relational mapping for you. They can even handle mapping many-to-many joined entities, if you're careful in setting it up. I hope this is helpful, Steve ---(end of broadcast)--- TIP 6: explain analyze is your friend
[SQL] (repost) Help understanding expressions in order by clause
Hi, (I posted this to the list on 10/25 but it didn't seem to get distributed - apologies if it did and I'm actually double posting right now..) I've read on this list about some pretty powerful examples of using expressions in order by clauses to sort according to very complex rules. I kind of half get it and was playing around with it today to learn more. I'm hoping for one or two additional pointers from the resident experts.. Take this sample: --sql start drop table if exists test_order; create temp table test_order ( id serial, email varchar(255) ); insert into test_order (email) values ('[EMAIL PROTECTED]'); insert into test_order (email) values ('[EMAIL PROTECTED]'); insert into test_order (email) values ('[EMAIL PROTECTED]'); insert into test_order (email) values ('[EMAIL PROTECTED]'); insert into test_order (email) values ('[EMAIL PROTECTED]'); select * from test_order order by id 3, email --sql end Putting the expression id 3 causes the id of 3 to sort to the TOP of the list. This statement does the same thing: select * from test_order order by id = 3 DESC, email I know there is a good, rational reason for this, and I'm hoping someone will explain it to me? From what I can see, true is sorting AFTER false in Postgres? In general, is it better/preferred just to use case statements to get the results I'm expecting? Or maybe someone could point me to the relevant docs that explains order by behavior in more detail? I read this http://www.postgresql.org/docs/8.2/interactive/sql-select.html#SQL-ORDERBY and it didn't make things any clearer. I'm quite interested in the power of expressioned sort by's and so I appreciate any assistance in this regard. Thanks! Steve ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[SQL] SQL performance help: self join or static var
Hi, Given a table (truncated some real fields for simplicity): CREATE TABLE city ( id serial NOT NULL, muni_city_id integer, post_code_city_id integer, alias_city_id integer, city_type character varying(15), post_code_type character varying(15), CONSTRAINT city_pkey PRIMARY KEY (id) ) WITHOUT OIDS; CREATE INDEX index_city_on_muni_city_id ON city USING btree (muni_city_id); CREATE INDEX index_city_on_post_code_type ON city USING btree (post_code_type); Filled with ~168,000 records Which of the following SQL statements should I expect better performance on? select * from city where post_code_type in ('P', 'R') EXPLAIN ANALYZE: Seq Scan on city (cost=0.00..4492.82 rows=76172 width=290) (actual time=0.039..163.564 rows=30358 loops=1) Filter: ((post_code_type)::text = ANY (('{P,R}'::character varying[])::text[])) Total runtime: 231.947 ms OR select * from city where id = muni_city_id EXPLAIN ANALYZE: Seq Scan on city (cost=0.00..3535.41 rows=383 width=290) (actual time=0.022..124.463 rows=30200 loops=1) Filter: (muni_city_id = id) Total runtime: 195.342 ms In my case both statements are semantically equivalent and I'm trying to figure out if I should prefer the search of a varchar field for static values over the self join constraint to an indexed integer column? My (almost completely ignorant) eyes say that the latter (id=muni_city_id) is faster by a little - any voices of support or other ideas in this regard? Caveats? Thanks, Steve ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] raw data into table process
Hi, I'm not sure if you have access to a scripting language (like perl or ruby) but my experience is that if you transform the source text file into a secondary text file that postgres copy can read natively/directly into the data formats you want, the copy command will move everything into your primary table and way faster than using sql to transform. I've had great experience with using Ruby/regex to do text file transforms such as this - it's amazingly fast (considering many say it's a performance dog). The best way to get started is copy some sample data OUT of PG to a file and then use your scripting language to build a transformation that formats exactly like your sample. (I've imported polygon, circle and point types using this method, as well as dates, like what you're trying to do). I hope that helps - drop me a line off-list if you'd like some sample Ruby code to read/write/transform your source. Steve At 06:33 AM 8/22/2007, [EMAIL PROTECTED] wrote: Date: Wed, 22 Aug 2007 14:36:15 +1000 From: novice [EMAIL PROTECTED] To: [EMAIL PROTECTED], pgsql-sql@postgresql.org Subject: raw data into table process Message-ID: [EMAIL PROTECTED] I am trying to record the following entries into a table. I'm curious to know if there's an efficient/effective way of doing this? This entries come from an ancient datalogger (note: separated by space and uses YY/MM/DD format to record date) Plain file sample.dat 3665 OK BS 07/08/16 07:28 3665 CC BS 07/08/16 07:29 3665 CS BS 07/08/16 07:29 3665 CS BS 07/08/16 07:29 4532 OK BS 07/08/16 07:34 4004 OK BS 07/08/16 07:51 3991 OK BS 07/08/16 07:54 This is the table that I'm adding the entries to CREATE TABLE maintenance ( maintenance_id SERIAL PRIMARY KEY, meter_id integer, status text, inspector text, inspection_date timestamp with time zone, ) -- Begin SQL Script -- First table to dump the records in CREATE TABLE dataload1 (data text) -- Dump records using \copy \copy dataload1 FROM sample.dat -- Second table to import unique records ONLY CREATE TABLE dataload2 AS SELECT DISTINCT data FROM dataload1; -- Now I update unique records into the maintenance table -- maintenance_id is SERIAL so it will be populated automatically INSERT INTO maintenance(meter_id, status, inspector, inspection_date) SELECT substr(data, 1, 4)::int , substr(data, 8, 3) , substr(data, 21, 2) , (20||substr(data, 24, 2) ||'-'|| substr(data, 27, 2) ||'-'|| substr(data, 30, 2)||' '||substr(data, 33, 5))::timestamp as inspection_date FROM dataload2 -- So the new records will also be in timestamp order ORDER BY inspection_date ; -- Some housekeeping VACUUM FULL VERBOSE ANALYZE maintenance; -- Finally, drop the temporary tables DROP TABLE dataload1 DROP TABLE dataload2 -- End SQL script Any thoughts and suggestions welcome. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [SQL] Increment a sequence by more than one
Hi Peter, I struggled to implement Michael's suggestion to use CACHE in this regard when he made it but after your encouragement I've studied it more and you and he are both totally right - CACHE is designed to do exactly what I want. Here's the sample code so as to put this issue to bed and to record what the solution is: Scenario: Bob wants a block of 50 id's Alice just wants a single id but will accidentally interlope into Bob's sequence obtainment. property_id_seq = 100 Bob: # alter sequence property_id_seq CACHE 50 Alice: # select nextval('property_id_seq') = 101 (wastes ids up to 150) Bob: # select nextval('propery_id_seq') = 151 (Bob now knows that 151-201 are locked permanently for his exclusive use) Bob: # alter sequence property_id_seq CACHE 1 = Sequence will now return single ids to everyone So in the worst case, there will be id wastage equal to the CACHE size times the number of interlopers who grab ids while Bob is obtaining his block. And Bob's time to grab a set of id's is fairly small since he's only issuing a couple of very fast sql statements.. NOTE: If all calling parties must agree to always use the same CACHE number for obtaining blocks of id's, then this method seems bulletproof (if two parties use differing CACHE #'s then they could cause too few id's to be CACHED to one of the parties). I hope this helps someone else on the archives down the road. Thanks to everyone for putting their time and attention on this problem. I'm very grateful. Sincerely, Steve At 08:00 AM 8/6/2007, Peter Childs wrote: On 03/08/07, Michael Glaesemann mailto:[EMAIL PROTECTED][EMAIL PROTECTED] wrote: On Aug 3, 2007, at 15:27 , Erik Jones wrote: Is there actually a requirement that the block of 5000 values not have gaps? Good point. If not, why not make the versioned table's id column default to nextval from the same sequence? Of course, the ids of the two tables could be interleaved in this case. This might not be an issue, of course. Michael Glaesemann grzm seespotcode net It seams to me that one should use the cache feature of a sequence is there just for this purpose. That way when you get the next value your session caches and any other sessions will get one after your cache range. Peter
[SQL] Increment a sequence by more than one
Hi, I'm writing an import app in a third party language. It's going to use copy to to move data from STDIN to a postgres (8.2) table. There are some complexities though: it's going to copy the records to a versioned table first, and then at a later time the records will be copied by a different procedure to the live table. The live table and versioned table are identical in terms of their field definitions. But there is no sequence associated with the versioned table (whose primary key is id plus import_group_id, whereas the live table's pk is just id). So all versioned table entries must already know what their id would be in the live table. (This makes sense for other business process we have, but it's a bit of a problem in this instance). My problem: I'd like to be able to grab a block of id's from the live table's pk sequence. So let's say my importer has 5,000 new rows to import and the current max pk in the live table is 540,203. I'd like to be able to increment the primary key sequence in such a way that I get a block of ids all to myself and the sequence is reset to 545,203 with a guarantee that all the id's between 540203 and 545203 are unused. I'm guessing this can be done with a stored procedure, but if possible I'd like to be able to run this command from my third party app without calling a stored procedure (we try not to use stored procedures here b/c we code entirely in this third party language - if we had to, it's possible we could install a stored procedure though). But since I've seen so much magic on display from people on this list, I'm going to ask if it's possible to do this solely from PG SQL sent from a third party language? The main tricky bit seems to be ensuring that everything is locked so two such increment calls at the same time don't yield overlapping blocks of ids. Is there a way to lock the sequence generator for the duration of a nextval and setval call? Since pk sequence functions like nextval cannot be rolled back, I'm guessing that begin/end won't do the job? I.e: -- need magic lock statement on pk sequence here nextval -- returns 540203 setval(545203) -- now sequence is set to where I want it and I own 5000 id's -- release magic lock here My fallback is to just have a routine that calls nextval a bunch of times and stores all the id's it gets - they may or may not be sequential but they'll be unique. This is going to be a really slow way to get a large number of id's of course and just seems plain wrongheaded in many ways. Any insights? All help is appreciated and input on a better way to solve the problem completely is of course welcome as well. Sincerely, Steve
Re: [SQL] Increment a sequence by more than one
Hi Scott, Thanks for this info (and Michael too!). Let me see if I understand your suggestion. I would run these three commands in sequence: # select nextval('[my_seq_name]'); returns = 52 [I believe that the sequence is at 52] # alter sequence [my_seq_name] increment by 5000; # select nextval('[my_seq_name]'); returns = 5052 If the third command doesn't return 5052 - i.e. it returns 5053, then I know that somewhere in this sequence another process grabbed an id out from under me. It doesn't matter where, but I would know that my 5000 id's are not unique and should be discarded? If the third command DOES return 5052, then I know that those 5000 id's are locked for my use and no other application could have grabbed one of them? Can anyone see a flaw in that? It looks right to me.. Scott - it also seems to me that I need not waste all those id's if another application does grab one during my statement: If I detect a failure, I could just reset the pk sequence back to the max id of the underlying table before trying again. I think this code would do it (stolen from Ruby's postgres adaptor): SELECT setval('[seq_name]', (SELECT COALESCE(MAX([pk_of_seq]), (SELECT min_value FROM [seq_name])) FROM [table_of_pk]), false) So for table property with pk of id and sequence name property_id_seq: SELECT setval('property_id_seq', (SELECT COALESCE(MAX(id), (SELECT min_value FROM property_id_seq)) FROM property), false) Then I could just retry the first process - though on a table that is very busy, I might have retry for a while.. Regarding Michael's suggestion - I tried messing around with LOCK and similar commands but they're only allowed to run against TABLES not SEQUENCES - too bad - that would have been perfect. I'm now starting to think that there's no way to solve this problem in an elegant manner even in a stored procedure? Your method seems to be as good as it's going to get? (Not that I'm complaining!) Thanks again - any thoughts are welcome, Steve At 08:01 PM 8/3/2007, Scott Marlowe wrote: On 8/3/07, Steve Midgley [EMAIL PROTECTED] wrote: Hi, I'm writing an import app in a third party language. It's going to use copy to to move data from STDIN to a postgres (8.2) table. There are some complexities though: it's going to copy the records to a versioned table first, and then at a later time the records will be copied by a different procedure to the live table. The live table and versioned table are identical in terms of their field definitions. But there is no sequence associated with the versioned table (whose primary key is id plus import_group_id, whereas the live table's pk is just id). So all versioned table entries must already know what their id would be in the live table. (This makes sense for other business process we have, but it's a bit of a problem in this instance). My problem: I'd like to be able to grab a block of id's from the live table's pk sequence. So let's say my importer has 5,000 new rows to import and the current max pk in the live table is 540,203. I'd like to be able to increment the primary key sequence in such a way that I get a block of ids all to myself and the sequence is reset to 545,203 with a guarantee that all the id's between 540203 and 545203 are unused. The real danger in doing this is race conditions. Most anything you do involves a possible race condition. As long as the race condition doesn't result in an id getting used twice, you're safe. So: test=# create sequence m; CREATE SEQUENCE test=# select nextval('m'); nextval - 1 (1 row) test=# alter sequence m increment by 5000; ALTER SEQUENCE test=# select nextval('m'); nextval - 5001 (1 row) test=# alter sequence m increment by 1; ALTER SEQUENCE test=# select nextval('m'); nextval - 5002 (1 row) In this method, the only possible race condition is that someone might run a nextval('m') between the time you set the increment to 5000 and 1 again. If that happens, you'd have 5,000 missing ids, but since sequences are designed to prevent dupes, not holes, that's ok. But since I've seen so much magic on display from people on this list, I'm going to ask if it's possible to do this solely from PG SQL sent from a third party language? The main tricky bit seems to be ensuring that everything is locked so two such increment calls at the same time don't yield overlapping blocks of ids. Is there a way to lock the sequence generator for the duration of a nextval and setval call? Avoiding the setval is the real key. It doesn't scale. Missing 5,000 ids is no big deal. repeating them IS a big deal. Not using setval is the best answer. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[SQL] Race condition in resetting a sequence
Hi Scott, You've moved into more general territory, so I'm starting a new thread. The code I provided to reset a primary key sequence is actually part of Ruby on Rails core library - actually they use something very similar to what I originally sent: SELECT setval('#{sequence}', (SELECT COALESCE(MAX(#{pk})+(SELECT increment_by FROM #{sequence}), (SELECT min_value FROM #{sequence})) FROM #{table}), false) Where: #{sequence} = sequence name #{pk} = primary key of table under sequence #{table} = table under sequence Their code is a little different from what I provided before b/c it increments by one (times the increment #) above the max(pk). But essentially it's the same. (I think their method leaves small gaps in the sequence every time it runs). Also I think they're method is likely to be a little slower (one extra select statement) and therefore (perhaps) more vulnerable to a race? You mentioned something more general though: As long as you're using setval you have a race condition? However the postgres manual states: The sequence functions, listed in http://www.postgresql.org/docs/8.2/interactive/functions-sequence.html#FUNCTIONS-SEQUENCE-TABLETable 9-34, provide simple, multiuser-safe methods for obtaining successive sequence values from sequence objects. (http://www.postgresql.org/docs/8.2/interactive/functions-sequence.html) Included in Table 9-34 is setval - so I'm not clear how it can have a race condition all by itself? Or maybe it only has a race condition when used in ways similar to how Ruby/Rails is using it? (i.e. with a compound select/coalesce statement as one of its parameters?) Would this command have a race condition: select setval('my_pk_seq', 500) This issue is reasonably important since Ruby on Rails is fairly widely used. As you say, the race window would be pretty small on a compound select -- and the Ruby function doesn't actually get called very often, but if you wouldn't mind explaining how the race condition would manifest, I'll post a warning on the RoR bug tracking site so that people can at least understand that there's a potential bug here.. Thanks again, Steve At 08:42 PM 8/3/2007, Scott Marlowe wrote: On 8/3/07, Steve Midgley [EMAIL PROTECTED] wrote: Hi Scott, Thanks for this info (and Michael too!). Let me see if I understand your suggestion. I would run these three commands in sequence: # select nextval('[my_seq_name]'); returns = 52 [I believe that the sequence is at 52] # alter sequence [my_seq_name] increment by 5000; # select nextval('[my_seq_name]'); returns = 5052 If the third command doesn't return 5052 - i.e. it returns 5053, then I know that somewhere in this sequence another process grabbed an id out from under me. It doesn't matter where, but I would know that my 5000 id's are not unique and should be discarded? If the third command DOES return 5052, then I know that those 5000 id's are locked for my use and no other application could have grabbed one of them? No, that's not what would happen. If someone grabbed an id after the increment value was changed, then you'd get 10052, cause they would increment the sequence by 5,000.since you're not using setval, and you're keeping the increment positive, there's no danger of collision, only of over-incrementing and leaving a giant hole in your sequence. which is ok. Can anyone see a flaw in that? It looks right to me.. Scott - it also seems to me that I need not waste all those id's if another application does grab one during my statement: If I detect a failure, I could just reset the pk sequence back to the max id of the underlying table before trying again. I think this code would do it (stolen from Ruby's postgres adaptor): That is open to a race condition. The bad kind. SELECT setval('[seq_name]', (SELECT COALESCE(MAX([pk_of_seq]), (SELECT min_value FROM [seq_name])) FROM [table_of_pk]), false) As long as you're using setval, you have a race condition. Please avoid it. Unless you can guarantee that no one else is using the database at the same time (during a maintenance window etc...) So for table property with pk of id and sequence name property_id_seq: SELECT setval('property_id_seq', (SELECT COALESCE(MAX(id), (SELECT min_value FROM property_id_seq)) FROM property), false) You'd think that the select coalesce and the outer select setval would not have a race condition, but they still do. Just a much smaller one. I'm now starting to think that there's no way to solve this problem in an elegant manner even in a stored procedure? Your method seems to be as good as it's going to get? (Not that I'm complaining!) Yep. Safe is better than pretty or elegant. :)
[SQL] Selecting rows with static ordering
Hello, I have a strange problem (or one that I've never had before anyway). I am searching for a list of id's for a given table (these id values are generated at run-time and held statically in an application-local variable). From that application, I want to retrieve all those rows, and I want them in the order they are currently stored in that variable. So take for example this foreign application variable: ids = 3,2,5,1,4 The application then executes this sql: select * from table where id in (3,2,5,1,4) As-is, of course, the above query will return the 5 records in a semi-random (i.e. unpredictable/unreliable) order. And I don't want to just order by id - I want to order by id(3,2,5,1,4) (if you see what I mean) Is there a neat trick that anyone knows for pulling this off in a single query? Basically right now I'm issuing 5 queries to the backend to ensure ordering but this horribly inefficient. Any input or advice would be appreciated, Steve ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [SQL] [pgsql-sql] Daily digest v1.2492 (19 messages)
Hi John, It sounds like a disk-bound operation, so cpu is not maxed out. I'm not clear on all the details of your operation but it sounds like you're using Java to do row-by-row based inserts, selects and updates within a transaction, from a file. This can be a very slow process if you have many rows. The OS stats you describe fits that theory (but not conclusively). If you are using (psuedo-)code such as: Open file { Read line { select from Pg: select from [other_table] where val = [line[colN]] exec to Pg: insert into [table] (col1, col2, ...) values (line[col1], line[col2]... } } You can radically speed up such a system by using the copy (http://www.postgresql.org/docs/8.2/interactive/sql-copy.html) command to load all the data at once from the file into Pg and then do post-processing with Java/SQL to get all the fields looking right. Doing a bulk update with a join across several tables is so much faster than looping through them with a wrapper in Java (or other lang) you won't believe it. I hope this helps and is on-topic for you. Steve At 09:38 AM 4/3/2007, [EMAIL PROTECTED] wrote: Date: Tue, 03 Apr 2007 22:16:13 +0800 From: John Summerfield [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: A long-running transaction Message-ID: [EMAIL PROTECTED] I have a Java (java 1.1) program that I wrote some years ago, to read records from a text file and insert it into a ostgresql database. One of the assumptions I made was that one file contained one day's data, maybe as many as 1500 records, and I coded it to do the whole lot as one transaction so either a single file was loaded in its entirity, or none of its data was. I lost the Java code, but revived the idea and I've collected about two years' data using (Linux) shell scripts, and loading the data using psql. Then, I found the Java code on a disused hard disk:-) I made the necessary changes for it to build in java 1.5, and used psql to extract data from my new database in the correct format for the old program. This time, I have a little more data than I ever loaded at once before: [EMAIL PROTECTED]:~$ wc -l testdata 6242217 testdata [EMAIL PROTECTED]:~$ \ls -hl testdata -rw-r--r-- 1 summer summer 285M 2007-03-28 22:32 testdata [EMAIL PROTECTED]:~$ Now, it wouldn't surprise me if postgresql used lots of memory - but how much could it possibly need? My laptop, where I first tried this, has 1.25 Gbytes, so I could allow it some. It wouldn't surprise me a lot if it used lots of memory and caused all sorts of problems, but no, it's chugging away, still using no more RAM than it could have had on my old Pentium 133 all those years ago. In the process of checking it out, I've set it running on a machine with a AMD Sempron(tm) 2400+ running Kubuntu 6.10 (kernel is 2.6.17-6-server-xen0) and 512 Mbytes of RAM. This is the java program:-) summer pts/6:0.0 Thu205days 1:07 1:07 /usr/bin/gij-4.1 -cp /usr/s It's been running five days so far, and I can see where it's up to by attaching strace. It's reading 2k of the input file every few seconds. Okay, clearly something's wrong, and I don't think it's all my crddu code. No probs swapping: [EMAIL PROTECTED]:~$ free total used free sharedbuffers cached Mem:460800 456472 4328 0860 262164 -/+ buffers/cache: 193448 267352 Swap: 14618722841461588 [EMAIL PROTECTED]:~$ It is hitting the disk pretty hard now on this machine, but the laptop's still going too, and the disk seems to run about half the time, part of a second running, part idle (but the intervals are getting shorter). It struck me as fairly curious that neither postgresql nor the application was hogging the CPU. Perhaps the laptop is more interesting: look at the size of the buffer pool: [EMAIL PROTECTED]:~ free total used free sharedbuffers cached Mem: 12955281268548 26980 0 3976 392388 -/+ buffers/cache: 872184 423344 Swap: 1941496 326561908840 [EMAIL PROTECTED]:~ Again, no problem with over-use of RAM, and I'm logged on using KDE too and that's running fine. It's been running a little longer here: summer pts/2328Mar07 5days 25:12 25:11 java -cp /home/summer/Classes/:/usr/share/p This is Sun's Java 1.5 on OpenSUSE 10.2. This is what suggested I should write: [EMAIL PROTECTED]:~ procinfo Linux 2.6.18.8-0.1-default ([EMAIL PROTECTED]) (gcc 4.1.2 20061115) #1 1CPU [Echidna.] Memory: TotalUsedFree Shared Buffers Mem: 1295528 1271720 23808 03716 Swap: 1941496 32656 1908840 Bootup: Tue Mar 27 18:50:19 2007Load average: 2.21 2.65 2.69 2/243 19305 user : 1d 3:17:04.03 16.0% page in : 131097310 disk 1: 3079516r20087664w nice : 0:05:39.64 0.1% page out: 197016649
Re: [SQL] Issue with copying data from a text file.
Hi, I think I had the exact same problem as you do a while back and I solved it by removing the header row and the CSV HEADER clause of the statement. For the large files I had, it was easier (for me) to remove the header row than it was to escape out all the quotes (or regen the file): COPY deal_lines_temp_load FROM 'c:/temp/autodrs_deal_lines.txt' WITH DELIMITER AS '^'; I think the parser doesn't look for nor generate quoted rows except when CSV is specified.. It would be nice if there was a way to specify a HEADER row without invoking CSV parsing rules (friendly hint to core devs!) :) Let us all know if that works! Steve At 03:14 AM 3/20/2007, you wrote: Date: Tue, 20 Mar 2007 11:25:38 +0900 From: Paul Lambert [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: Issue with copying data from a text file. Message-ID: [EMAIL PROTECTED] I have a procedure in place that copies data from a caret delimited text file into a table storing some information. One of the fields in the table contains an item description which may contain item dimensions such as - 17 alloy wheels The problem I am getting when I do my load is I believe due to the presence of the double quotation marks giving the copy the impression that it is to include the information following as a single text string until it gets to the next set of double quotes. As a result, I get the following: AutoDRS=# COPY deal_lines_temp_load FROM 'c:/temp/autodrs_deal_lines.txt' WITH DELIMITER AS '^' CSV HEADER; ERROR: value too long for type character varying(30) CONTEXT: COPY deal_lines_temp_load, line 87, column order_desc: 17 5 spoke alloy wheels.^1291.18^117.38^983.69^1291.18^^C^^ The column as you can see is defined as a 30 character field, the load contains in this column ^17 5 spoke alloy wheels.^ I note an option in the COPY command to specify the quote character, defaulting to double quote. The problem being a single quote will also be used in the data, as will other characters. Is there any way to get a copy to have no quote character? I.e. read the file and put whatever is between the caret characters straight into the appropriate field exactly as is. TIA, Paul. -- Paul Lambert Database Administrator AutoLedgers ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [SQL] A form of inheritance with PostgreSQL
Hi Greg, While not in a C++ framework, you might find that it's not too hard to implement something similar in your system - It's called Single Table Inheritance. References to the Ruby on Rails implementation here: http://wiki.rubyonrails.org/rails/pages/SingleTableInheritance It's based on Martin Fowler's Patterns of Enterprise Architecture book - please find references to his original patterns here: http://www.martinfowler.com/eaaCatalog/singleTableInheritance.html The key, I believe, is simply adding a type and a parent_id to the class table, so you can model all your types and their hierarchical relations. Fowler's diagram is pretty clear. I think then you would store the data in another table (or tables) and link into this inheritance structure to establish ancestry for any piece of data (some people try to store the data in this table too, but I think that's a mistake personally). If I understand what you're trying to do, you can use this design pattern in your application language to implement an inheritance scheme without any special database features (i.e. in a SQL-standard manner). I hope this is helpful, Steve At 12:28 AM 3/9/2007, [EMAIL PROTECTED] wrote: Date: Thu, 08 Mar 2007 13:01:51 -0500 From: Greg Toombs [EMAIL PROTECTED] To: pgsql-sql@postgresql.org Subject: A form of inheritance with PostgreSQL Message-ID: [EMAIL PROTECTED] Hello. I'm trying to figure out how to nicely implement a C++ class-like system with PostgreSQL. Consider the following: Tables Fruit, Apple, Orange I want to design the foreign key scheme such that there are relations between fruit and apple, and fruit and orange, that imply that apple is a fruit, and orange is a fruit. I don't want to eliminate the existence of Apple and Orange tables, because there will be columns specific to both Apple and Orange; if I include these columns in Fruit, then if Fruit is an Orange, the Apple columns will be needlessly present in Apple rows. The different ways of implementing this scheme that I've thought of (some uglier than others): - Have Fruit contain foreign keys to both Apple and Orange, and write a check constraint in Fruit specifying that exactly one of (Apple FK, Orange FK) needs to be non-null. The disadvantage of this method is that it isn't exactly loosely coupled. For every other fruit type table I implemented I'd have to go back and add a foreign key in Fruit. - Have a foreign key in Apple to Fruit, and in Orange to Fruit; then somehow create a constraint that imposes uniqueness on the union of foreign keys in both Apple and Orange. To figure out what type of fruit a Fruit row is, run a query for foreign keys in Orange and Apple matching the primary key of Fruit. You'd also want to somehow create a constraint that the result of this query should always return exactly one row (perhaps with a trigger?) Any advice will be appreciated! As I'm relatively new to Postgre, I might need some help with the actual implementation as well. Thank you. - Greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [SQL] GiST index question: performance
Hi Peter, :) All my Pg code is written via (or handed to) an abstraction layer, and I actually write no functions or stored procedures at all. I write using Rails, so in this case it's a Ruby library called ActiveRecord which has a Postgres module that allows me to talk via ActiveRecord-speak or via direct Postgres sql commands. (For example, AR has no idea how to create a GiST index, so I issue that DDL statement manually using the special syntax - also AR is not always so smart about SQL queries so tricky ones I write by hand). Maybe I misunderstand Q3C completely but it looks like C code that has to be installed into the Postgres server itself - not a series of SQL functions that can implemented on an unmodified server. I think my ISP is fine with anything that gets installed via user-level privileges. Anything that requires root and/or anything that involves binary code they are more cautious about. To be fair, I'm cautious about the same things, but given Oleg's reputation and contributions to Pg, I wouldn't be so concerned about Q3C specifically. Am I ignorant of something fundamental in this conversation? I really do appreciate any education or insight here. Are C code patches or functions more of a risk to server stability/reliability than higher level code? Or am I speaking gibberish? Thanks, Steve At 01:01 AM 3/6/2007, Peter Eisentraut wrote: Steve Midgley wrote: my ISP that manages my Pg SQL server is (in my interests) concerned about installing anything non-standard (read: unstable) onto their server. I was able to get them to install your TSearch2 b/c it's been proven many times, but I'm hesitant to even bring up Q3C since it's less widely deployed. How do you manage to get your own code installed under that theory? -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[SQL] GiST index question: performance
Hi, First off, can I say how much I love GiST? It's already solved a few problems for me that seemed impossible to solve in real-time queries. Thanks to everyone who works on that project! I'm developing a geographic index based on a set of zip code boundaries. Points of interest (POI) will fall within some boundaries and not others. I need to search to find which POI are within a specified boundary. I think have two options (see below) and I'm wondering if anyone has an opinion or experience as to whether one or the other will have substantially different performance characteristics. I can obviously test when I get that far, but I'd prefer to try the anticipated faster route first, if anyone has existing experience they can share: 1) Index a series of circles of NN radius around each boundary marker (lat/long point). Run a search on POI for those that fall within any of the specified circles. 2) Index a set of polygons that mark the minimum area around the boundary markers in question. Run a search on POI that fall within this single polygon. The polygon will have more points, but there will be more circles to search - my understanding of GiST is limited so I'm not sure if there's a performance benefit to searching many circles or a few polygons. My tables are of this size: # of POI: 50,000 # of zip blocks (with and without regions): 217,000 # of zip blocks in a given city (and hence in a given polygon): ~5 Any thoughts or ideas? Thank you, Steve p.s. I could use a GIS system alongside of Postgres but performance and efficiency are key to this system, and it seems to me that raw GiST indexed SQL queries are going to be fastest and create the lowest load on the server? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [SQL] GiST index question: performance
Thanks Oleg - very interesting stuff you are working on. You may recall I exchanged emails with you on openfts a little while ago - my ISP that manages my Pg SQL server is (in my interests) concerned about installing anything non-standard (read: unstable) onto their server. I was able to get them to install your TSearch2 b/c it's been proven many times, but I'm hesitant to even bring up Q3C since it's less widely deployed. The search method I proposed in my first email is not totally accurate but just searching circles with radii using a GiST index and standard Pg circle datatypes seems like a close enough solution for me (as opposed to Q3C's conical search intersections with a spherical projection). I realize that at higher latitudes my circles will be elliptical but our needs are for approximations that are very fast rather than accurate and the radii being searched are small relative to the size of the sphere (I.e. when searching Nome, find everything in +/- 40 miles and especially don't return Anchorage POI).. It's an end user database, so if the query takes 500ms, that's really too long. On the Q3C site, I see that your measure of speed is processing many, many rows in 20 hours, which is a whole different ballgame. :) Do you have a thought as to whether GiST is going to be faster/more efficient with Pg standard types of polygons or circles? I suppose I should just test out both, and quit wasting your time. I'll certainly repost to the list with whatever I uncover. I really do appreciate the help you've provided. Sincerely, Steve At 12:21 PM 3/5/2007, you wrote: On Mon, 5 Mar 2007, Steve Midgley wrote: Hi, First off, can I say how much I love GiST? It's already solved a few problems for me that seemed impossible to solve in real-time queries. Thanks to everyone who works on that project! Thanks, Steve ! I'm developing a geographic index based on a set of zip code boundaries. Points of interest (POI) will fall within some boundaries and not others. I need to search to find which POI are within a specified boundary. You POI is what we call ConeSearch query in astronomy. Please, take a look on Q3C algorithm available from http://q3c.sf.net. Some information http://www.sai.msu.su/~megera/wiki/SkyPixelization This is what we use in our Virtual Observatory project and we're able to work with 10^9 objects on moderate hardware. It doesn't use GiST but special pixelization scheme allow to use standard Btree. I think have two options (see below) and I'm wondering if anyone has an opinion or experience as to whether one or the other will have substantially different performance characteristics. I can obviously test when I get that far, but I'd prefer to try the anticipated faster route first, if anyone has existing experience they can share: 1) Index a series of circles of NN radius around each boundary marker (lat/long point). Run a search on POI for those that fall within any of the specified circles. 2) Index a set of polygons that mark the minimum area around the boundary markers in question. Run a search on POI that fall within this single polygon. The polygon will have more points, but there will be more circles to search - my understanding of GiST is limited so I'm not sure if there's a performance benefit to searching many circles or a few polygons. My tables are of this size: # of POI: 50,000 # of zip blocks (with and without regions): 217,000 # of zip blocks in a given city (and hence in a given polygon): ~5 Any thoughts or ideas? Thank you, Steve p.s. I could use a GIS system alongside of Postgres but performance and efficiency are key to this system, and it seems to me that raw GiST indexed SQL queries are going to be fastest and create the lowest load on the server? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Re: [SQL] COPY FROM query
Hi, For what it's worth, WindowsXP (at least) seems to currently support forward slashes at the OS level, so this Postgres behavior isn't as odd as it might seem. If you enclose your paths with double quote () marks, Windows will even accept Unix style paths for some instructions on the command line: dir /temp Feeding Windows API calls with forward slashes seems to work with everything I've tried so far, so using them in Postgres seems perfectly smart, and reasonable.. Hope that helps, Steve At 11:03 AM 2/12/2007, [EMAIL PROTECTED] wrote: Date: Mon, 12 Feb 2007 09:09:08 +0900 From: Paul Lambert [EMAIL PROTECTED] To: Joe [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Subject: Re: COPY FROM query. Message-ID: [EMAIL PROTECTED] Joe wrote: Hi Paul, On Mon, 2007-02-12 at 08:19 +0900, Paul Lambert wrote: I'm attempting to copy from a table into a file using a select query inside the copy. The following is my command: COPY (SELECT DISTINCT ON (dealer_id,appraisal_id) * FROM appraisals_temp) TO 'c:\autodrs_appraisal_new.txt' WITH DELIMITER AS '^' CSV HEADER; I get the following returned: WARNING: nonstandard use of escape in a string literal LINE 1: ...ealer_id,appraisal_id) * FROM appraisals_temp) TO 'C:\autodr... ^ HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. ERROR: relative path not allowed for COPY to file SQL state: 42602 (The caret character is pointing to the M in FROM) I believe that on Windows you need to use double backslashes, i.e., 'c:\ \autodrs_appraisal_new.txt', although the regular slash may also work, i.e., 'c:/autodrs_appraisal_new.txt'. Joe ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings If this is the case, it is strange that the first copy statement works as that is also only using a single backslash. Having said that, if I switch it to a forward slash it works without error... odd, given that weenblows standard is backslash. Thanks for the help though. -- Paul Lambert Database Administrator AutoLedgers ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster