Re: [PERFORM] queries with lots of UNIONed relations
14.01.11 00:26, Tom Lane написав(ла): Robert Haas writes: On Thu, Jan 13, 2011 at 3:12 PM, Jon Nelson wrote: I still think that having UNION do de-duplication of each contributory relation is a beneficial thing to consider -- especially if postgresql thinks the uniqueness is not very high. This might be worth a TODO. I don't believe there is any case where hashing each individual relation is a win compared to hashing them all together. If the optimizer were smart enough to be considering the situation as a whole, it would always do the latter. How about cases when individual relations are already sorted? This will mean that they can be deduplicated fast and in streaming manner. Even partial sort order may help because you will need to deduplicate only groups with equal sorted fields, and this will take much less memory and be much more streaming. And if all individual deduplications are streaming and are sorted in one way - you can simply do a merge on top. Best regards, Vitalii Tymchyshyn. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] plan question - query with order by and limit not choosing index depends on size of limit, table
Thanks Robert, this is what I was looking for. I will try these suggestions and follow up if any of them are the silver bullet. On Fri, Jan 14, 2011 at 7:11 AM, Robert Haas wrote: > On Thu, Jan 6, 2011 at 4:36 PM, Mike Broers wrote: > > Thanks for the assistance. > > Here is an explain analyze of the query with the problem limit: > > production=# explain analyze select * from landing_page.messages where > > ((messages.topic = E'x') AND (messages.processed = 'f')) ORDER BY > > messages.created_at ASC limit 10; > > > >QUERY PLAN > > > > > -- > > > > Limit (cost=0.00..2891.06 rows=10 width=1340) (actual > > time=207922.586..207922.586 rows=0 loops=1) > >-> Index Scan using idx_landing_page_messages_created_at on messages > > (cost=0.00..449560.48 rows=1555 widt > > h=1340) (actual time=207922.581..207922.581 rows=0 loops=1) > > Filter: ((NOT processed) AND ((topic)::text = 'x'::text)) > > Total runtime: 207949.413 ms > > (4 rows) > > You're not the first person to have been bitten by this. The > optimizer thinks that rows WHERE NOT processed and topic = 'x' are > reasonably common, so it figures that it can just index scan until it > finds 10 of them. But when it turns out that there are none at all, > it ends up having to scan the entire index, which stinks big-time. > > The alternative plan is to use a different index to find ALL the > relevant rows, sort them, and then take the top 10. That would suck > if there actually were tons of rows like this, but there aren't. > > So the root of the problem, in some sense, is that the planner's > estimate of the selectivity of "NOT processed and topic = 'x'" is not > very good. Some things to try: > > - increase the statistics target for the "processed" and "topic" > columns even higher > - put the processed rows in one table and the not processed rows in > another table > - do something like SELECT * FROM (SELECT .. LIMIT 200 OFFSET 0) LIMIT > 10 to try to fool the planner into planning based on the higher, inner > limit > - create a partial index on messages (topic) WHERE NOT processed and > see if the planner will use it > > ...Robert >
Re: [PERFORM] The good, old times
Craig Ringer wrote: On 01/12/2011 10:16 PM, Guillaume Cottenceau wrote: What's your point and in what is it related to that ML? Given the package names, I suspect this is a poorly-expressed complaint about the performance of downloads from the pgdg/psqlrpms site. If that was the original poster's intent, they would've been better served with a post that included some minimal details like: Yes, it was a complaint about the download speed. - Information abut their local connectivity - mtr --report / traceroute output - tests from other available hosts As for the traceroute information, here it is: traceroute yum.pgrpms.org traceroute to yum.pgrpms.org (77.79.103.58), 30 hops max, 40 byte packets 1 216.169.135.254 (216.169.135.254) 0.389 ms 0.404 ms 0.451 ms 2 host189.131.26.216.vmsinfo.com (216.26.131.189) 9.355 ms 9.357 ms 9.368 ms 3 v11.lc2.lou.peak10.net (216.26.190.10) 9.645 ms 9.645 ms 9.637 ms 4 ge-7-41.car1.Cincinnati1.Level3.net (4.53.64.41) 13.002 ms 13.002 ms 13.018 ms 5 ae-2-5.bar1.Cincinnati1.Level3.net (4.69.132.206) 13.101 ms 13.098 ms 13.087 ms 6 ae-10-10.ebr2.Chicago1.Level3.net (4.69.136.214) 22.096 ms 21.358 ms 21.329 ms 7 ae-1-100.ebr1.Chicago1.Level3.net (4.69.132.41) 27.729 ms 10.812 ms 24.132 ms 8 ae-2-2.ebr2.NewYork2.Level3.net (4.69.132.66) 34.008 ms 33.960 ms 34.088 ms 9 ae-1-100.ebr1.NewYork2.Level3.net (4.69.135.253) 34.152 ms 35.353 ms 37.068 ms 10 ae-4-4.ebr1.NewYork1.Level3.net (4.69.141.17) 36.998 ms 37.248 ms 36.986 ms 11 ae-43-43.ebr2.London1.Level3.net (4.69.137.73) 107.031 ms ae-42-42.ebr2.London1.Level3.net (4.69.137.69) 104.624 ms 107.000 ms 12 ae-2-52.edge4.London1.Level3.net (4.69.139.106) 107.506 ms 106.993 ms 180.229 ms 13 (195.50.122.174) 168.849 ms 160.917 ms 161.713 ms 14 static.turktelekom.com.tr (212.156.103.42) 176.503 ms 179.012 ms 179.394 ms 15 gayrettepe-t3-1-gayrettepe-t2-1.turktelekom.com.tr (212.156.118.29) 167.867 ms 167.870 ms 167.862 ms 16 88.255.240.110 (88.255.240.110) 167.515 ms 168.172 ms 165.829 ms 17 ns1.gunduz.org (77.79.103.58) 171.574 ms !X * * [mgogala@lpo-postgres-d01 ~]$ Are there any good mirrors? Apparently, there is something slow in the force. If that wasn't the original poster's intent, perhaps it'd be worth a second try to explain what they were *trying* to say? Was it just a joke - 'cos if so, it was kinda flat. -- Craig Ringer -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] plan question - query with order by and limit not choosing index depends on size of limit, table
On Fri, Jan 14, 2011 at 11:36 AM, Mike Broers wrote: > Thanks Robert, this is what I was looking for. I will try these suggestions > and follow up if any of them are the silver bullet. No problem - and sorry for the off-list reply. I was a little sleepy when I wrote that; thanks for getting it back on-list. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] "SELECT .. WHERE NOT IN" query running for hours
2011/1/10 Mladen Gogala : > Well, I really hoped that Bruce, Robert or Greg would take on this one, but > since there are no more qualified takers, I'll take a shot at this one. For > the "NOT IN (result of a correlated sub-query)", the sub-query needs to be > executed for every row matching the conditions on the driving table, while > the !EXISTS is just a complement of join. It's all in the basic set theory > which serves as a model for the relational databases. As Scott says, the real problem is the NULL handling. The semantics are otherwise similar. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Problems with FTS
On Tue, Jan 11, 2011 at 3:16 AM, Rauan Maemirov wrote: > Hi, Kevin. > Sorry for long delay. > EXPLAIN ANALYZE SELECT "v"."id", "v"."title" FROM "video" AS "v" > WHERE (v.active) AND (v.fts @@ > 'dexter:A|season:A|seri:A|декстер:A|качество:A|сезон:A|серия:A'::tsquery and > v.id <> 500563 ) > ORDER BY COALESCE(ts_rank_cd( '{0.1, 0.2, 0.7, 1.0}', v.fts, > 'dexter:A|season:A|seri:A|декстер:A|качество:A|сезон:A|серия:A'::tsquery), > 1) DESC, v.views DESC > LIMIT 6 > "Limit (cost=103975.50..103975.52 rows=6 width=280) (actual > time=2893.193..2893.199 rows=6 loops=1)" > " -> Sort (cost=103975.50..104206.07 rows=92228 width=280) (actual > time=2893.189..2893.193 rows=6 loops=1)" > " Sort Key: (COALESCE(ts_rank_cd('{0.1,0.2,0.7,1}'::real[], fts, '( ( > ( ( ( ''dexter'':A | ''season'':A ) | ''seri'':A ) | ''декстер'':A ) | > ''качество'':A ) | ''сезон'':A ) | ''серия'':A'::tsquery), 1::real)), views" > " Sort Method: top-N heapsort Memory: 25kB" > " -> Seq Scan on video v (cost=0.00..102322.34 rows=92228 > width=280) (actual time=0.100..2846.639 rows=54509 loops=1)" > " Filter: (active AND (fts @@ '( ( ( ( ( ''dexter'':A | > ''season'':A ) | ''seri'':A ) | ''декстер'':A ) | ''качество'':A ) | > ''сезон'':A ) | ''серия'':A'::tsquery) AND (id <> 500563))" > "Total runtime: 2893.264 ms" > Table scheme: > CREATE TABLE video > ( > id bigserial NOT NULL, > hash character varying(12), > account_id bigint NOT NULL, > category_id smallint NOT NULL, > converted boolean NOT NULL DEFAULT false, > active boolean NOT NULL DEFAULT true, > title character varying(255), > description text, > tags character varying(1000), > authorized boolean NOT NULL DEFAULT false, > adult boolean NOT NULL DEFAULT false, > views bigint DEFAULT 0, > rating real NOT NULL DEFAULT 0, > screen smallint DEFAULT 2, > duration smallint, > "type" smallint DEFAULT 0, > mp4 smallint NOT NULL DEFAULT 0, > size bigint, > size_high bigint DEFAULT 0, > source character varying(255), > storage_id smallint NOT NULL DEFAULT 1, > rule_watching smallint, > rule_commenting smallint, > count_comments integer NOT NULL DEFAULT 0, > count_likes integer NOT NULL DEFAULT 0, > count_faves integer NOT NULL DEFAULT 0, > fts tsvector, > modified timestamp without time zone NOT NULL DEFAULT now(), > created timestamp without time zone DEFAULT now(), > CONSTRAINT video_pkey PRIMARY KEY (id), > CONSTRAINT video_hash_key UNIQUE (hash) > ) > WITH ( > OIDS=FALSE > ); > Indexes: > CREATE INDEX idx_video_account_id ON video USING btree (account_id); > CREATE INDEX idx_video_created ON video USING btree (created); > CREATE INDEX idx_video_fts ON video USING gin (fts); > CREATE INDEX idx_video_hash ON video USING hash (hash); > (here I tried both gist and gin indexes) > I have 32Gb ram and 2 core quad E5520, 2.27GHz (8Mb cache). > Pgsql conf: > max_connections = 200 > shared_buffers = 7680MB > work_mem = 128MB > maintenance_work_mem = 1GB > effective_cache_size = 22GB > default_statistics_target = 100 > Anything else? For returning that many rows, an index scan might actually be slower. Maybe it's worth testing. Try: SET enable_seqscan=off; EXPLAIN ANALYZE ... and see what you get. If it's slower, well, then be happy it didn't use the index (maybe the question is... what index should you have instead?). If it's faster, post the results... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] "COPY TO stdout" statements occurrence in log files
Hi there, I have one log file per week and logging all statements >= 500 ms execution time. But, with "normal" statements are occuring something like this: 2011-01-13 00:11:38 BRT LOG: duration: 2469.000 ms statement: FETCH 1000 IN bloboid 2011-01-13 00:12:01 BRT LOG: duration: 797.000 ms statement: SELECT tableoid, oid, nspname, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = nspowner) as rolname, nspacl FROM pg_namespace 2011-01-13 00:12:06 BRT LOG: duration: 766.000 ms statement: *COPY*public.log (codlog, matricula, data, descricao, codcurso, ip) WITH OIDS *TO stdout; * 2011-01-13 00:12:10 BRT LOG: duration: 2328.000 ms statement: FETCH 1000 IN bloboid 2011-01-13 00:12:34 BRT LOG: duration: 594.000 ms statement: SELECT tableoid, oid, nspname, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = nspowner) as rolname, nspacl FROM pg_namespace 2011-01-13 00:12:38 BRT LOG: duration: 672.000 ms statement: *COPY*public.avaliacao_topico_opcao (codavaliacao_topico_opcao, codavaliacao_topico, descricao, selecao) WITH OIDS *TO stdout; * 2011-01-13 00:12:39 BRT LOG: duration: 891.000 ms statement: COPY public.log (codlog, matricula, data, descricao, codcurso, ip) WITH OIDS TO stdout; Is this normal? I'm afraid because my application doesn't run this kind of statement, so how can I know what is doing these commands? Maybe pg_dump? Thank you! Fernando
Re: [PERFORM] queries with lots of UNIONed relations
On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane wrote: > Jon Nelson writes: >> On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane wrote: >>> If you have enough memory to de-dup them individually, you surely have >>> enough to de-dup all at once. > >> If everything were available up-front, sure. >> However, and please correct me if I'm wrong, but doesn't postgresql >> work in a fairly linear fashion, moving from table to table performing >> a series of operations on each? > > Doing a single sort+uniq works like that. But the alternate plan you > are proposing we should consider involves building all the lower > hashtables, and then reading from them to fill the upper hashtable. > Max memory consumption *is* worst case here. Remember HashAggregate > is incapable of swapping to disk (and if it did, you wouldn't be nearly > as pleased with its performance). That's not exactly what I'm proposing - but it is probably due to a lack of understanding some of the underlying details of how postgresql works. I guess I had assumed that the result of a HashAggregate or any other de-duplication process was a table-like structure. Regarding being pleased with hash aggregate - I am! - except when it goes crazy and eats all of the memory in the machine. I'd trade a bit of performance loss for not using up all of the memory and crashing. However, maybe I'm misunderstanding how SELECT DISTINCT works internally. In the case where a hashtable is used, does postgresql utilize table-like structure or does it remain a hashtable in memory? If it's a hashtable, couldn't the hashtable be built on-the-go rather than only after all of the underlying tuples are available? I'd love a bit more explanation as to how this works. Another example of where this might be useful: I'm currently running a SELECT DISTINCT query over some 500 million rows (120 contributory tables). I expect a de-duplicated row count of well under 10% of that 500 million, probably below 1%. The plan as it stands is to execute a series of sequential scans, appending each of the rows from each contributory table and then aggregating them. If the expected distinctness of each contributory subquery is, say, 5% then instead of aggregating over 500 million tuples the aggregation would take place over 25 million. In this case, that is a savings of 10 gigabytes, approximately. Yes, it's true, the same amount of data has to be scanned. However, the amount of data that needs to be stored (in memory or on disk) in order to provide a final de-duplication is much smaller. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] "COPY TO stdout" statements occurrence in log files
> Is this normal? I'm afraid because my application doesn't run this kind of > statement, so how can I know what is doing these commands? Maybe pg_dump? I think pg_dump is likely, yes, if you have that scheduled. I don't think anything in the log file will identify it as pg_dump explicitly (I believe as far as the server is concerned, pg_dump is just another client), but if you're concerned about this, you can add the client pid (%p) to log_line_prefix in postgresql.conf, log the pg_dump pid through whatever mechanism manages that, and compare. --- Maciek Sakrejda | System Architect | Truviso 1065 E. Hillsdale Blvd., Suite 215 Foster City, CA 94404 (650) 242-3500 Main msakre...@truviso.com www.truviso.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Best way to get the latest revision from a table
I'm using 8.3 and I have a table that contains many revisions of the same entity and I have a query that is super slow. Please help! I'm going to paste in some SQL to set up test cases and some plans below. If that isn't the right way to post to this list please let me know and I'll revise. My table looks kind of like this but wider: CREATE TEMPORARY TABLE test (revision SERIAL NOT NULL PRIMARY KEY, a INTEGER NOT NULL, b INTEGER NOT NULL, c INTEGER NOT NULL); INSERT INTO test (a, b, c) SELECT a, 1, 25 FROM generate_series(1, 10) AS t1(a), generate_series(1, 10) as t2(b); CREATE INDEX test_a ON test (a); ANALYZE test; I need to SELECT all the columns with the latest revision for a subset of As. What is the right way to do this quickly? When I do it like this: CREATE TEMPORARY TABLE request (a INTEGER NOT NULL); INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a); ANALYZE request; SELECT * FROM request JOIN (SELECT a, MAX(b) as b FROM test GROUP BY a) max USING (a) JOIN test USING (a, b); DROP TABLE request; The plan for the SELECT is pretty bad: "Hash Join (cost=32792.50..77907.29 rows=62288 width=20) (actual time=769.570...050 rows=199 loops=1)" " Hash Cond: ((max(pg_temp_7.test.revision)) = pg_temp_7.test.revision)" " -> Hash Join (cost=5.48..38659.23 rows=62288 width=8) (actual time=20.621..830.235 rows=199 loops=1)" "Hash Cond: (pg_temp_7.test.a = request.a)" "-> GroupAggregate (cost=0.00..37170.11 rows=62601 width=8) (actual time=16.847..808.475 rows=10 loops=1)" " -> Index Scan using test_a on test (cost=0.00..31388.04 rows=12 width=8) (actual time=16.826..569.035 rows=100 loops=1)" "-> Hash (cost=2.99..2.99 rows=199 width=4) (actual time=3.736..3.736 rows=199 loops=1)" " -> Seq Scan on request (cost=0.00..2.99 rows=199 width=4) (actual time=3.658..3.689 rows=199 loops=1)" " -> Hash (cost=15405.12..15405.12 rows=12 width=16) (actual time=723.673..723.673 rows=100 loops=1)" "-> Seq Scan on test (cost=0.00..15405.12 rows=12 width=16) (actual time=0.006..290.313 rows=100 loops=1)" "Total runtime: .267 ms" If I instead issue the query as: CREATE TEMPORARY TABLE request (a INTEGER NOT NULL, revision INTEGER); INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a); UPDATE request SET revision = (SELECT MAX(revision) FROM test WHERE request.a = test.a); ANALYZE request; SELECT * FROM request JOIN test USING (revision) DROP TABLE request; The whole thing runs tons faster. The UPDATE uses the right index and is way sub second and the SELECT's plan is fine: "Merge Join (cost=11.66..76.09 rows=199 width=20) (actual time=0.131..0.953 rows=199 loops=1)" " Merge Cond: (test.revision = request.revision)" " -> Index Scan using test_pkey on test (cost=0.00..31388.04 rows=12 width=16) (actual time=0.017..0.407 rows=2001 loops=1)" " -> Sort (cost=11.59..12.09 rows=199 width=8) (actual time=0.102..0.133 rows=199 loops=1)" "Sort Key: request.revision" "Sort Method: quicksort Memory: 34kB" "-> Seq Scan on request (cost=0.00..3.99 rows=199 width=8) (actual time=0.020..0.050 rows=199 loops=1)" "Total runtime: 1.005 ms" Am I missing something or is this really the best way to do this in 8.3? Thanks for slogging through all this, Nik Everett
Re: [PERFORM] "COPY TO stdout" statements occurrence in log files
msakre...@truviso.com (Maciek Sakrejda) writes: >> Is this normal? I'm afraid because my application doesn't run this kind of >> statement, so how can I know what is doing these commands? Maybe pg_dump? > > I think pg_dump is likely, yes, if you have that scheduled. I don't > think anything in the log file will identify it as pg_dump explicitly > (I believe as far as the server is concerned, pg_dump is just another > client), but if you're concerned about this, you can add the client > pid (%p) to log_line_prefix in postgresql.conf, log the pg_dump pid > through whatever mechanism manages that, and compare. That's an option... More are possible... 1. Our DBAs have been known to create users specifically for doing backups ("dumpy"). It doesn't seem like a *huge* proliferation of users to have some 'utility' user names for common processes. 2. In 9.1, there will be a new answer, as there's a GUC to indicate the "application_name". -- "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." -- Rich Cook -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best way to get the latest revision from a table
Nikolas Everett wrote: > Am I missing something or is this really the best way to do this in 8.3? How about this?: SELECT y.* from (select a, max(revision) as revision from test where a between 2 and 200 group by a) x join test y using (a, revision); -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best way to get the latest revision from a table
On Fri, Jan 14, 2011 at 5:30 PM, Kevin Grittner wrote: > SELECT y.* > from (select a, max(revision) as revision > from test where a between 2 and 200 > group by a) x > join test y using (a, revision); While certainly simpler than my temp table this really just exposes a flaw in my example - I'm really going to be doing this with an arbitrary list of As.
Re: [PERFORM] "COPY TO stdout" statements occurrence in log files
On Fri, Jan 14, 2011 at 23:19, Chris Browne wrote: > 2. In 9.1, there will be a new answer, as there's a GUC to indicate the > "application_name". Actually this was already introduced in PostgreSQL 9.0 :) You can add application_name to your log_line_prefix with %a. For pg_dump it will display "pg_dump" Regards, Marti -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best way to get the latest revision from a table
Nikolas Everett wrote: > I'm really going to be doing this with an arbitrary list of As. OK, how about this?: CREATE TEMPORARY TABLE request (a INTEGER NOT NULL); INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a); ANALYZE request; SELECT y.* from (select a, max(revision) as revision from test join request using (a) group by a) x join test y using (a, revision); DROP TABLE request; -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best way to get the latest revision from a table
On 01/14/2011 03:17 PM, Nikolas Everett wrote: SELECT * FROM request JOIN (SELECT a, MAX(b) as b FROM test GROUP BY a) max USING (a) JOIN test USING (a, b); This actually looks like a perfect candidate for DISTINCT ON. SELECT DISTINCT ON (a, b) a, b, revision FROM test ORDER BY a, b DESC; Maybe I'm just misunderstanding your situation, though. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 stho...@peak6.com __ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best way to get the latest revision from a table
Shaun Thomas wrote: > This actually looks like a perfect candidate for DISTINCT ON. > > SELECT DISTINCT ON (a, b) a, b, revision >FROM test > ORDER BY a, b DESC; I wouldn't say perfect. It runs about eight times slower than what I suggested and returns a fairly random value for revision instead of the max(revision). -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best way to get the latest revision from a table
"Kevin Grittner" writes: > Shaun Thomas wrote: >> This actually looks like a perfect candidate for DISTINCT ON. >> >> SELECT DISTINCT ON (a, b) a, b, revision >> FROM test >> ORDER BY a, b DESC; > I wouldn't say perfect. It runs about eight times slower than what > I suggested and returns a fairly random value for revision instead > of the max(revision). Shaun's example is a bit off: normally, when using DISTINCT ON, you want an ORDER BY key that uses all the given DISTINCT keys and then some more. To get the max revision for each a/b combination it ought to be SELECT DISTINCT ON (a, b) a, b, revision FROM test ORDER BY a, b, revision DESC; As for speed, either one might be faster in a particular situation. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best way to get the latest revision from a table
Tom Lane wrote: > Shaun's example is a bit off > As for speed, either one might be faster in a particular > situation. After fixing a mistake in my testing and learning from Tom's example I generated queries against the OP's test data which produce identical results, and I'm finding no significant difference between run times for the two versions. The OP should definitely try both against the real tables. Here are the queries which run against the test set: DROP TABLE IF EXISTS request; CREATE TEMPORARY TABLE request (a INTEGER NOT NULL); INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a); ANALYZE request; SELECT y.* from (select a, max(revision) as revision from test join request using (a) group by a) x join test y using (a, revision) order by a, revision DESC; DROP TABLE IF EXISTS request; CREATE TEMPORARY TABLE request (a INTEGER NOT NULL); INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a); ANALYZE request; SELECT DISTINCT ON (a, b, c) revision, a, b, c FROM test join request using (a) ORDER BY a, b, c, revision DESC; Sorry for not sorting it out better initially. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best way to get the latest revision from a table
On Fri, Jan 14, 2011 at 7:59 PM, Kevin Grittner wrote: > Tom Lane wrote: > > > Shaun's example is a bit off > > > As for speed, either one might be faster in a particular > > situation. > > After fixing a mistake in my testing and learning from Tom's example > I generated queries against the OP's test data which produce > identical results, and I'm finding no significant difference between > run times for the two versions. The OP should definitely try both > against the real tables. > > > -Kevin > After trying both against the real tables DISTINCT ON seems to be about two orders of magnitude faster than the other options. Thanks so much! Nik Everett