Re: [PERFORM] Optimizing a VIEW
On Fri, 15 Aug 2008, Madison Kelly wrote: Below I will post the VIEW and a sample of the query's EXPLAIN ANALYZE. Thanks for any tips/help/clue-stick-beating you may be able to share! This query looks incredibly expensive: SELECT ... FROM customer a, history.customer_data b, history.customer_data c, history.customer_data d, history.customer_data e, history.customer_data f, history.customer_data g, history.customer_data h, history.customer_data i, history.customer_data j, history.customer_data k, history.customer_data l WHERE a.cust_id=b.cd_cust_id AND a.cust_id=c.cd_cust_id AND a.cust_id=d.cd_cust_id AND a.cust_id=e.cd_cust_id AND a.cust_id=f.cd_cust_id AND a.cust_id=g.cd_cust_id AND a.cust_id=h.cd_cust_id AND a.cust_id=i.cd_cust_id AND a.cust_id=j.cd_cust_id AND a.cust_id=k.cd_cust_id AND a.cust_id=l.cd_cust_id AND ... I would refactor this significantly, so that instead of returning a wide result, it would return more than one row per customer. Just do a single join between customer and history.customer_data - it will run much faster. Matthew -- Here we go - the Fairy Godmother redundancy proof. -- Computer Science Lecturer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Cross Join Problem
Hi, Following is the Query : SELECT sum(id), sum(cd), sum(ad) FROM table1 a , table2 b cross join table3 c WHERE a.nkey = b.key AND a.dkey = c.key AND c.date = '2008-02-01' AND b.id = 999 ; We have fired this on our production system which is postgres 8.1.3, and got the following explain analyse of it Aggregate (cost=11045.52..11045.53 rows=1 width=24) (actual time=79.290..79.291 rows=1 loops=1) -> Nested Loop (cost=49.98..11043.42 rows=279 width=24) (actual time=1.729..50.498 rows=10473 loops=1) -> Nested Loop (cost=0.00..6.05 rows=1 width=8) (actual time=0.028..0.043 rows=1 loops=1) -> Index Scan using rnididx on table2 b (cost=0.00..3.02 rows=1 width=4) (actual time=0.011..0.015 rows=1 loops=1) Index Cond: (id = 999) -> Index Scan using rddtidx on table3 c (cost=0.00..3.02 rows=1 width=4) (actual time=0.010..0.016 rows=1 loops=1) Index Cond: (date = '2008-02-01 00:00:00'::timestamp without time zone) -> Bitmap Heap Scan on table1 a (cost=49.98..10954.93 rows=5496 width=32) (actual time=1.694..19.006 rows=10473 loops=1) Recheck Cond: ((a.nkey = "outer"."key") AND (a.dkey = "outer"."key")) -> Bitmap Index Scan on rndateidx (cost=0.00..49.98 rows=5496 width=0) (actual time=1.664..1.664 rows=10473 loops=1) Index Cond: ((a.nkey = "outer"."key") AND (a.dkey = "outer"."key")) Total runtime: 79.397 ms Time: 80.752 ms Same Query when we fire on postgres 8.3.3, following is the explain analyse QUERY PLAN --- Aggregate (cost=1171996.35..1171996.36 rows=1 width=24) (actual time=6360.783..6360.785 rows=1 loops=1) -> Nested Loop (cost=0.00..1171994.28 rows=275 width=24) (actual time=3429.309..6330.424 rows=10473 loops=1) Join Filter: (a.nkey = b.key) -> Index Scan using rnididx on table2 b (cost=0.00..4.27 rows=1 width=4) (actual time=0.030..0.033 rows=1 loops=1) Index Cond: (id = 999) -> Nested Loop (cost=0.00..1169411.17 rows=206308 width=28) (actual time=0.098..4818.450 rows=879480 loops=1) -> Index Scan using rddtidx on table1 c (cost=0.00..4.27 rows=1 width=4) (actual time=0.031..0.034 rows=1 loops=1) Index Cond: (date = '2008-02-01 00:00:00'::timestamp without time zone) -> Index Scan using rdnetidx on table1 a (cost=0.00..1156050.51 rows=1068511 width=32) (actual time=0.047..1732.229 rows=879480 loops=1) Index Cond: (a.dkey = c.key) Total runtime: 6360.978 ms The Query on postgres 8.1.3 use to take only 80.752 ms is now taking 6364.950 ms. We have done vacuum analyse on all the tables. Can anybody helpout over here ... was may b wrong... and why the query seems to take time on postgres 8.3.3. Is it 8.3.3 problem or its cross join problem on 8.3.3 Thanx -- Regards Gauri
Re: [PERFORM] Cross Join Problem
"Gauri Kanekar" <[EMAIL PROTECTED]> writes: > Following is the Query : > SELECT sum(id), sum(cd), sum(ad) >FROM table1 a , table2 b cross join table3 c >WHERE a.nkey = b.key > AND a.dkey = c.key > AND c.date = '2008-02-01' > AND b.id = 999 ; > We have fired this on our production system which is postgres 8.1.3, and got > the following explain analyse of it > Aggregate (cost=11045.52..11045.53 rows=1 width=24) (actual > time=79.290..79.291 rows=1 loops=1) >-> Nested Loop (cost=49.98..11043.42 rows=279 width=24) (actual > time=1.729..50.498 rows=10473 loops=1) > -> Nested Loop (cost=0.00..6.05 rows=1 width=8) (actual > time=0.028..0.043 rows=1 loops=1) >-> Index Scan using rnididx on table2 b (cost=0.00..3.02 > rows=1 width=4) (actual time=0.011..0.015 rows=1 loops=1) > Index Cond: (id = 999) >-> Index Scan using rddtidx on table3 c (cost=0.00..3.02 > rows=1 width=4) (actual time=0.010..0.016 rows=1 loops=1) > Index Cond: (date = '2008-02-01 00:00:00'::timestamp > without time zone) > -> Bitmap Heap Scan on table1 a (cost=49.98..10954.93 rows=5496 > width=32) (actual time=1.694..19.006 rows=10473 loops=1) >Recheck Cond: ((a.nkey = "outer"."key") AND (a.dkey = > "outer"."key")) >-> Bitmap Index Scan on rndateidx (cost=0.00..49.98 > rows=5496 width=0) (actual time=1.664..1.664 rows=10473 loops=1) > Index Cond: ((a.nkey = "outer"."key") AND (a.dkey = > "outer"."key")) > Total runtime: 79.397 ms No PG release since 7.3 would have voluntarily planned that query that way. Maybe you were using join_collapse_limit = 1 to force the join order? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Slow query with a lot of data
Hi, I run this query: select max(a."user"), b.category, count(1) from result a, domain_categories b where a."domain" = b."domain" group by b.category; the table result contains all websites a user visited. And the table domain_categories contains all categories a domain is in. result has 20 Mio rows and domain_categories has about 12 Mio. There are 500.000 different users. I have indexes on result.domain, domain_categories.domain, result.user, domain_categories.category. Clustered result on user and domain_categories on domain. explain analyze says (limited to one user with id 1337): "HashAggregate (cost=2441577.16..2441614.72 rows=2504 width=8) (actual time=94667.335..94671.508 rows=3361 loops=1)" " -> Merge Join (cost=2119158.02..2334105.00 rows=14329622 width=8) (actual time=63559.938..94621.557 rows=36308 loops=1)" "Merge Cond: (a.domain = b.domain)" "-> Sort (cost=395.52..405.49 rows=3985 width=8) (actual time=0.189..0.211 rows=19 loops=1)" " Sort Key: a.domain" " Sort Method: quicksort Memory: 27kB" " -> Index Scan using result_user_idx on result a (cost=0.00..157.21 rows=3985 width=8) (actual time=0.027..0.108 rows=61 loops=1)" "Index Cond: ("user" = 1337)" "-> Materialize (cost=2118752.28..2270064.64 rows=12104989 width=8) (actual time=46460.599..82336.116 rows=12123161 loops=1)" " -> Sort (cost=2118752.28..2149014.75 rows=12104989 width=8) (actual time=46460.592..59595.851 rows=12104989 loops=1)" "Sort Key: b.domain" "Sort Method: external sort Disk: 283992kB" "-> Seq Scan on domain_categories b (cost=0.00..198151.89 rows=12104989 width=8) (actual time=14.352..22572.869 rows=12104989 loops=1)" "Total runtime: 94817.058 ms" This is running on a pretty small server with 1gb of ram and a slow sata hd. Shared_buffers is 312mb, max_fsm_pages = 153600. Everything else is commented out. Postgresql v8.3.3. Operating system Ubuntu 8.04. It would be great if someone could help improve this query. This is for a research project at my university. Thanks in advance, Moritz -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with a lot of data
On Mon, 18 Aug 2008, Moritz Onken wrote: I have indexes on result.domain, domain_categories.domain, result.user, domain_categories.category. Clustered result on user and domain_categories on domain. "-> Materialize (cost=2118752.28..2270064.64 rows=12104989 width=8) (actual time=46460.599..82336.116 rows=12123161 loops=1)" " -> Sort (cost=2118752.28..2149014.75 rows=12104989 width=8) (actual time=46460.592..59595.851 rows=12104989 loops=1)" "Sort Key: b.domain" "Sort Method: external sort Disk: 283992kB" "-> Seq Scan on domain_categories b (cost=0.00..198151.89 rows=12104989 width=8) (actual time=14.352..22572.869 rows=12104989 loops=1)" This is weird, given you say you have clustered domain_categories on domain. Have you analysed? You should be able to run: EXPLAIN SELECT * from domain_categories ORDER BY domain and have it say "Index scan" instead of "Seq Scan followed by disc sort)". Matthew -- Patron: "I am looking for a globe of the earth." Librarian: "We have a table-top model over here." Patron: "No, that's not good enough. Don't you have a life-size?" Librarian: (pause) "Yes, but it's in use right now." -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with a lot of data
Am 18.08.2008 um 16:30 schrieb Matthew Wakeling: On Mon, 18 Aug 2008, Moritz Onken wrote: I have indexes on result.domain, domain_categories.domain, result.user, domain_categories.category. Clustered result on user and domain_categories on domain. "-> Materialize (cost=2118752.28..2270064.64 rows=12104989 width=8) (actual time=46460.599..82336.116 rows=12123161 loops=1)" " -> Sort (cost=2118752.28..2149014.75 rows=12104989 width=8) (actual time=46460.592..59595.851 rows=12104989 loops=1)" "Sort Key: b.domain" "Sort Method: external sort Disk: 283992kB" "-> Seq Scan on domain_categories b (cost=0.00..198151.89 rows=12104989 width=8) (actual time=14.352..22572.869 rows=12104989 loops=1)" This is weird, given you say you have clustered domain_categories on domain. Have you analysed? You should be able to run: EXPLAIN SELECT * from domain_categories ORDER BY domain and have it say "Index scan" instead of "Seq Scan followed by disc sort)". Matthew Thanks, the index was created but I forgot to run analyze again on that table. I had a little mistake in my previous sql query. The corrected version is this: explain analyze select a."user", b.category, count(1) from result a, domain_categories b where a."domain" = b."domain" and a."user" = 1337 group by a."user", b.category; (notice the additional group by column). explain analyze: "HashAggregate (cost=817397.78..817428.92 rows=2491 width=8) (actual time=42874.339..42878.419 rows=3361 loops=1)" " -> Merge Join (cost=748.47..674365.50 rows=19070970 width=8) (actual time=15702.449..42829.388 rows=36308 loops=1)" "Merge Cond: (b.domain = a.domain)" "-> Index Scan using domain_categories_domain on domain_categories b (cost=0.00..391453.79 rows=12105014 width=8) (actual time=39.018..30166.349 rows=12104989 loops=1)" "-> Sort (cost=395.52..405.49 rows=3985 width=8) (actual time=0.188..32.345 rows=36309 loops=1)" " Sort Key: a.domain" " Sort Method: quicksort Memory: 27kB" " -> Index Scan using result_user_idx on result a (cost=0.00..157.21 rows=3985 width=8) (actual time=0.021..0.101 rows=61 loops=1)" "Index Cond: ("user" = 1337)" "Total runtime: 42881.382 ms" This is still very slow... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with a lot of data
On Mon, 18 Aug 2008, Moritz Onken wrote: "HashAggregate (cost=817397.78..817428.92 rows=2491 width=8) (actual time=42874.339..42878.419 rows=3361 loops=1)" " -> Merge Join (cost=748.47..674365.50 rows=19070970 width=8) (actual > time=15702.449..42829.388 rows=36308 loops=1)" "Merge Cond: (b.domain = a.domain)" "-> Index Scan using domain_categories_domain on domain_categories b > (cost=0.00..391453.79 rows=12105014 width=8) (actual time=39.018..30166.349 > rows=12104989 loops=1)" "-> Sort (cost=395.52..405.49 rows=3985 width=8) (actual > time=0.188..32.345 rows=36309 loops=1)" " Sort Key: a.domain" " Sort Method: quicksort Memory: 27kB" " -> Index Scan using result_user_idx on result a > (cost=0.00..157.21 rows=3985 width=8) (actual time=0.021..0.101 rows=61 > loops=1)" "Index Cond: ("user" = 1337)" "Total runtime: 42881.382 ms" This is still very slow... Well, you're getting the database to read the entire contents of the domain_categories table in order. That's 12 million rows - a fair amount of work. You may find that removing the "user = 1337" constraint doesn't make the query much slower - that's where you get a big win by clustering on domain. You might also want to cluster the results table on domain. If you want the results for just one user, it would be very helpful to have a user column on the domain_categories table, and an index on that column. However, that will slow down the query for all users a little. Matthew -- For every complex problem, there is a solution that is simple, neat, and wrong. -- H. L. Mencken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Cross Join Problem
[ please keep the list cc'd for the archives' sake ] "Gauri Kanekar" <[EMAIL PROTECTED]> writes: > On Mon, Aug 18, 2008 at 7:32 PM, Tom Lane <[EMAIL PROTECTED]> wrote: >> No PG release since 7.3 would have voluntarily planned that query that >> way. Maybe you were using join_collapse_limit = 1 to force the join >> order? > Yes, We have set join_collapse_limit set to 1. Ah, so really your question is why join_collapse_limit isn't working as you expect. That code changed quite a bit in 8.2, and the way it works now is that the critical decision occurs while deciding whether to fold the cross-join (a sub-problem of size 2) into the top-level join problem. Which is a decision that's going to be driven by from_collapse_limit not join_collapse_limit. So one way you could make it work is to reduce from_collapse_limit to less than 3, but I suspect you'd find that that has too many bad consequences for other queries. What's probably best is to write the problem query like this: FROM table1 a cross join ( table2 b cross join table3 c ) which will cause join_collapse_limit to be the relevant number at both steps. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] file system and raid performance
On Fri, Aug 15, 2008 at 12:22 PM, Bruce Momjian <[EMAIL PROTECTED]> wrote: > Mark Wong wrote: >> On Mon, Aug 4, 2008 at 10:04 PM, <[EMAIL PROTECTED]> wrote: >> > On Mon, 4 Aug 2008, Mark Wong wrote: >> > >> >> Hi all, >> >> >> >> We've thrown together some results from simple i/o tests on Linux >> >> comparing various file systems, hardware and software raid with a >> >> little bit of volume management: >> >> >> >> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide > > Mark, very useful analysis. I am curious why you didn't test > 'data=writeback' on ext3; 'data=writeback' is the recommended mount > method for that file system, though I see that is not mentioned in our > official documentation. I have one set of results with ext3 data=writeback and it appears that some of the write tests have less throughput than data=ordered. For anyone who wants to look at the results details: http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide it's under the "Aggregate Bandwidth (MB/s) - RAID 5 (256KB stripe) - No partition table" table. Regards, Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with a lot of data
Well, you're getting the database to read the entire contents of the domain_categories table in order. That's 12 million rows - a fair amount of work. You may find that removing the "user = 1337" constraint doesn't make the query much slower - that's where you get a big win by clustering on domain. You might also want to cluster the results table on domain. Running the query for more than one user is indeed not much slower. That's what I need. I'm clustering the results table on domain right now. But why is this better than clustering it on "user"? If you want the results for just one user, it would be very helpful to have a user column on the domain_categories table, and an index on that column. However, that will slow down the query for all users a little. A row in domain_categories can belong to more than one user. But I don't need to run this query for only one user anyway. Thanks so far, -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with a lot of data
On Mon, 18 Aug 2008, Moritz Onken wrote: Running the query for more than one user is indeed not much slower. That's what I need. I'm clustering the results table on domain right now. But why is this better than clustering it on "user"? The reason is the way that the merge join algorithm works. What it does is takes two tables, and sorts them both by the join fields. Then it can stream through both tables producing results as it goes. It's the best join algorithm, but it does require both tables to be sorted by the same thing, which is domain in this case. The aggregating on user happens after the join has been done, and the hash aggregate can accept the users in random order. If you look at your last EXPLAIN, see that it has to sort the result table on domain, although it can read the domain_categories in domain order due to the clustered index. "HashAggregate " -> Merge Join "Merge Cond: (b.domain = a.domain)" "-> Index Scan using domain_categories_domain on domain_categories b "-> Sort " Sort Key: a.domain" " Sort Method: quicksort Memory: 27kB" " -> Index Scan using result_user_idx on result a "Index Cond: ("user" = 1337)" Without the user restriction and re-clustering, this should become: "HashAggregate " -> Merge Join "Merge Cond: (b.domain = a.domain)" "-> Index Scan using domain_categories_domain on domain_categories b "-> Index Scan using result_domain on result a Matthew -- Vacuums are nothings. We only mention them to let them know we know they're there. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance