Re: [PERFORM] query optimization
: 258.245 ms (13 rows) krichy= ANALYZE c; ANALYZE krichy= explain analyze select * from a left join (b inner join c on b.id = c.id) on a.b = b.id where a.id = 1; QUERY PLAN --- Hash Right Join (cost=2651.29..6419.30 rows=1 width=16) (actual time=83.295..255.653 rows=1 loops=1) Hash Cond: (b.id = a.b) - Hash Join (cost=2643.00..6036.00 rows=10 width=8) (actual time=83.275..222.373 rows=10 loops=1) Hash Cond: (b.id = c.id) - Seq Scan on b (cost=0.00..1393.00 rows=10 width=4) (actual time=0.010..35.726 rows=10 loops=1) - Hash (cost=1393.00..1393.00 rows=10 width=4) (actual time=83.238..83.238 rows=10 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 2344kB - Seq Scan on c (cost=0.00..1393.00 rows=10 width=4) (actual time=0.009..36.243 rows=10 loops=1) - Hash (cost=8.28..8.28 rows=1 width=8) (actual time=0.011..0.011 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB - Index Scan using a_idx1 on a (cost=0.00..8.28 rows=1 width=8) (actual time=0.008..0.009 rows=1 loops=1) Index Cond: (id = 1) Total runtime: 256.008 ms (13 rows) So after analyzing all the tables, the result changed much, psql uses other plans to do the query, and in my case it is effectively much slower. My configuration file has: work_mem=16MB with this removed, the query goes fast again, but I dont know why the more memory makes psql chose a worse plan. Thanks in advance, Kojedzinszky Richard Euronet Magyarorszag Informatikai Zrt. On Thu, 26 Apr 2012, Tom Lane wrote: Date: Thu, 26 Apr 2012 15:17:18 -0400 From: Tom Lane t...@sss.pgh.pa.us To: Richard Kojedzinszky kri...@tvnetwork.hu Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] query optimization Richard Kojedzinszky kri...@tvnetwork.hu writes: Dear list, We have a database schema, which looks the same as the attached script. When filling the tables with data, and skipping analyze on the table (so pg_stats contains no records for table 'a'), the first select in the script runs fast, but after an analyze the planner decides to sequence scan tables b and c, thus making the query much slower. Can somebody help me solving this issue, or tuning our installation to not to use sequence scans in this case? Um ... did you analyze all the tables, or just some of them? I get sub-millisecond runtimes if all four tables have been analyzed, but it does seem to pick lousy plans if, say, only a and b have been analyzed. What you really need for this query structure is the parameterized-path work I've been doing for 9.2; but at least on the exact example given, I'm not seeing that 9.1 is that much worse. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] query optimization
Dear list, We have a database schema, which looks the same as the attached script. When filling the tables with data, and skipping analyze on the table (so pg_stats contains no records for table 'a'), the first select in the script runs fast, but after an analyze the planner decides to sequence scan tables b and c, thus making the query much slower. Can somebody help me solving this issue, or tuning our installation to not to use sequence scans in this case? Thanks in advance, Kojedzinszky Richard Euronet Magyarorszag Informatikai Zrt. test.sql Description: application/sql -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
Richard Kojedzinszky kri...@tvnetwork.hu wrote: tuning our installation to not to use sequence scans in this case? Make sure effective_cache_size is set to the sum of shared_buffers and whatever your OS shows as usable for caching. Try adjusting cost factors: maybe random_page_cost between 1 and 2, and cpu_tuple_cost between 0.03 and 0.05. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
Richard Kojedzinszky kri...@tvnetwork.hu writes: Dear list, We have a database schema, which looks the same as the attached script. When filling the tables with data, and skipping analyze on the table (so pg_stats contains no records for table 'a'), the first select in the script runs fast, but after an analyze the planner decides to sequence scan tables b and c, thus making the query much slower. Can somebody help me solving this issue, or tuning our installation to not to use sequence scans in this case? Um ... did you analyze all the tables, or just some of them? I get sub-millisecond runtimes if all four tables have been analyzed, but it does seem to pick lousy plans if, say, only a and b have been analyzed. What you really need for this query structure is the parameterized-path work I've been doing for 9.2; but at least on the exact example given, I'm not seeing that 9.1 is that much worse. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
Tom Lane wrote on 26.04.2012 21:17: Richard Kojedzinszkykri...@tvnetwork.hu writes: Dear list, We have a database schema, which looks the same as the attached script. When filling the tables with data, and skipping analyze on the table (so pg_stats contains no records for table 'a'), the first select in the script runs fast, but after an analyze the planner decides to sequence scan tables b and c, thus making the query much slower. Can somebody help me solving this issue, or tuning our installation to not to use sequence scans in this case? Um ... did you analyze all the tables, or just some of them? I get sub-millisecond runtimes if all four tables have been analyzed, but it does seem to pick lousy plans if, say, only a and b have been analyzed. Here it's similar to Richard's experience: Before analyzing the four tables, the first statement yields this plan: Merge Left Join (cost=504.89..2634509.91 rows=12500 width=16) (actual time=0.103..0.108 rows=1 loops=1) Merge Cond: (a.b = b.id) - Sort (cost=504.89..506.14 rows=500 width=8) (actual time=0.043..0.043 rows=1 loops=1) Sort Key: a.b Sort Method: quicksort Memory: 17kB - Bitmap Heap Scan on a (cost=12.14..482.47 rows=500 width=8) (actual time=0.028..0.029 rows=1 loops=1) Recheck Cond: (id = 4) - Bitmap Index Scan on a_idx1 (cost=0.00..12.01 rows=500 width=0) (actual time=0.021..0.021 rows=1 loops=1) Index Cond: (id = 4) - Materialize (cost=0.00..884002.52 rows=5000 width=8) (actual time=0.041..0.057 rows=5 loops=1) - Merge Join (cost=0.00..759002.52 rows=5000 width=8) (actual time=0.037..0.051 rows=5 loops=1) Merge Cond: (b.id = c.id) - Index Scan using b_idx1 on b (cost=0.00..4376.26 rows=10 width=4) (actual time=0.016..0.018 rows=5 loops=1) - Materialize (cost=0.00..4626.26 rows=10 width=4) (actual time=0.017..0.022 rows=5 loops=1) - Index Scan using c_idx1 on c (cost=0.00..4376.26 rows=10 width=4) (actual time=0.014..0.017 rows=5 loops=1) Total runtime: 0.209 ms This continues to stay the plan for about 10-15 repetitions, then it turns to this plan Hash Right Join (cost=2701.29..6519.30 rows=1 width=16) (actual time=79.604..299.227 rows=1 loops=1) Hash Cond: (b.id = a.b) - Hash Join (cost=2693.00..6136.00 rows=10 width=8) (actual time=79.550..265.251 rows=10 loops=1) Hash Cond: (b.id = c.id) - Seq Scan on b (cost=0.00..1443.00 rows=10 width=4) (actual time=0.011..36.158 rows=10 loops=1) - Hash (cost=1443.00..1443.00 rows=10 width=4) (actual time=79.461..79.461 rows=10 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 2735kB - Seq Scan on c (cost=0.00..1443.00 rows=10 width=4) (actual time=0.010..32.930 rows=10 loops=1) - Hash (cost=8.28..8.28 rows=1 width=8) (actual time=0.015..0.015 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB - Index Scan using a_idx1 on a (cost=0.00..8.28 rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1) Index Cond: (id = 4) Total runtime: 299.564 ms (I guess autovacuum kicked in, because that the same plan I get when running analyze on all four tables right after populating them) And the second one yields this one here (Regardless of analyze or not): QUERY PLAN Nested Loop Left Join (cost=0.00..16.89 rows=1 width=16) (actual time=0.027..0.031 rows=1 loops=1) - Nested Loop Left Join (cost=0.00..16.57 rows=1 width=12) (actual time=0.020..0.022 rows=1 loops=1) - Index Scan using a_idx1 on a (cost=0.00..8.28 rows=1 width=8) (actual time=0.011..0.012 rows=1 loops=1) Index Cond: (id = 4) - Index Scan using b_idx1 on b (cost=0.00..8.28 rows=1 width=4) (actual time=0.004..0.005 rows=1 loops=1) Index Cond: (a.b = id) - Index Scan using c_idx1 on c (cost=0.00..0.31 rows=1 width=4) (actual time=0.004..0.005 rows=1 loops=1) Index Cond: (b.id = id) Total runtime: 0.104 ms My version: PostgreSQL 9.1.3, compiled by Visual C++ build 1500, 32-bit Running on Windows XP SP3 shared_buffers = 768MB work_mem = 24MB effective_cache_size = 1024MB All other (relevant) settings are on defaults Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
Thomas Kellerer spam_ea...@gmx.net writes: Tom Lane wrote on 26.04.2012 21:17: Um ... did you analyze all the tables, or just some of them? I get sub-millisecond runtimes if all four tables have been analyzed, but it does seem to pick lousy plans if, say, only a and b have been analyzed. Here it's similar to Richard's experience: Before analyzing the four tables, the first statement yields this plan: [ merge joins ] This continues to stay the plan for about 10-15 repetitions, then it turns to this plan [ hash joins ] Hmm. I see it liking the merge-join plan (with minor variations) with or without analyze data, but if just some of the tables have been analyzed, it goes for the hash plan which is a good deal slower. The cost estimates aren't that far apart though. In any case, the only reason the merge join is so fast is that the data is perfectly ordered in each table; on a less contrived example, it could well be a lot slower. And the second one yields this one here (Regardless of analyze or not): Yeah, the trick there is that it's valid to re-order the joins, since they're both left joins. In git HEAD I get something like this: regression=# explain analyze select * from a left join (b inner join c on b.id = c.id) on a.b = b.id where a.id = 4; QUERY PLAN --- Nested Loop Left Join (cost=0.00..17.18 rows=1 width=16) (actual time=0.024..0.026 rows=1 loops=1) - Index Scan using a_idx1 on a (cost=0.00..8.38 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=1) Index Cond: (id = 4) - Nested Loop (cost=0.00..8.80 rows=1 width=8) (actual time=0.011..0.012 rows=1 loops=1) - Index Only Scan using b_idx1 on b (cost=0.00..8.38 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=1) Index Cond: (id = a.b) Heap Fetches: 1 - Index Only Scan using c_idx1 on c (cost=0.00..0.41 rows=1 width=4) (actual time=0.004..0.005 rows=1 loops=1) Index Cond: (id = b.id) Heap Fetches: 1 Total runtime: 0.080 ms (11 rows) but 9.1 and older are not smart enough to do it like that when they can't re-order the joins. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
On 04/26/2012 04:08 PM, Tom Lane wrote: Thomas Kellererspam_ea...@gmx.net writes: Tom Lane wrote on 26.04.2012 21:17: Um ... did you analyze all the tables, or just some of them? I get sub-millisecond runtimes if all four tables have been analyzed, but it does seem to pick lousy plans if, say, only a and b have been analyzed. Here it's similar to Richard's experience: Before analyzing the four tables, the first statement yields this plan: [ merge joins ] This continues to stay the plan for about 10-15 repetitions, then it turns to this plan [ hash joins ] Hmm. I see it liking the merge-join plan (with minor variations) with or without analyze data, but if just some of the tables have been analyzed, it goes for the hash plan which is a good deal slower. The cost estimates aren't that far apart though. In any case, the only reason the merge join is so fast is that the data is perfectly ordered in each table; on a less contrived example, it could well be a lot slower. It's not so terribly contrived, is it? It's common enough to have tables which are append-only and to join them by something that corresponds to the append order (serial field, timestamp etc.) cheers andrew -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization using order by and limit
Stephen Frost sfr...@snowman.net writes: What I think is happening here is that PG is pushing down that filter (not typically a bad thing..), but with that condition, it's going to scan the index until it finds a match for that filter before returning back up only to have that result cut out due to the limit. Yeah, it's spending quite a lot of time finding the first matching row in each child table. I'm curious why that is though; are the child tables not set up with nonoverlapping firstloadtime ranges? What would be great is if PG would realize that the CHECK constraints prevent earlier records from being in these earlier tables, The explain shows that that isn't the case, because it *is* finding at least one candidate row in each table. It's just running quite far into the firstloadtime sequence to do it. If you're stuck with this table arrangement, one thing that would help is a two-column index on (host_guid, firstloadtime) on each child table. That would match the search condition exactly, and so reduce the cost to find the first matching row to nearly nil. Whether this query's speed is important enough to justify maintaining such an index is a question I can't answer for you. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization using order by and limit
* Tom Lane (t...@sss.pgh.pa.us) wrote: Yeah, it's spending quite a lot of time finding the first matching row in each child table. I'm curious why that is though; are the child tables not set up with nonoverlapping firstloadtime ranges? They are set up w/ nonoverlapping firstloadtime ranges, using CHECK constraints such as: osmoduleloads_2011_09_14_firstloadtime_check CHECK (firstloadtime = 1296044640::bigint::numeric AND firstloadtime 1296053280::bigint::numeric) The issue here is that the query is saying Give me the first 150 records with this host_id in this week-long range. PG happily eliminates all the tables that are outside of the week-long range during constraint exclusion. After that, however, it hunts down the earliest records (which matches 'host_id') from each child table. Sure, from each table there's a record in the week-long range with the host_id that matches. What PG doesn't realize is that it can stop after pulling the 150 records from the most recent table (and flipping the direction of the query or the tables doesn't help- PG still pulls a record from each table). What would be great is if PG would realize that the CHECK constraints prevent earlier records from being in these earlier tables, The explain shows that that isn't the case, because it *is* finding at least one candidate row in each table. It's just running quite far into the firstloadtime sequence to do it. My point above is that the CHECK constraints ensure an ordering which could be leveraged to use the latest table first and then stop if enough tuples are returned (or immediately go to the next table), without ever considering the other tables. I'm not looking for PG to eliminate those other tables for consideration in all cases- if the limit is large enough, it may get all the way down to them. I'm pretty sure this isn't something which PG does today and I don't expect teaching it to do this to be trivial, but it certainly would be nice as this strikes me as a very common use-case. If you're stuck with this table arrangement, one thing that would help is a two-column index on (host_guid, firstloadtime) on each child table. Agreed, I mentioned this to the OP previously and it's on his list of things to try. Thanks, Stephen signature.asc Description: Digital signature
Re: [PERFORM] Query optimization using order by and limit
Stephen Frost sfr...@snowman.net writes: * Tom Lane (t...@sss.pgh.pa.us) wrote: Yeah, it's spending quite a lot of time finding the first matching row in each child table. I'm curious why that is though; are the child tables not set up with nonoverlapping firstloadtime ranges? The issue here is that the query is saying Give me the first 150 records with this host_id in this week-long range. Oh, I see. So the query range overlaps multiple child tables, even after constraint exclusion eliminates a lot of them. My point above is that the CHECK constraints ensure an ordering which could be leveraged to use the latest table first and then stop if enough tuples are returned (or immediately go to the next table), without ever considering the other tables. Yeah. My opinion is that trying to reverse-engineer that from the CHECK constraints would cost a lot more than it's worth. What we need, and will hopefully have sooner or later, is an abstract concept of partitioned table in which this kind of relationship is known a-priori instead of having to be laboriously re-deduced every time we plan a query. If you're stuck with this table arrangement, one thing that would help is a two-column index on (host_guid, firstloadtime) on each child table. Agreed, I mentioned this to the OP previously and it's on his list of things to try. AFAICS the fact that this example would be fast if we were only paying attention to the newest table is mere luck. If it can take a long time to find the first matching host_guid record in several of the child tables, why might it not take just as long to find said record in the other one? I think you really need the two-column indexes, if keeping this query's runtime to a minimum is critical. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization using order by and limit
On Wed, Sep 21, 2011 at 11:22:53PM -0400, Tom Lane wrote: Michael Viscuso michael.visc...@getcarbonblack.com writes: Greg/Tom, you are correct, these columns should be modified to whatever is easiest for Postgres to recognize 64-bit unsigned integers. Would you still recommend bigint for unsigned integers? I likely read the wrong documentation that suggested bigint for signed 64-bit integers and numeric(20) for unsigned 64-bit integers. Unsigned? Oh, hm, that's a bit of a problem because we don't have any unsigned types. If you really need to go to 2^64 and not 2^63 then you're stuck with numeric ... but that last bit is costing ya a lot. regards, tom lane Hi Michael, If you have access to the application, you can map the unsigned 64-bits to the PostgreSQL signed 64-bit type with a simple subtraction. That will allow you to drop all the numeric use. Also if the guid is a 64-bit values stuffed into a numeric(20), you can do it there as well. I achieved a hefty performance boost by making those application level changes in a similar situation. Regards, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization using order by and limit
Thanks Ken, I'm discussing with my coworker how to best make that change *as we speak*. Do you think this will also resolve the original issue I'm seeing where the query doesn't limit out properly and spends time in child tables that won't yield any results? I was hoping that by using the check constraints, I could query over a week or month's worth of partitioned tables and the combination of order by and limit would eliminate any time searching unnecessary tables but that doesn't appear to be true. (I'm still very new to high-end Postgres performance so I could be mistaken.) Regardless, in the meantime, I'll switch those columns to bigint instead of numeric and have an update as soon as possible. Thanks for your help! Mike On 9/22/2011 9:41 AM, k...@rice.edu wrote: On Wed, Sep 21, 2011 at 11:22:53PM -0400, Tom Lane wrote: Michael Viscuso michael.visc...@getcarbonblack.com writes: Greg/Tom, you are correct, these columns should be modified to whatever is easiest for Postgres to recognize 64-bit unsigned integers. Would you still recommend bigint for unsigned integers? I likely read the wrong documentation that suggested bigint for signed 64-bit integers and numeric(20) for unsigned 64-bit integers. Unsigned? Oh, hm, that's a bit of a problem because we don't have any unsigned types. If you really need to go to 2^64 and not 2^63 then you're stuck with numeric ... but that last bit is costing ya a lot. regards, tom lane Hi Michael, If you have access to the application, you can map the unsigned 64-bits to the PostgreSQL signed 64-bit type with a simple subtraction. That will allow you to drop all the numeric use. Also if the guid is a 64-bit values stuffed into a numeric(20), you can do it there as well. I achieved a hefty performance boost by making those application level changes in a similar situation. Regards, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization using order by and limit
* Michael Viscuso (michael.visc...@getcarbonblack.com) wrote: Adding the final condition hosts_guid = '2007075705813916178' is what ultimately kills it http://explain.depesz.com/s/8zy. By adding the host_guid, it spends considerably more time in the older tables than without this condition and I'm not sure why. What I think is happening here is that PG is pushing down that filter (not typically a bad thing..), but with that condition, it's going to scan the index until it finds a match for that filter before returning back up only to have that result cut out due to the limit. Having it as numerics isn't helping here, but the bigger issue is having to check all those tuples for a match to the filter. Mike, the filter has to be applied before the order by/limit, since those clauses come after the filter has been applied (you wouldn't want a 'where x = 2 limit 10' to return early just because it found 10 records where x didn't equal 2). What would be great is if PG would realize that the CHECK constraints prevent earlier records from being in these earlier tables, so it shouldn't need to consider them at all once the records from the 'latest' table has been found and the limit reached (reverse all this for an 'ascending' query, of course), which we can do when there's no order by. I don't believe we have that kind of logic or that information available at this late stage- the CHECK constraints are used to eliminate the impossible-to-match tables, but that's it. One option, which isn't great of course, would be to implement your own 'nested loop' construct (something I typically despise..) in the application which just walks backwards from the latest and pulls whatever records it can from each day and then stops once it hits the limit. Thanks, Stephen signature.asc Description: Digital signature
Re: [PERFORM] Query optimization using order by and limit
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Stephen, I spent the better part of the day implementing an application layer nested loop and it seems to be working well. Of course it's a little slower than a Postgres only solution because it has to pass data back and forth for each daily table query until it reaches the limit, but at least I don't have runaway queries like I was seeing before. That should be a pretty good stopgap solution for the time being. I was really hoping there was a Postgres exclusive answer though! :) If there are any other suggestions, it's a simple flag in my application to query the other way again... Thanks for all your help - and I'm still looking to change those numerics to bigints, just haven't figured out the best way yet. Mike On 9/22/2011 10:53 AM, Stephen Frost wrote: * Michael Viscuso (michael.visc...@getcarbonblack.com) wrote: Adding the final condition hosts_guid = '2007075705813916178' is what ultimately kills it http://explain.depesz.com/s/8zy. By adding the host_guid, it spends considerably more time in the older tables than without this condition and I'm not sure why. What I think is happening here is that PG is pushing down that filter (not typically a bad thing..), but with that condition, it's going to scan the index until it finds a match for that filter before returning back up only to have that result cut out due to the limit. Having it as numerics isn't helping here, but the bigger issue is having to check all those tuples for a match to the filter. Mike, the filter has to be applied before the order by/limit, since those clauses come after the filter has been applied (you wouldn't want a 'where x = 2 limit 10' to return early just because it found 10 records where x didn't equal 2). What would be great is if PG would realize that the CHECK constraints prevent earlier records from being in these earlier tables, so it shouldn't need to consider them at all once the records from the 'latest' table has been found and the limit reached (reverse all this for an 'ascending' query, of course), which we can do when there's no order by. I don't believe we have that kind of logic or that information available at this late stage- the CHECK constraints are used to eliminate the impossible-to-match tables, but that's it. One option, which isn't great of course, would be to implement your own 'nested loop' construct (something I typically despise..) in the application which just walks backwards from the latest and pulls whatever records it can from each day and then stops once it hits the limit. Thanks, Stephen -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOe7zzAAoJEBKjVK2HR1IXYwAIAKQBnFOtCNljL1Hs1ZQW3e+I ele/kZCiHzgHLFpN7zawt1Y7qf+3ntd6u+mkatJsnqeC+HY1Qee4VTUqr+hIKhcc VIGuuYkzuojs6/PgF6MAERHP24lRFdLCQtMgTY8RshYODvc07VpqkLq1cXhsNJZw 6pNBTEpEmA0MzMrmk3x6C8lFbyXZAYUxNLwG5SEWecV+lkOjnA70oKnSxG6EXRgk fkj2l1ezVn23KoO8SSUp4xBFHHOY/PQP9JtV7b52Gm5PC7lOqFFrXFygNP0KkWho TzyjoYKttShEjmTMXoLt181+NB4rQEas8USasemRA1pUkx2NrfvcK46gYucOAsg= =8yQW -END PGP SIGNATURE- -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization using order by and limit
Mike, * Michael Viscuso (michael.visc...@getcarbonblack.com) wrote: I spent the better part of the day implementing an application layer nested loop and it seems to be working well. Of course it's a little slower than a Postgres only solution because it has to pass data back and forth for each daily table query until it reaches the limit, but at least I don't have runaway queries like I was seeing before. That should be a pretty good stopgap solution for the time being. Glad to hear that you were able to get something going which worked for you. I was really hoping there was a Postgres exclusive answer though! :) If there are any other suggestions, it's a simple flag in my application to query the other way again... I continue to wonder if some combination of multi-column indexes might have made the task of finding the 'lowest' record from each of the tables fast enough that it wouldn't be an issue. Thanks for all your help - and I'm still looking to change those numerics to bigints, just haven't figured out the best way yet. Our timestamps are also implemented using 64bit integers and would allow you to use all the PG date/time functions and operators. Just a thought. Thanks, Stephen signature.asc Description: Digital signature
Re: [PERFORM] Query optimization using order by and limit
Stephen, Yes, I couldn't agree more. The next two things I will be looking at very carefully are the timestamps and indexes. I will reply to this post if either dramatically helps. Thanks again for all your help. My eyes were starting to bleed from staring at explain logs! Mike On Thu, Sep 22, 2011 at 7:14 PM, Stephen Frost sfr...@snowman.net wrote: Mike, * Michael Viscuso (michael.visc...@getcarbonblack.com) wrote: I spent the better part of the day implementing an application layer nested loop and it seems to be working well. Of course it's a little slower than a Postgres only solution because it has to pass data back and forth for each daily table query until it reaches the limit, but at least I don't have runaway queries like I was seeing before. That should be a pretty good stopgap solution for the time being. Glad to hear that you were able to get something going which worked for you. I was really hoping there was a Postgres exclusive answer though! :) If there are any other suggestions, it's a simple flag in my application to query the other way again... I continue to wonder if some combination of multi-column indexes might have made the task of finding the 'lowest' record from each of the tables fast enough that it wouldn't be an issue. Thanks for all your help - and I'm still looking to change those numerics to bigints, just haven't figured out the best way yet. Our timestamps are also implemented using 64bit integers and would allow you to use all the PG date/time functions and operators. Just a thought. Thanks, Stephen -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iEYEARECAAYFAk57wXAACgkQrzgMPqB3kijaNwCfQ9cSdzzHyiPwa+BTzIihWR7T baoAoIbL8P3atU1cfbcCoFXFGbKE7fPt =ZRqu -END PGP SIGNATURE-
[PERFORM] Query optimization using order by and limit
First of all, thank you for taking the time to review my question. After attending the PostgresOpen conference in Chicago last week, I've been pouring over explain logs for hours on end and although my system is MUCH better, I still can't resolve a few issues. Luckily my data is pretty well structured so solving one issue will likely solve many more so I'll start with this one. Version: PostgreSQL 9.1rc1, compiled by Visual C++ build 1500, 64-bit OS: Windows 7 64-bit ORM: SQLAlchemy Postgres table structure: I have daily partitioned tables for each of 4 core tables (the tables with the majority of my application's data). Each daily table inherits from its parent. I do not explicitly define a REFERENCE between these tables because I cannot guarantee the order in which the events are inserted into the database, but where there are references, the referenced row should exist in the other's daily table. The reason I partitioned the data in this manner is to increase query speed and make it easy to archive old data. (I'm new to high-end Postgres performance so there's likely several fundamental flaws in my assumptions. I won't turn down any recommendation.) An example of a daily partitioned table follows: cb=# \d osmoduleloads_2011_09_14; Table public.osmoduleloads_2011_09_14 Column |Type | Modifiers ---+-+ guid | numeric(20,0) | not null osprocess_guid| numeric(20,0) | not null filepath_guid | numeric(20,0) | not null firstloadtime | numeric(20,0) | not null md5hash | bytea | not null host_guid | numeric(20,0) | default NULL::numeric process_create_time | numeric(20,0) | default NULL::numeric process_filepath_guid | numeric(20,0) | default NULL::numeric event_time| timestamp without time zone | default '2011-09-14 00:00:00'::timestamp without time zone Indexes: osmoduleloads_2011_09_14_pkey PRIMARY KEY, btree (guid) idx_osmoduleloads_2011_09_14_filepath_guid btree (filepath_guid) idx_osmoduleloads_2011_09_14_firstload_time btree (firstloadtime) idx_osmoduleloads_2011_09_14_host_guid btree (host_guid) idx_osmoduleloads_2011_09_14_md5hash btree (md5hash) idx_osmoduleloads_2011_09_14_osprocess_guid btree (osprocess_guid) Check constraints: osmoduleloads_2011_09_14_event_time_check CHECK (event_time = '2011-09-14 00:00:00'::timestamp without time zone) osmoduleloads_2011_09_14_firstloadtime_check CHECK (firstloadtime = 1296044640::bigint::numeric AND firstloadtime 1296053280::bigint::numeric) Inherits: osmoduleloads Objective: The firstloadtime check constraint ensures that the record is applicable to that daily table. (In case you were wondering, the large numerics correspond to the Windows 100-nanosecond since the Epoch.) I'm inserting millions of records into each daily table so query slowness is quite easy to spot. Given that there is so much data per daily table, I was hoping to use the order by and limit clauses to stop out a query once it sufficed the limit clause and not be forced to visit each daily table. However, I'm spending way too much time in the older tables than I'd like - which leads me to believe that I;m doing something wrong. For ease of viewing, my explain analyze can be found at http://explain.depesz.com/s/tot I'm still very new to this so I'm not sure if explain.depesz.com saves the original query. It wasn't readily apparent that it did so here is the original query: SELECT osm_1.*, storefiles_1.*, filepaths_1.*, filepaths_2.* FROM (SELECT * FROM osmoduleloads JOIN hosts ON hosts.guid = osmoduleloads.host_guid WHERE hosts.guid = '2007075705813916178' AND osmoduleloads.firstloadtime = 1296043200 AND osmoduleloads.firstloadtime 12961036799000 AND hosts.enabled = true AND hosts.user_id = 111 ORDER BY osmoduleloads.firstloadtime DESC LIMIT 251) AS osm_1 LEFT OUTER JOIN storefiles AS storefiles_1 ON osm_1.md5hash = storefiles_1.md5hash LEFT OUTER JOIN filepaths AS filepaths_1 ON osm_1.process_filepath_guid = filepaths_1.guid AND osm_1.event_time = filepaths_1.event_time LEFT OUTER JOIN filepaths AS filepaths_2 ON osm_1.filepath_guid = filepaths_2.guid AND osm_1.event_time= filepaths_2.event_time ORDER BY osm_1.firstloadtime DESC; Hopefully my assumptions about order by and limit are correct and this query can be optimized. Again, appreciate any help you can lend. Thanks in advance. Mike
Re: [PERFORM] Query optimization using order by and limit
On 09/21/2011 07:14 PM, Michael Viscuso wrote: Check constraints: osmoduleloads_2011_09_14_event_time_check CHECK (event_time = '2011-09-14 00:00:00'::timestamp without time zone) osmoduleloads_2011_09_14_firstloadtime_check CHECK (firstloadtime = 1296044640::bigint::numeric AND firstloadtime 1296053280::bigint::numeric) Inherits: osmoduleloads That weird casting can't be helping. I'm not sure if it's your problem here, but the constraint exclusion code is pretty picky about matching the thing you're looking for against the CHECK constraint, and this is a messy one. The bigint conversion in the middle there isn't doing anything useful for you anyway; you really should simplify this to just look like this: firstloadtime = 1296044640::numeric SELECT osm_1.*, storefiles_1.*, filepaths_1.*, filepaths_2.* FROM (SELECT * FROM osmoduleloads JOIN hosts ON hosts.guid = osmoduleloads.host_guid WHERE hosts.guid = '2007075705813916178' AND osmoduleloads.firstloadtime = 1296043200 AND osmoduleloads.firstloadtime 12961036799000 AND hosts.enabled = true AND hosts.user_id = 111 ORDER BY osmoduleloads.firstloadtime DESC LIMIT 251) AS osm_1 LEFT OUTER JOIN storefiles AS storefiles_1 ON osm_1.md5hash = storefiles_1.md5hash LEFT OUTER JOIN filepaths AS filepaths_1 ON osm_1.process_filepath_guid = filepaths_1.guid AND osm_1.event_time = filepaths_1.event_time LEFT OUTER JOIN filepaths AS filepaths_2 ON osm_1.filepath_guid = filepaths_2.guid AND osm_1.event_time= filepaths_2.event_time ORDER BY osm_1.firstloadtime DESC; What you should start with here is confirming whether or not a simpler query touches all of the partitions or just the ones you expect it to. A simpler one like this: SELECT * FROM osmoduleloads WHERE osmoduleloads.firstloadtime = 1296043200 AND osmoduleloads.firstloadtime 12961036799000; Would be the place to begin. Once you've got that working, then you can build up more pieces, and see if one of them results in the query not excluding partitions anymore or not. I can't figure out if you're running into a basic error here, where constraint exclusion just isn't working at all, or if you are only having this problem because the query is too complicated. Figuring that out will narrow the potential solutions. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
Re: [PERFORM] Query optimization using order by and limit
Greg Smith g...@2ndquadrant.com writes: That weird casting can't be helping. I'm not sure if it's your problem here, but the constraint exclusion code is pretty picky about matching the thing you're looking for against the CHECK constraint, and this is a messy one. The bigint conversion in the middle there isn't doing anything useful for you anyway; you really should simplify this to just look like this: firstloadtime = 1296044640::numeric I have a more aggressive suggestion: change all the numeric(20,0) fields to bigint. Unless the OP actually needs values wider than 64 bits, the choice to use numeric is a significant performance penalty for nothing. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization using order by and limit
Thanks guys, First of all, I should have included my postgres.conf file with the original submission. Sorry about that. It is now attached. Based on a recommendation, I also should have shown the parent child relationship between osmoduleloads and its daily partitioned tables. to reduce clutter, It is at the end of this message. Taking this one step at a time and taking Greg's second suggestion first, issuing select * from osmoduleloads WHERE osmoduleloads.firstloadtime = 1296043200 AND osmoduleloads.firstloadtime 12961036799000; appears to only query the appropriate daily tables (2011_09_13 through 2011_09_20 - http://explain.depesz.com/s/QCG). So it appears that constraint_exclusion is working properly. Putting a limit on the query like: select * from osmoduleloads WHERE osmoduleloads.firstloadtime = 1296043200 AND osmoduleloads.firstloadtime 12961036799000 limit 251; has the result that I'd expect to see http://explain.depesz.com/s/O7fZ. Ordering by firstloadtime AND limiting like: select * from osmoduleloads WHERE osmoduleloads.firstloadtime = 1296043200 AND osmoduleloads.firstloadtime 12961036799000 order by firstloadtime desc limit 251; also has the result that I'd expect to see http://explain.depesz.com/s/RDh. Adding the hosts join condition to the mix was still OK http://explain.depesz.com/s/2Ns. Adding the hosts.enabled condition was still OK http://explain.depesz.com/s/UYN. Adding the hosts.user_id = 111 started the descent but it appears to still be obeying the proper contraint_exclusion that I'd expect, just with a ton of rows returned from the most recent daily tables http://explain.depesz.com/s/4WE. Adding the final condition hosts_guid = '2007075705813916178' is what ultimately kills it http://explain.depesz.com/s/8zy. By adding the host_guid, it spends considerably more time in the older tables than without this condition and I'm not sure why. Thanks Greg for the recommendation to step through it like that - hopefully this helps get us closer to a resolution. Greg/Tom, you are correct, these columns should be modified to whatever is easiest for Postgres to recognize 64-bit unsigned integers. Would you still recommend bigint for unsigned integers? I likely read the wrong documentation that suggested bigint for signed 64-bit integers and numeric(20) for unsigned 64-bit integers. Thanks again for all your help! Perhaps 15 hours of pouring over explain logs will finally pan out! Mike cb=# \d+ osmoduleloads; Table public.osmoduleloads Column |Type | Modifiers | Storage | Description ---+-+---+--+- guid | numeric(20,0) | not null | main | osprocess_guid| numeric(20,0) | not null | main | filepath_guid | numeric(20,0) | not null | main | firstloadtime | numeric(20,0) | not null | main | md5hash | bytea | not null | extended | host_guid | numeric(20,0) | default NULL::numeric | main | process_create_time | numeric(20,0) | default NULL::numeric | main | process_filepath_guid | numeric(20,0) | default NULL::numeric | main | event_time| timestamp without time zone | | plain| Indexes: osmoduleloads_pkey PRIMARY KEY, btree (guid) Child tables: osmoduleloads_2001_12_31, osmoduleloads_2010_10_11, osmoduleloads_2010_10_12, osmoduleloads_2010_10_13, osmoduleloads_2011_07_27, osmoduleloads_2011_08_04, osmoduleloads_2011_08_05, osmoduleloads_2011_08_06, osmoduleloads_2011_08_07, osmoduleloads_2011_08_08, osmoduleloads_2011_08_09, osmoduleloads_2011_08_10, osmoduleloads_2011_08_11, osmoduleloads_2011_08_12, osmoduleloads_2011_08_13, osmoduleloads_2011_08_14, osmoduleloads_2011_08_15, osmoduleloads_2011_08_16, osmoduleloads_2011_08_17, osmoduleloads_2011_08_18, osmoduleloads_2011_08_19, osmoduleloads_2011_08_20, osmoduleloads_2011_08_21, osmoduleloads_2011_08_22, osmoduleloads_2011_08_23, osmoduleloads_2011_08_24, osmoduleloads_2011_08_25, osmoduleloads_2011_08_26, osmoduleloads_2011_08_27, osmoduleloads_2011_08_28, osmoduleloads_2011_08_29, osmoduleloads_2011_08_30, osmoduleloads_2011_08_31,
Re: [PERFORM] Query optimization using order by and limit
Michael Viscuso michael.visc...@getcarbonblack.com writes: Greg/Tom, you are correct, these columns should be modified to whatever is easiest for Postgres to recognize 64-bit unsigned integers. Would you still recommend bigint for unsigned integers? I likely read the wrong documentation that suggested bigint for signed 64-bit integers and numeric(20) for unsigned 64-bit integers. Unsigned? Oh, hm, that's a bit of a problem because we don't have any unsigned types. If you really need to go to 2^64 and not 2^63 then you're stuck with numeric ... but that last bit is costing ya a lot. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization help
Hi, On 30 August 2011 15:36, Szymon Kosok szy...@mwg.pl wrote: Hello, I asked that question on StackOverflow, but didn't get any valuable response, so I'll ask it here. :) I have such query: Could you please re-post your explain using this web site: http://explain.depesz.com/ and post links to Stackoverflow question? What is your Postgres version? Database settings? I see huge discrepancy between predicted and actual row numbers (like 1264420 vs 485). I would try the following: - check column statistics (pg_stasts) and focus on the following columns: n_distinct, null_frac, most_common_vals. If they are way-off from the actual values then you should tweak (auto)analyze process: run manual/auto analyse more often (check pg_stat_user_tables), increase default_statistics_target (per column or global) - try to disable nested loop join (set enable_nestloop=off) -- Ondrej Ivanic (ondrej.iva...@gmail.com) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization help
2011/8/30 Ondrej Ivanič ondrej.iva...@gmail.com: Could you please re-post your explain using this web site: http://explain.depesz.com/ and post links to Stackoverflow question? Here it is: http://explain.depesz.com/s/Iaa - try to disable nested loop join (set enable_nestloop=off) Even worse performance (http://explain.depesz.com/s/mMi). My configuration:http://pastie.org/2453148 (copied and pasted only uncommented important variables). It's decent hardware. i7, 16 GB of RAM, 3x2 RAID 10 (7200rpm) for OS + data, RAID 1 (2 disks, 7200rpm) for WAL, RAID controller with BBU and 512 MB memory cache (cache is set to write only). PS. Sorry Ondrej, accidentally I've sent reply to you, not to list. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization help
Hi, 2011/8/30 Szymon Kosok szy...@mwg.pl: 2011/8/30 Ondrej Ivanič ondrej.iva...@gmail.com: Could you please re-post your explain using this web site: http://explain.depesz.com/ and post links to Stackoverflow question? Here it is: http://explain.depesz.com/s/Iaa - try to disable nested loop join (set enable_nestloop=off) Thanks, I would try to materialise spoleczniak_tablica table. Your query looks like this: select ... from spoleczniak_tablica inner join ... where ... order by spoleczniak_tablica.id desc limit 21 So I would rewrite your query like this: select ... from ( select ... from spoleczniak_tablica where order by spoleczniak_tablica.id desc limit 21 ) as x inner join ... -- Ondrej Ivanic (ondrej.iva...@gmail.com) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Query optimization help
Hello, I asked that question on StackOverflow, but didn't get any valuable response, so I'll ask it here. :) I have such query: SELECT spoleczniak_tablica.id, spoleczniak_tablica.postac_id, spoleczniak_tablica.hash, spoleczniak_tablica.typ, spoleczniak_tablica.ikona, spoleczniak_tablica.opis, spoleczniak_tablica.cel, spoleczniak_tablica.data, postac_postacie.id, postac_postacie.user_id, postac_postacie.avatar, postac_postacie.ikonka, postac_postacie.imie, postac_postacie.nazwisko, postac_postacie.pseudonim, postac_postacie.plec, postac_postacie.wzrost, postac_postacie.waga, postac_postacie.ur_tydz, postac_postacie.ur_rok, postac_postacie.ur_miasto_id, postac_postacie.akt_miasto_id, postac_postacie.kasa, postac_postacie.punkty, postac_postacie.zmeczenie, postac_postacie.zdrowie, postac_postacie.kariera FROM spoleczniak_tablica INNER JOIN postac_postacie ON (spoleczniak_tablica.postac_id = postac_postacie.id) WHERE spoleczniak_tablica.postac_id = 1 or spoleczniak_tablica.id in(select wpis_id from spoleczniak_oznaczone where etykieta_id in(select tag_id from spoleczniak_subskrypcje where postac_id = 1)) or (spoleczniak_tablica.postac_id in(select obserwowany_id from spoleczniak_obserwatorium where obserwujacy_id = 1) and hash not in('dyskusja', 'kochanie', 'szturniecie')) or (spoleczniak_tablica.cel = 1 and spoleczniak_tablica.hash in('dyskusja', 'kochanie', 'obserwatorium', 'szturchniecie')) or spoleczniak_tablica.hash = 'administracja-info' or exists(select 1 from spoleczniak_komentarze where kredka_id = spoleczniak_tablica.id and postac_id = 1) ORDER BY spoleczniak_tablica.id DESC LIMIT 21; and it's real performance bottleneck for us. It's one of the most often executed query on our site. Here is EXPLAIN ANALYZE: Limit (cost=52.69..185979.44 rows=21 width=283) (actual time=5.981..149.110 rows=21 loops=1) - Nested Loop (cost=52.69..27867127142.57 rows=3147528 width=283) (actual time=5.981..149.103 rows=21 loops=1) - Index Scan Backward using spoleczniak_tablica_pkey on spoleczniak_tablica (cost=52.69..27866103743.37 rows=3147528 width=194) (actual time=5.971..148.963 rows=21 loops=1) Filter: ((postac_id = 1) OR (SubPlan 1) OR ((hashed SubPlan 2) AND ((hash)::text ALL ('{dyskusja,kochanie,szturniecie}'::text[]))) OR ((cel = 1) AND ((hash)::text = ANY ('{dyskusja,kochanie,obserwatorium,szturchniecie}'::text[]))) OR ((hash)::text = 'administracja-info'::text) OR (alternatives: SubPlan 3 or hashed SubPlan 4)) SubPlan 1 - Materialize (cost=13.28..11947.85 rows=1264420 width=4) (actual time=0.000..0.024 rows=485 loops=2137) - Nested Loop (cost=13.28..685.75 rows=1264420 width=4) (actual time=0.119..0.664 rows=485 loops=1) - HashAggregate (cost=5.89..5.90 rows=1 width=4) (actual time=0.015..0.017 rows=7 loops=1) - Index Scan using spoleczniak_subskrypcje_postac_id on spoleczniak_subskrypcje (cost=0.00..5.89 rows=2 width=4) (actual time=0.005..0.009 rows=7 loops=1) Index Cond: (postac_id = 1) - Bitmap Heap Scan on spoleczniak_oznaczone (cost=7.38..674.96 rows=391 width=8) (actual time=0.019..0.082 rows=69 loops=7) Recheck Cond: (etykieta_id = spoleczniak_subskrypcje.tag_id) - Bitmap Index Scan on spoleczniak_oznaczone_etykieta_id (cost=0.00..7.29 rows=391 width=0) (actual time=0.013..0.013 rows=69 loops=7) Index Cond: (etykieta_id = spoleczniak_subskrypcje.tag_id) SubPlan 2 - Index Scan using spoleczniak_obserwatorium_obserwujacy_id on spoleczniak_obserwatorium (cost=0.00..39.36 rows=21 width=4) (actual time=0.006..0.030 rows=26 loops=1) Index Cond: (obserwujacy_id = 1) SubPlan 3 - Bitmap Heap Scan on spoleczniak_komentarze (cost=18.67..20.68 rows=1 width=0) (never executed) Recheck Cond: ((kredka_id = spoleczniak_tablica.id) AND (postac_id = 1)) - BitmapAnd (cost=18.67..18.67 rows=1 width=0) (never executed) - Bitmap Index Scan on spoleczniak_komentarze_kredka_id (cost=0.00..2.98 rows=24 width=0) (never executed) Index Cond: (kredka_id = spoleczniak_tablica.id) - Bitmap Index Scan on spoleczniak_komentarze_postac_id (cost=0.00..15.44 rows=890 width=0) (never executed) Index Cond: (postac_id = 1) SubPlan 4 - Index Scan using spoleczniak_komentarze_postac_id on spoleczniak_komentarze (cost=0.00..1610.46 rows=890 width=4) (actual time=0.013..2.983 rows=3605 loops=1) Index Cond: (postac_id = 1) - Index Scan using postac_postacie_pkey on
[PERFORM] Query optimization help
Hello, I asked that question on StackOverflow, but didn't get any valuable response, so I'll ask it here. :) I have such query: SELECT spoleczniak_tablica.id, spoleczniak_tablica.postac_id, spoleczniak_tablica.hash, spoleczniak_tablica.typ, spoleczniak_tablica.ikona, spoleczniak_tablica.opis, spoleczniak_tablica.cel, spoleczniak_tablica.data, postac_postacie.id, postac_postacie.user_id, postac_postacie.avatar, postac_postacie.ikonka, postac_postacie.imie, postac_postacie.nazwisko, postac_postacie.pseudonim, postac_postacie.plec, postac_postacie.wzrost, postac_postacie.waga, postac_postacie.ur_tydz, postac_postacie.ur_rok, postac_postacie.ur_miasto_id, postac_postacie.akt_miasto_id, postac_postacie.kasa, postac_postacie.punkty, postac_postacie.zmeczenie, postac_postacie.zdrowie, postac_postacie.kariera FROM spoleczniak_tablica INNER JOIN postac_postacie ON (spoleczniak_tablica.postac_id = postac_postacie.id) WHERE spoleczniak_tablica.postac_id = 1 or spoleczniak_tablica.id in(select wpis_id from spoleczniak_oznaczone where etykieta_id in(select tag_id from spoleczniak_subskrypcje where postac_id = 1)) or (spoleczniak_tablica.postac_id in(select obserwowany_id from spoleczniak_obserwatorium where obserwujacy_id = 1) and hash not in('dyskusja', 'kochanie', 'szturniecie')) or (spoleczniak_tablica.cel = 1 and spoleczniak_tablica.hash in('dyskusja', 'kochanie', 'obserwatorium', 'szturchniecie')) or spoleczniak_tablica.hash = 'administracja-info' or exists(select 1 from spoleczniak_komentarze where kredka_id = spoleczniak_tablica.id and postac_id = 1) ORDER BY spoleczniak_tablica.id DESC LIMIT 21; and it's real performance bottleneck for us. It's one of the most often executed query on our site. Here is EXPLAIN ANALYZE: Limit (cost=52.69..185979.44 rows=21 width=283) (actual time=5.981..149.110 rows=21 loops=1) - Nested Loop (cost=52.69..27867127142.57 rows=3147528 width=283) (actual time=5.981..149.103 rows=21 loops=1) - Index Scan Backward using spoleczniak_tablica_pkey on spoleczniak_tablica (cost=52.69..27866103743.37 rows=3147528 width=194) (actual time=5.971..148.963 rows=21 loops=1) Filter: ((postac_id = 1) OR (SubPlan 1) OR ((hashed SubPlan 2) AND ((hash)::text ALL ('{dyskusja,kochanie,szturniecie}'::text[]))) OR ((cel = 1) AND ((hash)::text = ANY ('{dyskusja,kochanie,obserwatorium,szturchniecie}'::text[]))) OR ((hash)::text = 'administracja-info'::text) OR (alternatives: SubPlan 3 or hashed SubPlan 4)) SubPlan 1 - Materialize (cost=13.28..11947.85 rows=1264420 width=4) (actual time=0.000..0.024 rows=485 loops=2137) - Nested Loop (cost=13.28..685.75 rows=1264420 width=4) (actual time=0.119..0.664 rows=485 loops=1) - HashAggregate (cost=5.89..5.90 rows=1 width=4) (actual time=0.015..0.017 rows=7 loops=1) - Index Scan using spoleczniak_subskrypcje_postac_id on spoleczniak_subskrypcje (cost=0.00..5.89 rows=2 width=4) (actual time=0.005..0.009 rows=7 loops=1) Index Cond: (postac_id = 1) - Bitmap Heap Scan on spoleczniak_oznaczone (cost=7.38..674.96 rows=391 width=8) (actual time=0.019..0.082 rows=69 loops=7) Recheck Cond: (etykieta_id = spoleczniak_subskrypcje.tag_id) - Bitmap Index Scan on spoleczniak_oznaczone_etykieta_id (cost=0.00..7.29 rows=391 width=0) (actual time=0.013..0.013 rows=69 loops=7) Index Cond: (etykieta_id = spoleczniak_subskrypcje.tag_id) SubPlan 2 - Index Scan using spoleczniak_obserwatorium_obserwujacy_id on spoleczniak_obserwatorium (cost=0.00..39.36 rows=21 width=4) (actual time=0.006..0.030 rows=26 loops=1) Index Cond: (obserwujacy_id = 1) SubPlan 3 - Bitmap Heap Scan on spoleczniak_komentarze (cost=18.67..20.68 rows=1 width=0) (never executed) Recheck Cond: ((kredka_id = spoleczniak_tablica.id) AND (postac_id = 1)) - BitmapAnd (cost=18.67..18.67 rows=1 width=0) (never executed) - Bitmap Index Scan on spoleczniak_komentarze_kredka_id (cost=0.00..2.98 rows=24 width=0) (never executed) Index Cond: (kredka_id = spoleczniak_tablica.id) - Bitmap Index Scan on spoleczniak_komentarze_postac_id (cost=0.00..15.44 rows=890 width=0) (never executed) Index Cond: (postac_id = 1) SubPlan 4 - Index Scan using spoleczniak_komentarze_postac_id on spoleczniak_komentarze (cost=0.00..1610.46 rows=890 width=4) (actual time=0.013..2.983 rows=3605 loops=1) Index Cond: (postac_id = 1) - Index Scan using postac_postacie_pkey on
[PERFORM] Query optimization problem
I have a query: SELECT d1.ID, d2.ID FROM DocPrimary d1 JOIN DocPrimary d2 ON d2.BasedOn=d1.ID WHERE (d1.ID=234409763) or (d2.ID=234409763) i think what QO(Query Optimizer) can make it faster (now it seq scan and on million records works 7 sec) SELECT d1.ID, d2.ID FROM DocPrimary d1 JOIN DocPrimary d2 ON d2.BasedOn=d1.ID WHERE (d2.BasedOn=234409763) or (d2.ID=234409763) -- Slow Query -- test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT d1.ID, d2.ID test-# FROM DocPrimary d1 test-# JOIN DocPrimary d2 ON d2.BasedOn=d1.ID test-# WHERE (d1.ID=234409763) or (d2.ID=234409763); QUERY PLAN Hash Join (cost=58.15..132.35 rows=2 width=8) (actual time=0.007..0.007 rows=0 loops=1) Output: d1.id, d2.id Hash Cond: (d2.basedon = d1.id) Join Filter: ((d1.id = 234409763) OR (d2.id = 234409763)) - Seq Scan on public.docprimary d2 (cost=0.00..31.40 rows=2140 width=8) (actual time=0.002..0.002 rows=0 loops=1) Output: d2.id, d2.basedon - Hash (cost=31.40..31.40 rows=2140 width=4) (never executed) Output: d1.id - Seq Scan on public.docprimary d1 (cost=0.00..31.40 rows=2140 width=4) (never executed) Output: d1.id -- Fast Query -- test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT d1.ID, d2.ID test-# FROM DocPrimary d1 test-# JOIN DocPrimary d2 ON d2.BasedOn=d1.ID test-# WHERE (d2.BasedOn=234409763) or (d2.ID=234409763); QUERY PLAN - Nested Loop (cost=8.60..58.67 rows=12 width=8) (actual time=0.026..0.026 rows=0 loops=1) Output: d1.id, d2.id - Bitmap Heap Scan on public.docprimary d2 (cost=8.60..19.31 rows=12 width=8) (actual time=0.023..0.023 rows=0 loops=1) Output: d2.id, d2.basedon Recheck Cond: ((d2.basedon = 234409763) OR (d2.id = 234409763)) - BitmapOr (cost=8.60..8.60 rows=12 width=0) (actual time=0.018..0.018 rows=0 loops=1) - Bitmap Index Scan on basedon_idx (cost=0.00..4.33 rows=11 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (d2.basedon = 234409763) - Bitmap Index Scan on id_pk (cost=0.00..4.26 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=1) Index Cond: (d2.id = 234409763) - Index Scan using id_pk on public.docprimary d1 (cost=0.00..3.27 rows=1 width=4) (never executed) Output: d1.id, d1.basedon Index Cond: (d1.id = d2.basedon) PGver: PostgreSQL 9.0b x86 OS: Win7 x64 - Create table query: - CREATE TABLE docprimary ( id integer NOT NULL, basedon integer, CONSTRAINT id_pk PRIMARY KEY (id) ); CREATE INDEX basedon_idx ON docprimary USING btree (basedon); -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization problem
Hello Zotov, Somehow the equivalence d2.basedon=d1.id is not used in the slow query, probably because the equivalence constant value would be used inside a not-base expression (the OR). You can see that the equivalence values *are* used by changing the or to an and and compare both queries. The only thing you can do to guarantee the planner has all information to in cases like this it explicitly name the equivalence inside OR expressions, e.g. SELECT d1.ID, d2.ID FROM DocPrimary d1 JOIN DocPrimary d2 ON d2.BasedOn=d1.ID WHERE (d1.ID=234409763 and d2.basedon=234409763) or (d2.ID=234409763) ; regards, Yeb Havinga PS: the analyze time of the slow query showed 0.007ms? Zotov wrote: I have a query: SELECT d1.ID, d2.ID FROM DocPrimary d1 JOIN DocPrimary d2 ON d2.BasedOn=d1.ID WHERE (d1.ID=234409763) or (d2.ID=234409763) i think what QO(Query Optimizer) can make it faster (now it seq scan and on million records works 7 sec) SELECT d1.ID, d2.ID FROM DocPrimary d1 JOIN DocPrimary d2 ON d2.BasedOn=d1.ID WHERE (d2.BasedOn=234409763) or (d2.ID=234409763) -- Slow Query -- test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT d1.ID, d2.ID test-# FROM DocPrimary d1 test-# JOIN DocPrimary d2 ON d2.BasedOn=d1.ID test-# WHERE (d1.ID=234409763) or (d2.ID=234409763); QUERY PLAN Hash Join (cost=58.15..132.35 rows=2 width=8) (actual time=0.007..0.007 rows=0 loops=1) Output: d1.id, d2.id Hash Cond: (d2.basedon = d1.id) Join Filter: ((d1.id = 234409763) OR (d2.id = 234409763)) - Seq Scan on public.docprimary d2 (cost=0.00..31.40 rows=2140 width=8) (actual time=0.002..0.002 rows=0 loops=1) Output: d2.id, d2.basedon - Hash (cost=31.40..31.40 rows=2140 width=4) (never executed) Output: d1.id - Seq Scan on public.docprimary d1 (cost=0.00..31.40 rows=2140 width=4) (never executed) Output: d1.id -- Fast Query -- test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT d1.ID, d2.ID test-# FROM DocPrimary d1 test-# JOIN DocPrimary d2 ON d2.BasedOn=d1.ID test-# WHERE (d2.BasedOn=234409763) or (d2.ID=234409763); QUERY PLAN - Nested Loop (cost=8.60..58.67 rows=12 width=8) (actual time=0.026..0.026 rows=0 loops=1) Output: d1.id, d2.id - Bitmap Heap Scan on public.docprimary d2 (cost=8.60..19.31 rows=12 width=8) (actual time=0.023..0.023 rows=0 loops=1) Output: d2.id, d2.basedon Recheck Cond: ((d2.basedon = 234409763) OR (d2.id = 234409763)) - BitmapOr (cost=8.60..8.60 rows=12 width=0) (actual time=0.018..0.018 rows=0 loops=1) - Bitmap Index Scan on basedon_idx (cost=0.00..4.33 rows=11 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (d2.basedon = 234409763) - Bitmap Index Scan on id_pk (cost=0.00..4.26 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=1) Index Cond: (d2.id = 234409763) - Index Scan using id_pk on public.docprimary d1 (cost=0.00..3.27 rows=1 width=4) (never executed) Output: d1.id, d1.basedon Index Cond: (d1.id = d2.basedon) PGver: PostgreSQL 9.0b x86 OS: Win7 x64 - Create table query: - CREATE TABLE docprimary ( id integer NOT NULL, basedon integer, CONSTRAINT id_pk PRIMARY KEY (id) ); CREATE INDEX basedon_idx ON docprimary USING btree (basedon); -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Query Optimization
Hi, Can anyone suggest why this query so slow. SELECT version(); version - PostgreSQL 8.4.2 on i386-portbld-freebsd7.2, compiled by GCC cc (GCC) 4.2.1 20070719 [FreeBSD], 32-bit (1 row) explain analyze SELECT DT.value, DT.meassure_date, DT.ms_status_id as status_id, S.descr_bg as status_bg, S.descr_en as status_en, VT.id as value_type_id, VT.descr_en as value_type_en, VT.descr_bg as value_type_bg, T.unit as value_type_unit, T.name as general_value_type, T.ms_db_type_id FROM ms_data AS DT, ms_statuses AS S, ms_value_types AS VT, ms_types AS T, ms_commands_history AS CH WHERE DT.ms_value_type_id = 88 AND DT.meassure_date = '2010-04-01 1:00:00' AND DT.meassure_date = '2010-04-01 1:10:00' AND DT.ms_command_history_id = CH.id AND CH.ms_device_id = 7 AND DT.ms_value_type_id = VT.id AND VT.ms_type_id = T.id AND DT.ms_status_id = S.id GROUP BY value, meassure_date, status_id, status_bg, status_en, value_type_id, value_type_en, value_type_bg, value_type_unit, general_value_type, ms_db_type_id ORDER BY meassure_date DESC; QUERY PLAN -- Group (cost=23.93..23.96 rows=1 width=229) (actual time=63274.021..63274.021 rows=0 loops=1) - Sort (cost=23.93..23.94 rows=1 width=229) (actual time=63274.016..63274.016 rows=0 loops=1) Sort Key: dt.meassure_date, dt.value, dt.ms_status_id, s.descr_bg, s.descr_en, vt.id, vt.descr_en, vt.descr_bg, t.unit, t.name, t.ms_db_type_id Sort Method: quicksort Memory: 17kB - Nested Loop (cost=0.00..23.92 rows=1 width=229) (actual time=63273.982..63273.982 rows=0 loops=1) - Nested Loop (cost=0.00..19.64 rows=1 width=165) (actual time=63273.977..63273.977 rows=0 loops=1) - Nested Loop (cost=0.00..15.36 rows=1 width=101) (actual time=63273.974..63273.974 rows=0 loops=1) - Nested Loop (cost=0.00..11.08 rows=1 width=23) (actual time=63273.970..63273.970 rows=0 loops=1) - Index Scan using ms_commands_history_ms_device_id_idx on ms_commands_history ch (cost=0.00..4.33 rows=1 width=8) (actual time=0.163..25254.004 rows=9807 loops=1) Index Cond: (ms_device_id = 7) - Index Scan using ms_data_ms_command_history_id_idx on ms_data dt (cost=0.00..6.74 rows=1 width=31) (actual time=3.868..3.868 rows=0 loops=9807) Index Cond: (dt.ms_command_history_id = ch.id) Filter: ((dt.meassure_date = '2010-04-01 01:00:00'::timestamp without time zone) AND (dt.meassure_date = '2010-04-01 01:10:00'::timestamp without time zone) AND (dt.ms_value_type_id = 88)) - Index Scan using ms_value_types_pkey on ms_value_types vt (cost=0.00..4.27 rows=1 width=82) (never executed) Index Cond: (vt.id = 88) - Index Scan using ms_types_pkey on ms_types t (cost=0.00..4.27 rows=1 width=72) (never executed) Index Cond: (t.id = vt.ms_type_id) - Index Scan using ms_statuses_pkey on ms_statuses s (cost=0.00..4.27 rows=1 width=68) (never executed)Index Cond: (s.id = dt.ms_status_id) Total runtime: 63274.256 ms Thanks in advance. Kaloyan Iliev -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription:
Re: [PERFORM] Query Optimization
In response to Kaloyan Iliev Iliev : Hi, Can anyone suggest why this query so slow. - Index Scan using ms_commands_history_ms_device_id_idx on ms_commands_history ch (cost=0.00..4.33 rows=1 width=8) (actual time=0.163..25254.004 rows=9807 loops=1) Estimated rows: 1, actual rows: 9807, that's a BIG difference and, maybe, your problem. Btw.: your explain is hard to read (line-wrapping). It's better to attach the explain as an own file... Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: - Header) GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431 2EB0 389D 1DC2 3172 0C99 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
On Fri, Nov 27, 2009 at 4:47 PM, Faheem Mitha fah...@email.unc.edu wrote: If not, you might want to look at some way of pre-marking the non-duplicate rows so that you don't have to recompute that each time. What are the options re pre-marking? Well, what I usually do is - if I'm going to do the same distinct-ification frequently, I add an extra column (say, a boolean) and set it to true for all and only those rows which will pass the distinct-ification filter. Then I can just say WHERE that column name. Yes, I see. The problem with is premarking is that the selection is somewhat dynamic, in the sense that this depends on the idlink table, which depends on patient data, which can change. Yeah. For things like this I find you have to think hard about how to organize your schema so that you can optimize the queries you care about. There are no just do this and it works solutions to performance problems of this type. Still, many of them are solvable by making the right decisions elsewhere. Sometimes you can use triggers to recompute your premarks when the data in the other table changes. Another strategy is to keep a cache of precomputed results somewhere. When the underlying data changes, you use triggers to invalidate anything in the cache that might now be wrong, and set things up so that it will be recomputed when next it is used. But in either case you have to figure out the right place to do the computation so that it gains you more than it saves you, and adjusting your schema is often necessary. Then you might be able to use the underlying table directly in the next CTE, which will usually permit better optimization, more use of indices, etc. It seems pretty unfortunate that dedup_patient_anno joins against geno and then patient_geno does what appears to be the same join again. Is there some way to eliminate that? If so it will probably help. You don't say whether you are looking at the PED or TPED query, so I'll assume PED. They are similar anyway. I see your point re the joins. You mean anno INNER JOIN geno followed by geno INNER JOIN dedup_patient_anno ? I think the point of the first join is to reduce the anno table based on information from the geno table. The result is basically a subset of the anno table with some potential duplication removed, which is then re-joined to the geno table. I agree this seems a bit suboptimal, and there might be a better way to do this. Yeah, I didn't think about it in detail, but it looks like it should be possible. Eliminating joins can sometimes have *dramatic* effects on query performance, and it never hurts. Failing all else, couldn't I smoosh together the two queries and do a triple join? For reference, the two CTEs in question, from the PED query, are as follows. dedup_patient_anno AS ( SELECT * FROM (SELECT *, row_number() OVER(PARTITION BY anno.rsid ORDER BY anno.id) FROM anno INNER JOIN geno ON anno.id = geno.anno_id WHERE idlink_id = (SELECT MIN(id) FROM idlink ) ) AS s WHERE row_number = '1' ), patient_geno AS ( SELECT geno.idlink_id AS idlink_id, geno.anno_id AS anno_id, geno.snpval_id AS snpval_id, allelea_id, alleleb_id FROM geno INNER JOIN dedup_patient_anno ON geno.anno_id = dedup_patient_anno.id ), If that will give the same results, which I'm not immediately certain about, then I highly recommend it. In general I would recommend only using CTEs to express concepts that can't sensibly be expressed in other ways, not to beautify your queries. Keep in mind that joins can be reordered and/or executed using different methods but most other operations can't be, so trying to get your joins together in one place is usually a good strategy, in my experience. And of course if that lets you reduce the total number of joins, that's even better. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
On Wed, 25 Nov 2009, Robert Haas wrote: On Wed, Nov 25, 2009 at 5:54 PM, Faheem Mitha fah...@email.unc.edu wrote: Well, any method of DISTINCT-ifying is likely to be somewhat slow, but I've had good luck with SELECT DISTINCT ON (...) in the past, as compared with other methods. YMMV - the only way to find out is to benchmark it. I don't think it's non-deterministic if you order by the DISTINCT-ON columns and enough extras to break any ties - you should get the first one of each set. Right, but adding enough extras to break ties is up to the user, and the language doesn't guarantee anything, so it feels more fragile. If not, you might want to look at some way of pre-marking the non-duplicate rows so that you don't have to recompute that each time. What are the options re pre-marking? Well, what I usually do is - if I'm going to do the same distinct-ification frequently, I add an extra column (say, a boolean) and set it to true for all and only those rows which will pass the distinct-ification filter. Then I can just say WHERE that column name. Yes, I see. The problem with is premarking is that the selection is somewhat dynamic, in the sense that this depends on the idlink table, which depends on patient data, which can change. Then you might be able to use the underlying table directly in the next CTE, which will usually permit better optimization, more use of indices, etc. It seems pretty unfortunate that dedup_patient_anno joins against geno and then patient_geno does what appears to be the same join again. Is there some way to eliminate that? If so it will probably help. You don't say whether you are looking at the PED or TPED query, so I'll assume PED. They are similar anyway. I see your point re the joins. You mean anno INNER JOIN geno followed by geno INNER JOIN dedup_patient_anno ? I think the point of the first join is to reduce the anno table based on information from the geno table. The result is basically a subset of the anno table with some potential duplication removed, which is then re-joined to the geno table. I agree this seems a bit suboptimal, and there might be a better way to do this. Yeah, I didn't think about it in detail, but it looks like it should be possible. Eliminating joins can sometimes have *dramatic* effects on query performance, and it never hurts. Failing all else, couldn't I smoosh together the two queries and do a triple join? For reference, the two CTEs in question, from the PED query, are as follows. dedup_patient_anno AS ( SELECT * FROM (SELECT *, row_number() OVER(PARTITION BY anno.rsid ORDER BY anno.id) FROM anno INNER JOIN geno ON anno.id = geno.anno_id WHEREidlink_id= (SELECT MIN(id) FROMidlink ) ) AS s WHERE row_number = '1' ), patient_geno AS ( SELECT geno.idlink_id AS idlink_id, geno.anno_id AS anno_id, geno.snpval_id AS snpval_id, allelea_id, alleleb_id FROMgeno INNER JOIN dedup_patient_anno ON geno.anno_id = dedup_patient_anno.id ), Regards, Faheem. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
On Mon, Nov 23, 2009 at 5:47 PM, Faheem Mitha fah...@email.unc.edu wrote: Hi everybody, I've got two queries that needs optimizing. Actually, there are others, but these are pretty representative. You can see the queries and the corresponding plans at http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf or http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex if you prefer text (latex file, effectively text in this case) The background to this is at http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf If more details are required, let me know and I can add them. I'd appreciate suggestions about how to make these queries go faster. Please CC this email address on any replies. I've found that a good way to approach optimizing queries of this type is to look at the EXPLAIN ANALYZE results and figure out which parts of the query are slow. Then simplify the rest of the query as much as possible without eliminating the slowness. Then try to figure out how to optimize the simplified query: rewrite the logic, add indices, change the schema, etc. Lastly start adding the other bits back in. It looks like the dedup_patient_anno CTE is part of your problem. Try pulling that piece out and optimizing it separately. I wonder if that could be rewritten to use SELECT DISTINCT ON (...) and whether that would be any faster. If not, you might want to look at some way of pre-marking the non-duplicate rows so that you don't have to recompute that each time. Then you might be able to use the underlying table directly in the next CTE, which will usually permit better optimization, more use of indices, etc. It seems pretty unfortunate that dedup_patient_anno joins against geno and then patient_geno does what appears to be the same join again. Is there some way to eliminate that? If so it will probably help. Once you've got those parts of the query as well-optimized as you can, add the next pieces in and start hacking on those. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
Hi Robert, Thanks very much for your suggestions. On Wed, 25 Nov 2009, Robert Haas wrote: On Mon, Nov 23, 2009 at 5:47 PM, Faheem Mitha fah...@email.unc.edu wrote: Hi everybody, I've got two queries that needs optimizing. Actually, there are others, but these are pretty representative. You can see the queries and the corresponding plans at http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf or http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex if you prefer text (latex file, effectively text in this case) The background to this is at http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf If more details are required, let me know and I can add them. I'd appreciate suggestions about how to make these queries go faster. Please CC this email address on any replies. I've found that a good way to approach optimizing queries of this type is to look at the EXPLAIN ANALYZE results and figure out which parts of the query are slow. Then simplify the rest of the query as much as possible without eliminating the slowness. Then try to figure out how to optimize the simplified query: rewrite the logic, add indices, change the schema, etc. Lastly start adding the other bits back in. Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough to figure out which bits are slow. :-) It looks like the dedup_patient_anno CTE is part of your problem. Try pulling that piece out and optimizing it separately. I wonder if that could be rewritten to use SELECT DISTINCT ON (...) and whether that would be any faster. Isn't SELECT DISTINCT supposed to be evil, since in general the result is not deterministic? I think I had SELECT DISTINCT earlier, and removed it because of that, with the help of Andrew (RhodiumToad on #postgresql) I didn't compare the corresponding subqueries separately, so don't know what speed difference this made. If not, you might want to look at some way of pre-marking the non-duplicate rows so that you don't have to recompute that each time. What are the options re pre-marking? Then you might be able to use the underlying table directly in the next CTE, which will usually permit better optimization, more use of indices, etc. It seems pretty unfortunate that dedup_patient_anno joins against geno and then patient_geno does what appears to be the same join again. Is there some way to eliminate that? If so it will probably help. You don't say whether you are looking at the PED or TPED query, so I'll assume PED. They are similar anyway. I see your point re the joins. You mean anno INNER JOIN geno followed by geno INNER JOIN dedup_patient_anno ? I think the point of the first join is to reduce the anno table based on information from the geno table. The result is basically a subset of the anno table with some potential duplication removed, which is then re-joined to the geno table. I agree this seems a bit suboptimal, and there might be a better way to do this. Once you've got those parts of the query as well-optimized as you can, add the next pieces in and start hacking on those. Regards, Faheem. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
On Wed, Nov 25, 2009 at 5:54 PM, Faheem Mitha fah...@email.unc.edu wrote: Hi Robert, Thanks very much for your suggestions. Hi everybody, I've got two queries that needs optimizing. Actually, there are others, but these are pretty representative. You can see the queries and the corresponding plans at http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf or http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex if you prefer text (latex file, effectively text in this case) The background to this is at http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf If more details are required, let me know and I can add them. I'd appreciate suggestions about how to make these queries go faster. Please CC this email address on any replies. I've found that a good way to approach optimizing queries of this type is to look at the EXPLAIN ANALYZE results and figure out which parts of the query are slow. Then simplify the rest of the query as much as possible without eliminating the slowness. Then try to figure out how to optimize the simplified query: rewrite the logic, add indices, change the schema, etc. Lastly start adding the other bits back in. Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough to figure out which bits are slow. :-) Well, you basically just look for the big numbers. The actual numbers are in ms, and each node includes the times for the things beneath it, so usually my approach is to just look at lower and lower levels of the tree (i.e. the parts that are more indented) until I find the lowest level that is slow. Then I look at the query bits presented there to figure out which piece of the SQL it corresponds to. Looking at the estimates (which are not in ms or any other particular unit) can be helpful too, in that it can help you find places where the planner thought it would be fast but it was actually slow. To do this, look at the top level of the query and get a sense of what the ratio between estimated-cost-units and actual-ms is. Then look for big (order of magnitude) deviations from this throughout the plan. Those are places where you want to either gather better statistics, or rewrite the query so that it can make better use of statistics. The latter is more of an art than a science - I or someone else on this list can help you with it if we find a specific case to look at. It looks like the dedup_patient_anno CTE is part of your problem. Try pulling that piece out and optimizing it separately. I wonder if that could be rewritten to use SELECT DISTINCT ON (...) and whether that would be any faster. Isn't SELECT DISTINCT supposed to be evil, since in general the result is not deterministic? I think I had SELECT DISTINCT earlier, and removed it because of that, with the help of Andrew (RhodiumToad on #postgresql) I didn't compare the corresponding subqueries separately, so don't know what speed difference this made. Well, any method of DISTINCT-ifying is likely to be somewhat slow, but I've had good luck with SELECT DISTINCT ON (...) in the past, as compared with other methods. YMMV - the only way to find out is to benchmark it. I don't think it's non-deterministic if you order by the DISTINCT-ON columns and enough extras to break any ties - you should get the first one of each set. If not, you might want to look at some way of pre-marking the non-duplicate rows so that you don't have to recompute that each time. What are the options re pre-marking? Well, what I usually do is - if I'm going to do the same distinct-ification frequently, I add an extra column (say, a boolean) and set it to true for all and only those rows which will pass the distinct-ification filter. Then I can just say WHERE that column name. Then you might be able to use the underlying table directly in the next CTE, which will usually permit better optimization, more use of indices, etc. It seems pretty unfortunate that dedup_patient_anno joins against geno and then patient_geno does what appears to be the same join again. Is there some way to eliminate that? If so it will probably help. You don't say whether you are looking at the PED or TPED query, so I'll assume PED. They are similar anyway. I see your point re the joins. You mean anno INNER JOIN geno followed by geno INNER JOIN dedup_patient_anno ? I think the point of the first join is to reduce the anno table based on information from the geno table. The result is basically a subset of the anno table with some potential duplication removed, which is then re-joined to the geno table. I agree this seems a bit suboptimal, and there might be a better way to do this. Yeah, I didn't think about it in detail, but it looks like it should be possible. Eliminating joins can sometimes have *dramatic* effects on query performance, and it never hurts. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your
[PERFORM] query optimization
Hi everybody, I've got two queries that needs optimizing. Actually, there are others, but these are pretty representative. You can see the queries and the corresponding plans at http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf or http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex if you prefer text (latex file, effectively text in this case) The background to this is at http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf If more details are required, let me know and I can add them. I'd appreciate suggestions about how to make these queries go faster. Please CC this email address on any replies. Regards, Faheem. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
2009/11/23 Faheem Mitha fah...@email.unc.edu Hi everybody, I've got two queries that needs optimizing. Actually, there are others, but these are pretty representative. You can see the queries and the corresponding plans at http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf or http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex if you prefer text (latex file, effectively text in this case) The background to this is at http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf If more details are required, let me know and I can add them. I'd appreciate suggestions about how to make these queries go faster. Please CC this email address on any replies. Regards, Faheem. Hi Faheem, There appears to be a discrepancy between the 2 PDFs you provided. One says you're using PostgreSQL 8.3, and the other shows you using common table expressions, which are only available in 8.4+. Thom
Re: [PERFORM] query optimization
On Mon, 23 Nov 2009, Thom Brown wrote: Hi Faheem, There appears to be a discrepancy between the 2 PDFs you provided. One says you're using PostgreSQL 8.3, and the other shows you using common table expressions, which are only available in 8.4+. Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf and make sure all the information is current. Thanks for pointing out my error. Regards, Faheem. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
On Tue, Nov 24, 2009 at 12:49 AM, Faheem Mitha fah...@email.unc.edu wrote: Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf and make sure all the information is current. Thanks for pointing out my error. excellent report! about the copy problem: You seem to have created the primary key before doing the copy (at least that`s what the dump before copy says). This is bad. Create it after the copy. Greetings Marcin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
How often are the tables you query from updated? Rgds Sebastian On Tue, Nov 24, 2009 at 12:52 AM, marcin mank marcin.m...@gmail.com wrote: On Tue, Nov 24, 2009 at 12:49 AM, Faheem Mitha fah...@email.unc.edu wrote: Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf and make sure all the information is current. Thanks for pointing out my error. excellent report! about the copy problem: You seem to have created the primary key before doing the copy (at least that`s what the dump before copy says). This is bad. Create it after the copy. Greetings Marcin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
On Tue, 24 Nov 2009, Sebastian Jörgensen wrote: How often are the tables you query from updated? Quite rarely. Once in a while. The large tables, eg. geno, are basically static. Regards, Faheem. Rgds Sebastian On Tue, Nov 24, 2009 at 12:52 AM, marcin mank marcin.m...@gmail.com wrote: On Tue, Nov 24, 2009 at 12:49 AM, Faheem Mitha fah...@email.unc.edu wrote: Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf and make sure all the information is current. Thanks for pointing out my error. excellent report! about the copy problem: You seem to have created the primary key before doing the copy (at least that`s what the dump before copy says). This is bad. Create it after the copy. Greetings Marcin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization
I am struggeling with the following query which fetches a random subset of 200 questions that matches certain tags within certain languages. However, the query takes forever to evaluate, even though I have a limit 200 appended. Any ideas on how to optimize it? QUERY: SELECT distinct q.question_id FROM question_tags qt, questions q WHERE q.question_id = qt.question_id AND q.STATUS = 1 AND not q.author_id = 105 AND ((qt.language_id = 5 and qt.tag_id in (1101,917,734,550,367,183)) or (qt.language_id = 4 and qt.tag_id in (856,428)) or (qt.language_id = 3 and qt.tag_id in (1156,1028,899,771,642,514,385,257,128)) or (qt.language_id = 2 and qt.tag_id in (1193,1101,1009,917,826,734,642,550,458,367,275,183,91))) and q.question_id not in (413) LIMIT 200 EXPLAIN ANALYZE: = Limit (cost=1.50..1267.27 rows=200 width=4) (actual time=278.169..880.934 rows=200 loops=1) - Unique (cost=1.50..317614.50 rows=50185 width=4) (actual time=278.165..880.843 rows=200 loops=1) - Merge Join (cost=1.50..317489.04 rows=50185 width=4) (actual time=278.162..880.579 rows=441 loops=1) Merge Cond: (qt.question_id = q.question_id) - Index Scan using question_tags_question_id on question_tags qt (cost=0.00..301256.96 rows=82051 width=4) (actual time=24.171..146.811 rows=6067 loops=1) Filter: (((language_id = 5) AND (tag_id = ANY ('{1101,917,734,550,367,183}'::integer[]))) OR ((language_id = 4) AND (tag_id = ANY ('{856,428}'::integer[]))) OR ((language_id = 3) AND (tag_id = ANY ('{1156,1028,899,771,642,514,385,257,128}'::integer[]))) OR ((language_id = 2) AND (tag_id = ANY ('{1193,1101,1009,917,826,734,642,550,458,367,275,183,91}'::integer[] - Index Scan using questions_pkey on questions q (cost=0.00..15464.12 rows=83488 width=4) (actual time=222.956..731.737 rows=1000 loops=1) Filter: ((q.author_id 105) AND (q.question_id 413) AND (q.status = 1)) Total runtime: 881.152 ms (9 rows) An index on (language_id,tag_id) should be the first thing to try. Or perhaps even (status,language_id,tag_id) or (language_id, tad_id, status) (but that depends on the stats on status column). An index on author_id will probably not be useful for this particular query because your condition is author_id != constant. Also CLUSTER question_tags on (language_id, tad_id). What is the database size versus RAM ? You must have a hell of a lot of questions to make this slow... (or bloat ?) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Query optimization
I am struggeling with the following query which fetches a random subset of 200 questions that matches certain tags within certain languages. However, the query takes forever to evaluate, even though I have a limit 200 appended. Any ideas on how to optimize it? QUERY: SELECT distinct q.question_id FROM question_tags qt, questions q WHERE q.question_id = qt.question_id AND q.STATUS = 1 AND not q.author_id = 105 AND ((qt.language_id = 5 and qt.tag_id in (1101,917,734,550,367,183)) or (qt.language_id = 4 and qt.tag_id in (856,428)) or (qt.language_id = 3 and qt.tag_id in (1156,1028,899,771,642,514,385,257,128)) or (qt.language_id = 2 and qt.tag_id in (1193,1101,1009,917,826,734,642,550,458,367,275,183,91))) and q.question_id not in (413) LIMIT 200 EXPLAIN ANALYZE: = Limit (cost=1.50..1267.27 rows=200 width=4) (actual time=278.169..880.934 rows=200 loops=1) - Unique (cost=1.50..317614.50 rows=50185 width=4) (actual time=278.165..880.843 rows=200 loops=1) - Merge Join (cost=1.50..317489.04 rows=50185 width=4) (actual time=278.162..880.579 rows=441 loops=1) Merge Cond: (qt.question_id = q.question_id) - Index Scan using question_tags_question_id on question_tags qt (cost=0.00..301256.96 rows=82051 width=4) (actual time=24.171..146.811 rows=6067 loops=1) Filter: (((language_id = 5) AND (tag_id = ANY ('{1101,917,734,550,367,183}'::integer[]))) OR ((language_id = 4) AND (tag_id = ANY ('{856,428}'::integer[]))) OR ((language_id = 3) AND (tag_id = ANY ('{1156,1028,899,771,642,514,385,257,128}'::integer[]))) OR ((language_id = 2) AND (tag_id = ANY ('{1193,1101,1009,917,826,734,642,550,458,367,275,183,91}'::integer[] - Index Scan using questions_pkey on questions q (cost=0.00..15464.12 rows=83488 width=4) (actual time=222.956..731.737 rows=1000 loops=1) Filter: ((q.author_id 105) AND (q.question_id 413) AND (q.status = 1)) Total runtime: 881.152 ms (9 rows) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization
Le Sunday 30 November 2008 19:45:11 tmp, vous avez écrit : I am struggeling with the following query which fetches a random subset of 200 questions that matches certain tags within certain languages. However, the query takes forever to evaluate, even though I have a limit 200 appended. Any ideas on how to optimize it? QUERY: SELECT distinct q.question_id FROM question_tags qt, questions q WHERE q.question_id = qt.question_id AND q.STATUS = 1 AND not q.author_id = 105 AND ((qt.language_id = 5 and qt.tag_id in (1101,917,734,550,367,183)) or (qt.language_id = 4 and qt.tag_id in (856,428)) or (qt.language_id = 3 and qt.tag_id in (1156,1028,899,771,642,514,385,257,128)) or (qt.language_id = 2 and qt.tag_id in (1193,1101,1009,917,826,734,642,550,458,367,275,183,91))) and q.question_id not in (413) LIMIT 200 EXPLAIN ANALYZE: = Limit (cost=1.50..1267.27 rows=200 width=4) (actual time=278.169..880.934 rows=200 loops=1) - Unique (cost=1.50..317614.50 rows=50185 width=4) (actual time=278.165..880.843 rows=200 loops=1) - Merge Join (cost=1.50..317489.04 rows=50185 width=4) (actual time=278.162..880.579 rows=441 loops=1) Merge Cond: (qt.question_id = q.question_id) - Index Scan using question_tags_question_id on question_tags qt (cost=0.00..301256.96 rows=82051 width=4) (actual time=24.171..146.811 rows=6067 loops=1) Filter: (((language_id = 5) AND (tag_id = ANY ('{1101,917,734,550,367,183}'::integer[]))) OR ((language_id = 4) AND (tag_id = ANY ('{856,428}'::integer[]))) OR ((language_id = 3) AND (tag_id = ANY ('{1156,1028,899,771,642,514,385,257,128}'::integer[]))) OR ((language_id = 2) AND (tag_id = ANY ('{1193,1101,1009,917,826,734,642,550,458,367,275,183,91}'::integer[] - Index Scan using questions_pkey on questions q (cost=0.00..15464.12 rows=83488 width=4) (actual time=222.956..731.737 rows=1000 loops=1) Filter: ((q.author_id 105) AND (q.question_id 413) AND (q.status = 1)) Total runtime: 881.152 ms (9 rows) First, because of the distinct, the limit 200 wont reduce the work to be done a lot : it will still have to sort for the unique. Only when everything is sorted will it take only the first 200 records. And anyhow it seems there are only 441 rows before doing the distinct, so, at least for this query, it won't change a lot the times. Then it seems to me that you may try to create composed indexes, because there is a lot of filtering after the index scans (that is if you want the query to be faster). Maybe (q.author_id,q.status). For qt you may try (qt.language_id,qt.tag_id)... Hope it helps Cheers -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Query Optimization with Kruskal’s Algorithm
Hello friends, I'm working on optimizing queries using the Kruskal algorithm ( http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4318118). I did several tests in the database itself and saw interesting results. I did 10 executions with each query using unchanged source of Postgres and then adapted to the algorithm of Kruskal. The query I used is composed of 12 tables and 11 joins. Results Postgresql unchanged (ms): (\ timing) 170,690 168,214 182,832 166,172 174,466 167,143 167,287 172,891 170,452 165,665 average= 170,5812 ms Results of Postgresql with the Kruskal algorithm (ms): (\ timing) 520,590 13,533 8,410 5,162 5,543 4,999 9,871 4,984 5,010 8,883 average= 58,6985 ms As you can see the result, using the Kruskal algorithm, the first query takes more time to return results. This does not occur when using the original source of Postgres. So how is the best method to conduct the tests? I take into consideration the average of 10 executions or just the first one? Do you think I must clean the cache after each query? (because the other (9) executions may have information in memory). regards, Tarcizio Bini.
Re: [PERFORM] Query Optimization with Kruskal’s Algorithm
On 5/7/08, Tarcizio Bini [EMAIL PROTECTED] wrote: I'm working on optimizing queries using the Kruskal algorithm (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4318118). That paper looks very interesting. I would love to hear what the PostgreSQL committers think of this algorithm. Alexander. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Query Optimization
Dear Sirs, I am doing this project of optimizing pg-sql query engine with compressed annealing. I would like to know if any deficiency in existing GEQO. If there are any TODO items remaining in GEQO kindly brief about the same. Awaiting discussions on this. GN - Never miss a thing. Make Yahoo your homepage.
[PERFORM] Query optimization....
In a followup to a question I put forward here on performance which I traced to the stats bug (and fixed it). Now I'm trying to optimize that query and... I'm getting confused fast... I have the following (fairly complex) statement which is run with some frequency: select post.forum, post.subject, post.replied from post where toppost = 1 and (replied (select lastview from forumlog where login='someone' and forum=post.forum and number is null)) is not false AND (replied (select lastview from forumlog where login='someone' and forum=post.forum and number=post.number)) is not false order by pinned desc, replied desc; This gives me exactly what I'm looking for BUT can be quite slow. The forumlog table has one tuple for each post and user; it has the fields forum, number, login and lastview. The post items have a forum, number and replied field (which is used to match the lastview one.) When you look at a post (which may have replies) the application updates your existing entry in that table if there is one, or INSERTs a new tuple if not. Therefore, for each post you have viewed, there is a tuple in the forumlog table which represents the last time you looked at that item. The problem is that for a person who has NOT visited a specific thread of discussion, there is no forumlog entry for that person and post in the table. Thus, to get all posts which (1) you've not seen at all, or (2) you've seen but someone has added to since you saw them, the above complex query is what I've come up with; there may be a null table entry which a wildcard match if its present - if there is no match then the item also must treated as new. The above statement works - but its slow. The following query is VERY fast but only returns those in which there IS an entry in the table (e.g. you've visited the item at least once) select post.forum, post.subject, post.replied from post, forumlog where post.number = forumlog.number and post.toppost = 1 and post.replied forumlog.lastview and forumlog.login='someone' order by pinned desc, replied desc; What I haven't been able to figure out is how to structure a query that is both fast and will return the posts for which you DO NOT have a matching entry in the forumlog table for the specific post but DO either (1) match the null number entry (that is, they're posted later than that) OR (2) have no match at all. (The first statement matches these other two cases) Any ideas? (Its ok if that query(s) are separate; in other words, its cool if I have to execute two or even three queries and get the results separately - in fact, that might be preferrable in some circumstances) Ideas? -- Karl Denninger ([EMAIL PROTECTED]) http://www.denninger.net %SPAMBLOCK-SYS: Matched [EMAIL PROTECTED], message ok ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Query Optimization
It's not necessarily the join order that's an issue; it could also be due to the merge join that it does in the first case. I've also run into situations where the cost estimate for a merge join is way off the mark. Rather than forcing the join order, you might try setting enable_mergejoin=false. On Mon, Feb 19, 2007 at 06:03:22PM +0100, Reinhard Vicinus wrote: PostgreSQL version: 8.1.6 OS: Debian etch The following query needs a lot of time because the query planner reorders the joins: select m.message_idnr, v.headervalue, n.headername from dbmail_messages m join dbmail_headervalue v ON v.physmessage_id=m.physmessage_id join dbmail_headername n ON v.headername_id=n.id where m.mailbox_idnr = 1022 AND message_idnr BETWEEN 698928 AND 1496874 and lower(n.headername) IN ('from','to','cc','subject','date','message-id', 'priority','x-priority','references','newsgroups','in-reply-to', 'content-type','x-spam-status','x-spam-flag'); If I prevent the query planner from reordering the joins with 'set join_collapse_limit=1;' then the same query is faster. At the end of the Mail is the output of a explain analyze for both cases. The statistics of the database are updated each night. Is there an error (in the statistical data) which is responsible for the reordering of the joins? And if not are there other alternatives for preventing join reordering? Thanks Reinhard Explain analyze with set join_collapse_limit=8: Merge Join (cost=388657.62..391332.20 rows=821 width=127) (actual time=82677.950..89103.192 rows=2699 loops=1) Merge Cond: (outer.physmessage_id = inner.physmessage_id) - Sort (cost=2901.03..2902.61 rows=632 width=16) (actual time=247.238..247.578 rows=373 loops=1) Sort Key: m.physmessage_id - Bitmap Heap Scan on dbmail_messages m (cost=9.16..2871.63 rows=632 width=16) (actual time=38.072..246.509 rows=373 loops=1) Recheck Cond: (mailbox_idnr = 1022) Filter: ((message_idnr = 698928) AND (message_idnr = 1496874)) - Bitmap Index Scan on dbmail_messages_8 (cost=0.00..9.16 rows=902 width=0) (actual time=25.561..25.561 rows=615 loops=1) Index Cond: (mailbox_idnr = 1022) - Sort (cost=385756.58..387089.35 rows=533108 width=127) (actual time=80156.731..85760.186 rows=3278076 loops=1) Sort Key: v.physmessage_id - Hash Join (cost=51.00..285787.17 rows=533108 width=127) (actual time=34.519..28260.855 rows=3370242 loops=1) Hash Cond: (outer.headername_id = inner.id) - Seq Scan on dbmail_headervalue v (cost=0.00..241200.39 rows=7840939 width=115) (actual time=0.006..16844.479 rows=7854485 loops=1) - Hash (cost=50.72..50.72 rows=113 width=28) (actual time=34.493..34.493 rows=35 loops=1) - Bitmap Heap Scan on dbmail_headername n (cost=28.44..50.72 rows=113 width=28) (actual time=11.796..34.437 rows=35 loops=1) Recheck Cond: ((lower((headername)::text) = 'from'::text) OR (lower((headername)::text) = 'to'::text) OR (lower((headername)::text) = 'cc'::text) OR (lower((headername)::text) = 'subject'::text) OR (lower((headername)::text) = 'date'::text) OR (lower((headername)::text) = 'message-id'::text) OR (lower((headername)::text) = 'priority'::text) OR (lower((headername)::text) = 'x-priority'::text) OR (lower((headername)::text) = 'references'::text) OR (lower((headername)::text) = 'newsgroups'::text) OR (lower((headername)::text) = 'in-reply-to'::text) OR (lower((headername)::text) = 'content-type'::text) OR (lower((headername)::text) = 'x-spam-status'::text) OR (lower((hea dername)::text) = 'x-spam-flag'::text)) - BitmapOr (cost=28.44..28.44 rows=116 width=0) (actual time=11.786..11.786 rows=0 loops=1) - Bitmap Index Scan on dbmail_headername_lower_headername (cost=0.00..2.03 rows=8 width=0) (actual time=0.037..0.037 rows=3 loops=1) Index Cond: (lower((headername)::text) = 'from'::text) - Bitmap Index Scan on dbmail_headername_lower_headername (cost=0.00..2.03 rows=8 width=0) (actual time=0.013..0.013 rows=3 loops=1) Index Cond: (lower((headername)::text) = 'to'::text) - Bitmap Index Scan on dbmail_headername_lower_headername (cost=0.00..2.03 rows=8 width=0) (actual time=0.013..0.013 rows=3 loops=1) Index Cond: (lower((headername)::text) = 'cc'::text) - Bitmap Index Scan on dbmail_headername_lower_headername (cost=0.00..2.03 rows=8 width=0) (actual time=0.014..0.014 rows=3 loops=1) Index Cond: (lower((headername)::text) = 'subject'::text)
[PERFORM] Query Optimization
PostgreSQL version: 8.1.6 OS: Debian etch The following query needs a lot of time because the query planner reorders the joins: select m.message_idnr, v.headervalue, n.headername from dbmail_messages m join dbmail_headervalue v ON v.physmessage_id=m.physmessage_id join dbmail_headername n ON v.headername_id=n.id where m.mailbox_idnr = 1022 AND message_idnr BETWEEN 698928 AND 1496874 and lower(n.headername) IN ('from','to','cc','subject','date','message-id', 'priority','x-priority','references','newsgroups','in-reply-to', 'content-type','x-spam-status','x-spam-flag'); If I prevent the query planner from reordering the joins with 'set join_collapse_limit=1;' then the same query is faster. At the end of the Mail is the output of a explain analyze for both cases. The statistics of the database are updated each night. Is there an error (in the statistical data) which is responsible for the reordering of the joins? And if not are there other alternatives for preventing join reordering? Thanks Reinhard Explain analyze with set join_collapse_limit=8: Merge Join (cost=388657.62..391332.20 rows=821 width=127) (actual time=82677.950..89103.192 rows=2699 loops=1) Merge Cond: (outer.physmessage_id = inner.physmessage_id) - Sort (cost=2901.03..2902.61 rows=632 width=16) (actual time=247.238..247.578 rows=373 loops=1) Sort Key: m.physmessage_id - Bitmap Heap Scan on dbmail_messages m (cost=9.16..2871.63 rows=632 width=16) (actual time=38.072..246.509 rows=373 loops=1) Recheck Cond: (mailbox_idnr = 1022) Filter: ((message_idnr = 698928) AND (message_idnr = 1496874)) - Bitmap Index Scan on dbmail_messages_8 (cost=0.00..9.16 rows=902 width=0) (actual time=25.561..25.561 rows=615 loops=1) Index Cond: (mailbox_idnr = 1022) - Sort (cost=385756.58..387089.35 rows=533108 width=127) (actual time=80156.731..85760.186 rows=3278076 loops=1) Sort Key: v.physmessage_id - Hash Join (cost=51.00..285787.17 rows=533108 width=127) (actual time=34.519..28260.855 rows=3370242 loops=1) Hash Cond: (outer.headername_id = inner.id) - Seq Scan on dbmail_headervalue v (cost=0.00..241200.39 rows=7840939 width=115) (actual time=0.006..16844.479 rows=7854485 loops=1) - Hash (cost=50.72..50.72 rows=113 width=28) (actual time=34.493..34.493 rows=35 loops=1) - Bitmap Heap Scan on dbmail_headername n (cost=28.44..50.72 rows=113 width=28) (actual time=11.796..34.437 rows=35 loops=1) Recheck Cond: ((lower((headername)::text) = 'from'::text) OR (lower((headername)::text) = 'to'::text) OR (lower((headername)::text) = 'cc'::text) OR (lower((headername)::text) = 'subject'::text) OR (lower((headername)::text) = 'date'::text) OR (lower((headername)::text) = 'message-id'::text) OR (lower((headername)::text) = 'priority'::text) OR (lower((headername)::text) = 'x-priority'::text) OR (lower((headername)::text) = 'references'::text) OR (lower((headername)::text) = 'newsgroups'::text) OR (lower((headername)::text) = 'in-reply-to'::text) OR (lower((headername)::text) = 'content-type'::text) OR (lower((headername)::text) = 'x-spam-status'::text) OR (lower((hea dername)::text) = 'x-spam-flag'::text)) - BitmapOr (cost=28.44..28.44 rows=116 width=0) (actual time=11.786..11.786 rows=0 loops=1) - Bitmap Index Scan on dbmail_headername_lower_headername (cost=0.00..2.03 rows=8 width=0) (actual time=0.037..0.037 rows=3 loops=1) Index Cond: (lower((headername)::text) = 'from'::text) - Bitmap Index Scan on dbmail_headername_lower_headername (cost=0.00..2.03 rows=8 width=0) (actual time=0.013..0.013 rows=3 loops=1) Index Cond: (lower((headername)::text) = 'to'::text) - Bitmap Index Scan on dbmail_headername_lower_headername (cost=0.00..2.03 rows=8 width=0) (actual time=0.013..0.013 rows=3 loops=1) Index Cond: (lower((headername)::text) = 'cc'::text) - Bitmap Index Scan on dbmail_headername_lower_headername (cost=0.00..2.03 rows=8 width=0) (actual time=0.014..0.014 rows=3 loops=1) Index Cond: (lower((headername)::text) = 'subject'::text) - Bitmap Index Scan on dbmail_headername_lower_headername (cost=0.00..2.03 rows=8 width=0) (actual time=0.014..0.014 rows=3 loops=1) Index Cond: (lower((headername)::text) = 'date'::text) - Bitmap Index Scan on dbmail_headername_lower_headername (cost=0.00..2.03 rows=8 width=0) (actual time=0.019..0.019 rows=4 loops=1)
Re: [PERFORM] Query Optimization
Reinhard Vicinus [EMAIL PROTECTED] writes: PostgreSQL version: 8.1.6 The following query needs a lot of time because the query planner reorders the joins: Try reducing random_page_cost, increasing effective_cache_size, and/or updating to PG 8.2. Any of these are likely to make it like the nestloop plan better... regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] Query optimization with X Y JOIN
Hey guys, how u been. This is quite a newbie question, but I need to ask it. I'm trying to wrap my mind around the syntax of join and why and when to use it. I understand the concept of making a query go faster by creating indexes, but it seems that when I want data from multiple tables that link together the query goes slow. The slow is typically due to expensive nested loops. The reason is, all my brain understands is: select tablea.data tableb.data tablec.data from tablea tableb tablec where tablea.pri_key = tableb.foreign_key AND tableb.pri_key = tablec.foreign_key AND... From what I read, it seems you can use inner/outer right/left join on (bla) but when I see syntax examples I see that sometimes tables are omitted from the 'from' section of the query and other times, no. Sometimes I see that the join commands are nested and others, no and sometimes I see joins syntax that only applies to one table. From what I understand join can be used to tell the database the fast way to murge table data together to get results by specifiying the table that has the primary keys and the table that has the foreign keys. I've read all through the postgres docs on this command and I'm still left lost. Can someone please explain to me in simple language how to use these commands or provide me with a link. I need it to live right now. Thanx.
Re: [PERFORM] Query optimization with X Y JOIN
First, this isn't really the right place to ask -- this forum is about performance, not SQL syntax. Second, this isn't a question anyone can answer in a reasonable length of time. What you're asking for usually is taught in a class on relational database theory, which is typically a semester or two in college. If you really need a crash course, dig around on the web for terms like SQL Tutorial. Good luck, Craig [EMAIL PROTECTED] wrote: Hey guys, how u been. This is quite a newbie question, but I need to ask it. I'm trying to wrap my mind around the syntax of join and why and when to use it. I understand the concept of making a query go faster by creating indexes, but it seems that when I want data from multiple tables that link together the query goes slow. The slow is typically due to expensive nested loops. The reason is, all my brain understands is: select tablea.data tableb.data tablec.data from tablea tableb tablec where tablea.pri_key = tableb.foreign_key AND tableb.pri_key = tablec.foreign_key AND... From what I read, it seems you can use inner/outer right/left join on (bla) but when I see syntax examples I see that sometimes tables are omitted from the 'from' section of the query and other times, no. Sometimes I see that the join commands are nested and others, no and sometimes I see joins syntax that only applies to one table. From what I understand join can be used to tell the database the fast way to murge table data together to get results by specifiying the table that has the primary keys and the table that has the foreign keys. I've read all through the postgres docs on this command and I'm still left lost. Can someone please explain to me in simple language how to use these commands or provide me with a link. I need it to live right now. Thanx. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Query optimization with X Y JOIN
If I want my database to go faster, due to X then I would think that the issue is about performance. I wasn't aware of a paticular constraint on X. I have more that a rudementary understanding of what's going on here, I was just hoping that someone could shed some light on the basic principal of this JOIN command and its syntax. Most people I ask, don't give me straight answers and what I have already read on the web is not very helpful thus far. - Original Message - From: Craig A. James [EMAIL PROTECTED] To: pgsql-performance@postgresql.org Sent: Thursday, January 26, 2006 11:12 AM Subject: Re: [PERFORM] Query optimization with X Y JOIN First, this isn't really the right place to ask -- this forum is about performance, not SQL syntax. Second, this isn't a question anyone can answer in a reasonable length of time. What you're asking for usually is taught in a class on relational database theory, which is typically a semester or two in college. If you really need a crash course, dig around on the web for terms like SQL Tutorial. Good luck, Craig [EMAIL PROTECTED] wrote: Hey guys, how u been. This is quite a newbie question, but I need to ask it. I'm trying to wrap my mind around the syntax of join and why and when to use it. I understand the concept of making a query go faster by creating indexes, but it seems that when I want data from multiple tables that link together the query goes slow. The slow is typically due to expensive nested loops. The reason is, all my brain understands is: select tablea.data tableb.data tablec.data from tablea tableb tablec where tablea.pri_key = tableb.foreign_key AND tableb.pri_key = tablec.foreign_key AND... From what I read, it seems you can use inner/outer right/left join on (bla) but when I see syntax examples I see that sometimes tables are omitted from the 'from' section of the query and other times, no. Sometimes I see that the join commands are nested and others, no and sometimes I see joins syntax that only applies to one table. From what I understand join can be used to tell the database the fast way to murge table data together to get results by specifiying the table that has the primary keys and the table that has the foreign keys. I've read all through the postgres docs on this command and I'm still left lost. Can someone please explain to me in simple language how to use these commands or provide me with a link. I need it to live right now. Thanx. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Query optimization with X Y JOIN
[EMAIL PROTECTED] wrote: If I want my database to go faster, due to X then I would think that the issue is about performance. I wasn't aware of a paticular constraint on X. I have more that a rudementary understanding of what's going on here, I was just hoping that someone could shed some light on the basic principal of this JOIN command and its syntax. Most people I ask, don't give me straight answers and what I have already read on the web is not very helpful thus far. What you are looking for is here: http://sqlzoo.net/ It is an excellent website that discusses in depth but at a tutorial style level how and what SQL is and how to use it. Including JOINS. FYI, a JOIN is basically a FROM with an integrated WHERE clause. That is a very simplified description and isn't 100% accurate but it is close. I strongly suggest the website I mentioned above as it will resolve your question. Joshua D. Drake -- The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: PLphp, PLperl - http://www.commandprompt.com/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Query optimization with X Y JOIN
[EMAIL PROTECTED] wrote: If I want my database to go faster, due to X then I would think that the issue is about performance. I wasn't aware of a paticular constraint on X. You haven't asked a performance question yet though. I have more that a rudementary understanding of what's going on here, I was just hoping that someone could shed some light on the basic principal of this JOIN command and its syntax. Most people I ask, don't give me straight answers and what I have already read on the web is not very helpful thus far. OK - firstly it's not a JOIN command. It's a SELECT query that happens to join (in your example) three tables together. The syntax is specified in the SQL reference section of the manuals, and I don't think it's different from the standard SQL spec here. A query that joins two or more tables (be they real base-tables, views or sub-query result-sets) produces the product of both. Normally you don't want this so you apply constraints to that join (table_a.col1 = table_b.col2). In some cases you want all the rows from one side of a join, whether or not you get a match on the other side of the join. This is called an outer join and results in NULLs for all the columns on the outside of the join. A left-join returns all rows from the table on the left of the join, a right-join from the table on the right of it. When planning a join, the planner will try to estimate how many matches it will see on each side, taking into account any extra constraints (you might want only some of the rows in table_a anyway). It then decides whether to use any indexes on the relevant column(s). Now, if you think the planner is making a mistake we'll need to see the output of EXPLAIN ANALYSE for the query and will want to know that you've vacuumed and analysed the tables in question. Does that help at all? -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Query optimization with X Y JOIN
Yes, that helps a great deal. Thank you so much. - Original Message - From: Richard Huxton dev@archonet.com To: [EMAIL PROTECTED] Cc: pgsql-performance@postgresql.org Sent: Thursday, January 26, 2006 11:47 AM Subject: Re: [PERFORM] Query optimization with X Y JOIN [EMAIL PROTECTED] wrote: If I want my database to go faster, due to X then I would think that the issue is about performance. I wasn't aware of a paticular constraint on X. You haven't asked a performance question yet though. I have more that a rudementary understanding of what's going on here, I was just hoping that someone could shed some light on the basic principal of this JOIN command and its syntax. Most people I ask, don't give me straight answers and what I have already read on the web is not very helpful thus far. OK - firstly it's not a JOIN command. It's a SELECT query that happens to join (in your example) three tables together. The syntax is specified in the SQL reference section of the manuals, and I don't think it's different from the standard SQL spec here. A query that joins two or more tables (be they real base-tables, views or sub-query result-sets) produces the product of both. Normally you don't want this so you apply constraints to that join (table_a.col1 = table_b.col2). In some cases you want all the rows from one side of a join, whether or not you get a match on the other side of the join. This is called an outer join and results in NULLs for all the columns on the outside of the join. A left-join returns all rows from the table on the left of the join, a right-join from the table on the right of it. When planning a join, the planner will try to estimate how many matches it will see on each side, taking into account any extra constraints (you might want only some of the rows in table_a anyway). It then decides whether to use any indexes on the relevant column(s). Now, if you think the planner is making a mistake we'll need to see the output of EXPLAIN ANALYSE for the query and will want to know that you've vacuumed and analysed the tables in question. Does that help at all? -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PERFORM] Query Optimization - Hash Join estimate off?
Hello! First off, I'm a real newbie at trying to read the output of explain analyze. I have several similar queries in my application that I've got incorporated into views. When they run sub 300ms, the users don't seem to mind. However, one of them (query is below along with some relevant table information) is running about 800ms and my users are starting to grumble. I ran explain analyze on it (explain analyze results are below). I noticed that the biggest chunk of time is being taken by a Hash Join near the top of the output (I'm still not sure what the indentation means and what the order means). If I look at the estimate, it is comparable to several other hash join estimates in the query; however, the actual cost in time is significantly higher than those other hash joins. Is this significant? I tried optimizing according to "SQL Tuning" by Tow, but this actually seemed to slow things down. It also seemed that the query optimizer in PostgreSQL reordered things on its own according to its own plan anyway. Is this correct? I'd appreciate any help I can get to try to get this query below 300ms. Thanks! Mark The platform is a dual 2.2GHz Xeon 1.2GB RAM with mirrored drives (raid 1) running Win2000 Pro. I run "vacuum analyze" every night. The postgresql.conf is basically standard except that I've opened it up to listen to the external network. Other changes: max_connections = 100 shared_buffers = 1 query (the person_id = 1 in the where clause is changed on a case by case basis - depending upon who's running the query): explain analyzeSELECT DISTINCT c.job_id, g.person_id, c.job_no, b.deadline, c.name, bid_date(c.job_id) AS bid_date, c.miscq, c.city, c.st, j.name AS eng, c.s_team AS salesteam, CASE WHEN c.file_loc = 0 THEN 'No Bid'::character varying WHEN c.file_loc = -1 THEN 'Bid Board'::character varying WHEN c.file_loc = -2 THEN 'Lost Job'::character varying WHEN c.file_loc = -3 THEN 'See Job Notes'::character varying WHEN c.file_loc -3 OR c.file_loc IS NULL THEN ''::character varying WHEN h.initials IS NOT NULL THEN h.initials ELSE 'Unknown person'::character varying END AS file_loc, COALESCE(c.city::text || COALESCE(', '::text || c.st::text, ''::text), COALESCE(c.st, ''::character varying)::text) AS "location", c.file_loc AS file_loc_id FROM status a LEFT JOIN status_list b ON a.status_id = b.status_id AND b.active LEFT JOIN job c ON c.job_id = b.job_id LEFT JOIN builder_list d ON c.job_id = d.job_id AND (d.won_heat OR d.won_vent OR d.won_tc OR c.heat AND d.bid_heat AND d.won_heat IS NULL OR c.vent AND d.bid_vent AND d.won_vent IS NULL OR c.tc AND d.bid_tc AND d.won_tc IS NULL) AND d.role = 'C'::bpchar LEFT JOIN company e ON d.company_id = e.company_id LEFT JOIN call_list f ON e.company_id = f.company_id LEFT JOIN person g ON f.person_id = g.person_id OR "position"(c.s_team::text, g.initials::text) 0 LEFT JOIN person h ON c.file_loc = h.person_id LEFT JOIN builder_list i ON c.job_id = i.job_id AND i.role = 'E'::bpchar LEFT JOIN company j ON i.company_id = j.company_id WHERE a.name::text = 'Awaiting Award'::character varying::text and g.person_id = 1 ORDER BY c.job_id, g.person_id, c.job_no, b.deadline, c.name, bid_date(c.job_id), c.miscq, c.city, COALESCE(c.city::text || COALESCE(', '::text || c.st::text, ''::text), COALESCE(c.st, ''::character varying)::text), c.st, CASE WHEN c.file_loc = 0 THEN 'No Bid'::character varying WHEN c.file_loc = -1 THEN 'Bid Board'::character varying WHEN c.file_loc = -2 THEN 'Lost Job'::character varying WHEN c.file_loc = -3 THEN 'See Job Notes'::character varying WHEN c.file_loc -3 OR c.file_loc IS NULL THEN ''::character varying WHEN h.initials IS NOT NULL THEN h.initials ELSE 'Unknown person'::character varyingEND, j.name, c.s_team, c.file_loc; Tables: status - 14 rows status_list - 6566 rows job - 2210 rows builder_list- 9670 rows company - 1249 rows call_list - 4731 rows person - 27 rows Primary keys: any field with a "_id" suffix is a primary key; and thus is implicitly indexed. Other indexes: status_list(job_id) btree status_list(status_id) btree job(file_loc) btree builder_list(company_id) btree call_list(company_id) btree call_list(person_id) btree call_list(company_id) btree person(company_id) btree explain analyze: Unique (cost=1798.47..1809.38 rows=291 width=114) (actual time=766.000..781.000 rows=566 loops=1) - Sort (cost=1798.47..1799.19 rows=291 width=114) (actual time=766.000..766.000 rows=1473 loops=1) Sort Key: c.job_id, g.person_id, c.job_no, b.deadline, c.name, bid_date(c.job_id), c.miscq, c.city, COALESCE(((c.city)::text || COALESCE((', '::text || (c.st)::text), ''::text)), (COALESCE(c.st, ''::character varying))::text), c.st, CASE WHEN (c.fi (..) - Hash Left Join (cost=1750.81..1786.56 rows=291 width=114) (actual time=453.000..750.000 rows=1473 loops=1) Hash Cond: ("outer".company_id = "inner".company_id) - Merge Left Join (cost=1707.20..1722.53 rows=291 width=95) (actual time=437.000..484.000 rows=1473 loops=1) Merge Cond:
[PERFORM] query optimization help
Hi All, I have the following query to generate a report grouped by "states". SELECT distinct upper(cd.state) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d left JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year',d.time)= 2004 GROUP BY mystate ORDER BY mystate; mystate | total_amount | total_fee -+--+--- | 3695 | 0AR | 3000 | 0AZ | 1399 | 0CA | 113100 | 6242FL | 121191 | 9796GA | 34826876 | 47GEORGIA | 57990 | p; 3500IEIE | 114000 | 4849MD | 2 | 1158MI | 906447 | 0NY | 8000 | 600PA | 6200 | 375SC | 25000 | 600TN | 1443681 | 1124 | 13300 | 0(15 rows) If you notice, my problem in this query is that the records for GA, GEORGIA appear separately. But what I want to do is to havethem combined to a single entry with their values summed up . Initially we had accepted both formats as input for the state field. Also, there are some invalid entries for the state field (like the "IEIE" and null values), which appear because theinput forstatewas not validated initially.These entries have to be eliminated from the report.This query did not take a long time to complete, but did not meet the needs for the report. So, the querywas rewritten to the following query which takes nearly 7-8 mins to complete on our test database: SELECT (SELECT DISTINCT pc.state FROM postalcode pc WHERE UPPER(cd.state) IN (pc.state, pc.state_code)) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year', d.time) = 2004 GROUP BY mystate ORDER BY mystate; mystate | total_amount | total_fee +--+---ARIZONA | 1399 | 0ARKANSAS | 3000 | 0CALIFORNIA | 113100 | 6242FLORIDA | 121191 | 9796GEORGIA | 34884866 | 482388MARYLAND | 2 | 1158MICHIGAN | 906447 | 0NEW YORK | 8000 | 600PENNSYLVANIA | 6200 | 375SOUTH CAROLINA | 25000 | 600TENNESSEE | 1443681 | 1124 | 130995 | 4849 Here is the explain analyze of this query: QUERY PLAN Aggregate (cost=1226.57..1226.58 rows=1 width=38) (actual time=362355.58..362372.09 rows=12 loops=1) - Group (cost=1226.57..1226.57 rows=1 width=38) (actual time=362355.54..362367.73 rows=2197 loops=1) - Sort (cost=1226.57..1226.57 rows=1 width=38) (actual time=362355.53..362356.96 rows=2197 loops=1) Sort Key: (subplan) - Nested Loop (cost=0.00..1226.56 rows=1 width=38) (actual time=166.11..362321.46 rows=2197 loops=1) - Nested Loop (cost=0.00..1220.53 rows=1 width=26) (actual time=1.68..361.32 rows=2115 loops=1) - Seq Scan on customerdata cd (cost=0.00..274.32 ro ws=31 width=10) (actual time=0.04..29.87 rows=3303 loops=1) Filter: (lower((country)::text) = 'us'::text) - Index Scan using data_uid_idx on data d (cost=0.00..30.08 rows=1 width=16) (actual time=0.04..0.09 rows=1 loops=3303) Index Cond: (d.uid = "outer".uid) Filter: (((what = 26) OR (what = 0) OR (what = 15)) AND ((flags = 1) OR (flags = 9) OR (flags = 10) OR (flags = 12)) AND (date_part('year'::text, "time") = 2004::double precision)) - Index Scan using merchant_purchase_data_idx on merchant_purchase mp (cost=0.00..6.01 rows=1 width=12) (actual time=0.05..0.05 rows=1 loops=2115) Index Cond: ("outer".id = mp.data_id) SubPlan - Unique (cost=2237.12..2243.22 rows=122 width=13) (actual time=161.25..164.68 rows=1 loops=2197) - Sort (cost=2237.12..2240.17 rows=1220 width=13) (actual time=161.21..161.88 rows=1033 loops=2197) Sort Key: state - Seq Scan on postalcode pc (cost=0.00..2174.56 rows=1220 width=13) (actual time=35.79..148.33 rows=1033 loops=2197) Filter: ((upper(($0)::text) = (state)::text) OR (upper(($0)::text) = (state_code)::text))Total runtime: 362372.57 msec The postalcode table is used in the query to validate the states and to combine the entries like GA and GEORGIA. \d postalcode Table "public.postalcode" Column | Type | Modifiers +---+id | integer | not null default nextval('public.postalcode_id_seq'::text)country | character(2) | state | character varying(30) | zipcode | character varying(20) | city | character varying(50) | city_alias | character varying(20) | state_code | character varying(2) | Indexes: postalcode_country_key unique btree (country, state_code, zipcode), postalcode_state_code_idx btree (state_code), postalcode_state_idx btree (state) The postalcode table has 70328 rows! Can some one please help me
Re: [PERFORM] query optimization help
Please post in plaintext, not html where possible. Your group by clause was 'myst'...was that supposed to be mystate? Her is something you might try...use the original query form and create a function which resolves the state code from the input data...you are already doing that with upper. So, create function get_state_code(text) returns char(2) as $$ select case when len($1) = 2 then upper($1) else lookup_state_code($1) end; $$ language sql stable; lookup_state_code is a similar function which is boils down to a select from a lookup table. Or, you could make a giant cast statement (when GEORGIA then GA, etc). and now your function becomes IMMUTABLE and should execute very fast. Just make sure all the states are spelled correct in the original table via domain constraint. Merlin -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of sarlav kumar Sent: Friday, January 14, 2005 9:40 AM To: pgsqlnovice; pgsqlperform Subject: [PERFORM] query optimization help Hi All, I have the following query to generate a report grouped by states. SELECT distinct upper(cd.state) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d left JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year',d.time)= 2004 GROUP BY myst ate ORDER BY mystate; ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] query optimization help
Hi, Thanks for the help. I actually got around with it by doing the following. I created a temporary table: create table statesnew as select distinct state,state_code from postalcode where lower(country)='us'; And then changed the query to : SELECT (SELECT sn.state FROM statesnew sn WHERE UPPER(cd.state) IN (sn.state, sn.state_code)) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year', d.time) = 2004 GROUP BY mystate ORDER BY mystate; This workedwell, as it reduced the number of entries it had to search from. I am not sure how to use the function you have written. Can you give me pointers on that? Thanks, Saranya Merlin Moncure [EMAIL PROTECTED] wrote: Please post in plaintext, not html where possible.Your group by clause was 'myst'...was that supposed to be mystate? Yes, It is mystate. It continues on the next line:) Her is something you might try...use the original query form and create a function which resolves the state code from the input data...you are already doing that with upper.So,create function get_state_code(text) returns char(2) as $$select case when len($1) = 2 then upper($1)else lookup_state_code($1)end;$$language sql stable;lookup_state_code is a similar function which is boils down to a select from a lookup table. Or, you could make a giant cast statement (when GEORGIA then GA, etc). and now your function becomes IMMUTABLE and should execute very fast. Just make sure all the states are spelled correct in the original table via domain constraint.Merlin__Do You Yahoo!?Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: [PERFORM] Query Optimization
sarlav kumar wrote: Hi all, Can someone please help me optimize this query? Is there a better way to write this query? I am generating a report of transactions ordered by time and with details of the sender and receiver etc. SELECT distinct a.time::date ||'br'||substring(a.time::time::text,1,8) as Time, CASE WHEN a.what = 0 THEN 'Money Transfer' WHEN a.what = 15 THEN 'Purchase' WHEN a.what = 26 THEN 'Merchant Streamline' WHEN a.what = 13 THEN 'Reversal' END as Transaction_Type , c1.account_no as SenderAccount, c2.account_no as RecieverAccount, b.country as SenderCountry, d.country as RecieverCountry, b.firstname as SenderFirstName, b.lastname as SenderLastName, d.firstname as ReceiverFirstName, d.lastname as ReceiverLastName, a.status as status, (select sum(td.amount * 0.01) from transaction_data td where td.data_id = a2.id and td.dir = 1 and td.uid = a.target_uid) as ReversedAmount, (select sum(td.amount * 0.01) from transaction_data td where td.data_id = a2.id and td.dir = 0 and td.uid = a.uid ) as DepositedAmount, a.flags, (a.amount * 0.01) as Amount, (a.fee * 0.01) as Fee FROM data a, customerdata b, customerdata d, customer c1, customer c2 , participant p, data a2 WHERE p.id = a.partner_id AND (a.uid = c1.id) AND (a.target_uid = c2.id) and c1.id=b.uid and c2.id=d.uid and a.confirmation is not null AND (a2.ref_id = a.id) and ((a2.what = 13) or (a2.what = 17) ) ORDER BY time desc ; (query plan followed) The expensive operation is the UNIQUE. Are you sure, in terms of business logic, that this is necessary? Is it actually possible to have duplicate transactions at the exact same time, and if so, would you really want to eliminate them? As an aside, I prefer to have numeric constants like the 'what' field in a small lookup table of two columns (what_code, what_description); it's easier to extend and to document. begin:vcard fn:Andrew Lazarus n:Lazarus;Andrew org:Pillette Investment Management;Research and Development adr;dom:;;3028 Fillmore;San Francisco;CA;94123 email;internet:[EMAIL PROTECTED] title:Director tel;work:800-366-0688 tel;fax:415-440-4093 url:http://www.pillette.com version:2.1 end:vcard ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[PERFORM] Query Optimization
Hi all, Can someone please help me optimize this query? Is there a better way to write this query? I am generating a report of transactions ordered by time and with details of the sender and receiver etc. SELECT distinct a.time::date ||'br'||substring(a.time::time::text,1,8) as Time,CASE WHEN a.what = 0 THEN 'Money Transfer' WHEN a.what = 15 THEN 'Purchase' WHEN a.what = 26 THEN 'Merchant Streamline' WHEN a.what = 13 THEN 'Reversal' END as Transaction_Type ,c1.account_no as SenderAccount, c2.account_no as RecieverAccount, b.country as SenderCountry, d.country as RecieverCountry,b.firstname as SenderFirstName, b.lastname as SenderLastName, d.firstname as ReceiverFirstName, d.lastname as ReceiverLastName, a.status as status,(select sum(td.amount * 0.01) from transaction_data td where td.data_id = a2.id and td.dir = 1 and td.uid = a.target_uid) as ReversedAmount,(select sum(td.amount * 0.01) from transaction_data td where td.data_id = a2.id and td.dir = 0 and td.uid = a.uid ) as DepositedAmount, a.flags, (a.amount * 0.01) as Amount,(a.fee * 0.01) as Fee FROM data a, customerdata b, customerdata d, customer c1, customer c2 , participant p, data a2 WHERE p.id = a.partner_id AND (a.uid = c1.id) AND (a.target_uid = c2.id) and c1.id=b.uid and c2.id=d.uidand a.confirmation is not null AND (a2.ref_id = a.id) and ((a2.what = 13) or (a2.what = 17) ) ORDER BY time desc ; QUERY PLAN-Unique (cost=2978.27..2981.54 rows=8 width=150) (actual time=502.29..506.75 rows=382 loops=1) - Sort (cost=2978.27..2978.46 rows=77 width=150) (actual time=502.29..502.61 rows=461 loops=1) Sort Key: a."time")::date)::text || 'br'::text) || "substring"(((a."time")::time without time zone)::text, 1, 8)), CASE WHEN (a.what = 0) THEN 'Money Transfer'::text WHEN (a.what = 15) THEN 'Purchase'::text WHEN (a.what = 26) THEN 'Merchant Streamline'::text WHEN (a.what = 13) THEN 'Reversal'::text ELSE NULL::text END, c1.account_no, c2.account_no, b.country, d.country, b.firstname, b.lastname, d.firstname, d.lastname, a. status, (subplan), (subplan), a.flags, ((a.amount)::numeric * 0.01), ((a.fee)::numeric * 0.01) - Hash Join (cost=2687.00..2975.86 rows=77 width=150) (actual time=423.91..493.48 rows=461 loops=1) Hash Cond: ("outer".partner_id = "inner".id) - Nested Loop (cost=2494.67..2781.99 rows=77 width=146) (actual time=413.19..441.61 rows=472 loops=1) - Merge Join (cost=2494.67..2526.04 rows=77 width=116) (actual time=413.09..429.86 rows=472 loops=1)& nbsp; Merge Cond: ("outer".id = "inner".ref_id) - Sort (cost=1443.39..1458.57 rows=6069 width=108) (actual time=370.14..377.72 rows=5604 loops=1) Sort Key: a.id - Hash Join (cost=203.50..1062.01 rows=6069 width=108) (actual time=20.35..335.44 rows=5604 loops=1) Hash Cond: ("outer".uid = "inner".id) - Merge Join (cost=0.00..676.43 rows=6069 width=91) (actual time=0.42..255.33 rows=5611 loops=1) Merge Cond: ("outer".target_uid = "inner".uid) - Merge Join (cost=0.00..1224.05 rows=6069 width=61) (actual time=0.34..156.74 rows=5611 loops=1) Merge Cond: ("outer".target_uid = "inner".id) - Index Scan using data_target_uid on data a (cost=0.00..2263.05 rows=6069 width=44) (actual time=0.23..63.87 rows=5630 loops=1) Filter: (confirmation IS NOT NULL) - Index Scan using customer_pkey on customer c2 (cost=0.00..631.03 rows=6120 width=17) (actual time=0.05..50.97 rows=10862 loops=1) - Index Scan using customerdata_uid_idx on customerdata d (cost=0.00..312.36 rows=6085 width=30) (actual time=0.06..48.95 rows=10822 loops=1) - Hash (cost=188.20..188.20 rows=6120 width=17) (actual time=19.81..19.81 rows=0 loops=1) - Seq Scan on customer c1 (cost=0.00..188.20 rows=6120 width=17) (actual time=0.03..12.30 rows=6157 loops=1) - Sort (cost=1051.28..1052.52 rows=497 width=8) (actual time=42.05..4 2.51 rows=542 loops=1) Sort Key: a2.ref_id - Seq Scan on data a2 (cost=0.00..1029.00 rows=497 width=8) (actual time=0.21..41.14 rows=545 loops=1) Filter: ((what = 13) OR (what = 17)) - Index Scan using customerdata_uid_i dx on customerdata b (cost=0.00..3.31 rows=1 width=30) (actual time=0.01..0.01 rows=1 loops=472) Index Cond: (b.uid = "outer".uid) - Hash (cost=192.26..192.26 rows=26 width=4) (actual time=10.50..10.50 rows=0 loops=1) - Seq Scan on participant p (cost=0.00..192.26 rows=26 width=4) (actual time=10.42..10.46 rows=26 loops=1) SubPlan - Aggregate (cost=6.08..6.08 rows=1 width=4) (actual time=0.03..0.03 rows=1 loops=461) - Index Scan using td_data_id_idx on transaction_data td (cost=0.00..6.08 rows=1 width=4) (actual time=0.02..0.02 rows=1 loops=461) Index Cond: (data_id = $0) Filter: ((dir = 1) AND (uid = $1)) - Aggregate (cost=6.08..6.08 rows=1 width=4) (actual time=0.02..0.02 rows=1 loops=461) - Index Scan using td_data_id_idx on transaction_data td (cost=0.00..6.08 rows=1 width=4) (actual time=0.01..0.01 rows=1 loops=461) Index Cond: (data_id = $0) Filter: ((dir = 0) AND (uid =
Re: [PERFORM] query optimization question
On Thu, 29 Jan 2004, Tom Lane wrote: jackdb-# GROUP BY memberid_ HAVING ( Um, that's not what I had in mind at all. Does GROUP BY actually do anything at all here? (You didn't answer me as to whether memberid_ is a unique identifier or not, but if it is, this GROUP BY is just an expensive no-op.) From your comment I assume that there is no transformation in pg that detects that the group by columns are unique? this is all blue-sky speculation anyway. What I'm actually more interested in is your statement that MSSQL can do the original query quickly. I find that a bit hard to believe because I don't see any relevant optimization techniques. Getting rid of the group by would not give that kind of speedup? Maybe mssql manage to rewrite the query like that before executing. -- /Dennis Björklund ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] query optimization question
On Thu, 29 Jan 2004, Jack Coates wrote: Probably better to repost it as a gzip'd attachment. That should complete with a picture of the GUI version. 26k zipped, let's see if this makes it through. Are you sure you attached it? At least when it got here there was no attachment. -- /Dennis Björklund ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] query optimization question
Tom Lane [EMAIL PROTECTED] writes: Jack Coates [EMAIL PROTECTED] writes: yup -- here it is. It will probably be a nasty mess after linewrap gets done with it, yup, sure is :-( If I was familiar with the layout I could probably decipher where the line breaks are supposed to be, but right now I'm just confused. I just replaced all newlines that are followed by lines starting in column 1 with spaces and got something reasonable: SELECT DISTINCT members_.memberid_ FROM members_ WHERE ( members_.List_='list1' AND members_.MemberType_='normal'AND members_.SubType_='mail' ) GROUP BY memberid_ HAVING ( ( select count(*) from lyrActiveRecips, outmail_where outmail11 1 0 NULLNULL1 NULL102274.5 NULLNULLNULL104.10356 NULLNULLSELECT 0 NULL |--Parallelism(Gather Streams)11 2 1 Parallelism Gather Streams NULLNULL102274.50.0 0.22011127 23 104.10356 [members_].[MemberID_] NULLPLAN_ROW-1 1.0 |--Filter(WHERE:(If ([Expr1006] IS NULL) then 0 else [Expr1006]+If ([Expr1012] IS NULL) then 0 else [Expr1012]=3)) 11 3 2 Filter Filter WHERE:(If ([Expr1006] IS NULL) then 0 else [Expr1006]+If ([Expr1012] IS NULL) then 0 else [Expr1012]=3) NULL102274.50.0 3.5393338 23 103.88345 [members_].[MemberID_] NULLPLAN_ROW-1 1.0 |--Hash Match(Right Outer Join, HASH:([lyrCompletedRecips].[MemberID])=([members_].[MemberID_]), RESIDUAL:([members_].[MemberID_]=[lyrCompletedRecips].[MemberID])) 11 4 3 Hash Match Right Outer Join HASH:([lyrCompletedRecips].[MemberID])=([members_].[MemberID_]), RESIDUAL:([members_].[MemberID_]=[lyrCompletedRecips].[MemberID]) NULL 4782883.5 0.0 21.874712 23 100.34412 [members_].[MemberID_], [Expr1006], [Expr1012] NULLPLAN_ROW-1 1.0 |--Compute Scalar(DEFINE:([Expr1012]=Convert([Expr1020]))) 11 5 4 Compute Scalar Compute Scalar DEFINE:([Expr1012]=Convert([Expr1020])) [Expr1012]=Convert([Expr1020]) 119575.35 0.0 1.3723248 15 4.3749919 [lyrCompletedRecips].[MemberID], [Expr1012] NULLPLAN_ROW-1 1.0 ||--Hash Match(Aggregate, HASH:([lyrCompletedRecips].[MemberID]), RESIDUAL:([lyrCompletedRecips].[MemberID]=[lyrCompletedRecips].[MemberID]) DEFINE:([Expr1020]=COUNT(*))) 11 6 5 Hash Match Aggregate HASH:([lyrCompletedRecips].[MemberID]), RESIDUAL:([lyrCompletedRecips].[MemberID]=[lyrCompletedRecips].[MemberID]) [Expr1020]=COUNT(*) 119575.35 0.0 1.3723248 15 4.3749919 [lyrCompletedRecips].[MemberID], [Expr1020] NULLPLAN_ROW-1 1.0 | |--Parallelism(Repartition Streams, PARTITION COLUMNS:([lyrCompletedRecips].[MemberID])) 11 7 6 Parallelism Repartition Streams PARTITION COLUMNS:([lyrCompletedRecips].[MemberID]) NULL 119640.60.0 0.32407209 173 3.002667 [lyrCompletedRecips].[MemberID] NULLPLAN_ROW-1 1.0 | |--Nested Loops(Inner Join, OUTER REFERENCES:([outmail_].[MessageID_])) 11 8 7 Nested LoopsInner Join OUTER REFERENCES:([outmail_].[MessageID_]) NULL119640.60.0 0.75014657 173 2.6785948 [lyrCompletedRecips].[MemberID] NULL PLAN_ROW-1 1.0 | |--Parallelism(Distribute Streams) 11 9 8 Parallelism Distribute Streams NULLNULL1.0 0.0 2.8501874E-2128 9.4664574E-2[outmail_].[MessageID_] NULLPLAN_ROW -1 1.0 | ||--Clustered Index Scan(OBJECT:([lmdb].[dbo].[outmail_].[IX_outmail_list]), WHERE:([outmail_].[Type_]='list'))11 10 9 Clustered Index Scan Clustered Index ScanOBJECT:([lmdb].[dbo].[outmail_].[IX_outmail_list]), WHERE:([outmail_].[Type_]='list') [outmail_].[Type_], [outmail_].[MessageID_] 1.0 0.01878925 3.981E-5128 3.7658099E-2[outmail_].[Type_], [outmail_].[MessageID_] NULLPLAN_ROW0 1.0 | |--Clustered Index Seek(OBJECT:([lmdb].[dbo].[lyrCompletedRecips].[IX_CompletedRecipsMailingID]), SEEK:([lyrCompletedRecips].[MailingID]=[outmail_].[MessageID_]), WHERE:([lyrCompletedRecips].[CompletionStatusID]=300 AN11 11 8 Clustered Index SeekClustered Index Seek OBJECT:([lmdb].[dbo].[lyrCompletedRecips].[IX_CompletedRecipsMailingID]), SEEK:([lyrCompletedRecips].[MailingID]=[outmail_].[MessageID_]),
Re: [PERFORM] query optimization question
On Thu, 2004-01-29 at 23:23, Dennis Bjorklund wrote: On Thu, 29 Jan 2004, Jack Coates wrote: Probably better to repost it as a gzip'd attachment. That should complete with a picture of the GUI version. 26k zipped, let's see if this makes it through. Are you sure you attached it? At least when it got here there was no attachment. argh; attached the 40K version which was in color, removed it to make the new one with greyscale and forgot to attach that. Here it is again: -- Jack Coates, Lyris Technologies Applications Engineer 510-549-4350 x148, [EMAIL PROTECTED] Interoperability is the keyword, uniformity is a dead end. --Olivier Fourdan pg-perf-sql-plan.tgz Description: application/compressed-tar ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] query optimization question
On Wed, 2004-01-28 at 18:04, Tom Lane wrote: Jack Coates [EMAIL PROTECTED] writes: I've got a query that needs some help, please. Is there a way to avoid all the looping? I've got freedom to work with the double-indented sections below ) AND (, but the initial select distinct wrapper is much more difficult to change. This is auto-generated code. Well, you're not going to get any serious improvement without a wholesale rewrite of the query --- I'd think that something driven by a GROUP BY memberid_ HAVING count(*) = whatever at the outer level would be a better way to approach it. As you have it, the system has no choice but to fully evaluate two very expensive subselects, from scratch, for each outer row. I hear you. There's definitely an understanding that this tool can generate some gnarly queries, and we want to redesign in a way that will allow some more intelligence to be applied to the problem. In the meantime, I'll be happy if PG grinds at the same level as other databases. MS-SQL completed that query in 25 minutes on a database with 31 times the data in it. Since I'm one of the bigger *nix fans around here, that doesn't make me happy. However... ( select count(*) from lyrActiveRecips, members_ a, outmail_ where lyrActiveRecips.UserName = a.UserNameLC_ and lyrActiveRecips.Domain = a.Domain_ and a.MemberID_ = members_.MemberID_ and outmail_.MessageID_ = lyrActiveRecips.MailingID Is memberid_ a unique identifier for members_, as one would think from the name? If so, can't you drop the join of members_ a in this subselect, and just use the corresponding fields from the outer table? ( select count(*) from lyrCompletedRecips, members_ a, outmail_ where a.MemberID_ = lyrCompletedRecips.MemberID and a.UserNameLC_ = members_.UserNameLC_ and a.Domain_ = members_.Domain_ and outmail_.MessageID_ = lyrCompletedRecips.MailingID Why are the join conditions different here from the other subselect? Can't you rephrase them the same as above, and then again remove the inner appearance of members_ ? regards, tom lane unfortunately, the column names are different between lyrcompletedrecips and lyractiverecips. However, one thing we were able to do is to reduce the number of queries by not trying to match across multiple lists. SELECT DISTINCT members_.emailaddr_, members_.memberid_ FROM members_ WHERE ( members_.List_='list1' AND members_.MemberType_='normal' AND members_.SubType_='mail' AND members_.emailaddr_ IS NOT NULL ) AND ( ( select count(*) from lyrActiveRecips, outmail_ where outmail_.MessageID_ = lyrActiveRecips.MailingID and outmail_.Type_ = 'list' and members_.MemberID_ = lyrActiveRecips.MemberID and lyrActiveRecips.NextAttempt '2004-01-20 00:00:00' ) + ( select count(*) from lyrCompletedRecips, outmail_ where members_.MemberID_ = lyrCompletedRecips.MemberID and outmail_.MessageID_ = lyrCompletedRecips.MailingID and outmail_.Type_ = 'list' and lyrCompletedRecips.FinalAttempt '2004-01-20 00:00:00' and lyrCompletedRecips.CompletionStatusID = 300 ) = 3 ); That completed in 3.5 minutes on MS-SQL. I killed the query this morning after 15 hours on PostgreSQL 7.4. I tried a GROUP BY memberid_ HAVING variation, which completed in 59 seconds on MS-SQL. I killed it after 35 minutes on PostgreSQL. On a more positive note, if you remember the benchmarking I was doing last month, PostgreSQL got some pretty good relative numbers. It requires a lot of hand-holding and tuning relative to MS-SQL, but it certainly beat the pants off of Oracle 8 and 9 for speed and ease of management. Oracle 8 was in fact unable to complete the uglier stress tests. I'll be working on a tuning recommendations white paper today. thanks for all the help, -- Jack Coates, Lyris Technologies Applications Engineer 510-549-4350 x148, [EMAIL PROTECTED] Interoperability is the keyword, uniformity is a dead end. --Olivier Fourdan ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] query optimization question
Jack Coates [EMAIL PROTECTED] writes: That completed in 3.5 minutes on MS-SQL. I killed the query this morning after 15 hours on PostgreSQL 7.4. I tried a GROUP BY memberid_ HAVING variation, which completed in 59 seconds on MS-SQL. I killed it after 35 minutes on PostgreSQL. Hm. I'd like to think that 7.4 would be competitive on grouping queries. What sort of plan did you get from it? regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] query optimization question
Jack Coates [EMAIL PROTECTED] writes: jackdb=# explain SELECT DISTINCT members_.memberid_ jackdb-# FROM members_ jackdb-# WHERE ( members_.List_='list1' jackdb(# AND members_.MemberType_='normal' jackdb(# AND members_.SubType_='mail' jackdb(# AND members_.emailaddr_ IS NOT NULL ) jackdb-# GROUP BY memberid_ HAVING ( Um, that's not what I had in mind at all. Does GROUP BY actually do anything at all here? (You didn't answer me as to whether memberid_ is a unique identifier or not, but if it is, this GROUP BY is just an expensive no-op.) What I was envisioning was pulling the sub-selects up to the top level and using grouping to calculate the count(*) values for all memberids in parallel. Roughly speaking it would look like (again assuming memberid_ is unique) SELECT memberid_ FROM ( SELECT memberid_ FROM lyrActiveRecips, members_, outmail WHERE (all the conditions for this case) UNION ALL SELECT memberid_ FROM lyrCompletedRecips, members_, outmail WHERE (all the conditions for this case) ) GROUP BY memberid_ HAVING count(*) = 3; However, if you can't change the boilerplate part of your query then this is all blue-sky speculation anyway. What I'm actually more interested in is your statement that MSSQL can do the original query quickly. I find that a bit hard to believe because I don't see any relevant optimization techniques. Do they have any equivalent to EXPLAIN that would give some hint how they're doing it? regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] query optimization question
On Thu, 2004-01-29 at 11:31, Tom Lane wrote: Jack Coates [EMAIL PROTECTED] writes: jackdb=# explain SELECT DISTINCT members_.memberid_ jackdb-# FROM members_ jackdb-# WHERE ( members_.List_='list1' jackdb(# AND members_.MemberType_='normal' jackdb(# AND members_.SubType_='mail' jackdb(# AND members_.emailaddr_ IS NOT NULL ) jackdb-# GROUP BY memberid_ HAVING ( Um, that's not what I had in mind at all. Does GROUP BY actually do anything at all here? (You didn't answer me as to whether memberid_ is a unique identifier or not, but if it is, this GROUP BY is just an expensive no-op.) Sorry for the misunderstanding. It should be unique, yes. What I was envisioning was pulling the sub-selects up to the top level and using grouping to calculate the count(*) values for all memberids in parallel. Roughly speaking it would look like (again assuming memberid_ is unique) SELECT memberid_ FROM ( SELECT memberid_ FROM lyrActiveRecips, members_, outmail WHERE (all the conditions for this case) UNION ALL SELECT memberid_ FROM lyrCompletedRecips, members_, outmail WHERE (all the conditions for this case) ) GROUP BY memberid_ HAVING count(*) = 3; However, if you can't change the boilerplate part of your query then this is all blue-sky speculation anyway. Got it now -- I'm running into some subquery errors trying to implement this, anyway. What I'm actually more interested in is your statement that MSSQL can do the original query quickly. I find that a bit hard to believe because I don't see any relevant optimization techniques. Do they have any equivalent to EXPLAIN that would give some hint how they're doing it? yup -- here it is. It will probably be a nasty mess after linewrap gets done with it, so let me know if you'd like me to post a copy on ftp. SELECT DISTINCT members_.memberid_ FROM members_ WHERE ( members_.List_='list1'AND members_.MemberType_='normal'AND members_.SubType_='mail' ) GROUP BY memberid_ HAVING ( ( select count(*) from lyrActiveRecips, outmail_where outmail 11 1 0 NULLNULL1 NULL102274.5NULL NULLNULL104.10356 NULLNULLSELECT 0 NULL |--Parallelism(Gather Streams)11 2 1 Parallelism Gather Streams NULLNULL102274.50.0 0.22011127 23 104.10356 [members_].[MemberID_] NULLPLAN_ROW-1 1.0 |--Filter(WHERE:(If ([Expr1006] IS NULL) then 0 else [Expr1006]+If ([Expr1012] IS NULL) then 0 else [Expr1012]=3)) 11 3 2 Filter Filter WHERE:(If ([Expr1006] IS NULL) then 0 else [Expr1006]+If ([Expr1012] IS NULL) then 0 else [Expr1012]=3) NULL102274.50.0 3.5393338 23 103.88345 [members_].[MemberID_] NULLPLAN_ROW-1 1.0 |--Hash Match(Right Outer Join, HASH:([lyrCompletedRecips].[MemberID])=([members_].[MemberID_]), RESIDUAL:([members_].[MemberID_]=[lyrCompletedRecips].[MemberID])) 11 4 3 Hash Match Right Outer Join HASH:([lyrCompletedRecips].[MemberID])=([members_].[MemberID_]), RESIDUAL:([members_].[MemberID_]=[lyrCompletedRecips].[MemberID]) NULL 4782883.5 0.0 21.874712 23 100.34412 [members_].[MemberID_], [Expr1006], [Expr1012] NULLPLAN_ROW-1 1.0 |--Compute Scalar(DEFINE:([Expr1012]=Convert([Expr1020]))) 11 5 4 Compute Scalar Compute Scalar DEFINE:([Expr1012]=Convert([Expr1020])) [Expr1012]=Convert([Expr1020]) 119575.35 0.0 1.3723248 15 4.3749919 [lyrCompletedRecips].[MemberID], [Expr1012] NULLPLAN_ROW-1 1.0 ||--Hash Match(Aggregate, HASH:([lyrCompletedRecips].[MemberID]), RESIDUAL:([lyrCompletedRecips].[MemberID]=[lyrCompletedRecips].[MemberID]) DEFINE:([Expr1020]=COUNT(*)))11 6 5 Hash Match Aggregate HASH:([lyrCompletedRecips].[MemberID]), RESIDUAL:([lyrCompletedRecips].[MemberID]=[lyrCompletedRecips].[MemberID]) [Expr1020]=COUNT(*) 119575.35 0.0 1.3723248 15 4.3749919 [lyrCompletedRecips].[MemberID], [Expr1020] NULLPLAN_ROW-1 1.0 | |--Parallelism(Repartition Streams, PARTITION COLUMNS:([lyrCompletedRecips].[MemberID])) 11 7 6 Parallelism Repartition Streams PARTITION COLUMNS:([lyrCompletedRecips].[MemberID]) NULL 119640.60.0 0.32407209 173 3.002667 [lyrCompletedRecips].[MemberID] NULLPLAN_ROW-1 1.0 | |--Nested Loops(Inner Join, OUTER REFERENCES:([outmail_].[MessageID_])) 11 8 7 Nested LoopsInner JoinOUTER REFERENCES:([outmail_].[MessageID_])NULL119640.60.0 0.75014657 173 2.6785948
Re: [PERFORM] query optimization question
On Thu, 2004-01-29 at 14:01, Tom Lane wrote: Probably better to repost it as a gzip'd attachment. That should protect the formatting and get it into the list archives. regards, tom lane complete with a picture of the GUI version. 26k zipped, let's see if this makes it through. -- Jack Coates, Lyris Technologies Applications Engineer 510-549-4350 x148, [EMAIL PROTECTED] Interoperability is the keyword, uniformity is a dead end. --Olivier Fourdan ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PERFORM] query optimization differs between view and explicit query
I have a large query which I would like to place in a view. The explicit query is sufficiently fast, but the same query as a view is much slower and uses a different plan. I would appreciate an explanation of why this is, and, more importantly whether/how I might coax the view to use a different plan. The original query: [EMAIL PROTECTED] select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as pstart, max(H.pstop) as pstop,A.ident,(A.ident/Q.len::float*100)::int as pct_ident, sum(H.pstop-H.pstart+1) as aln_length,H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as gstart, max(H.gstop) as gstop from p2gblathsp H join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id join pseq Q on H.pseq_id=Q.pseq_id where H.pseq_id=76 group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len \g /dev/null Time: 277.804 ms Now as a view: [EMAIL PROTECTED] create view v1 as select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as pstart, max(H.pstop) as pstop,A.ident,(A.ident/Q.len::float*100)::int as pct_ident, sum(H.pstop-H.pstart+1) as aln_length,H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as gstart, max(H.gstop) as gstop from p2gblathsp H join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id join pseq Q on H.pseq_id=Q.pseq_id group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len; CREATE VIEW Time: 103.041 ms [EMAIL PROTECTED] select * from v1 where pseq_id=76 \g /dev/null Time: 31973.979 ms Okay, that's ~100x slower. The plans: [EMAIL PROTECTED] explain select distinct on snip... same as the first query above QUERY PLAN -- Unique (cost=11157.75..11187.26 rows=454 width=40) - GroupAggregate (cost=11157.75..11186.13 rows=454 width=40) - Sort (cost=11157.75..11158.89 rows=454 width=40) Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len - Nested Loop (cost=11125.62..11137.71 rows=454 width=40) - Index Scan using pseq_pkey on pseq q (cost=0.00..3.01 rows=2 width=6) Index Cond: (76 = pseq_id) - Materialize (cost=11125.62..11127.89 rows=227 width=38) - Nested Loop (cost=546.15..11125.62 rows=227 width=38) - Hash Join (cost=546.15..10438.72 rows=227 width=34) Hash Cond: (outer.p2gblathsp_id = inner.p2gblathsp_id) - Seq Scan on p2gblatalnhsp ah (cost=0.00..6504.03 rows=451503 width=8) - Hash (cost=545.58..545.58 rows=227 width=34) - Index Scan using p2gblathsp_p_lookup on p2gblathsp h (cost=0.00..545.58 rows=227 wid Index Cond: (pseq_id = 76) - Index Scan using p2gblataln_pkey on p2gblataln a (cost=0.00..3.01 rows=1 width=8) Index Cond: (outer.p2gblataln_id = a.p2gblataln_id) (17 rows) [EMAIL PROTECTED] explain select * from v1 where pseq_id=76; QUERY PLAN -- Subquery Scan v1 (cost=246907.54..281897.70 rows=2258 width=77) Filter: (pseq_id = 76) - Unique (cost=246907.54..276254.13 rows=451486 width=40) - GroupAggregate (cost=246907.54..275125.41 rows=451486 width=40) - Sort (cost=246907.54..248036.25 rows=451486 width=40) Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len - Hash Join (cost=14019.29..204503.24 rows=451486 width=40) Hash Cond: (outer.p2gblataln_id = inner.p2gblataln_id) - Hash Join (cost=7632.79..191344.45 rows=451486 width=36) Hash Cond: (outer.p2gblathsp_id = inner.p2gblathsp_id) - Merge Join (cost=0.00..176939.38 rows=451486 width=36) Merge Cond: (outer.pseq_id = inner.pseq_id) - Index Scan using p2gblathsp_p_lookup on p2gblathsp h (cost=0.00..16102.40 rows=451485 widt
Re: [PERFORM] query optimization differs between view and explicit
Stephan Szabo [EMAIL PROTECTED] writes: On Thu, 29 Jan 2004, Reece Hart wrote: I have a large query which I would like to place in a view. The explicit query is sufficiently fast, but the same query as a view is much slower and uses a different plan. I would appreciate an explanation of why this is, and, more importantly whether/how I might coax the view to use a different plan. Well, in general [ they're not the same query ] Right. The reason the performance is so much worse is that the restriction pseq_id=76 cannot be pushed down into the view subquery; we have to form the entire logical output of the view and then filter on pseq_id=76. In your inline query you have done the pushing down anyway and so the restriction is applied much lower in the plan, resulting in lots less work. But the results might be different. The point that Stephan makes is explicitly understood by the planner as of PG 7.4: * 3. If the subquery uses DISTINCT ON, we must not push down any quals that * refer to non-DISTINCT output columns, because that could change the set * of rows returned. It's hard to give any advice on how to make a faster view without more context. What's the actual intention in all this? What's the semantics of pseq_id --- is it unique? It might be you could fix the problem by adding pseq_id to the DISTINCT ON list, but we don't have enough info to understand whether that would break the desired behavior. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[PERFORM] query optimization question
Hi all, I've got a query that needs some help, please. Is there a way to avoid all the looping? I've got freedom to work with the double-indented sections below ) AND (, but the initial select distinct wrapper is much more difficult to change. This is auto-generated code. explain analyze SELECT DISTINCT members_.emailaddr_, members_.memberid_ FROM members_ WHERE ( members_.List_='list1' AND members_.MemberType_='normal' AND members_.SubType_='mail' AND members_.emailaddr_ IS NOT NULL ) AND ( ( select count(*) from lyrActiveRecips, members_ a, outmail_ where lyrActiveRecips.UserName = a.UserNameLC_ and lyrActiveRecips.Domain = a.Domain_ and a.MemberID_ = members_.MemberID_ and outmail_.MessageID_ = lyrActiveRecips.MailingID and outmail_.Type_ = 'list' and lyrActiveRecips.NextAttempt '2004-01-20 00:00:00' ) + ( select count(*) from lyrCompletedRecips, members_ a, outmail_ where a.MemberID_ = lyrCompletedRecips.MemberID and a.UserNameLC_ = members_.UserNameLC_ and a.Domain_ = members_.Domain_ and outmail_.MessageID_ = lyrCompletedRecips.MailingID and outmail_.Type_ = 'list' and lyrCompletedRecips.FinalAttempt '2004-01-20 00:00:00' and lyrCompletedRecips.CompletionStatusID = 300 ) = 3 ) ; QUERY PLAN - Unique (cost=537.06..537.07 rows=1 width=72) (actual time=114460.908..114460.908 rows=0 loops=1) - Sort (cost=537.06..537.06 rows=1 width=72) (actual time=114460.905..114460.905 rows=0 loops=1) Sort Key: emailaddr_, memberid_ - Index Scan using ix_members_list_notifyerr on members_ (cost=0.00..537.05 rows=1 width=72) (actual time=114460.893..114460.893 rows=0 loops=1) Index Cond: ((list_)::text = 'list1'::text) Filter: (((membertype_)::text = 'normal'::text) AND ((subtype_)::text = 'mail'::text) AND (emailaddr_ IS NOT NULL) AND (((subplan) + (subplan)) = 3)) SubPlan - Aggregate (cost=52.39..52.39 rows=1 width=0) (actual time=0.089..0.090 rows=1 loops=818122) - Hash Join (cost=47.55..52.39 rows=1 width=0) (actual time=0.086..0.086 rows=0 loops=818122) Hash Cond: (outer.memberid_ = inner.memberid) - Index Scan using ix_members_emaillc on members_ a (cost=0.00..4.83 rows=1 width=4) (actual time=0.077..0.081 rows=1 loops=818122) Index Cond: (((domain_)::text = ($2)::text) AND ((usernamelc_)::text = ($1)::text)) - Hash (cost=47.55..47.55 rows=1 width=4) (actual time=0.025..0.025 rows=0 loops=1) - Hash Join (cost=25.00..47.55 rows=1 width=4) (actual time=0.023..0.023 rows=0 loops=1) Hash Cond: (outer.messageid_ = inner.mailingid) - Seq Scan on outmail_ (cost=0.00..22.50 rows=6 width=4) (actual time=0.001..0.001 rows=0 loops=1) Filter: ((type_)::text = 'list'::text) - Hash (cost=25.00..25.00 rows=2 width=8) (actual time=0.003..0.003 rows=0 loops=1) - Seq Scan on lyrcompletedrecips (cost=0.00..25.00 rows=2 width=8) (actual time=0.001..0.001 rows=0 loops=1) Filter: ((finalattempt '2004-01-20 00:00:00'::timestamp without time zone) AND (completionstatusid = 300)) - Aggregate (cost=51.59..51.59 rows=1 width=0) (actual time=0.033..0.034 rows=1 loops=818122) - Hash Join (cost=27.35..51.59 rows=1 width=0) (actual time=0.028..0.028 rows=0 loops=818122) Hash Cond: (((outer.username)::text = (inner.usernamelc_)::text) AND ((outer.domain)::text = (inner.domain_)::text)) - Hash Join (cost=22.52..46.72 rows=3 width=211) (actual time=0.003..0.003 rows=0 loops=818122) Hash Cond: (outer.mailingid = inner.messageid_) - Seq Scan on lyractiverecips (cost=0.00..22.50 rows=334 width=215) (actual time=0.001..0.001 rows=0 loops=818122) Filter: (nextattempt '2004-01-20 00:00:00'::timestamp