Re: [PERFORM] Very specialised query
On Fri, 27 Mar 2009, Dimitri Fontaine wrote: Maybe it's just that I didn't devote enough time to reading your detailed explanation above, but this part sounds like it could be done in an aggregate you'd use in a correlated subquery containing the right ORDER BY, couldn't it? http://www.postgresql.org/docs/8.3/interactive/sql-createaggregate.html Can you return multiple rows from an aggregate function? Matthew -- -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Very specialised query
On Fri, 27 Mar 2009, Marc Mamin wrote: if your data are mostly static and you have a few mains objects, maybe you can have some gain while defining conditional indexes for those plus one for the rest and then slicing the query: Maybe. I thought about doing that. However, I am not convinced that would be much of a gain, and would require major rewriting of the queries, as you suggest. WHERE (l2.start BETWEEN l1.start AND l1.end OR l1.start BETWEEN l2.start AND l2.end ) Yes, that's another way to calculate an overlap. However, it turns out to not be that fast. The problem is that OR there, which causes a bitmap index scan, as the leaf of a nested loop join, which can be rather slow. Matthew -- I'd try being be a pessimist, but it probably wouldn't work anyway. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Very specialised query
On Fri, 27 Mar 2009, Tom Lane wrote: Notice the two different index conditions: (l1.end l2.start) AND (l1.start l2.start) - between (l1.end l2.start) AND (l1.start = l2.start) - open-ended Both have a cost of (cost=0.00..123.10 rows=4809 width=12) Currently the planner only notices that for a range check that involves comparisons of the same variable expression to two constants (or pseudoconstants anyway). In principle it might be reasonable to have a heuristic that reduces the estimated selectivity in the example above, but it looks to me like it'd make clauselist_selectivity() a lot slower and more complicated. When you see (l1.end l2.start), how do you know which variable to try to match up against others? And if you try to match both, what do you do when you get matches for both? Those two index conditions are on an index scan on the field l2.start. Therefore, I would expect to only have to take any notice of l2.start when working out selectivity on a range check for this particular plan. When there is an index scan on a different field, then try and match that one up instead. Matthew -- -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Bad plan for nested loop + limit
On Sun, Mar 1, 2009 at 4:32 AM, Robert Haas robertmh...@gmail.com wrote: What do you have default_statistics_target set to? If it's less than 100, you should probably raise it to 100 and re-analyze (the default value for 8.4 will be 100, but for 8.3 and prior it is 10). Changing it to 100 fixed the problem. Thanks for alerting me to the existence of default_statistics_target. Alexander. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance
I have two tables, like this: Big table: CREATE TABLE photo_info_data ( photo_id integer NOT NULL, field_name character varying NOT NULL, field_value character varying, CONSTRAINT photo_info_data_pk PRIMARY KEY (photo_id, field_name) ) WITH (OIDS=FALSE); CREATE INDEX user_info_data_ix_field_value ON user_info_data USING btree (field_value); Small table: CREATE TABLE t_query_data ( i integer, key character varying, op character varying, value character varying ) WITH (OIDS=FALSE); I have around 240 rows in photo_info_data, and just two rows in t_query_data: i | key | op | value ---+--++ 1 | f-stop | eq | 2.6 2 | shutter | gt | 1/100 This is the query I'm executing: SELECT * FROM photo_info_data u JOIN t_query_data t on u.field_name = key This query takes around 900ms to execute. It returns 6 rows. When I do 'explain analyze' for some reason it takes around 7 seconds, and this is what I get: phototest=# explain analyze select * from photo_info_data u join t_query_data t on u.field_name = key; QUERY PLAN Hash Join (cost=1.04..58676.31 rows=218048 width=68) (actual time=2381.895..7087.225 rows=6 loops=1) Hash Cond: ((u.field_name)::text = (t.key)::text) - Seq Scan on photo_info_data u (cost=0.00..47500.30 rows=2398530 width=50) (actual time=0.042..3454.112 rows=2398446 loops=1) - Hash (cost=1.02..1.02 rows=2 width=18) (actual time=0.016..0.016 rows=2 loops=1) - Seq Scan on t_query_data t (cost=0.00..1.02 rows=2 width=18) (actual time=0.003..0.007 rows=2 loops=1) Total runtime: 7087.291 ms (6 rows) Time: 7088.663 ms I can rerun this query many times, it's always around 7 seconds. I/O wait during the query is nonexistant, it just takes 100% of CPU time (i have a DualCore Opteron server). If I force the planner not to use sequential_scan, here is what I get: phototest=# explain analyze select * from photo_info_data u join t_query_data t on u.field_name = key; QUERY PLAN - Nested Loop (cost=100039134.84..100130206.79 rows=218048 width=68) (actual time=271.138..540.998 rows=6 loops=1) - Seq Scan on t_query_data t (cost=1.00..10001.02 rows=2 width=18) (actual time=0.008..0.015 rows=2 loops=1) - Bitmap Heap Scan on photo_info_data u (cost=39134.84..63740.08 rows=109024 width=50) (actual time=270.464..270.469 rows=3 loops=2) Recheck Cond: ((u.field_name)::text = (t.key)::text) - Bitmap Index Scan on photo_info_data_pk (cost=0.00..39107.59 rows=109024 width=0) (actual time=270.435..270.435 rows=3 loops=2) Index Cond: ((u.field_name)::text = (t.key)::text) Total runtime: 541.065 ms (7 rows) Time: 542.147 ms The database currently has only those two tables. I have vacuumed them prior running above queries. I tought this information also might be important: phototest=# select key, count(*) from photo_info_data u join t_query_data t on u.field_name = key group by key; key | count --+--- f-stop | 3 shutter | 3 (2 rows) Am I doing something wrong here? The photo_info_data would hold around 10.000.000 records, should I be doing 'set seq_scan to false' each time I will want to run this query? (Since I'm accessing postgres trough JDBC I'll have same situation I had weeks ago, I described it here also). Mike -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance
Mario Splivalo mario.spliv...@megafon.hr writes: - Bitmap Heap Scan on photo_info_data u (cost=39134.84..63740.08 rows=109024 width=50) (actual time=270.464..270.469 rows=3 loops=2) Recheck Cond: ((u.field_name)::text = (t.key)::text) - Bitmap Index Scan on photo_info_data_pk (cost=0.00..39107.59 rows=109024 width=0) (actual time=270.435..270.435 rows=3 loops=2) Index Cond: ((u.field_name)::text = (t.key)::text) You need to figure out why that rowcount estimate is off by more than four orders of magnitude :-( regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance
Tom Lane wrote: Mario Splivalo mario.spliv...@megafon.hr writes: - Bitmap Heap Scan on photo_info_data u (cost=39134.84..63740.08 rows=109024 width=50) (actual time=270.464..270.469 rows=3 loops=2) Recheck Cond: ((u.field_name)::text = (t.key)::text) - Bitmap Index Scan on photo_info_data_pk (cost=0.00..39107.59 rows=109024 width=0) (actual time=270.435..270.435 rows=3 loops=2) Index Cond: ((u.field_name)::text = (t.key)::text) You need to figure out why that rowcount estimate is off by more than four orders of magnitude :-( Huh, thnx! :) Could you give me some starting points, what do I do? Could it be because table is quite large, and there are only 3 columns that match join condition? Now, after I finished writing above lines, index creation on photo_info_data(field_name) was done. When I rerun above query, here is what I get: phototest=# explain analyze select field_name, count(*) from t_query_data t join photo_info_data u on t.key = u.field_name group by field_name; QUERY PLAN -- HashAggregate (cost=57414.33..57414.61 rows=22 width=9) (actual time=0.135..0.139 rows=2 loops=1) - Nested Loop (cost=2193.50..56324.09 rows=218048 width=9) (actual time=0.063..0.114 rows=6 loops=1) - Seq Scan on t_query_data t (cost=0.00..1.02 rows=2 width=6) (actual time=0.019..0.022 rows=2 loops=1) - Bitmap Heap Scan on photo_info_data u (cost=2193.50..26798.74 rows=109024 width=9) (actual time=0.025..0.030 rows=3 loops=2) Recheck Cond: ((u.field_name)::text = (t.key)::text) - Bitmap Index Scan on photo_info_data_ix__field_name (cost=0.00..2166.24 rows=109024 width=0) (actual time=0.019..0.019 rows=3 loops=2) Index Cond: ((u.field_name)::text = (t.key)::text) Total runtime: 0.200 ms (8 rows) So, I guess I solved my problem! :) The explain analyze still shows that row estimate is 'quite off' (109024 estimated vs only 3 actuall), but the query is light-speeded :) I tought that having primary key (and auto-index because of primary key) on (photo_id, field_name) should be enough. Now I have two indexes on field_name, but that seems to do good. Mike -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Very specialised query
WHERE (l2.start BETWEEN l1.start AND l1.end OR l1.start BETWEEN l2.start AND l2.end ) Yes, that's another way to calculate an overlap. However, it turns out to not be that fast. The problem is that OR there, which causes a bitmap index scan, as the leaf of a nested loop join, which can be rather slow. Ok , than splitting these checks in 2 Queries with UNION is better. But I often read that BETWEEN is faster than using 2 comparison operators. Here I guess that a combined index on (start,end) makes sense: .. WHERE l2.start BETWEEN l1.start AND l1.end .. UNION .. WHERE l1.start BETWEEN l2.start AND l2.end .. The first clause being equivalent to AND l1.start = l2.end AND l1.end = l2.start AND l1.start = l2.start I don't know how you have to deal the limit conditions... Marc Mamin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Very specialised query
Shouldn't Postgres favour a between index scan over an open-ended one? On Fri, 27 Mar 2009, Tom Lane wrote: Currently the planner only notices that for a range check that involves comparisons of the same variable expression to two constants (or pseudoconstants anyway). In principle it might be reasonable to have a heuristic that reduces the estimated selectivity in the example above, but it looks to me like it'd make clauselist_selectivity() a lot slower and more complicated. When you see (l1.end l2.start), how do you know which variable to try to match up against others? And if you try to match both, what do you do when you get matches for both? Arguably, having multiple greater than constraints on a field should not affect the selectivity much, because the separate constraints will all be throwing away the same set of rows. So having a single greater than will halve the number of rows, while two greater thans will divide the number of rows by three, etc. That's a vast approximation of course, and should be skewed by the statistics. However, combining a greater than with a less than has a different effect on selectivity. If the numbers were completely random with identical value spread in each field, then a single greater than would halve the number of rows and an additional less than would halve the number again. However, in most real life situations the values will not be completely random, but will be very skewed, as in our case. Unless the planner starts collecting statistics on the standard distribution of the difference between two fields, that will be hard to account for though. Have a look at the following EXPLAIN ANALYSE: SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 22893 AND l2.objectid = 22893 AND l1.id l2.id AND l1.start l2.end AND l1.end l2.start AND l1.start l2.start UNION ALL SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 22893 AND l2.objectid = 22893 AND l1.id l2.id AND l1.start l2.end AND l1.end l2.start AND l1.start = l2.start; QUERY PLAN -- Append (cost=0.00..20479179.74 rows=138732882 width=8) (actual time=0.055..2362748.298 rows=258210 loops=1) - Nested Loop (cost=0.00..9545925.46 rows=69366441 width=8) (actual time=0.053..627.038 rows=99436 loops=1) Join Filter: ((l1.id l2.id) AND (l1.start l2.end)) - Index Scan using location_test_obj_end on location l1 (cost=0.00..55966.07 rows=43277 width=12) (actual time=0.025..58.604 rows=43534 loops=1) Index Cond: (objectid = 22893) - Index Scan using location_test_obj_start on location l2 (cost=0.00..123.10 rows=4809 width=12) (actual time=0.003..0.006 rows=2 loops=43534) Index Cond: ((l2.objectid = 22893) AND (l1.end l2.start) AND (l1.start l2.start)) - Nested Loop (cost=0.00..9545925.46 rows=69366441 width=8) (actual time=0.041..2361632.540 rows=158774 loops=1) Join Filter: ((l1.id l2.id) AND (l1.start l2.end)) - Index Scan using location_test_obj_end on location l1 (cost=0.00..55966.07 rows=43277 width=12) (actual time=0.009..80.814 rows=43534 loops=1) Index Cond: (objectid = 22893) - Index Scan using location_test_obj_start on location l2 (cost=0.00..123.10 rows=4809 width=12) (actual time=0.012..29.823 rows=21768 loops=43534) Index Cond: ((l2.objectid = 22893) AND (l1.end l2.start) AND (l1.start = l2.start)) Total runtime: 2363015.959 ms (14 rows) Note again the two leaf index scans that really matter for performance. On one of them, there is a difference, and the other is open ended. Expected rows Seen rows difference 4809 2 open-ended 4809 21768 The first query fragment takes 700ms to execute, and the second takes about forty minutes. It is effectively doing a nested loop join with hardly any benefit gained from the indexes at all, over a simple index on objectid. I may as well make the query a lot simpler, and do: SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 22893 AND l2.objectid = 22893 AND l1.id l2.id AND l1.start l2.end AND l1.end l2.start Unless this particular issue is improved in the planner, I don't think this particular style of query is going to work for us. I know that the database could in theory return a
Re: [PERFORM] Very specialised query
On Mon, 30 Mar 2009, Marc Mamin wrote: But I often read that BETWEEN is faster than using 2 comparison operators. http://www.postgresql.org/docs/current/static/functions-comparison.html says otherwise. a BETWEEN x AND y is equivalent to a = x AND a = y There is no difference between the two respective forms apart from the CPU cycles required to rewrite the first one into the second one internally. Matthew -- Don't worry! The world can't end today because it's already tomorrow in Australia. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Very specialised query
On Mon, 30 Mar 2009, Віталій Тимчишин wrote: select case when n == 1 then id1 else id2 end, case when n == 2 then id1 else id2 end from ( ... a, (values (1),(2)) b(n) Yeah, that's nice. However, it is still the case that we can't trust the database to choose the correct plan. It is currently only choosing the correct plan now by chance, and some time later it may by chance switch to one that takes 40 minutes. I'll certainly add it to my list of possibilities. Matthew -- Jadzia: Don't forget the 34th rule of acquisition: Peace is good for business. Quark: That's the 35th. Jadzia: Oh yes, that's right. What's the 34th again? Quark: War is good for business. It's easy to get them mixed up. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Forcing seq_scan off for large table joined with tiny table yeilds improved performance
On Mon, Mar 30, 2009 at 9:34 AM, Mario Splivalo mario.spliv...@megafon.hr wrote: - Bitmap Heap Scan on photo_info_data u (cost=2193.50..26798.74 rows=109024 width=9) (actual time=0.025..0.030 rows=3 loops=2) Recheck Cond: ((u.field_name)::text = (t.key)::text) - Bitmap Index Scan on photo_info_data_ix__field_name (cost=0.00..2166.24 rows=109024 width=0) (actual time=0.019..0.019 rows=3 loops=2) So, I guess I solved my problem! :) The explain analyze still shows that row estimate is 'quite off' (109024 estimated vs only 3 actuall), but the query is light-speeded :) It's not really solved, it's just a happy coincidence that the current plan runs well. In order to keep the query planner making good choices you need to increase stats target for the field in the index above. The easiest way to do so is to do this: alter database mydb set default_statistics_target=100; and run analyze again: analyze; I tought that having primary key (and auto-index because of primary key) on (photo_id, field_name) should be enough. Now I have two indexes on field_name, but that seems to do good. Nope, it's about the stats collected that let the planner make the right choice. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Very specialised query
Hi. Look, what I did mean by symmetric is that you don't need to make second part of query because you will get just same results simply by select case when n == 1 then id1 else id2 end, case when n == 2 then id1 else id2 end from ( SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 22893 AND l2.objectid = 22893 AND l1.id l2.id AND l1.start l2.end AND l1.end l2.start AND l1.start l2.start) a, (values (1),(2)) b(n) (I may miss some border cases like when l1.start=l2.start and/or l1.end=l2.end, but this can be fixed by adding = to query). Look, You can have 4 types of intersections: a) 1s 2s 2e 1e - 2 inside 1 b) 2s 1s 1e 2e - 1 inside 2 (symmetric to (a), if you have 1,2 from (a) you can generate 2,1 for (b)) c) 1s 2s 1e 2e - 1 to the left of 2 d) 2s 1s 2e 1e - 2 to the left of 1 (symmetric to (c), if you have 1,2 from (c) you can generate 2,1 for (d)) The query above gives you results for (a) and (c) and you don't need any second part - simply add symmetric results. Correct me if I miss something. Best Regards, Vitalii Tymchyshyn
Re: [PERFORM] Very specialised query
Hello Matthew, Another idea: Are your objects limited to some smaller ranges of your whole interval ? If yes you may possibly reduce the ranges to search for while using an additional table with the min(start) max(end) of each object... Marc Mamin
[PERFORM] Trying to track down weird query stalls
I'm running a 64-bit build of Postgres 8.3.5 on AIX 5.3, and have a really strange, annoying transient problem with one particular query stalling. The symptom here is that when this query is made with X or more records in a temp table involved in the join (where X is constant when the problem manifests, but is different between manifestations) the query takes about 20 minutes. When the query is made with X-1 records it takes less than a second. It's just this one query -- for everything else the system's nice and snappy. The machine load's never above 9 (it's a 32 CPU box) and hasn't had less than 60G of free system memory on it. An EXPLAIN ANALYZE of the two queries (with X-1 and X records) is even more bizarre. Not only are the two query plans identical (save trivial differences because of the record count differences) but the explain EXPLAIN ANALYZE total runtimes are near-identical -- the fast case showed 259ms, the slow case 265ms. When the slow query was actually run, though, it took 20 minutes. There were no blocked back ends shown in pg_stat_activity, and the back end itself was definitely moving. I trussed the back end stuck running the slow query and it spent nearly all its time doing kread() and kwrite() calls. The DB log didn't show anything interesting in it. I checked to make sure the SQL statement that was EXPLAIN ANALYZEd was the one actually executed, and I pulled the client code into the debugger and single-stepped through just to make sure it was getting stuck on that one SQL statement and it wasn't the client doing something unexpected. Just to be even more annoying, the problem goes away. Right now I can't trigger the problem. Last friday it happened reliably feeding 58 records into this query. The week before it was 38 records. Today? Nothing, the system's full of snappy. At the moment I'm at a loss as to what's going wrong, and the fact that I can't duplicate it right now when I actually have time to look into the problem's damned annoying. What I'm looking for are some hints as to where to look the next time it does happen, or things I can flip on to catch more in the log. (The logging parameters are all set at their defaults) I'm going to try attaching with dbx to get a stack trace (I assume this is relatively safe?) but past that I'm kind of stumped. Help? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Very specialised query
Yeah, that's nice. However, it is still the case that we can't trust the database to choose the correct plan. It is currently only choosing the correct plan now by chance, and some time later it may by chance switch to one that takes 40 minutes. What is the bad plan? Is it like the first plan from your first message? You can sometimes tweak optimizer to make sure it will do correct plan. E.g. when your database fits in memory, you can tweak page access costs. Also don't forget to raise statistics target. BTW: About aggregates: they can return arrays, but I can't imagine what you can group by on... May be windowing functions from 8.4 could help. Also, if your maximum length (select max(end-start) from location) is low enough, you can try adding some more constraints to make optimizer happy (have it more precise row count to select correct plan).
Re: [PERFORM] Trying to track down weird query stalls
On Mon, Mar 30, 2009 at 1:50 PM, d...@sidhe.org wrote: I'm running a 64-bit build of Postgres 8.3.5 on AIX 5.3, and have a really strange, annoying transient problem with one particular query stalling. The symptom here is that when this query is made with X or more records in a temp table involved in the join (where X is constant when the problem manifests, but is different between manifestations) the query takes about 20 minutes. When the query is made with X-1 records it takes less than a second. It's just this one query -- for everything else the system's nice and snappy. The machine load's never above 9 (it's a 32 CPU box) and hasn't had less than 60G of free system memory on it. An EXPLAIN ANALYZE of the two queries (with X-1 and X records) is even more bizarre. Not only are the two query plans identical (save trivial differences because of the record count differences) but the explain EXPLAIN ANALYZE total runtimes are near-identical -- the fast case showed 259ms, the slow case 265ms. When the slow query was actually run, though, it took 20 minutes. There were no blocked back ends shown in pg_stat_activity, and the back end itself was definitely moving. I trussed the back end stuck running the slow query and it spent nearly all its time doing kread() and kwrite() calls. The DB log didn't show anything interesting in it. I checked to make sure the SQL statement that was EXPLAIN ANALYZEd was the one actually executed, and I pulled the client code into the debugger and single-stepped through just to make sure it was getting stuck on that one SQL statement and it wasn't the client doing something unexpected. Just to be even more annoying, the problem goes away. Right now I can't trigger the problem. Last friday it happened reliably feeding 58 records into this query. The week before it was 38 records. Today? Nothing, the system's full of snappy. At the moment I'm at a loss as to what's going wrong, and the fact that I can't duplicate it right now when I actually have time to look into the problem's damned annoying. What I'm looking for are some hints as to where to look the next time it does happen, or things I can flip on to catch more in the log. (The logging parameters are all set at their defaults) I'm going to try attaching with dbx to get a stack trace (I assume this is relatively safe?) but past that I'm kind of stumped. log_min_duration_statement is a good place to start, but it sounds like the query plan you're getting when you test it by hand isn't the same one that's actually executing, so I'm not sure how far that's going to get you. contrib/auto_explain sounds like it would be just the thing, but unfortunately that's an 8.4 feature which hasn't been released yet. I'm not sure whether it could be built and run against 8.3. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
On Mon, Mar 30, 2009 at 1:50 PM, d...@sidhe.org wrote: I'm running a 64-bit build of Postgres 8.3.5 on AIX 5.3, and have a really strange, annoying transient problem with one particular query stalling. The symptom here is that when this query is made with X or more records in a temp table involved in the join (where X is constant when the problem manifests, but is different between manifestations) the query takes about 20 minutes. When the query is made with X-1 records it takes less than a second. It's just this one query -- for everything else the system's nice and snappy. The machine load's never above 9 (it's a 32 CPU box) and hasn't had less than 60G of free system memory on it. An EXPLAIN ANALYZE of the two queries (with X-1 and X records) is even more bizarre. Not only are the two query plans identical (save trivial differences because of the record count differences) but the explain EXPLAIN ANALYZE total runtimes are near-identical -- the fast case showed 259ms, the slow case 265ms. log_min_duration_statement is a good place to start, but it sounds like the query plan you're getting when you test it by hand isn't the same one that's actually executing, so I'm not sure how far that's going to get you. contrib/auto_explain sounds like it would be just the thing, but unfortunately that's an 8.4 feature which hasn't been released yet. I'm not sure whether it could be built and run against 8.3. I'm not executing any of the EXPLAINs by hand, because I didn't want to have to worry about typos or filling in temp tables with test data. Inside the app the SQL for the problematic query's stored in a variable -- when the task runs with debugging enabled it first executes the query with EXPLAIN ANALYZE prepended and dumps the output, then it executes the query itself. It's possible something's going wrong in that, but the code's pretty simple. Arguably in this case the actual query should run faster than the EXPLAIN ANALYZE version, since the cache is hot. (Though that'd only likely shave a few dozen ms off the runtime) -Dan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
On Mon, Mar 30, 2009 at 2:42 PM, d...@sidhe.org wrote: On Mon, Mar 30, 2009 at 1:50 PM, d...@sidhe.org wrote: I'm not executing any of the EXPLAINs by hand, because I didn't want to have to worry about typos or filling in temp tables with test data. Inside the app the SQL for the problematic query's stored in a variable -- when the task runs with debugging enabled it first executes the query with EXPLAIN ANALYZE prepended and dumps the output, then it executes the query itself. It's possible something's going wrong in that, but the code's pretty simple. Arguably in this case the actual query should run faster than the EXPLAIN ANALYZE version, since the cache is hot. (Though that'd only likely shave a few dozen ms off the runtime) Well... yeah. Also EXPLAIN ANALYZE has a non-trivial amount of overhead, so that is quite bizarre. I have to suspect there is some subtle difference between the way the EXPLAIN ANALYZE is done and the way the actual query is done... like maybe one uses parameter substitution and the other doesn't or, well, I don't know. But I don't see how turning on debugging (which is essentially what EXPLAIN ANALYZE is) can prevent the query from being slow. Hence the query to the list. *Something* is going on, and beats me what. I'm assuming I'm triggering some bug in the postgres back end, or there's some completely bizarre edge case that this tickles. (The massive kread/kwrite activity that truss showed me when I checked seemed rather unusual, to say the least) EXPLAIN ANALYZE is my normal means of diagnosing performance problems, but that isn't helping as it shows perfectly sane results. That leaves abnormal means, and outside of trussing the back end or attaching with dbx to get a stack trace I just don't have any of those. I'm not even sure what I should be looking for when I do get a stack trace. -Dan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
On Mon, Mar 30, 2009 at 12:42 PM, d...@sidhe.org wrote: On Mon, Mar 30, 2009 at 1:50 PM, d...@sidhe.org wrote: I'm running a 64-bit build of Postgres 8.3.5 on AIX 5.3, and have a really strange, annoying transient problem with one particular query stalling. The symptom here is that when this query is made with X or more records in a temp table involved in the join (where X is constant when the problem manifests, but is different between manifestations) the query takes about 20 minutes. When the query is made with X-1 records it takes less than a second. It's just this one query -- for everything else the system's nice and snappy. The machine load's never above 9 (it's a 32 CPU box) and hasn't had less than 60G of free system memory on it. An EXPLAIN ANALYZE of the two queries (with X-1 and X records) is even more bizarre. Not only are the two query plans identical (save trivial differences because of the record count differences) but the explain EXPLAIN ANALYZE total runtimes are near-identical -- the fast case showed 259ms, the slow case 265ms. log_min_duration_statement is a good place to start, but it sounds like the query plan you're getting when you test it by hand isn't the same one that's actually executing, so I'm not sure how far that's going to get you. contrib/auto_explain sounds like it would be just the thing, but unfortunately that's an 8.4 feature which hasn't been released yet. I'm not sure whether it could be built and run against 8.3. I'm not executing any of the EXPLAINs by hand, because I didn't want to have to worry about typos or filling in temp tables with test data. Inside the app the SQL for the problematic query's stored in a variable -- when the task runs with debugging enabled it first executes the query with EXPLAIN ANALYZE prepended and dumps the output, then it executes the query itself. It's possible something's going wrong in that, but the code's pretty simple. Arguably in this case the actual query should run faster than the EXPLAIN ANALYZE version, since the cache is hot. (Though that'd only likely shave a few dozen ms off the runtime) Joining a lot of tables together? Could be GEQO kicking in. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
On Mon, Mar 30, 2009 at 1:42 PM, d...@sidhe.org wrote: On Mon, Mar 30, 2009 at 12:42 PM, d...@sidhe.org wrote: Arguably in this case the actual query should run faster than the EXPLAIN ANALYZE version, since the cache is hot. (Though that'd only likely shave a few dozen ms off the runtime) Joining a lot of tables together? Could be GEQO kicking in. Only if I get different query plans for the query depending on whether it's being EXPLAIN ANALYZEd or not. That seems unlikely... Yes, you can. In fact you often will. Not because it's being explained or not, just because that's how GEQO works. Ouch. I did *not* know that was possible -- I assumed that the plan was deterministic and independent of explain analyze. The query has seven tables (one of them a temp table) and my geqo_threshold is set to 12. If I'm reading the docs right GEQO shouldn't kick in. -Dan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
On Mon, Mar 30, 2009 at 4:02 PM, d...@sidhe.org wrote: On Mon, Mar 30, 2009 at 1:42 PM, d...@sidhe.org wrote: On Mon, Mar 30, 2009 at 12:42 PM, d...@sidhe.org wrote: Arguably in this case the actual query should run faster than the EXPLAIN ANALYZE version, since the cache is hot. (Though that'd only likely shave a few dozen ms off the runtime) Joining a lot of tables together? Could be GEQO kicking in. Only if I get different query plans for the query depending on whether it's being EXPLAIN ANALYZEd or not. That seems unlikely... Yes, you can. In fact you often will. Not because it's being explained or not, just because that's how GEQO works. Ouch. I did *not* know that was possible -- I assumed that the plan was deterministic and independent of explain analyze. The query has seven tables (one of them a temp table) and my geqo_threshold is set to 12. If I'm reading the docs right GEQO shouldn't kick in. Any chance we could see the actual query? Right now I think we are shooting in the dark. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
On Mon, Mar 30, 2009 at 4:02 PM, d...@sidhe.org wrote: On Mon, Mar 30, 2009 at 1:42 PM, d...@sidhe.org wrote: On Mon, Mar 30, 2009 at 12:42 PM, d...@sidhe.org wrote: Arguably in this case the actual query should run faster than the EXPLAIN ANALYZE version, since the cache is hot. (Though that'd only likely shave a few dozen ms off the runtime) Joining a lot of tables together? Could be GEQO kicking in. Only if I get different query plans for the query depending on whether it's being EXPLAIN ANALYZEd or not. That seems unlikely... Yes, you can. In fact you often will. Not because it's being explained or not, just because that's how GEQO works. Ouch. I did *not* know that was possible -- I assumed that the plan was deterministic and independent of explain analyze. The query has seven tables (one of them a temp table) and my geqo_threshold is set to 12. If I'm reading the docs right GEQO shouldn't kick in. Any chance we could see the actual query? Right now I think we are shooting in the dark. The query is: select distinct temp_symbol.entityid, temp_symbol.libname, temp_symbol.objid, temp_symbol.objname, temp_symbol.fromsymid, temp_symbol.fromsymtype, temp_symbol.objinstance, NULL, temp_symbol.csid, libinstance.entityid, NULL, libobject.objid, NULL, provide_symbol.symbolid, provide_symbol.symboltype, libobject.objinstance, libobject.libinstanceid, objectinstance.csid, NULL, provide_symbol.is_weak, NULL, provide_symbol.is_local, NULL, provide_symbol.is_template, NULL, provide_symbol.is_common from libinstance, library, libobject, provide_symbol, temp_symbol, objectinstance, attributes where libinstance.libdate = 1238445044 and libinstance.enddate 1238445044 and libinstance.libinstanceid = libobject.libinstanceid and libinstance.architecture = ? and attributes.entityid = libinstance.entityid and attributes.branchid = libinstance.branchid and attributes.architecture = libinstance.architecture and library.libid = libinstance.libid and not secondary and attribute in ('notoffline', 'notoffline') and (provide_symbol.symboltype = 'T') and libobject.objinstance = provide_symbol.objinstance and libinstance.branchid = ? and provide_symbol.symbolid = temp_symbol.symbolid and objectinstance.objinstance = libobject.objinstance and libinstance.istemp = 0 The explain analyze for the query's attached in a (possibly hopeless) attempt to keep it from being word-wrapped into unreadability. -Dan Unique (cost=25323267.82..25350114.59 rows=397730 width=192) (actual time=277.963..278.054 rows=35 loops=1) - Sort (cost=25323267.82..25324262.14 rows=397730 width=192) (actual time=277.960..277.967 rows=35 loops=1) Sort Key: temp_symbol.entityid, temp_symbol.libname, temp_symbol.objid, temp_symbol.objname, temp_symbol.fromsymid, temp_symbol.fromsymtype, temp_symbol.objinstance, temp_symbol.csid, libinstance.ent ityid, libobject.objid, provide_symbol.symbolid, libobject.objinstance, libobject.libinstanceid, objectinstance.csid, provide_symbol.is_weak, provide_symbol.is_local, provide_symbol.is_template, provide_symb ol.is_common Sort Method: quicksort Memory: 27kB - Merge Join (cost=25234932.60..25286276.09 rows=397730 width=192) (actual time=276.217..277.397 rows=35 loops=1) Merge Cond: (libinstance.libinstanceid = libobject.libinstanceid) - Sort (cost=7950.07..7958.88 rows=3523 width=8) (actual time=256.190..256.655 rows=1838 loops=1) Sort Key: libinstance.libinstanceid Sort Method: quicksort Memory: 135kB - Merge Join (cost=7604.97..7742.52 rows=3523 width=8) (actual time=231.880..252.275 rows=1838 loops=1) Merge Cond: (libinstance.libid = library.libid) - Sort (cost=5951.50..5960.32 rows=3529 width=12) (actual time=156.866..157.638 rows=1849 loops=1) Sort Key: libinstance.libid Sort Method: quicksort Memory: 135kB - Merge Join (cost=5657.33..5743.55 rows=3529 width=12) (actual time=141.311..152.022 rows=1849 loops=1) Merge Cond: (libinstance.entityid = attributes.entityid) - Sort (cost=4143.42..4159.22 rows=6319 width=18) (actual time=126.355..128.805 rows=5532 loops=1) Sort Key: libinstance.entityid Sort Method: quicksort Memory: 625kB - Seq Scan on libinstance (cost=0.00..3744.52 rows=6319 width=18) (actual time=0.074..107.409 rows=5533
Re: [PERFORM] Trying to track down weird query stalls
d...@sidhe.org escribió: where libinstance.libdate = 1238445044 and libinstance.enddate 1238445044 and libinstance.libinstanceid = libobject.libinstanceid and libinstance.architecture = ? How are you generating the explain? My bet is that you're just substituting a literal in the architecture condition, but if the driver is smart then maybe it's preparating the query beforehand. You'll get a different plan in that case. Try something like this: prepare foo(smallint) as ... where libinstance.architecture = $1 ... explain analyze execute foo(1); If the plan you get from that is bad (and it often is), you should look at avoiding a query prepare. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
d...@sidhe.org escribió: where libinstance.libdate = 1238445044 and libinstance.enddate 1238445044 and libinstance.libinstanceid = libobject.libinstanceid and libinstance.architecture = ? How are you generating the explain? My bet is that you're just substituting a literal in the architecture condition, but if the driver is smart then maybe it's preparating the query beforehand. You'll get a different plan in that case. I don't think so. Perl's DBI is involved, but the statement's in a variable. The code in question is: if ($db-{debug}) { $db-debug(SQL is: $sql\n); my $rows = $db-{dbh}-selectall_arrayref(explain analyze $sql, undef, $db-{arch}, $db-{basebranch}); foreach my $row (@$rows) { $db-debug(join( , @$row). \n); } $db-debug_stamp(Initial query done\n); } $rows = $db-{dbh}-selectall_arrayref($sql, undef, $db-{arch}, $db-{basebranch}); There's no transform of the sql variable between the two statements, just a quick loop over the returned rows from the explain analyze to print them out. (I did try to make sure that the debugging bits were done the same way as the mainline code, but I may have bobbled something) -Dan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
d...@sidhe.org escribió: d...@sidhe.org escribió: where libinstance.libdate = 1238445044 and libinstance.enddate 1238445044 and libinstance.libinstanceid = libobject.libinstanceid and libinstance.architecture = ? How are you generating the explain? My bet is that you're just substituting a literal in the architecture condition, but if the driver is smart then maybe it's preparating the query beforehand. You'll get a different plan in that case. I don't think so. Perl's DBI is involved, but the statement's in a variable. So what's the ? in the query you pasted earlier? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
d...@sidhe.org escribió: d...@sidhe.org escribió: where libinstance.libdate = 1238445044 and libinstance.enddate 1238445044 and libinstance.libinstanceid = libobject.libinstanceid and libinstance.architecture = ? How are you generating the explain? My bet is that you're just substituting a literal in the architecture condition, but if the driver is smart then maybe it's preparating the query beforehand. You'll get a different plan in that case. I don't think so. Perl's DBI is involved, but the statement's in a variable. So what's the ? in the query you pasted earlier? The first ? (for architecture) is 1, the second ? (for branchid) is 0. They both should get passed to Postgres as $1 and $2, respectively, assuming DBD::Pg does its substitution right. (They're both supposed to go in as placeholders) -Dan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
d...@sidhe.org escribió: So what's the ? in the query you pasted earlier? The first ? (for architecture) is 1, the second ? (for branchid) is 0. They both should get passed to Postgres as $1 and $2, respectively, assuming DBD::Pg does its substitution right. (They're both supposed to go in as placeholders) Right, so how about you reread what I wrote above? Oh, hmm, so to be more clear: I don't think DBD::Pg is actually sending EXECUTE PREPARE. You need to do this over psql. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
d...@sidhe.org escribió: So what's the ? in the query you pasted earlier? The first ? (for architecture) is 1, the second ? (for branchid) is 0. They both should get passed to Postgres as $1 and $2, respectively, assuming DBD::Pg does its substitution right. (They're both supposed to go in as placeholders) Right, so how about you reread what I wrote above? Oh, hmm, so to be more clear: I don't think DBD::Pg is actually sending EXECUTE PREPARE. You need to do this over psql. Fair enough. (And sorry about the mis-read) Next time this occurs I'll try and duplicate this in psql. FWIW, a quick read of the C underlying the DBD::Pg module shows it using PQexecPrepared, so I'm pretty sure it is using prepared statements with placeholders, but double-checking seems prudent. -Dan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
d...@sidhe.org escribió: Fair enough. (And sorry about the mis-read) Next time this occurs I'll try and duplicate this in psql. FWIW, a quick read of the C underlying the DBD::Pg module shows it using PQexecPrepared, so I'm pretty sure it is using prepared statements with placeholders, but double-checking seems prudent. Yes, but I doubt that it'll be smart enough to work for EXPLAIN in the way we need here. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Trying to track down weird query stalls
On 3/30/09 2:34 PM, d...@sidhe.org d...@sidhe.org wrote: d...@sidhe.org escribió: So what's the ? in the query you pasted earlier? The first ? (for architecture) is 1, the second ? (for branchid) is 0. They both should get passed to Postgres as $1 and $2, respectively, assuming DBD::Pg does its substitution right. (They're both supposed to go in as placeholders) Right, so how about you reread what I wrote above? Oh, hmm, so to be more clear: I don't think DBD::Pg is actually sending EXECUTE PREPARE. You need to do this over psql. Fair enough. (And sorry about the mis-read) Next time this occurs I'll try and duplicate this in psql. FWIW, a quick read of the C underlying the DBD::Pg module shows it using PQexecPrepared, so I'm pretty sure it is using prepared statements with placeholders, but double-checking seems prudent. -Dan Regardless, its always a good idea to do a manual explain analyze with and without parameterization in psql if prepared statements are involved. The query planner functions very differently with and without them, almost always with a performance detriment to query execution times when parameterized. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance