Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0
Jim C. Nasby wrote: I'm not sure who's responsible for DBI::Pg (Josh?), but would it make sense to add better support for bytea to DBI::Pg? ISTM there should be a better way of doing this than adding gobs of \'s. It has support for binding a bytea parameter, but in this case we're trying to build up an array and pass that into a stored procedure. The $dbh-quote() method for DBD::Pg lacks the ability to quote bytea types. There is actually a TODO note in the code about adding support for quoting Pg specific types. Presumabliy the difficulties we are having with this would be solved by that, once it has been implemented. In the meantime, I believe it's just a matter of getting the right escapes happen so that the procedure is inserting values that we can later get via a select and using bind_param() with the PG_BYTEA type. Michael signature.asc Description: OpenPGP digital signature
[PERFORM] Planner incorrectly choosing seq scan over index scan
Hi all, We're using 8.0.3 and we're seeing a problem where the planner is choosing a seq scan and hash join over an index scan. If I set enable_hashjoin to off, then I get the plan I'm expecting and the query runs a lot faster. I've also tried lowering the random page cost (even to 1) but the planner still chooses to use the hash join. Does anyone have any thoughts/suggestions? I saw that there was a thread recently in which the planner wasn't correctly estimating the cost for queries using LIMIT. Is it possible that something similar is happening here (perhaps because of the sort) and that the patch Tom proposed would fix it? Thanks. Here are the various queries and plans: Normal settings explain analyze select c.sourceId, c.targetId, abs(c.tr - c.sr) as xmy, (c.sr - s.ar) * (c.tr - t.ar) as xy, (c.sr - s.ar) * (c.sr - s.ar) as x2, (c.tr - t.ar) * (c.tr - t.ar) as y2 from candidates617004 c, lte_user s, lte_user t where c.sourceId = s.user_id and c.targetId = t.user_id order by c.sourceId, c.targetId; QUERY PLAN Sort (cost=13430.57..13439.24 rows=3467 width=48) (actual time=1390.000..1390.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid - Merge Join (cost=9912.07..13226.72 rows=3467 width=48) (actual time=1344.000..1375.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) - Index Scan using lte_user_pkey on lte_user s (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..95.000 rows=50034 loops=1) - Sort (cost=9912.07..9920.73 rows=3467 width=40) (actual time=1156.000..1156.000 rows=3467 loops=1) Sort Key: c.sourceid - Hash Join (cost=8710.44..9708.21 rows=3467 width=40) (actual time=1125.000..1156.000 rows=3467 loops=1) Hash Cond: (outer.targetid = inner.user_id) - Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1) - Hash (cost=8011.95..8011.95 rows=279395 width=16) (actual time=1125.000..1125.000 rows=0 loops=1) - Seq Scan on lte_user t (cost=0.00..8011.95 rows=279395 width=16) (actual time=0.000..670.000 rows=279395 loops=1) Total runtime: 1406.000 ms enable_hashjoin disabled QUERY PLAN Sort (cost=14355.37..14364.03 rows=3467 width=48) (actual time=391.000..391.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid - Nested Loop (cost=271.52..14151.51 rows=3467 width=48) (actual time=203.000..359.000 rows=3467 loops=1) - Merge Join (cost=271.52..3490.83 rows=3467 width=40) (actual time=203.000..218.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) - Index Scan using lte_user_pkey on lte_user s (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..126.000 rows=50034 loops=1) - Sort (cost=271.52..280.19 rows=3467 width=32) (actual time=15.000..30.000 rows=3467 loops=1) Sort Key: c.sourceid - Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1) - Index Scan using lte_user_pkey on lte_user t (cost=0.00..3.03 rows=1 width=16) (actual time=0.031..0.036 rows=1 loops=3467) Index Cond: (outer.targetid = t.user_id) Total runtime: 406.000 ms random_page_cost set to 1.5 -- QUERY PLAN Sort (cost=12702.62..12711.29 rows=3467 width=48) (actual time=1407.000..1407.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid - Merge Join (cost=9912.07..12498.77 rows=3467 width=48) (actual time=1391.000..1407.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) - Index Scan using lte_user_pkey on lte_user s (cost=0.00..12807.34 rows=279395 width=16) (actual time=0.000..46.000 rows=50034 loops=1) - Sort (cost=9912.07..9920.73 rows=3467 width=40) (actual time=1188.000..1188.000 rows=3467 loops=1) Sort Key: c.sourceid - Hash Join (cost=8710.44..9708.21 rows=3467 width=40) (actual time=1157.000..1188.000 rows=3467 loops=1) Hash Cond: (outer.targetid = inner.user_id) - Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..15.000 rows=3467 loops=1) - Hash (cost=8011.95..8011.95 rows=279395 width=16) (actual time=1157.000..1157.000 rows=0 loops=1) - Seq Scan on lte_user t (cost=0.00..8011.95 rows=279395 width=16) (actual time=0.000..750.000 rows=279395 loops=1) Total runtime: 1422.000 ms random_page_cost set to 1.5 and enable_hashjoin set to false -- QUERY PLAN Sort (cost=13565.58..13574.25 rows=3467 width=48) (actual time=390.000..390.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid - Nested Loop (cost=271.52..13361.73 rows=3467 width=48) (actual time=203.000..360.000 rows=3467 loops=1) - Merge Join (cost=271.52..2762.88 rows=3467 width=40) (actual time=203.000..250.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) - Index Scan using lte_user_pkey on lte_user s (cost=0.00..12807.34 rows=279395 width=16) (actual
Re: [PERFORM] Planner incorrectly choosing seq scan over index scan
[Meetesh Karia - Tue at 12:19:27AM +0200] We're using 8.0.3 and we're seeing a problem where the planner is choosing a seq scan and hash join over an index scan. If I set enable_hashjoin to off, then I get the plan I'm expecting and the query runs a lot faster. I've also tried lowering the random page cost (even to 1) but the planner still chooses to use the hash join. Have you tried increasing the statistics collection? -- Tobias Brox, +47-91700050 Nordicbet, IT dept ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Planner incorrectly choosing seq scan over index scan
Meetesh Karia wrote: Hi all, We're using 8.0.3 and we're seeing a problem where the planner is choosing a seq scan and hash join over an index scan. If I set enable_hashjoin to off, then I get the plan I'm expecting and the query runs a lot faster. I've also tried lowering the random page cost (even to 1) but the planner still chooses to use the hash join. Does anyone have any thoughts/suggestions? I saw that there was a thread recently in which the planner wasn't correctly estimating the cost for queries using LIMIT. Is it possible that something similar is happening here (perhaps because of the sort) and that the patch Tom proposed would fix it? Thanks. Here are the various queries and plans: Normal settings ... QUERY PLAN Sort (cost=13430.57..13439.24 rows=3467 width=48) (actual time=1390.000..1390.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid - Merge Join (cost=9912.07..13226.72 rows=3467 width=48) (actual time=1344.000..1375.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) - Index Scan using lte_user_pkey on lte_user s (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..95.000 rows=50034 loops=1) This is where the planner is messing up, and mis-estimating the selectivity. It is expecting to get 280k rows, but only needs to get 50k. I assume lte_user is the bigger table, and that candidates617004 has some subset. Has lte_user and candidates617004 been recently ANALYZEd? All estimates, except for the expected number of rows from lte_user seem to be okay. Is user_id the primary key for lte_user? I'm trying to figure out how you can get 50k rows, by searching a primary key, against a 3.5k rows. Is user_id only part of the primary key for lte_user? Can you give us the output of: \d lte_user \d candidates617004 So that we have the description of the tables, and what indexes you have defined? Also, if you could describe the table layouts, that would help. John =:- - Sort (cost=9912.07..9920.73 rows=3467 width=40) (actual time=1156.000..1156.000 rows=3467 loops=1) Sort Key: c.sourceid - Hash Join (cost=8710.44..9708.21 rows=3467 width=40) (actual time=1125.000..1156.000 rows=3467 loops=1) Hash Cond: (outer.targetid = inner.user_id) - Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1) - Hash (cost=8011.95..8011.95 rows=279395 width=16) (actual time=1125.000..1125.000 rows=0 loops=1) - Seq Scan on lte_user t (cost=0.00..8011.95 rows=279395 width=16) (actual time=0.000..670.000 rows=279395 loops=1) Total runtime: 1406.000 ms enable_hashjoin disabled QUERY PLAN Sort (cost=14355.37..14364.03 rows=3467 width=48) (actual time=391.000..391.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid - Nested Loop (cost=271.52..14151.51 rows=3467 width=48) (actual time=203.000..359.000 rows=3467 loops=1) - Merge Join (cost=271.52..3490.83 rows=3467 width=40) (actual time=203.000..218.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) - Index Scan using lte_user_pkey on lte_user s (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..126.000 rows=50034 loops=1) - Sort (cost=271.52..280.19 rows=3467 width=32) (actual time=15.000..30.000 rows=3467 loops=1) Sort Key: c.sourceid - Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1) - Index Scan using lte_user_pkey on lte_user t (cost=0.00..3.03 rows=1 width=16) (actual time=0.031..0.036 rows=1 loops=3467) Index Cond: (outer.targetid = t.user_id) Total runtime: 406.000 ms random_page_cost set to 1.5 -- QUERY PLAN Sort (cost=12702.62..12711.29 rows=3467 width=48) (actual time=1407.000..1407.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid - Merge Join (cost=9912.07..12498.77 rows=3467 width=48) (actual time=1391.000..1407.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) - Index Scan using lte_user_pkey on lte_user s (cost=0.00..12807.34 rows=279395 width=16) (actual time=0.000..46.000 rows=50034 loops=1) - Sort (cost=9912.07..9920.73 rows=3467 width=40) (actual time=1188.000..1188.000 rows=3467 loops=1) Sort Key: c.sourceid - Hash Join (cost=8710.44..9708.21 rows=3467 width=40) (actual time=1157.000..1188.000 rows=3467 loops=1) Hash Cond: (outer.targetid = inner.user_id) - Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..15.000 rows=3467 loops=1)
Re: [PERFORM] Planner incorrectly choosing seq scan over index scan
Are you referring to the statistics gathering target for ANALYZE? Based on your email, I just tried the following and then re-ran the explain analyze but got the same incorrect plan: alter table candidates617004 alter column sourceId set statistics 1000, alter column targetId set statistics 1000; analyze candidates617004; alter table lte_user alter column user_id set statistics 1000; analyze lte_user; Thanks for your suggestion, Meetesh On 8/2/05, Tobias Brox [EMAIL PROTECTED] wrote: [Meetesh Karia - Tue at 12:19:27AM +0200] We're using 8.0.3 and we're seeing a problem where the planner is choosing a seq scan and hash join over an index scan. If I set enable_hashjoin to off, then I get the plan I'm expecting and the query runs a lot faster. I've also tried lowering the random page cost (even to 1) but the planner still chooses to use the hash join.Have you tried increasing the statistics collection?--Tobias Brox, +47-91700050Nordicbet, IT dept
Re: [PERFORM] Planner incorrectly choosing seq scan over index scan
Thanks John. I've answered your questions below: Has lte_user and candidates617004 been recently ANALYZEd? All estimates,except for the expected number of rows from lte_user seem to be okay. I ANALYZEd both tables just before putting together my first email. And, unfortunately, modifying the statistics target didn't help either. Is user_id the primary key for lte_user? Yes I'm trying to figure out how you can get 50k rows, by searching aprimary key, against a 3.5k rows. Is user_id only part of the primarykey for lte_user? Hmmm ... I missed that before. But, that surprises me too. Especially since sourceId in the candidates table has only 1 value. Also, user_id is the complete primary key for lte_user. Can you give us the output of:\d lte_user\d candidates617004 Sure, here they are: lte=# \d lte_user Table public.lte_user Column | Type | Modifiers ---+-+--- user_id | bigint | not null firstname | character varying(255) | lastname | character varying(255) | address1 | character varying(255) | address2 | character varying(255) | city | character varying(255) | state | character varying(255) | zip | character varying(255) | phone1 | character varying(255) | phone2 | character varying(255) | username | character varying(255) | password | character varying(255) | deleted | boolean | not null ext_cust_id | character varying(255) | aboutme | character varying(255) | birthday | timestamp without time zone | fm_id | bigint | ar | double precision | Indexes: lte_user_pkey PRIMARY KEY, btree (user_id) idx_user_extid btree (ext_cust_id) idx_user_username btree (username) Foreign-key constraints: fk_user_fm FOREIGN KEY (fm_id) REFERENCES fm(fm_id) lte=# \d candidates617004 Table public.candidates617004 Column | Type | Modifiers --+--+--- fmid | bigint | sourceid | bigint | sr | double precision | targetid | bigint | tr | double precision | Also, if you could describe the table layouts, that would help. Sure. The lte_user table is just a collection of users. user_id is assigned uniquely using a sequence. During some processing, we create a candidates table (candidates617004 in our case). This table is usually a temp table. sourceid is a user_id (in this case it is always 617004) and targetid is also a user_id (2860 distinct values out of 3467). The rest of the information is either only used in the select clause or not used at all during this processing. Did I miss something in the table layout description that would be helpful? Thanks for your help! Meetesh -Sort(cost=9912.07..9920.73 rows=3467 width=40) (actual time=1156.000..1156.000 rows=3467 loops=1) Sort Key: c.sourceid -Hash Join(cost=8710.44..9708.21 rows=3467 width=40) (actual time=1125.000..1156.000 rows=3467 loops=1) Hash Cond: (outer.targetid = inner.user_id) -Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1) -Hash(cost=8011.95..8011.95 rows=279395 width=16) (actual time=1125.000..1125.000 rows=0 loops=1) -Seq Scan on lte_user t (cost=0.00..8011.95 rows=279395 width=16) (actual time=0.000..670.000 rows=279395 loops=1) Total runtime: 1406.000 ms enable_hashjoin disabled QUERY PLAN Sort(cost=14355.37..14364.03 rows=3467 width=48) (actual time=391.000..391.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid -Nested Loop(cost=271.52..14151.51 rows=3467 width=48) (actual time=203.000..359.000 rows=3467 loops=1) -Merge Join(cost=271.52..3490.83 rows=3467 width=40) (actual time=203.000..218.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) -Index Scan using lte_user_pkey on lte_user s (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..126.000 rows=50034 loops=1) -Sort(cost=271.52..280.19 rows=3467 width=32) (actual time=15.000..30.000 rows=3467 loops=1) Sort Key: c.sourceid -Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1) -Index Scan using lte_user_pkey on lte_user t (cost=0.00..3.03 rows=1 width=16) (actual time=0.031..0.036 rows=1 loops=3467) Index Cond: (outer.targetid = t.user_id) Total runtime: 406.000 ms random_page_cost set to 1.5 -- QUERY PLAN Sort(cost= 12702.62..12711.29 rows=3467 width=48) (actual time=1407.000..1407.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid -Merge Join(cost=9912.07..12498.77 rows=3467 width=48) (actual time=1391.000..1407.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) -Index Scan using lte_user_pkey on lte_user s (cost=0.00..12807.34 rows=279395 width=16) (actual time=0.000..46.000 rows=50034 loops=1) -Sort(cost=9912.07..9920.73 rows=3467 width=40) (actual time=1188.000..1188.000 rows=3467 loops=1) Sort Key: c.sourceid -Hash Join(cost=8710.44..9708.21 rows=3467 width=40) (actual time=1157.000..1188.000 rows=3467 loops=1) Hash Cond:
Re: [PERFORM] Planner incorrectly choosing seq scan over index scan
Meetesh Karia [EMAIL PROTECTED] writes: Sure. The lte_user table is just a collection of users. user_id is assigned= uniquely using a sequence. During some processing, we create a candidates= table (candidates617004 in our case). This table is usually a temp table.= sourceid is a user_id (in this case it is always 617004) and targetid is=20 also a user_id (2860 distinct values out of 3467). The rest of the=20 information is either only used in the select clause or not used at all=20 during this processing. If you know that sourceid has only a single value, it'd probably be helpful to call out that value in the query, ie, where ... AND c.sourceId = 617004 ... regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings