Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-01 Thread Michael Parker
Jim C. Nasby wrote:

I'm not sure who's responsible for DBI::Pg (Josh?), but would it make
sense to add better support for bytea to DBI::Pg? ISTM there should be a
better way of doing this than adding gobs of \'s.
  

It has support for binding a bytea parameter, but in this case we're
trying to build up an array and pass that into a stored procedure.  The
$dbh-quote() method for DBD::Pg lacks the ability to quote bytea
types.  There is actually a TODO note in the code about adding support
for quoting Pg specific types.  Presumabliy the difficulties we are
having with this would be solved by that, once it has been implemented. 
In the meantime, I believe it's just a matter of getting the right
escapes happen so that the procedure is inserting values that we can
later get via a select and using bind_param() with the PG_BYTEA type.

Michael


signature.asc
Description: OpenPGP digital signature


[PERFORM] Planner incorrectly choosing seq scan over index scan

2005-08-01 Thread Meetesh Karia
Hi all,

We're using 8.0.3 and we're seeing a problem where the planner is
choosing a seq scan and hash join over an index scan. If I set
enable_hashjoin to off, then I get the plan I'm expecting and the query
runs a lot faster. I've also tried lowering the random page cost
(even to 1) but the planner still chooses to use the hash join.

Does anyone have any thoughts/suggestions? I saw that there was a
thread recently in which the planner wasn't correctly estimating the
cost for queries using LIMIT. Is it possible that something
similar is happening here (perhaps because of the sort) and that the
patch Tom proposed would fix it?

Thanks. Here are the various queries and plans:

Normal settings

explain analyze
 select
 c.sourceId,
 c.targetId,
 abs(c.tr - c.sr) as xmy,
 (c.sr - s.ar) * (c.tr - t.ar) as xy,
 (c.sr - s.ar) * (c.sr - s.ar) as x2,
 (c.tr - t.ar) * (c.tr - t.ar) as y2
 from
 candidates617004 c,
 lte_user s,
 lte_user t
 where
 c.sourceId = s.user_id
 and c.targetId = t.user_id
 order by
 c.sourceId,
 c.targetId;

QUERY PLAN
Sort (cost=13430.57..13439.24 rows=3467 width=48) (actual time=1390.000..1390.000 rows=3467 loops=1)
 Sort Key: c.sourceid, c.targetid
 - Merge Join (cost=9912.07..13226.72 rows=3467
width=48) (actual time=1344.000..1375.000 rows=3467 loops=1)
 Merge Cond: (outer.user_id = inner.sourceid)
 - Index Scan using
lte_user_pkey on lte_user s (cost=0.00..16837.71 rows=279395
width=16) (actual time=0.000..95.000 rows=50034 loops=1)
 - Sort
(cost=9912.07..9920.73 rows=3467 width=40) (actual
time=1156.000..1156.000 rows=3467 loops=1)
 Sort Key: c.sourceid

- Hash Join (cost=8710.44..9708.21 rows=3467 width=40)
(actual time=1125.000..1156.000 rows=3467 loops=1)

Hash Cond: (outer.targetid = inner.user_id)

- Seq Scan on candidates617004 c (cost=0.00..67.67
rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1)

- Hash (cost=8011.95..8011.95 rows=279395 width=16)
(actual time=1125.000..1125.000 rows=0 loops=1)

- Seq Scan on lte_user t (cost=0.00..8011.95
rows=279395 width=16) (actual time=0.000..670.000 rows=279395 loops=1)
Total runtime: 1406.000 ms

enable_hashjoin disabled

QUERY PLAN
Sort (cost=14355.37..14364.03 rows=3467 width=48) (actual time=391.000..391.000 rows=3467 loops=1)
 Sort Key: c.sourceid, c.targetid
 - Nested Loop (cost=271.52..14151.51 rows=3467 width=48) (actual time=203.000..359.000 rows=3467 loops=1)
 - Merge Join
(cost=271.52..3490.83 rows=3467 width=40) (actual time=203.000..218.000
rows=3467 loops=1)
 Merge Cond: (outer.user_id = inner.sourceid)

- Index Scan using lte_user_pkey on lte_user s
(cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..126.000
rows=50034 loops=1)

- Sort (cost=271.52..280.19 rows=3467 width=32) (actual
time=15.000..30.000 rows=3467 loops=1)

Sort Key: c.sourceid

- Seq Scan on candidates617004 c (cost=0.00..67.67
rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1)
 - Index Scan using
lte_user_pkey on lte_user t (cost=0.00..3.03 rows=1 width=16)
(actual time=0.031..0.036 rows=1 loops=3467)
 Index Cond: (outer.targetid = t.user_id)
Total runtime: 406.000 ms

random_page_cost set to 1.5
--
QUERY PLAN
Sort (cost=12702.62..12711.29 rows=3467 width=48) (actual time=1407.000..1407.000 rows=3467 loops=1)
 Sort Key: c.sourceid, c.targetid
 - Merge Join (cost=9912.07..12498.77 rows=3467
width=48) (actual time=1391.000..1407.000 rows=3467 loops=1)
 Merge Cond: (outer.user_id = inner.sourceid)
 - Index Scan using
lte_user_pkey on lte_user s (cost=0.00..12807.34 rows=279395
width=16) (actual time=0.000..46.000 rows=50034 loops=1)
 - Sort
(cost=9912.07..9920.73 rows=3467 width=40) (actual
time=1188.000..1188.000 rows=3467 loops=1)
 Sort Key: c.sourceid

- Hash Join (cost=8710.44..9708.21 rows=3467 width=40)
(actual time=1157.000..1188.000 rows=3467 loops=1)

Hash Cond: (outer.targetid = inner.user_id)

- Seq Scan on candidates617004 c (cost=0.00..67.67
rows=3467 width=32) (actual time=0.000..15.000 rows=3467 loops=1)

- Hash (cost=8011.95..8011.95 rows=279395 width=16)
(actual time=1157.000..1157.000 rows=0 loops=1)

- Seq Scan on lte_user t (cost=0.00..8011.95
rows=279395 width=16) (actual time=0.000..750.000 rows=279395 loops=1)
Total runtime: 1422.000 ms

random_page_cost set to 1.5 and enable_hashjoin set to false
--
QUERY PLAN
Sort (cost=13565.58..13574.25 rows=3467 width=48) (actual time=390.000..390.000 rows=3467 loops=1)
 Sort Key: c.sourceid, c.targetid
 - Nested Loop (cost=271.52..13361.73 rows=3467 width=48) (actual time=203.000..360.000 rows=3467 loops=1)
 - Merge Join
(cost=271.52..2762.88 rows=3467 width=40) (actual time=203.000..250.000
rows=3467 loops=1)
 Merge Cond: (outer.user_id = inner.sourceid)

- Index Scan using lte_user_pkey on lte_user s
(cost=0.00..12807.34 rows=279395 width=16) (actual 

Re: [PERFORM] Planner incorrectly choosing seq scan over index scan

2005-08-01 Thread Tobias Brox
[Meetesh Karia - Tue at 12:19:27AM +0200]
 We're using 8.0.3 and we're seeing a problem where the planner is choosing a 
 seq scan and hash join over an index scan. If I set enable_hashjoin to off, 
 then I get the plan I'm expecting and the query runs a lot faster. I've also 
 tried lowering the random page cost (even to 1) but the planner still 
 chooses to use the hash join.

Have you tried increasing the statistics collection?

-- 
Tobias Brox, +47-91700050
Nordicbet, IT dept

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Planner incorrectly choosing seq scan over index scan

2005-08-01 Thread John Arbash Meinel
Meetesh Karia wrote:
 Hi all,
 
 We're using 8.0.3 and we're seeing a problem where the planner is
 choosing a seq scan and hash join over an index scan.  If I set
 enable_hashjoin to off, then I get the plan I'm expecting and the query
 runs a lot faster.  I've also tried lowering the random page cost (even
 to 1) but the planner still chooses to use the hash join.
 
 Does anyone have any thoughts/suggestions?  I saw that there was a
 thread recently in which the planner wasn't correctly estimating the
 cost for queries using LIMIT.  Is it possible that something similar is
 happening here (perhaps because of the sort) and that the patch Tom
 proposed would fix it?
 
 Thanks.  Here are the various queries and plans:
 
 Normal settings

...

 QUERY PLAN
 Sort  (cost=13430.57..13439.24 rows=3467 width=48) (actual
 time=1390.000..1390.000 rows=3467 loops=1)
   Sort Key: c.sourceid, c.targetid
   -  Merge Join  (cost=9912.07..13226.72 rows=3467 width=48) (actual
 time=1344.000..1375.000 rows=3467 loops=1)
 Merge Cond: (outer.user_id = inner.sourceid)
 -  Index Scan using lte_user_pkey on lte_user s 
 (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..95.000
 rows=50034 loops=1)

This is where the planner is messing up, and mis-estimating the
selectivity. It is expecting to get 280k rows, but only needs to get 50k.
I assume lte_user is the bigger table, and that candidates617004 has
some subset.

Has lte_user and candidates617004 been recently ANALYZEd? All estimates,
except for the expected number of rows from lte_user seem to be okay.

Is user_id the primary key for lte_user?
I'm trying to figure out how you can get 50k rows, by searching a
primary key, against a 3.5k rows. Is user_id only part of the primary
key for lte_user?

Can you give us the output of:
\d lte_user
\d candidates617004

So that we have the description of the tables, and what indexes you have
defined?

Also, if you could describe the table layouts, that would help.

John
=:-


 -  Sort  (cost=9912.07..9920.73 rows=3467 width=40) (actual
 time=1156.000..1156.000 rows=3467 loops=1)
   Sort Key: c.sourceid
   -  Hash Join  (cost=8710.44..9708.21 rows=3467 width=40)
 (actual time=1125.000..1156.000 rows=3467 loops=1)
 Hash Cond: (outer.targetid = inner.user_id)
 -  Seq Scan on candidates617004 c 
 (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000
 rows=3467 loops=1)
 -  Hash  (cost=8011.95..8011.95 rows=279395
 width=16) (actual time=1125.000..1125.000 rows=0 loops=1)
   -  Seq Scan on lte_user t 
 (cost=0.00..8011.95 rows=279395 width=16) (actual time=0.000..670.000
 rows=279395 loops=1)
 Total runtime: 1406.000 ms
 
 enable_hashjoin disabled
 
 QUERY PLAN
 Sort  (cost=14355.37..14364.03 rows=3467 width=48) (actual
 time=391.000..391.000 rows=3467 loops=1)
   Sort Key: c.sourceid, c.targetid
   -  Nested Loop  (cost=271.52..14151.51 rows=3467 width=48) (actual
 time=203.000..359.000 rows=3467 loops=1)
 -  Merge Join  (cost=271.52..3490.83 rows=3467 width=40)
 (actual time=203.000..218.000 rows=3467 loops=1)
   Merge Cond: (outer.user_id = inner.sourceid)
   -  Index Scan using lte_user_pkey on lte_user s 
 (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..126.000
 rows=50034 loops=1)
   -  Sort  (cost=271.52..280.19 rows=3467 width=32) (actual
 time=15.000..30.000 rows=3467 loops=1)
 Sort Key: c.sourceid
 -  Seq Scan on candidates617004 c 
 (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000
 rows=3467 loops=1)
 -  Index Scan using lte_user_pkey on lte_user t 
 (cost=0.00..3.03 rows=1 width=16) (actual time=0.031..0.036 rows=1
 loops=3467)
   Index Cond: (outer.targetid = t.user_id)
 Total runtime: 406.000 ms
 
 random_page_cost set to 1.5
 --
 QUERY PLAN
 Sort  (cost=12702.62..12711.29 rows=3467 width=48) (actual
 time=1407.000..1407.000 rows=3467 loops=1)
   Sort Key: c.sourceid, c.targetid
   -  Merge Join  (cost=9912.07..12498.77 rows=3467 width=48) (actual
 time=1391.000..1407.000 rows=3467 loops=1)
 Merge Cond: (outer.user_id = inner.sourceid)
 -  Index Scan using lte_user_pkey on lte_user s 
 (cost=0.00..12807.34 rows=279395 width=16) (actual time=0.000..46.000
 rows=50034 loops=1)
 -  Sort  (cost=9912.07..9920.73 rows=3467 width=40) (actual
 time=1188.000..1188.000 rows=3467 loops=1)
   Sort Key: c.sourceid
   -  Hash Join  (cost=8710.44..9708.21 rows=3467 width=40)
 (actual time=1157.000..1188.000 rows=3467 loops=1)
 Hash Cond: (outer.targetid = inner.user_id)
 -  Seq Scan on candidates617004 c 
 (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..15.000
 rows=3467 loops=1)
  

Re: [PERFORM] Planner incorrectly choosing seq scan over index scan

2005-08-01 Thread Meetesh Karia
Are you referring to the statistics gathering target for ANALYZE?
Based on your email, I just tried the following and then re-ran the
explain analyze but got the same incorrect plan:

alter table candidates617004
 alter column sourceId set statistics 1000,
 alter column targetId set statistics 1000;
analyze candidates617004;

alter table lte_user
 alter column user_id set statistics 1000;
analyze lte_user;
Thanks for your suggestion,
Meetesh
On 8/2/05, Tobias Brox [EMAIL PROTECTED] wrote:
[Meetesh Karia - Tue at 12:19:27AM +0200] We're using 8.0.3 and we're seeing a problem where the planner is choosing a seq scan and hash join over an index scan. If I set enable_hashjoin to off, then I get the plan I'm expecting and the query runs a lot faster. I've also
 tried lowering the random page cost (even to 1) but the planner still chooses to use the hash join.Have you tried increasing the statistics collection?--Tobias Brox, +47-91700050Nordicbet, IT dept



Re: [PERFORM] Planner incorrectly choosing seq scan over index scan

2005-08-01 Thread Meetesh Karia
Thanks John. I've answered your questions below:
Has lte_user and candidates617004 been recently ANALYZEd? All estimates,except for the expected number of rows from lte_user seem to be okay.
I ANALYZEd both tables just before putting together my first
email. And, unfortunately, modifying the statistics target didn't
help either.
Is user_id the primary key for lte_user?
Yes 
I'm trying to figure out how you can get 50k rows, by searching aprimary key, against a 
3.5k rows. Is user_id only part of the primarykey for lte_user?
Hmmm ... I missed that before. But, that surprises me too.
Especially since sourceId in the candidates table has only 1
value. Also, user_id is the complete primary key for lte_user.
Can you give us the output of:\d lte_user\d candidates617004

Sure, here they are:

lte=# \d lte_user
 Table public.lte_user
 Column
|
Type
| Modifiers
---+-+---
user_id |
bigint
| not null
firstname | character varying(255) |
lastname | character varying(255) |
address1 | character varying(255) |
address2 | character varying(255) |
city | character varying(255) |
state | character varying(255) |
zip | character varying(255) |
phone1 | character varying(255) |
phone2 | character varying(255) |
username | character varying(255) |
password | character varying(255) |
deleted |
boolean
| not null
ext_cust_id | character varying(255) |
aboutme | character varying(255) |
birthday | timestamp without time zone |
fm_id |
bigint
|
ar  | double
precision
|
Indexes:
 lte_user_pkey PRIMARY KEY, btree (user_id)
 idx_user_extid btree (ext_cust_id)
 idx_user_username btree (username)
Foreign-key constraints:
 fk_user_fm FOREIGN KEY (fm_id) REFERENCES fm(fm_id)

lte=# \d candidates617004
 Table public.candidates617004
 Column
|
Type | Modifiers
--+--+---
fmid | bigint |
sourceid | bigint |
sr | double precision |
targetid | bigint |
tr | double precision | 
Also, if you could describe the table layouts, that would help.

Sure. The lte_user table is just a collection of users.
user_id is assigned uniquely using a sequence. During some
processing, we create a candidates table (candidates617004 in our
case). This table is usually a temp table. sourceid is a
user_id (in this case it is always 617004) and targetid is also a
user_id (2860 distinct values out of 3467). The rest of the
information is either only used in the select clause or not used at all
during this processing.

Did I miss something in the table layout description that would be helpful?

Thanks for your help!
Meetesh

-Sort(cost=9912.07..9920.73 rows=3467
width=40) (actual time=1156.000..1156.000 rows=3467 loops=1) Sort Key: c.sourceid
-Hash Join(cost=8710.44..9708.21 rows=3467
width=40) (actual time=1125.000..1156.000 rows=3467 loops=1)
Hash Cond: (outer.targetid = inner.user_id)
-Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1)
-Hash(cost=8011.95..8011.95 rows=279395 width=16) (actual time=1125.000..1125.000 rows=0 loops=1)
-Seq Scan on lte_user t (cost=0.00..8011.95 rows=279395 width=16) (actual time=0.000..670.000 rows=279395 loops=1) Total runtime: 1406.000 ms enable_hashjoin disabled 
 QUERY PLAN Sort(cost=14355.37..14364.03 rows=3467 width=48) (actual time=391.000..391.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid -Nested Loop(cost=271.52..14151.51
 rows=3467 width=48) (actual time=203.000..359.000 rows=3467 loops=1)
-Merge Join(cost=271.52..3490.83 rows=3467
width=40) (actual time=203.000..218.000 rows=3467 loops=1)
Merge Cond: (outer.user_id = inner.sourceid)
-Index Scan using lte_user_pkey on lte_user s (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..126.000 rows=50034 loops=1)
-Sort(cost=271.52..280.19 rows=3467
width=32) (actual time=15.000..30.000 rows=3467 loops=1)
Sort Key: c.sourceid
-Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1) -Index Scan using lte_user_pkey on lte_user t (cost=0.00..3.03
 rows=1 width=16) (actual time=0.031..0.036 rows=1 loops=3467)
Index Cond: (outer.targetid = t.user_id) Total runtime: 406.000 ms random_page_cost set to 1.5 -- QUERY PLAN Sort(cost=
12702.62..12711.29 rows=3467 width=48) (actual time=1407.000..1407.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid -Merge Join(cost=9912.07..12498.77 rows=3467 width=48) (actual
 time=1391.000..1407.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) -Index Scan using lte_user_pkey on lte_user s (cost=0.00..12807.34
 rows=279395 width=16) (actual time=0.000..46.000 rows=50034 loops=1)
-Sort(cost=9912.07..9920.73 rows=3467
width=40) (actual time=1188.000..1188.000 rows=3467 loops=1) Sort Key: c.sourceid
-Hash Join(cost=8710.44..9708.21 rows=3467
width=40) (actual time=1157.000..1188.000 rows=3467 loops=1)
Hash Cond: 

Re: [PERFORM] Planner incorrectly choosing seq scan over index scan

2005-08-01 Thread Tom Lane
Meetesh Karia [EMAIL PROTECTED] writes:
 Sure. The lte_user table is just a collection of users. user_id is assigned=
 uniquely using a sequence. During some processing, we create a candidates=
 table (candidates617004 in our case). This table is usually a temp table.=
 sourceid is a user_id (in this case it is always 617004) and targetid is=20
 also a user_id (2860 distinct values out of 3467). The rest of the=20
 information is either only used in the select clause or not used at all=20
 during this processing.

If you know that sourceid has only a single value, it'd probably be
helpful to call out that value in the query, ie,
where ... AND c.sourceId = 617004 ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings