Re: [PERFORM] Performance and IN clauses

2008-11-20 Thread Tomas Vondra

Mark Roberts napsal(a):

On Tue, 2008-11-18 at 17:38 +0100, [EMAIL PROTECTED] wrote:

I bet there is no 'critical' length - this is just another case of
index
scan vs. seqscan. The efficiency depends on the size of the table /
row,
amount of data in the table, variability of the column used in the IN
clause, etc.

Splitting the query with 1000 items into 10 separate queries, the
smaller
queries may be faster but the total time consumed may be actually
higher.
Something like

10 * (time of small query) + (time to combine them)  (time of large
query)

If the performance of the 'split' solution is actually better than the
original query, it just means that the planner does not use index scan
when it actually should. That means that either

(a) the planner is not smart enough
(b) it has not current statistics of the table (run ANALYZE on the
table)
(c) the statistics are not detailed enough (ALTER TABLE ... SET
STATICTICS)
(d) the cost variables are not set properly (do not match the hardware
-
decreate index scan cost / increase seq scan cost)

regards
Tomas


I know that it's much faster (for us) to run many smaller queries than
one large query, and I think that it's primarily because of your reason
a.  Most of our problems come from Pg misunderstanding the results of a
join and making a bad plan decision.  Batching dramatically reduces the
liklihood of this.


As I already said - even the smartest planner won't work without correct 
input data. Have you tried fixing the points (b), (c) and (d)?


Fixing them might improve the planner performance so that you don't need 
the batchning at all.


regards
Tomas

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Performance and IN clauses

2008-11-18 Thread Kynn Jones
Hi.  I have a Perl script whose main loop generates thousands of SQL updates
of the form
UPDATE edge SET keep = true WHERE node1 IN ( $node_list ) AND node2 =
$node_id;

...where here $node_list stands for a comma-separated list of integers, and
$node_id stands for some integer.

The list represented by $node_list can be fairly long (on average it has
around 900 entries, and can be as long as 30K entries), and I'm concerned
about the performance cost of testing for inclusion in such a long list.  Is
this done by a sequential search?  If so, is there a better way to write
this query?  (FWIW, I have two indexes on the edge table using btree( node1
) and btree( node2 ), respectively.)

Also, assuming that the optimal way to write the query depends on the length
of $node_list, how can I estimate the critical length at which I should
switch from one form of the query to the other?

TIA!

Kynn


Re: [PERFORM] Performance and IN clauses

2008-11-18 Thread Matthew Wakeling

On Tue, 18 Nov 2008, Kynn Jones wrote:

Also, assuming that the optimal way to write the query depends on the length of 
$node_list, how can I estimate the
critical length at which I should switch from one form of the query to the 
other?


In the past, I have found the fastest way to do this was to operate on 
groups of a bit less than a thousand values, and issue one query per 
group. Of course, Postgres may have improved since then, so I'll let more 
knowledgable people cover that for me.


Matthew

--
Heat is work, and work's a curse. All the heat in the universe, it's
going to cool down, because it can't increase, then there'll be no
more work, and there'll be perfect peace.  -- Michael Flanders

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance and IN clauses

2008-11-18 Thread tv
I bet there is no 'critical' length - this is just another case of index
scan vs. seqscan. The efficiency depends on the size of the table / row,
amount of data in the table, variability of the column used in the IN
clause, etc.

Splitting the query with 1000 items into 10 separate queries, the smaller
queries may be faster but the total time consumed may be actually higher.
Something like

10 * (time of small query) + (time to combine them)  (time of large query)

If the performance of the 'split' solution is actually better than the
original query, it just means that the planner does not use index scan
when it actually should. That means that either

(a) the planner is not smart enough
(b) it has not current statistics of the table (run ANALYZE on the table)
(c) the statistics are not detailed enough (ALTER TABLE ... SET STATICTICS)
(d) the cost variables are not set properly (do not match the hardware -
decreate index scan cost / increase seq scan cost)

regards
Tomas

 On Tue, 18 Nov 2008, Kynn Jones wrote:
 Also, assuming that the optimal way to write the query depends on the
 length of $node_list, how can I estimate the
 critical length at which I should switch from one form of the query to
 the other?

 In the past, I have found the fastest way to do this was to operate on
 groups of a bit less than a thousand values, and issue one query per
 group. Of course, Postgres may have improved since then, so I'll let more
 knowledgable people cover that for me.

 Matthew

 --
  Heat is work, and work's a curse. All the heat in the universe, it's
  going to cool down, because it can't increase, then there'll be no
  more work, and there'll be perfect peace.  -- Michael Flanders

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance and IN clauses

2008-11-18 Thread Mark Roberts

On Tue, 2008-11-18 at 17:38 +0100, [EMAIL PROTECTED] wrote:
 I bet there is no 'critical' length - this is just another case of
 index
 scan vs. seqscan. The efficiency depends on the size of the table /
 row,
 amount of data in the table, variability of the column used in the IN
 clause, etc.
 
 Splitting the query with 1000 items into 10 separate queries, the
 smaller
 queries may be faster but the total time consumed may be actually
 higher.
 Something like
 
 10 * (time of small query) + (time to combine them)  (time of large
 query)
 
 If the performance of the 'split' solution is actually better than the
 original query, it just means that the planner does not use index scan
 when it actually should. That means that either
 
 (a) the planner is not smart enough
 (b) it has not current statistics of the table (run ANALYZE on the
 table)
 (c) the statistics are not detailed enough (ALTER TABLE ... SET
 STATICTICS)
 (d) the cost variables are not set properly (do not match the hardware
 -
 decreate index scan cost / increase seq scan cost)
 
 regards
 Tomas

I know that it's much faster (for us) to run many smaller queries than
one large query, and I think that it's primarily because of your reason
a.  Most of our problems come from Pg misunderstanding the results of a
join and making a bad plan decision.  Batching dramatically reduces the
liklihood of this.

-Mark


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance