Re: [PERFORM] Optimizer: limit not taken into account

2006-05-17 Thread Bruno Wolff III
Please don't reply to previous messages to start new threads. This makes it
harder to find stuff in the archives and may keep people from noticing your
message.

On Wed, May 17, 2006 at 08:54:52 -0700,
  Craig A. James [EMAIL PROTECTED] wrote:
 Here's a corner case that might interest someone.  It tripped up one of 
 our programmers.
 
 We have a table with  10 million rows.  The ID column is indexed, the 
 table has been vacuum/analyzed.  Compare these two queries:
 
   select * from tbl where id = 1000 limit 1;
   select * from tbl where id = 1000 order by id limit 1;
 
 The first takes 4 seconds, and uses a full table scan.  The second takes 32 
 msec and uses the index.  Details are below.

I suspect it wasn't intended to be a full table scan. But rather a sequential
scan until it found a matching row. If the data in the table is ordered by
by id, this strategy may not work out well. Where as if the data is randomly
ordered, it would be expected to find a match quickly.

Have you analyzed the table recently? If the planner has bad stats on the
table, that is going to make it more likely to choose a bad plan.


 I understand why the planner makes the choices it does -- the id  
 1000 isn't very selective and under normal circumstances a full table 
 scan is probably the right choice.  But the limit 1 apparently doesn't 
 alter the planner's strategy at all.  We were surprised by this.
 
 Adding the order by was a simple solution.
 
 Craig
 
 
 
 pg= explain analyze select url, url_digest from url_queue where priority 
 = 1000 limit 1;
   QUERY PLAN
 --
 Limit  (cost=0.00..0.65 rows=1 width=108) (actual time=4036.113..4036.117 
 rows=1 loops=1)
   -  Seq Scan on url_queue  (cost=0.00..391254.35 rows=606176 width=108) 
   (actual time=4036.101..4036.101 rows=1 loops=1)
 Filter: (priority = 1000)
 Total runtime: 4036.200 ms
 (4 rows)
 
 pg= explain analyze select url, url_digest from url_queue where priority 
 = 1000 order by priority limit 1;
   QUERY PLAN
 --
 Limit  (cost=0.00..2.38 rows=1 width=112) (actual time=32.445..32.448 
 rows=1 loops=1)
   -  Index Scan using url_queue_priority on url_queue  
   (cost=0.00..1440200.41 rows=606176 width=112) (actual time=32.434..32.434 
   rows=1 loops=1)
 Index Cond: (priority = 1000)
 Total runtime: 32.566 ms
 
 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Optimizer: limit not taken into account

2006-05-17 Thread Simon Riggs
On Wed, 2006-05-17 at 08:54 -0700, Craig A. James wrote:
 Here's a corner case that might interest someone.  It tripped up one of our 
 programmers.
 
 We have a table with  10 million rows.  The ID column is indexed, the table 
 has been vacuum/analyzed.  Compare these two queries:
 
select * from tbl where id = 1000 limit 1;
select * from tbl where id = 1000 order by id limit 1;
 
 The first takes 4 seconds, and uses a full table scan.  The second takes 32 
 msec and uses the index.  
 Details are below.

The rows are not randomly distributed, so the SeqScan takes longer to
find 1 row than the index scan.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Optimizer: limit not taken into account

2006-05-17 Thread Tom Lane
Bruno Wolff III [EMAIL PROTECTED] writes:
 I suspect it wasn't intended to be a full table scan. But rather a sequential
 scan until it found a matching row. If the data in the table is ordered by
 by id, this strategy may not work out well. Where as if the data is randomly
 ordered, it would be expected to find a match quickly.

Right.  You can see from the differential in the estimates for the
SeqScan and the Limit nodes that the planner is not expecting the
seqscan to run to completion, but rather to find a matching row quite
quickly.

There is not anything in there that considers whether the table's
physical order is so nonrandom that the search will take much longer
than it would given uniform distribution.  It might be possible to do
something with the correlation statistic in simple cases ...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Optimizer: limit not taken into account

2006-05-17 Thread Craig A. James

Tom Lane wrote:

There is not anything in there that considers whether the table's
physical order is so nonrandom that the search will take much longer
than it would given uniform distribution.  It might be possible to do
something with the correlation statistic in simple cases ...


In this case, the rows are not random at all, in fact they're inserted from a sequence, then rows are deleted as they are processed.  If the planner is hoping for random physical distribution, this particular case is exactly wrong. 


Craig

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Optimizer: limit not taken into account

2006-05-17 Thread Jim C. Nasby
On Wed, May 17, 2006 at 08:54:52AM -0700, Craig A. James wrote:
 Here's a corner case that might interest someone.  It tripped up one of 
 our programmers.
 
 We have a table with  10 million rows.  The ID column is indexed, the 
 table has been vacuum/analyzed.  Compare these two queries:
 
   select * from tbl where id = 1000 limit 1;
   select * from tbl where id = 1000 order by id limit 1;
 
 The first takes 4 seconds, and uses a full table scan.  The second takes 32 
 msec and uses the index.  Details are below.
 
 I understand why the planner makes the choices it does -- the id  
 1000 isn't very selective and under normal circumstances a full table 
 scan is probably the right choice.  But the limit 1 apparently doesn't 
 alter the planner's strategy at all.  We were surprised by this.

Is it really not very selective? If there's 1000 rows in the table,
and id starts at 1 with very few gaps, then = 1000 should actually
be very selective...

Also, I hope you understand there's a big difference between a limit
query that does and doesn't have an order by.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match