[PERFORM] Adding LIMIT 1 kills performance.

2008-05-29 Thread Chris Shoemaker
[Attn list-queue maintainers: Please drop the earlier version
of this email that I accidentally sent from an unsubscribed address. ]

Hi, 

I'm having a strange problem with a slow-running select query.  The
query I use in production ends in LIMIT 1, and it runs very slowly.
But when I remove the LIMIT 1, the query runs quite quickly.  This
behavior has stumped a couple smart DBAs.

The full queries and EXPLAIN ANALYZE plans are included below, but by
way of explanation/observation:

1) The LIMIT 1 case will sometimes be quicker (but still much slower
than the non-LIMIT 1 case) for different values of
calendar_group_id.

2) The query below is a slightly simplified version of the one I
actually use.  The real one includes more conditions which explain why
each table is joined.  For reference, the original query is quoted at
the end [1].  The original query exhibits the same behavior as the
simplified versions w.r.t. the LIMIT 1 case taking _much_ longer
(even longer than the simplified version) than the non-LIMIT 1 case,
and uses the same plans.


Can anyone explain why such a slow plan is chosen when the LIMIT 1
is present?  Is there anything I can do to speed this query up?
Thanks.

-chris


production= select version();
  version   

--
 PostgreSQL 8.2.6 on x86_64-pc-linux-gnu, compiled by GCC 
x86_64-pc-linux-gnu-gcc (GCC) 4.1.2 (Gentoo 4.1.2 p1.0.2)
(1 row)

production= analyze calendar_groups;
ANALYZE
production= analyze calendar_links;
ANALYZE
production= analyze calendars;
ANALYZE
production= analyze event_updates;
ANALYZE
production= EXPLAIN ANALYZE SELECT event_updates.*
 FROM event_updates
 INNER JOIN calendars ON event_updates.feed_id = calendars.id
 INNER JOIN calendar_links ON calendars.id = 
calendar_links.source_tracker_id
 WHERE (calendar_links.calendar_group_id = 3640)
 ORDER BY event_updates.id DESC
 LIMIT 1;

   QUERY PLAN   

---
 Limit  (cost=16.55..91.73 rows=1 width=2752) (actual time=27810.058..27810.059 
rows=1 loops=1)
   -  Nested Loop  (cost=16.55..695694.18 rows=9254 width=2752) (actual 
time=27810.054..27810.054 rows=1 loops=1)
 Join Filter: (event_updates.feed_id = calendars.id)
 -  Index Scan Backward using event_updates_pkey on event_updates  
(cost=0.00..494429.30 rows=8944370 width=2752) (actual time=0.030..7452.142 
rows=5135706 loops=1)
 -  Materialize  (cost=16.55..16.56 rows=1 width=8) (actual 
time=0.001..0.002 rows=1 loops=5135706)
   -  Nested Loop  (cost=0.00..16.55 rows=1 width=8) (actual 
time=0.029..0.034 rows=1 loops=1)
 -  Index Scan using 
index_calendar_links_on_calendar_group_id_and_source_tracker_id on 
calendar_links  (cost=0.00..8.27 rows=1 width=4) (actual time=0.012..0.013 
rows=1 loops=1)
   Index Cond: (calendar_group_id = 3640)
 -  Index Scan using harvest_trackers_pkey on calendars  
(cost=0.00..8.27 rows=1 width=4) (actual time=0.012..0.013 rows=1 loops=1)
   Index Cond: (calendars.id = 
calendar_links.source_tracker_id)
 Total runtime: 27810.161 ms
(11 rows)

production= EXPLAIN ANALYZE SELECT event_updates.* FROM event_updates
 INNER JOIN calendars ON event_updates.feed_id = calendars.id
 INNER JOIN calendar_links ON calendars.id = 
calendar_links.source_tracker_id
 WHERE (calendar_links.calendar_group_id = 3640)
 ORDER BY event_updates.id DESC;

QUERY PLAN  
  
-
 Sort  (cost=43376.36..43399.50 rows=9256 width=2752) (actual 
time=10.178..10.205 rows=36 loops=1)
   Sort Key: event_updates.id
   -  Nested Loop  (cost=249.86..31755.56 rows=9256 width=2752) (actual 
time=9.957..10.098 rows=36 loops=1)
 -  Nested Loop  (cost=0.00..16.55 rows=1 width=8) (actual 
time=9.868..9.873 rows=1 loops=1)
   -  Index Scan using 
index_calendar_links_on_calendar_group_id_and_source_tracker_id on 
calendar_links  (cost=0.00..8.27 rows=1 width=4) (actual time=9.824..9.825 
rows=1 loops=1)
 Index Cond: (calendar_group_id = 3640)
   -  Index Scan using harvest_trackers_pkey on calendars  
(cost=0.00..8.27 rows=1 

Re: [PERFORM] Adding LIMIT 1 kills performance.

2008-05-29 Thread Shane Ambler

Chris Shoemaker wrote:

[Attn list-queue maintainers: Please drop the earlier version
of this email that I accidentally sent from an unsubscribed address. ]

Hi, 


I'm having a strange problem with a slow-running select query.  The
query I use in production ends in LIMIT 1, and it runs very slowly.
But when I remove the LIMIT 1, the query runs quite quickly.  This
behavior has stumped a couple smart DBAs.




Can anyone explain why such a slow plan is chosen when the LIMIT 1
is present?  Is there anything I can do to speed this query up?
Thanks.



From what I know using an ORDER BY and a LIMIT can often prevent 
*shortening* the query as it still needs to find all rows to perform the 
order by before it limits.

The difference in plans eludes me.


production= EXPLAIN ANALYZE SELECT event_updates.*
 FROM event_updates
 INNER JOIN calendars ON event_updates.feed_id = calendars.id
 INNER JOIN calendar_links ON calendars.id = 
calendar_links.source_tracker_id
 WHERE (calendar_links.calendar_group_id = 3640)
 ORDER BY event_updates.id DESC
 LIMIT 1;


Does removing the DESC from the order by give the same variation in 
plans? Or is this only when using ORDER BY ... DESC LIMIT 1?



One thing that interests me is try -

EXPLAIN ANALYZE SELECT * FROM (

SELECT event_updates.*
FROM event_updates
INNER JOIN calendars ON event_updates.feed_id = calendars.id
INNER JOIN calendar_links ON calendars.id = calendar_links.source_tracker_id
WHERE (calendar_links.calendar_group_id = 3640)
ORDER BY event_updates.id DESC
) AS foo

LIMIT 1;



--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Adding LIMIT 1 kills performance.

2008-05-29 Thread Chris Shoemaker
On Fri, May 30, 2008 at 02:23:46AM +0930, Shane Ambler wrote:
 Chris Shoemaker wrote:
 [Attn list-queue maintainers: Please drop the earlier version
 of this email that I accidentally sent from an unsubscribed address. ]

 Hi, 
 I'm having a strange problem with a slow-running select query.  The
 query I use in production ends in LIMIT 1, and it runs very slowly.
 But when I remove the LIMIT 1, the query runs quite quickly.  This
 behavior has stumped a couple smart DBAs.


 Can anyone explain why such a slow plan is chosen when the LIMIT 1
 is present?  Is there anything I can do to speed this query up?
 Thanks.


 From what I know using an ORDER BY and a LIMIT can often prevent 
 *shortening* the query as it still needs to find all rows to perform the 
 order by before it limits.

That makes complete sense, of course.

 The difference in plans eludes me.

 production= EXPLAIN ANALYZE SELECT event_updates.*
  FROM event_updates
  INNER JOIN calendars ON event_updates.feed_id = calendars.id
  INNER JOIN calendar_links ON calendars.id = 
 calendar_links.source_tracker_id
  WHERE (calendar_links.calendar_group_id = 3640)
  ORDER BY event_updates.id DESC
  LIMIT 1;

 Does removing the DESC from the order by give the same variation in plans? 
 Or is this only when using ORDER BY ... DESC LIMIT 1?

Except for using Index Scan instead of Index Scan Backward, the plan
is the same with ORDER BY ... or ORDER BY ... ASC as with ORDER BY
... DESC.  In case you're wondering what would happen without the
ORDER BY at all:

production= EXPLAIN SELECT event_updates.*
FROM event_updates
INNER JOIN calendars ON event_updates.feed_id = calendars.id
INNER JOIN calendar_links ON calendars.id = calendar_links.source_tracker_id
WHERE (calendar_links.calendar_group_id = 3640)
LIMIT 1; 
   QUERY 
PLAN   

 Limit  (cost=0.00..3.95 rows=1 width=2752)
   -  Nested Loop  (cost=0.00..36992.38 rows=9362 width=2752)
 -  Nested Loop  (cost=0.00..16.55 rows=1 width=8)
   -  Index Scan using 
index_calendar_links_on_calendar_group_id_and_source_tracker_id on 
calendar_links  (cost=0.00..8.27 rows=1 width=4)
 Index Cond: (calendar_group_id = 3640)
   -  Index Scan using harvest_trackers_pkey on calendars  
(cost=0.00..8.27 rows=1 width=4)
 Index Cond: (calendars.id = 
calendar_links.source_tracker_id)
 -  Index Scan using index_event_updates_on_feed_id_and_feed_type on 
event_updates  (cost=0.00..36858.50 rows=9386 width=2752)
   Index Cond: (event_updates.feed_id = calendars.id)
(9 rows)




 One thing that interests me is try -

 EXPLAIN ANALYZE SELECT * FROM (

 SELECT event_updates.*
 FROM event_updates
 INNER JOIN calendars ON event_updates.feed_id = calendars.id
 INNER JOIN calendar_links ON calendars.id = calendar_links.source_tracker_id
 WHERE (calendar_links.calendar_group_id = 3640)
 ORDER BY event_updates.id DESC
 ) AS foo

 LIMIT 1;

That's an interesting experiment.  Here are the results:

   QUERY PLAN   


 Limit  (cost=16.55..91.74 rows=1 width=6027) (actual 
time=490709.355..490709.357 rows=1 loops=1)
   -  Nested Loop  (cost=16.55..703794.95 rows=9361 width=2752) (actual 
time=490709.352..490709.352 rows=1 loops=1)
 Join Filter: (event_updates.feed_id = calendars.id)
 -  Index Scan Backward using event_updates_pkey on event_updates  
(cost=0.00..500211.53 rows=9047416 width=2752) (actual time=0.222..469082.071 
rows=5251179 loops=1)
 -  Materialize  (cost=16.55..16.56 rows=1 width=8) (actual 
time=0.001..0.002 rows=1 loops=5251179)
   -  Nested Loop  (cost=0.00..16.55 rows=1 width=8) (actual 
time=0.240..0.246 rows=1 loops=1)
 -  Index Scan using 
index_calendar_links_on_calendar_group_id_and_source_tracker_id on 
calendar_links  (cost=0.00..8.27 rows=1 width=4) (actual time=0.108..0.109 
rows=1 loops=1)
   Index Cond: (calendar_group_id = 3640)
 -  Index Scan using harvest_trackers_pkey on calendars  
(cost=0.00..8.27 rows=1 width=4) (actual time=0.127..0.129 rows=1 loops=1)
   Index Cond: (calendars.id = 
calendar_links.source_tracker_id)
 Total runtime: 490709.576 ms
(11 rows)


That is, no real change in the performance.

Still stumped,

Re: [PERFORM] Adding LIMIT 1 kills performance.

2008-05-29 Thread Tom Lane
Chris Shoemaker [EMAIL PROTECTED] writes:
 Still stumped,

The short answer here is that the planner is guessing that scanning the
index in ID order will come across the desired row (ie, the first one
matching the join condition) in less time than it will take to select
all the joinable rows, sort them by ID, and take the first one.  It's
wrong in this case, but the plan is not unreasonable on its face.
The problem boils down to a misestimate of how many join rows there are.
You might get better results by increasing the statistics targets.

There are plenty of similar cases in the list archives.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance