[PERFORM] Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1

2011-03-13 Thread John Surcombe
Hello,

 

We are using PostgreSQL 9.0.3, compiled by Visual C++ build 1500,
32-bit, installed on Windows 2003 R2 32-bit.

 

We have an 'aisposition' table used for a GPS tracking application,
containing ~30 million rows and a number of indexes.  Two of these are:

 

idx_receiveddatetime: indexes aisposition(receiveddatetime timestamp)

 

idx_userid_receiveddatetime: indexes aisposition(userid int4 desc,
receiveddatetime timestamp desc)

 

The problem we get is that the following query is taking many minutes to
run:

 

select * from aisposition where userid = 311369000 order by userid desc,
receiveddatetime desc limit 1

 

When we 'EXPLAIN' this query, PostgreSQL says it is using the index
idx_receiveddatetime.  The way the application is designed means that in
virtually all cases the query will have to scan a very long way into
idx_receiveddatetime to find the first record where userid = 311369000.
If however we delete the idx_receiveddatetime index, the query uses the
idx_userid_receiveddatetime index, and the query only takes a few
milliseconds.

 

The EXPLAIN ANALYZE output with idx_receiveddatetime in place is:

 

Limit  (cost=0.00..1.30 rows=1 width=398) (actual
time=1128097.540..1128097.541 rows=1 loops=1)

  -  Index Scan Backward using idx_receiveddatetime on aisposition
(cost=0.00..2433441.05 rows=1875926 width=398) (actual
time=1128097.532..1128097.532 rows=1 loops=1)

Filter: (userid = 311369000)

Total runtime: 1128097.609 ms

 

And with that index deleted:

 

Limit  (cost=0.00..4.01 rows=1 width=398) (actual time=60.633..60.634
rows=1 loops=1)

  -  Index Scan using idx_userid_receiveddatetime on aisposition
(cost=0.00..7517963.47 rows=1875926 width=398) (actual
time=60.629..60.629 rows=1 loops=1)

Index Cond: (userid = 311369000)

Total runtime: 60.736 ms

 

We would obviously prefer PostgreSQL to use the
idx_userid_receiveddatetime index in all cases, because we know that
this will guarantee results in a timely manner, whereas using
idx_receiveddatetime will usually require a scan of much of the table
and our application will not work.  What are we doing wrong?

 

Cheers now,

John



Re: [PERFORM] Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1

2011-03-13 Thread Jeremy
The row estimate is way off.  Is autovacuum disabled?
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1

2011-03-13 Thread Tom Lane
John Surcombe john.surco...@digimap.gg writes:
 When we 'EXPLAIN' this query, PostgreSQL says it is using the index
 idx_receiveddatetime.  The way the application is designed means that in
 virtually all cases the query will have to scan a very long way into
 idx_receiveddatetime to find the first record where userid = 311369000.
 If however we delete the idx_receiveddatetime index, the query uses the
 idx_userid_receiveddatetime index, and the query only takes a few
 milliseconds.

That's just bizarre ... it knows the index is applicable, and the cost
estimates clearly favor the better index, so why did it pick the worse
one?

I tried to duplicate this locally, without success, so there's some
contributing factor you've neglected to mention.  Can you put together a
self-contained test case that acts like this?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-13 Thread runner
Don't insert data into an indexed table. A very important point with


bulk-loading is that you should load all the data first, then create
the indexes. Running multiple (different) CREATE INDEX queries in
parallel can additionally save a lot of time. Also don't move data
back and forth between the tables, just drop the original when you're
done.

I just saw your post and it looks similar to what I'm doing.
We're going to be loading 12G of data from a MySQL dump into our 
pg 9.0.3 database next weekend.  I've been testing this for the last
two weeks.  Tried removing the indexes and other constraints just for
the import but for a noob like me, this was too much to ask.  Maybe
when I get more experience.  So I *WILL* be importing all of my data
into indexed tables.  I timed it and it will take eight hours.  

I'm sure I could get it down to two or three hours for the import
if I really knew more about postgres but that's the price you pay when
you slam dunk a project and your staff isn't familiar with the 
database back-end.  Other than being very inefficient, and consuming 
more time than necessary, is there any other down side to importing 
into an indexed table?  In the four test imports I've done,
everything seems to work fine, just takes a long time.

Sorry for hijacking your thread here!

 


Re: [PERFORM] Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1

2011-03-13 Thread Tom Lane
I wrote:
 John Surcombe john.surco...@digimap.gg writes:
 When we 'EXPLAIN' this query, PostgreSQL says it is using the index
 idx_receiveddatetime.  The way the application is designed means that in
 virtually all cases the query will have to scan a very long way into
 idx_receiveddatetime to find the first record where userid = 311369000.
 If however we delete the idx_receiveddatetime index, the query uses the
 idx_userid_receiveddatetime index, and the query only takes a few
 milliseconds.

 That's just bizarre ... it knows the index is applicable, and the cost
 estimates clearly favor the better index, so why did it pick the worse
 one?

No, scratch that, I misread the plans.  It *is* picking the plan it
thinks has lower cost; it's just a mistaken cost estimate.  It's strange
though that the less selective indexscan is getting a lower cost
estimate.  I wonder whether your table is (almost) perfectly ordered by
receiveddatetime, such that the one-column index has correlation close
to 1.0.  That could possibly lower the cost estimate to the point where
it'd appear to dominate the other index.  It'd be useful to see the
pg_stats.correlation value for both the userid and receiveddatetime
columns.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance