Re: [PERFORM] SELECT ignoring index even though ORDER BY and LIMIT present

2010-06-03 Thread Matthew Wakeling

On Wed, 2 Jun 2010, Jori Jovanovich wrote:

(2) Making the query faster by making the string match LESS specific (odd,
seems like it should be MORE)


No, that's the way round it should be. The LIMIT changes it all. Consider 
if you have a huge table, and half of the entries match your WHERE clause. 
To fetch the ORDER BY ... LIMIT 20 using an index scan would involve 
accessing only on average 40 entries from the table referenced by the 
index. Therefore, the index is quick. However, consider a huge table that 
only has twenty matching entries. The index scan would need to touch every 
single row in the table to return the matching rows, so a sequential scan, 
filter, and sort would be much faster. Of course, if you had an index 
capable of answering the WHERE clause, that would be even better for that 
case.


Matthew

--
Don't criticise a man until you have walked a mile in his shoes; and if
you do at least he will be a mile behind you and bare footed.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SELECT ignoring index even though ORDER BY and LIMIT present

2010-06-03 Thread Jori Jovanovich
hi,

I'm sorry for not posting this first.

The server is the following and is being used exclusively for this
PostgreSQL instance:

PostgreSQL 8.4.2 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.2.real (GCC)
4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit
Amazon EC2 Large Instance, 7.5GB memory, 64-bit

This is what is set in my postgresql.conf file:

max_connections = 100
ssl = true
shared_buffers = 24MB

ANALYZE VERBOSE EVENTS;
INFO:  analyzing public.events
INFO:  events: scanned 3 of 211312 pages, containing 1725088 live rows
and 0 dead rows; 3 rows in sample, 12151060 estimated total rows

Updating statistics did not effect the results -- it's still doing full
table scans (I had run statistics as well before posting here as well so
this was expected).

thank you

On Wed, Jun 2, 2010 at 8:49 PM, Bob Lunney bob_lun...@yahoo.com wrote:

 Jori,

 What is the PostgreSQL
 version/shared_buffers/work_mem/effective_cache_size/default_statistics_target?
 Are the statistics for the table up to date?  (Run analyze verbose
 tablename to update them.)  Table and index structure would be nice to
 know, too.

 If all else fails you can set enable_seqscan = off for the session, but
 that is a Big Hammer for what is probably a smaller problem.

 Bob Lunney

 --- On *Wed, 6/2/10, Jori Jovanovich j...@dimensiology.com* wrote:


 From: Jori Jovanovich j...@dimensiology.com
 Subject: [PERFORM] SELECT ignoring index even though ORDER BY and LIMIT
 present
 To: pgsql-performance@postgresql.org
 Date: Wednesday, June 2, 2010, 4:28 PM


 hi,

 I have a problem space where the main goal is to search backward in time
 for events.  Time can go back very far into the past, and so the
 table can get quite large.  However, the vast majority of queries are all
 satisfied by relatively recent data.  I have an index on the row creation
 date and I would like almost all of my queries to have a query plan looking
 something like:

  Limit ...
-  Index Scan Backward using server_timestamp_idx on events
  (cost=0.00..623055.91 rows=8695 width=177)
  ...

 However, PostgreSQL frequently tries to do a full table scan.  Often what
 controls whether a scan is performed or not is dependent on the size of the
 LIMIT and how detailed the WHERE clause is.  In practice, the scan is always
 the wrong answer for my use cases (where always is defined to be 99.9%).

 Some examples:

 (1) A sample query that devolves to a full table scan

   EXPLAIN
SELECT events.id, events.client_duration, events.message,
 events.created_by, events.source, events.type, events.event,
 events.environment,
   events.server_timestamp, events.session_id, events.reference,
 events.client_uuid
  FROM events
 WHERE client_uuid ~* E'^foo bar so what'
  ORDER BY server_timestamp DESC
 LIMIT 20;
 QUERY PLAN (BAD!)
 --
  Limit  (cost=363278.56..363278.61 rows=20 width=177)
-  Sort  (cost=363278.56..363278.62 rows=24 width=177)
  Sort Key: server_timestamp
  -  Seq Scan on events  (cost=0.00..363278.01 rows=24 width=177)
Filter: (client_uuid ~* '^foo bar so what'::text)


 (2) Making the query faster by making the string match LESS specific (odd,
 seems like it should be MORE)

   EXPLAIN
SELECT events.id, events.client_duration, events.message,
 events.created_by, events.source, events.type, events.event,
 events.environment,
   events.server_timestamp, events.session_id, events.reference,
 events.client_uuid
  FROM events
 WHERE client_uuid ~* E'^foo'
  ORDER BY server_timestamp DESC
 LIMIT 20;
 QUERY PLAN (GOOD!)


 
  Limit  (cost=0.00..1433.14 rows=20 width=177)
-  Index Scan Backward using server_timestamp_idx on events
  (cost=0.00..623055.91 rows=8695 width=177)
  Filter: (client_uuid ~* '^foo'::text)


 (3) Alternatively making the query faster by using a smaller limit

   EXPLAIN
SELECT events.id, events.client_duration, events.message,
 events.created_by, events.source, events.type, events.event,
 events.environment,
   events.server_timestamp, events.session_id, events.reference,
 events.client_uuid
  FROM events
 WHERE client_uuid ~* E'^foo bar so what'
  ORDER BY server_timestamp DESC
 LIMIT 10;
 QUERY PLAN (GOOD!)


 --
  Limit  (cost=0.00..259606.63 rows=10 width=177)
-  Index Scan Backward using server_timestamp_idx on events
  (cost=0.00..623055.91 rows=24 width=177)
  Filter: (client_uuid ~* '^foo bar so what'::text)


 I find myself wishing I could just put a SQL HINT on the query to force the
 index to be used but I understand that HINTs

[PERFORM] SELECT ignoring index even though ORDER BY and LIMIT present

2010-06-02 Thread Jori Jovanovich
hi,

I have a problem space where the main goal is to search backward in time for
events.  Time can go back very far into the past, and so the
table can get quite large.  However, the vast majority of queries are all
satisfied by relatively recent data.  I have an index on the row creation
date and I would like almost all of my queries to have a query plan looking
something like:

 Limit ...
   -  Index Scan Backward using server_timestamp_idx on events
 (cost=0.00..623055.91 rows=8695 width=177)
 ...

However, PostgreSQL frequently tries to do a full table scan.  Often what
controls whether a scan is performed or not is dependent on the size of the
LIMIT and how detailed the WHERE clause is.  In practice, the scan is always
the wrong answer for my use cases (where always is defined to be 99.9%).

Some examples:

(1) A sample query that devolves to a full table scan

  EXPLAIN
   SELECT events.id, events.client_duration, events.message,
events.created_by, events.source, events.type, events.event,
events.environment,
  events.server_timestamp, events.session_id, events.reference,
events.client_uuid
 FROM events
WHERE client_uuid ~* E'^foo bar so what'
 ORDER BY server_timestamp DESC
LIMIT 20;
QUERY PLAN (BAD!)
--
 Limit  (cost=363278.56..363278.61 rows=20 width=177)
   -  Sort  (cost=363278.56..363278.62 rows=24 width=177)
 Sort Key: server_timestamp
 -  Seq Scan on events  (cost=0.00..363278.01 rows=24 width=177)
   Filter: (client_uuid ~* '^foo bar so what'::text)


(2) Making the query faster by making the string match LESS specific (odd,
seems like it should be MORE)

  EXPLAIN
   SELECT events.id, events.client_duration, events.message,
events.created_by, events.source, events.type, events.event,
events.environment,
  events.server_timestamp, events.session_id, events.reference,
events.client_uuid
 FROM events
WHERE client_uuid ~* E'^foo'
 ORDER BY server_timestamp DESC
LIMIT 20;
QUERY PLAN (GOOD!)


 Limit  (cost=0.00..1433.14 rows=20 width=177)
   -  Index Scan Backward using server_timestamp_idx on events
 (cost=0.00..623055.91 rows=8695 width=177)
 Filter: (client_uuid ~* '^foo'::text)


(3) Alternatively making the query faster by using a smaller limit

  EXPLAIN
   SELECT events.id, events.client_duration, events.message,
events.created_by, events.source, events.type, events.event,
events.environment,
  events.server_timestamp, events.session_id, events.reference,
events.client_uuid
 FROM events
WHERE client_uuid ~* E'^foo bar so what'
 ORDER BY server_timestamp DESC
LIMIT 10;
QUERY PLAN (GOOD!)

--
 Limit  (cost=0.00..259606.63 rows=10 width=177)
   -  Index Scan Backward using server_timestamp_idx on events
 (cost=0.00..623055.91 rows=24 width=177)
 Filter: (client_uuid ~* '^foo bar so what'::text)


I find myself wishing I could just put a SQL HINT on the query to force the
index to be used but I understand that HINTs are considered harmful and are
therefore not provided for PostgreSQL, so what is the recommended way to
solve this?

thank you very much


Re: [PERFORM] SELECT ignoring index even though ORDER BY and LIMIT present

2010-06-02 Thread Kevin Grittner
Jori Jovanovich j...@dimensiology.com wrote:
 
 what is the recommended way to solve this?
 
The recommended way is to adjust your costing configuration to
better reflect your environment.  What version of PostgreSQL is
this?  What do you have set in your postgresql.conf file?  What does
the hardware look like?  How big is the active (frequently
referenced) portion of your database?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SELECT ignoring index even though ORDER BY and LIMIT present

2010-06-02 Thread Szymon Guz
2010/6/2 Jori Jovanovich j...@dimensiology.com

 hi,

 I have a problem space where the main goal is to search backward in time
 for events.  Time can go back very far into the past, and so the
 table can get quite large.  However, the vast majority of queries are all
 satisfied by relatively recent data.  I have an index on the row creation
 date and I would like almost all of my queries to have a query plan looking
 something like:



[CUT]

Do you have autovacuum running? Have you tried updating statistics?

regards
Szymon Guz


Re: [PERFORM] SELECT ignoring index even though ORDER BY and LIMIT present

2010-06-02 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Jori Jovanovich j...@dimensiology.com wrote:
 what is the recommended way to solve this?
 
 The recommended way is to adjust your costing configuration to
 better reflect your environment.

Actually, it's probably not the costs so much as the row estimates.
For instance, that first query was estimated to select 20 out of a
possible 24 rows.  If 24 is indeed the right number of matches, then
the planner is right and the OP is wrong: the indexscan is going to
have to traverse almost all of the table and therefore it will be a
lot slower than seqscan + sort.  Now, if the real number of matches
is a lot more than that, then the indexscan would make sense because it
could be expected to get stopped by the LIMIT before it has to traverse
too much of the table.  So the true problem is to get the rowcount
estimate to line up with reality.

Unfortunately the estimates for ~* are typically not very good.
If you could convert that to plain ~ (case sensitive) it'd probably
work better.  Also, if this isn't a particularly modern version of
Postgres, a newer version might do a bit better with the estimate.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SELECT ignoring index even though ORDER BY and LIMIT present

2010-06-02 Thread Bob Lunney
Jori,

What is the PostgreSQL 
version/shared_buffers/work_mem/effective_cache_size/default_statistics_target? 
 Are the statistics for the table up to date?  (Run analyze verbose tablename 
to update them.)  Table and index structure would be nice to know, too.

If all else fails you can set enable_seqscan = off for the session, but that is 
a Big Hammer for what is probably a smaller problem.

Bob Lunney

--- On Wed, 6/2/10, Jori Jovanovich j...@dimensiology.com wrote:

From: Jori Jovanovich j...@dimensiology.com
Subject: [PERFORM] SELECT ignoring index even though ORDER BY and LIMIT present
To: pgsql-performance@postgresql.org
Date: Wednesday, June 2, 2010, 4:28 PM

hi,

I have a problem space where the main goal is to search backward in time for 
events.  Time can go back very far into the past, and so the
table can get quite large.  However, the vast majority of queries are all 
satisfied by relatively recent data.  I have an index on the row creation date 
and I would like almost all of my queries to have a query plan looking 
something like:

 Limit ...   -  Index Scan Backward using server_timestamp_idx on events  
(cost=0.00..623055.91 rows=8695 width=177)
         ...
However, PostgreSQL frequently tries to do a full table scan.  Often what 
controls whether a scan is performed or not is dependent on the size of the 
LIMIT and how detailed the WHERE clause is.  In practice, the scan is always 
the wrong answer for my use cases (where always is defined to be 99.9%).

Some examples:

(1) A sample query that devolves to a full table scan

  EXPLAIN
   SELECT events.id, events.client_duration, events.message, events.created_by, 
events.source, events.type, events.event, events.environment,
          events.server_timestamp, events.session_id, events.reference, 
events.client_uuid     FROM events
    WHERE client_uuid ~* E'^foo bar so what' ORDER BY server_timestamp DESC
    LIMIT 20;                                QUERY PLAN (BAD!)
-- Limit
  (cost=363278.56..363278.61 rows=20 width=177)
   -  Sort  (cost=363278.56..363278.62 rows=24 width=177)         Sort Key: 
server_timestamp
         -  Seq Scan on events  (cost=0.00..363278.01 rows=24 width=177)
               Filter: (client_uuid ~* '^foo bar so what'::text)


(2) Making the query faster by making the string match LESS specific (odd, 
seems like it should be MORE)

  EXPLAIN
   SELECT events.id, events.client_duration, events.message, events.created_by, 
events.source, events.type, events.event, events.environment,
          events.server_timestamp, events.session_id, events.reference, 
events.client_uuid     FROM events
    WHERE client_uuid ~* E'^foo' ORDER BY server_timestamp DESC
    LIMIT 20;                                QUERY PLAN (GOOD!)                 
                      
 Limit
  (cost=0.00..1433.14 rows=20 width=177)   -  Index Scan Backward using 
server_timestamp_idx on events  (cost=0.00..623055.91 rows=8695 width=177)
         Filter: (client_uuid ~* '^foo'::text)

(3) Alternatively making the query faster by using a smaller limit

  EXPLAIN

   SELECT events.id, events.client_duration, events.message, events.created_by, 
events.source, events.type, events.event, events.environment,
          events.server_timestamp, events.session_id, events.reference, 
events.client_uuid     FROM events
    WHERE client_uuid ~* E'^foo bar so what' ORDER BY server_timestamp DESC
    LIMIT 10;
                                QUERY PLAN (GOOD!)                              
         
--
 Limit  (cost=0.00..259606.63 rows=10 width=177)   -  Index Scan Backward 
using server_timestamp_idx on events  (cost=0.00..623055.91 rows=24 width=177)
         Filter: (client_uuid ~* '^foo bar so what'::text)

I find myself wishing I could just put a SQL HINT on the query to force the 
index to be used but I understand that HINTs are considered harmful and are 
therefore not provided for PostgreSQL, so what is the recommended way to solve 
this?

thank you very much