[PERFORM] Seq Scan used instead of Index Scan

2011-11-23 Thread Gary Warner
Very Fast Version:

Recently my database stopped respecting one of my indexes, which took a query 
that should run in subsecond response time and turning it into something that 
with small data sets runs in the 7-10 minute range and with large data sets 
runs in the 30 minute - eternity range.

Explain Analyze tells me that what used to be an Index Scan has become a Seq 
Scan, doing a full table scan through 140 million records.

Any thoughts on why that happens?

+++
Longer version:

I have a very large database (billion+ records) that has begun not respecting 
indexes on some queries, resulting in what used to be instant answers now 
taking many minutes and I'm trying to figure out why.

I'll try to simplify the request this way:

I have a database of many hundreds of millions of email messages, and 
information about those messages.  There is a table that tells me when I 
received emails and from what source.  There is another table that tells what 
URLs were seen in the bodies of those emails and there is a table that links 
those two tables together.  Since many emails can contain the same URL, and 
many URLs can be seen in each email, we divide like this:

Tables
===
emails

link_urls

email_links

email.message_id = link_url.message_id
linkurl.urlid= email_links.urlid


One attribute of the URL is the hostname portion.

In my puzzle, I have a table of hostnames, and I want to get statistics about 
which emails contained URLs that pointed to those hostnames.

A very simplified version of the query could be:

select email.stuff from email natural join link_url natural join email_links 
where hostname = 'foo.bar.com';


We know that there were two emails that advertised foo.bar.com, so the 
objective is to link the presence of foo.bar.com to a URLID, use the link_url 
table to find the respective message_id and then go ask the email table for 
details about those two messages.


My explain analyze of that query takes seven minutes to complete, despite the 
fact that I have an index on message_id, and urlid, and machine where they 
occur in each of the three tables.

We broke the query down into it's components, and confirmed that what we 
consider the three natural pieces of the query each run blazing fast, each 
using the proper indexes.  But when we put them all together, the index gets 
ignored, and we sequential scan a 140 million row table.

Which is slow.


===
If we break it into pieces ... a human might think of the pieces as:




explain analyze select urlid from email_links where machine = 'foo.bar.com';

   QUERY PLAN  
---
 Bitmap Heap Scan on email_links  (cost=97.10..9650.58 rows=2457 width=33) 
(actual time=0.049..0.049 rows=1 loops=1)
   Recheck Cond: (machine = 'foo.bar.com'::text)
   -  Bitmap Index Scan on hostdex  (cost=0.00..96.49 rows=2457 width=0) 
(actual time=0.039..0.039 rows=1 loops=1)
 Index Cond: (machine = 'foo.bar.com'::text)
 Total runtime: 0.066 ms
(5 rows)


=

= explain analyze select message_id from link_url where urlid = 
'9de6440fcc089c654806bd4c853c76f1';
  QUERY PLAN
---
 Bitmap Heap Scan on link_url  (cost=65.20..5680.14 rows=1437 width=4) (actual 
time=0.074..0.083 rows=2 loops=1)
   Recheck Cond: (urlid = '9de6440fcc089c654806bd4c853c76f1'::text)
   -  Bitmap Index Scan on link_url_hkey  (cost=0.00..64.84 rows=1437 width=0) 
(actual time=0.021..0.021 rows=2 loops=1)
 Index Cond: (urlid = '9de6440fcc089c654806bd4c853c76f1'::text)
 Total runtime: 0.109 ms
(5 rows)

=

explain analyze select stuff from email where message_id in (78085350, 
78088168);
 QUERY PLAN 
---
 Bitmap Heap Scan on email  (cost=17.19..25.21 rows=2 width=8) (actual 
time=0.068..0.077 rows=2 loops=1)
   Recheck Cond: (message_id = ANY ('{78085350,78088168}'::integer[]))
   -  Bitmap Index Scan on email_pkey  (cost=0.00..17.19 rows=2 width=0) 
(actual time=0.054..0.054 rows=2 loops=1)
 Index Cond: (message_id = ANY ('{78085350,78088168}'::integer[]))
 Total runtime: 0.100 ms
(5 rows)

++


In my simple mind that should take (.1 + .109 + .066) = FAST.

Here's the Explain Analyze for the combined version:


explain analyze select email.stuff from email natural join link_url natural 
join email_link where machine = 'foo.bar.com';

 QUERY PLAN   
---
 Merge Join  

Re: [PERFORM] Seq Scan used instead of Index Scan

2011-11-23 Thread Claudio Freire
On Wed, Nov 23, 2011 at 7:24 PM, Gary Warner g...@cis.uab.edu wrote:
 See that Seq Scan on link_url?  We can't figure out why that is there!  We 
 should be scanning for a matching urlid and we have an index on urlid?

 When this is happening in a two table version of this problem, we can get 
 temporary relief by giving the statement:

 set enable_seqscan = false;

Obviously, because it thinks the index scan will perform worse.

It would be interesting to see the explain analyze with
enable_seqscan=false to see why

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Seq Scan used instead of Index Scan

2011-11-23 Thread Mark Kirkwood
Can you post your non-default postgresql.conf settings? (I'd hazard a 
guess that you have effective_cache_size set to the default 128MB).


Best wishes

Mark

On 24/11/11 11:24, Gary Warner wrote:

Very Fast Version:

Recently my database stopped respecting one of my indexes, which took a query that should 
run in subsecond response time and turning it into something that with small 
data sets runs in the 7-10 minute range and with large data sets runs in the 30 minute - 
eternity range.

Explain Analyze tells me that what used to be an Index Scan has become a Seq 
Scan, doing a full table scan through 140 million records.

Any thoughts on why that happens?




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Seq Scan used instead of Index Scan

2011-11-23 Thread Tom Lane
Gary Warner g...@cis.uab.edu writes:
 Recently my database stopped respecting one of my indexes, which took a query 
 that should run in subsecond response time and turning it into something 
 that with small data sets runs in the 7-10 minute range and with large data 
 sets runs in the 30 minute - eternity range.

 Explain Analyze tells me that what used to be an Index Scan has become a Seq 
 Scan, doing a full table scan through 140 million records.

 Any thoughts on why that happens?

I'd bet it has a lot to do with the nigh-three-orders-of-magnitude
overestimates of the numbers of matching rows.  You might find that
increasing the statistics targets for the indexed columns helps ---
I'm guessing that these particular key values are out in the long
tail of a highly skewed distribution, and the planner needs a larger MCV
list to convince it that non-MCV values will not occur very many times.

If that is an accurate guess, then trying to force the matter with
something like enable_seqscan = off is not a good production solution,
because it will result in horrid plans whenever you decide to query
a not-so-infrequent value.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance