[PERFORM] annoying query/planner choice

2004-01-11 Thread Andrew Rawnsley
I have a situation that is giving me small fits, and would like to see  
if anyone can shed any light on it.

I have a modest table (@1.4 million rows, and growing), that has a  
variety of queries run against it. One is
a very straightforward one - pull a set of distinct rows out based on  
two columns, with a simple where clause
based on one of the indexed columns. For illustration here, I've  
removed the distinct and order-by clauses, as
they are not the culprits.

Before I go on - v7.4.1, currently on a test box, dual P3, 1G ram, 10K  
scsi, Slackware 9 or so. The table has been
vacuumed and analyzed. Even offered pizza and beer. Production box will  
be a dual Xeon with 2G ram and RAID 5.

When the query is run with a where clause that returns small number of  
rows, the query uses the index and is quite speedy:

rav=# explain analyze select casno, parameter from hai.results where  
site_id = 9982;
  QUERY PLAN
 
--
 Index Scan using hai_res_siteid_ndx on results  (cost=0.00..7720.87  
rows=2394 width=30) (actual time=12.118..12.933 rows=50 loops=1)
   Index Cond: (site_id = 9982)
 Total runtime: 13.145 ms

When a query is run that returns a much larger set, the index is not  
used, I assume because the planner thinks that a sequential scan
would work just as well with a large result set:

rav=# explain analyze select casno, parameter from hai.results where  
site_id = 18;
  QUERY PLAN
 
--
 Seq Scan on results  (cost=0.00..73396.39 rows=211205 width=30)  
(actual time=619.020..15012.807 rows=186564 loops=1)
   Filter: (site_id = 18)
 Total runtime: 15279.789 ms
(3 rows)

Unfortunately, its way off:

rav=# set enable_seqscan=off;
SET
rav=# explain analyze select casno, parameter from hai.results where  
site_id = 18;
  QUERY  
PLAN
 
---
 Index Scan using hai_res_siteid_ndx on results  (cost=0.00..678587.01  
rows=211205 width=30) (actual time=9.575..3569.387 rows=186564 loops=1)
   Index Cond: (site_id = 18)
 Total runtime: 3872.292 ms
(3 rows)

I would like, of course, for it to use the index, given that it takes  
20-25% of the time. Fiddling with CPU_TUPLE_COST doesn't do anything  
until I exceed
0.5, which strikes me as a bit high (though please correct me if I am  
assuming too much...). RANDOM_PAGE_COST seems to have no effect. I  
suppose I could
cluster it, but it is constantly being added to, and would have to be  
re-done on a daily basis (if not more).

Any suggestions?





Andrew Rawnsley
President
The Ravensfield Digital Resource Group, Ltd.
(740) 587-0114
www.ravensfield.com
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] annoying query/planner choice

2004-01-11 Thread Dennis Bjorklund
On Sun, 11 Jan 2004, Andrew Rawnsley wrote:

 20-25% of the time. Fiddling with CPU_TUPLE_COST doesn't do anything
 until I exceed 0.5, which strikes me as a bit high (though please
 correct me if I am assuming too much...). RANDOM_PAGE_COST seems to have
 no effect.

What about the effective cache size, is that set properly?

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] annoying query/planner choice

2004-01-11 Thread Andrew Rawnsley
Low (1000). I'll fiddle with that. I just noticed that the machine only 
has 512MB of ram in it, and not 1GB. I must
have raided it for some other machine...

On Jan 11, 2004, at 10:50 PM, Dennis Bjorklund wrote:

On Sun, 11 Jan 2004, Andrew Rawnsley wrote:

20-25% of the time. Fiddling with CPU_TUPLE_COST doesn't do anything
until I exceed 0.5, which strikes me as a bit high (though please
correct me if I am assuming too much...). RANDOM_PAGE_COST seems to 
have
no effect.
What about the effective cache size, is that set properly?

--
/Dennis Björklund
---(end of 
broadcast)---
TIP 1: subscribe and unsubscribe commands go to 
[EMAIL PROTECTED]



Andrew Rawnsley
President
The Ravensfield Digital Resource Group, Ltd.
(740) 587-0114
www.ravensfield.com
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] annoying query/planner choice

2004-01-11 Thread Christopher Browne
Centuries ago, Nostradamus foresaw when [EMAIL PROTECTED] (Andrew Rawnsley) would 
write:
 I would like, of course, for it to use the index, given that it
 takes 20-25% of the time. Fiddling with CPU_TUPLE_COST doesn't do
 anything until I exceed 0.5, which strikes me as a bit high (though
 please correct me if I am assuming too much...). RANDOM_PAGE_COST
 seems to have no effect. I suppose I could cluster it, but it is
 constantly being added to, and would have to be re-done on a daily
 basis (if not more).

 Any suggestions?

The apparent problem is a bad query plan, and for clustering to fix
it seems a disturbing answer.

A problem I saw last week with some query plans pointed to the issue
that the statistics were inadequate.

We had some queries where indexing on customer is extremely
worthwhile in nearly all cases, but it often wasn't happening.  The
problem was that the 10 bins in the default stats table would
collect up stats about a few _highly_ active customers, and pretty
much ignore the less active ones.  Because the bins were highly
dominated by the few common values, stats for the others were missing
and pretty useless.

I upped the size of the histogram from 10 to 100, and that allowed
stats to be kept for less active customers, GREATLY improving the
quality of the queries.

The point that falls out is that if you have a column which has a
bunch of discrete values (rather more than 10) that aren't near-unique
(e.g. - on a table with a million transactions, you have a only few
hundred customers), that's a good candidate for upping column stats.

Thus, you might try:
  ALTER TABLE MY_TABLE ALTER COLUMN SOME_COLUMN SET STATISTICS 50;
  ANALYZE MY_TABLE;
-- 
let name=cbbrowne and tld=ntlug.org in name ^ @ ^ tld;;
http://www.ntlug.org/~cbbrowne/postgresql.html
There's  no  longer  a  boycott  of  Apple.  But  MacOS  is  still  a
proprietary OS. -- RMS - June 13, 1998

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] annoying query/planner choice

2004-01-11 Thread Tom Lane
Andrew Rawnsley [EMAIL PROTECTED] writes:
 I have a situation that is giving me small fits, and would like to see  
 if anyone can shed any light on it.

In general, pulling 10% of a table *should* be faster as a seqscan than
an indexscan, except under the most extreme assumptions about clustering
(is the table clustered on site_id, by any chance?).  What I suspect is
that the table is a bit larger than your available RAM, so that a
seqscan ends up flushing all of the kernel's cache and forcing a lot of
I/O, whereas an indexscan avoids the cache flush by not touching (quite)
all of the table.  The trouble with this is that the index only looks
that good under test conditions, ie, when you repeat it just after an
identical query that pulled all of the needed pages into RAM.  Under
realistic load conditions where different site_ids are being hit, the
indexscan is not going to be as good as you think, because it will incur
substantial I/O.

You should try setting up a realistic test load hitting different random
site_ids, and see whether it's really a win to force seqscan off for
this query or not.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster