Re: [PERFORM] Too Many OR's?

2005-11-15 Thread Scott Marlowe
On Tue, 2005-11-15 at 13:12, Bill McGonigle wrote:
> I have a query that's making the planner do the wrong thing (for my 
> definition of wrong) and I'm looking for advice on what to tune to make 
> it do what I want.
> 
> The query consists or SELECT'ing a few fields from a table for a large 
> number of rows.  The table has about seventy thousand rows and the user 
> is selecting some subset of them.  I first do a SELECT...WHERE to 
> determine the unique identifiers I want (works fine) and then I do a 
> SELECT WHERE IN giving the list of id's I need additional data on 
> (which I see from EXPLAIN just gets translated into a very long list of 
> OR's).
> 
> Everything works perfectly until I get to 65301 rows.  At 65300 rows, 
> it does an index scan and takes 2197.193 ms.  At 65301 rows it switches 
> to a sequential scan and takes 778951.556 ms.  Values known not to 
> affect this are: work_mem, effective_cache_size.  Setting 
> random_page_cost from 4 to 1 helps (79543.214 ms) but I'm not really 
> sure what '1' means, except it's relative.  Of course, setting 
> 'enable_seqscan false' helps immensely (2337.289 ms) but that's as 
> inelegant of a solution as I've found - if there were other databases 
> on this install that wouldn't be the right approach.
> 
> Now I can break this down into multiple SELECT's in code, capping each 
> query at 65300 rows, and that's a usable workaround, but academically 
> I'd like to know how to convince the planner to do it my way.  It's 
> making a bad guess about something but I'm not sure what.  I didn't see 
> any hard-coded limits grepping through the source (though it is close 
> to the 16-bit unsigned boundry - probably coincidental) so if anyone 
> has ideas or pointers to how I might figure out what's going wrong that 
> would be helpful.

OK, there IS a point at which switching to a sequential scan will be
fast.  I.e. when you're getting everything in the table.  But the
database is picking a number where to switch that is too low.

First, we need to know if the statistics are giving the query planner a
good enough idea of how many rows it's really gonna get versus how many
it expects.

Do an explain  and see how many it thinks it's gonna
get.  Since you've actually run it, you know how many it really is going
to get, so there's no need for an explain analyze  just
yet.

Now, as long as the approximation is pretty close, fine.  But if it's
off by factors, then we need to increase the statistics target on that
column, with:

ALTER TABLE name ALTER columnname SET STATISTICS xxx

where xxx is the new number.  The default is set in your postgresql.conf
file, and is usually pretty low, say 10.  You can go up to 1000, but
that makes query planning take longer.  Try some incremental increase to
say 20 or 40 or even 100, and run analyze on that table then do an
explain on it again until the estimate is close.

Once the estimate is close, you use change random_page_cost to get the
query planner to switch at the "right" time.  Change the number of in()
numbers and play with random_page_cost and see where that sweet spot
is.  note that what seems right on a single table for a single user may
not be best as you increase load or access other tables.

random_page_cost represents the increase in a random access versus a
sequential access.  As long as your data fit into ram, the difference is
pretty much none (i.e. random_page_cost=1) so don't set it too low, or
accessing REALLY large data sets could become REALLY slow, as it uses
indexes when it should have been sequentially scanning.

Also, check what you've got effective_cache set to.  This tells
postgresql how much memory your kernel is using for cache, and so lets
it know about how likely it is that your current data set under your
query is to be in there.

Also, read this:

http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Too Many OR's?

2005-11-15 Thread Bill McGonigle
I have a query that's making the planner do the wrong thing (for my 
definition of wrong) and I'm looking for advice on what to tune to make 
it do what I want.


The query consists or SELECT'ing a few fields from a table for a large 
number of rows.  The table has about seventy thousand rows and the user 
is selecting some subset of them.  I first do a SELECT...WHERE to 
determine the unique identifiers I want (works fine) and then I do a 
SELECT WHERE IN giving the list of id's I need additional data on 
(which I see from EXPLAIN just gets translated into a very long list of 
OR's).


Everything works perfectly until I get to 65301 rows.  At 65300 rows, 
it does an index scan and takes 2197.193 ms.  At 65301 rows it switches 
to a sequential scan and takes 778951.556 ms.  Values known not to 
affect this are: work_mem, effective_cache_size.  Setting 
random_page_cost from 4 to 1 helps (79543.214 ms) but I'm not really 
sure what '1' means, except it's relative.  Of course, setting 
'enable_seqscan false' helps immensely (2337.289 ms) but that's as 
inelegant of a solution as I've found - if there were other databases 
on this install that wouldn't be the right approach.


Now I can break this down into multiple SELECT's in code, capping each 
query at 65300 rows, and that's a usable workaround, but academically 
I'd like to know how to convince the planner to do it my way.  It's 
making a bad guess about something but I'm not sure what.  I didn't see 
any hard-coded limits grepping through the source (though it is close 
to the 16-bit unsigned boundry - probably coincidental) so if anyone 
has ideas or pointers to how I might figure out what's going wrong that 
would be helpful.


Thanks,
-Bill

-
Bill McGonigle, Owner   Work: 603.448.4440
BFC Computing, LLC  Home: 603.448.1668
[EMAIL PROTECTED]   Mobile: 603.252.2606
http://www.bfccomputing.com/Pager: 603.442.1833
Jabber: [EMAIL PROTECTED]  Text: [EMAIL PROTECTED]
Blog: http://blog.bfccomputing.com/


---(end of broadcast)---
TIP 6: explain analyze is your friend