Re: [PERFORM] Incorrect estimates on correlated filters

2008-08-16 Thread Decibel!

On Aug 13, 2008, at 1:45 PM, Chris Kratz wrote:
Yes, I know hints are frowned upon around here.  Though, I'd love  
to have them or something equivalent on this particular query just  
so the customer can run their important reports.  As it is, it's  
unrunnable.



Actually, now that I think about it the last time this was brought up  
there was discussion about something that doesn't force a particular  
execution method, but instead provides improved information to the  
planner. It might be worth pursuing that, as I think there was less  
opposition to it.

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Incorrect estimates on correlated filters

2008-08-14 Thread Gregory Stark
Craig Ringer [EMAIL PROTECTED] writes:

 It strikes me that there are really two types of query hint possible here.

 One tells the planner (eg) prefer a merge join here.

 The other gives the planner more information that it might not otherwise
 have to work with, so it can improve its decisions. The values used in
 this join condition are highly correlated.

This sounds familiar:

http://article.gmane.org/gmane.comp.db.postgresql.devel.general/55730/match=hints

Plus ça change...

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Incorrect estimates on correlated filters

2008-08-13 Thread Decibel!

On Aug 12, 2008, at 4:59 PM, Chris Kratz wrote:
Ran into a re-occuring performance problem with some report queries  
again today.  In a nutshell, we have filters on either multiple  
joined tables, or multiple columns on a single table that are  
highly correlated.  So, the estimates come out grossly incorrect  
(the planner has no way to know they are correlated).  2000:1 for  
one I'm looking at right now.  Generally this doesn't matter,  
except in complex reporting queries like these when this is the  
first join of 40 other joins.  Because the estimate is wrong at the  
lowest level, it snowballs up through the rest of the joins causing  
the query to run very, very slowly.   In many of these cases,  
forcing nested loops off for the duration of the query fixes the  
problem.  But I have a couple that still are painfully slow and  
shouldn't be.


I've been reading through the archives with others having similar  
problems (including myself a year ago).  Am I right in assuming  
that at this point there is still little we can do in postgres to  
speed up this kind of query?  Right now the planner has no way to  
know the correlation between different columns in the same table,  
let alone columns in different tables.  So, it just assumes no  
correlation and returns incorrectly low estimates in cases like these.


The only solution I've come up with so far is to materialize  
portions of the larger query into subqueries with these correlated  
filters which are indexed and analyzed before joining into the  
larger query.  This would keep the incorrect estimates from  
snowballing up through the chain of joins.


Are there any other solutions to this problem?



Well... you could try and convince certain members of the community  
that we actually do need some kind of a query hint mechanism... ;)


I did make a suggestion a few months ago that involved sorting a  
table on different columns and recording the correlation of other  
columns. The scheme isn't perfect, but it would help detect cases  
like a field populated by a sequence and another field that's insert  
timestamp; those two fields would correlate highly, and you should  
even be able to correlate the two histograms; that would allow you to  
infer that most of the insert times for _id's between 100 and 200  
will be between 2008-01-01 00:10 and 2008-01-01 00:20, for example.

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Incorrect estimates on correlated filters

2008-08-13 Thread Chris Kratz
On Wed, Aug 13, 2008 at 10:59 AM, Decibel! [EMAIL PROTECTED] wrote:

 On Aug 12, 2008, at 4:59 PM, Chris Kratz wrote:

 Ran into a re-occuring performance problem with some report queries again
 today.  In a nutshell, we have filters on either multiple joined tables, or
 multiple columns on a single table that are highly correlated.  So, the
 estimates come out grossly incorrect (the planner has no way to know they
 are correlated).  2000:1 for one I'm looking at right now.  Generally this
 doesn't matter, except in complex reporting queries like these when this is
 the first join of 40 other joins.  Because the estimate is wrong at the
 lowest level, it snowballs up through the rest of the joins causing the
 query to run very, very slowly.   In many of these cases, forcing nested
 loops off for the duration of the query fixes the problem.  But I have a
 couple that still are painfully slow and shouldn't be.

 I've been reading through the archives with others having similar problems
 (including myself a year ago).  Am I right in assuming that at this point
 there is still little we can do in postgres to speed up this kind of query?
  Right now the planner has no way to know the correlation between different
 columns in the same table, let alone columns in different tables.  So, it
 just assumes no correlation and returns incorrectly low estimates in cases
 like these.

 The only solution I've come up with so far is to materialize portions of
 the larger query into subqueries with these correlated filters which are
 indexed and analyzed before joining into the larger query.  This would keep
 the incorrect estimates from snowballing up through the chain of joins.

 Are there any other solutions to this problem?



 Well... you could try and convince certain members of the community that we
 actually do need some kind of a query hint mechanism... ;)

 I did make a suggestion a few months ago that involved sorting a table on
 different columns and recording the correlation of other columns. The scheme
 isn't perfect, but it would help detect cases like a field populated by a
 sequence and another field that's insert timestamp; those two fields would
 correlate highly, and you should even be able to correlate the two
 histograms; that would allow you to infer that most of the insert times for
 _id's between 100 and 200 will be between 2008-01-01 00:10 and 2008-01-01
 00:20, for example.
 --
 Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
 Give your computer some brain candy! www.distributed.net Team #1828


 Thanks for the reply,

Yes, I know hints are frowned upon around here.  Though, I'd love to have
them or something equivalent on this particular query just so the customer
can run their important reports.  As it is, it's unrunnable.

Unfortunately, if I don't think the sorting idea would help in the one case
I'm looking at which involves filters on two tables that are joined
together.  The filters happen to be correlated such that about 95% of the
rows from each filtered table are actually returned after the join.
Unfortunately, the planner thinks we will get 1 row back.

I do have to find a way to make these queries runnable.  I'll keep looking.

Thanks,

-Chris


Re: [PERFORM] Incorrect estimates on correlated filters

2008-08-13 Thread Alvaro Herrera
Chris Kratz wrote:

 Unfortunately, if I don't think the sorting idea would help in the one case
 I'm looking at which involves filters on two tables that are joined
 together.  The filters happen to be correlated such that about 95% of the
 rows from each filtered table are actually returned after the join.
 Unfortunately, the planner thinks we will get 1 row back.

Maybe you can wrap that part of the query in a SQL function and set its
estimated cost to the real values with ALTER FUNCTION ... ROWS.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Incorrect estimates on correlated filters

2008-08-13 Thread Craig Ringer
Decibel! wrote:

 Well... you could try and convince certain members of the community that
 we actually do need some kind of a query hint mechanism... ;)

It strikes me that there are really two types of query hint possible here.

One tells the planner (eg) prefer a merge join here.

The other gives the planner more information that it might not otherwise
have to work with, so it can improve its decisions. The values used in
this join condition are highly correlated.

Is there anything wrong with the second approach? It shouldn't tend to
suppress planner bug reports etc. Well, not unless people use it to lie
to the planner, and I expect results from that would be iffy at best. It
just provides information to supplement Pg's existing stats system to
handle cases where it's not able to reasonably collect the required
information.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Incorrect estimates on correlated filters

2008-08-12 Thread Chris Kratz
Hello All,

Ran into a re-occuring performance problem with some report queries again
today.  In a nutshell, we have filters on either multiple joined tables, or
multiple columns on a single table that are highly correlated.  So, the
estimates come out grossly incorrect (the planner has no way to know they
are correlated).  2000:1 for one I'm looking at right now.  Generally this
doesn't matter, except in complex reporting queries like these when this is
the first join of 40 other joins.  Because the estimate is wrong at the
lowest level, it snowballs up through the rest of the joins causing the
query to run very, very slowly.   In many of these cases, forcing nested
loops off for the duration of the query fixes the problem.  But I have a
couple that still are painfully slow and shouldn't be.

I've been reading through the archives with others having similar problems
(including myself a year ago).  Am I right in assuming that at this point
there is still little we can do in postgres to speed up this kind of query?
Right now the planner has no way to know the correlation between different
columns in the same table, let alone columns in different tables.  So, it
just assumes no correlation and returns incorrectly low estimates in cases
like these.

The only solution I've come up with so far is to materialize portions of the
larger query into subqueries with these correlated filters which are indexed
and analyzed before joining into the larger query.  This would keep the
incorrect estimates from snowballing up through the chain of joins.

Are there any other solutions to this problem?

Thanks,

-Chris