Re: [HACKERS] WIP: cross column correlation, 2nd shot

2012-03-14 Thread Robert Haas
On Tue, Mar 13, 2012 at 9:56 AM, Hans-Jürgen Schönig
postg...@cybertec.at wrote:
 Here's the cross-col patch against todays master branch.

Please add your patch here, so it doesn't get forgotten:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-28 Thread Robert Haas
On Sun, Feb 27, 2011 at 5:17 PM, Josh Berkus j...@agliodbs.com wrote:

 I think there would be value in giving the DBA an easier way to see
 which tables are hot, but I am really leery about the idea of trying
 to feed that directly into the query planner.  I think this is one of
 those cases where we let people tune it manually for starters, and
 then wait for feedback.  Eventually someone will say oh, I never tune
 that by hand any more, ever since I wrote this script which does the
 following computation... and I just run it out cron.  And then we
 will get out the party hats.  But we will never get the experience we
 need to say what that auto-tuning algorithm will be unless we first
 provide the knob for someone to fiddle with manually.

 I'm not disagreeing with that.  I'm saying first, we give DBAs a way to
 see which tables are currently hot.  Such a feature has multiple
 benefits, making it worth the overhead and/or coding effort.

 Whether we're shooting for autotuning or manual tuning, it starts with
 having the data.

Well, what we have now is a bunch of counters in pg_stat_all_tables
and pg_statio_all_tables.  Making that easier for the DBA almost seems
like more of a job for a third-party tool that, say, graphs it, than a
job for PG itself.  But if you have an idea I'm ears.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-28 Thread Alvaro Herrera
Excerpts from Robert Haas's message of sáb feb 26 02:24:26 -0300 2011:
 On Fri, Feb 25, 2011 at 2:26 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Excerpts from Rod Taylor's message of vie feb 25 14:03:58 -0300 2011:
 
  How practical would it be for analyze to keep a record of response times 
  for
  given sections of a table as it randomly accesses them and generate some
  kind of a map for expected response times for the pieces of data it is
  analysing?
 
  I think what you want is random_page_cost that can be tailored per
  tablespace.
 
 We have that.

Oh, right.

 But it's not the same as tracking *sections of a table*.

I dunno.  I imagine if you have a section of a table in different
storage than other sections, you created a tablespace and moved the
partition holding that section there.  Otherwise, how do you prevent the
tuples from moving to other sections?  (We don't really have a concept
of sections of a table.)

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-28 Thread Rod Taylor
  But it's not the same as tracking *sections of a table*.

 I dunno.  I imagine if you have a section of a table in different
 storage than other sections, you created a tablespace and moved the
 partition holding that section there.  Otherwise, how do you prevent the
 tuples from moving to other sections?  (We don't really have a concept
 of sections of a table.)


Section could be as simple as being on the inner or outer part of a single
disk, or as complicated as being on the SSD cache of a spinning disk, or in
the multi-gigabyte cache on the raid card or SAN due to being consistently
accessed.

Section is the wrong word. If primary key values under 10 million are
consistently accessed, they will be cached even if they do get moved through
the structure. Values over 10M may be fast if on the same page as the other
value but probably aren't.

This is very evident when dealing with time based data in what can be a very
large structure. 1% may be very hot and in memory while 99% is not.

Partitioning only helps if you can predict what will be hot in the future.
Sometimes an outside source (world events) impacts what section of the
structure is hot.

regards,

Rod


Re: [HACKERS] WIP: cross column correlation ...

2011-02-27 Thread Bruce Momjian
Rod Taylor wrote:
 On Fri, Feb 25, 2011 at 14:26, Alvaro Herrera 
 alvhe...@commandprompt.comwrote:
 
  Excerpts from Rod Taylor's message of vie feb 25 14:03:58 -0300 2011:
 
   How practical would it be for analyze to keep a record of response times
  for
   given sections of a table as it randomly accesses them and generate some
   kind of a map for expected response times for the pieces of data it is
   analysing?
 
  I think what you want is random_page_cost that can be tailored per
  tablespace.
 
 
 Yes, that can certainly help but does nothing to help with finding typical
 hot-spots or cached sections of the table and sending that information to
 the planner.
 
 Between Analyze random sampling and perhaps some metric during actual IO of
 random of queries we should be able to determine and record which pieces of
 data tend to be hot/in cache, or readily available and what data tends not
 to be.
 
 
 If the planner knew that the value 1 tends to have a much lower cost to
 fetch than any other value in the table (it is cached or otherwise readily
 available), it can choose a plan better suited toward that.

Well, one idea I have always had is feeding things the executor finds
back to the optimizer for use in planning future queries.  One argument
against that is that a planned query might run with different data
behavior than seen by the executor in the past, but we know if the
optimizer is planning something for immediate execution or later
execution, so we could use executor stats only when planning for
immediate execution.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-27 Thread Bruce Momjian
Grzegorz Jaskiewicz wrote:
 
 On 25 Feb 2011, at 13:18, Robert Haas wrote:
 
   People coming from Oracle are not favorably
  impressed either by the amount of monitoring data PostgreSQL can
  gather or by the number of knobs that are available to fix problems
  when they occur.  We don't need to have as many knobs as Oracle and we
  probably don't want to, and for that matter we probably couldn't if we
  did want to for lack of manpower, but that doesn't mean we should have
  none.
 
 Still, having more data a user can probe would be nice. 
 
 I wonder why everyone avoids Microsoft's approach to the subject. Apparently, 
 they go in the 'auto-tune as much as possible' direction. 
 And tests we did a while ago, involving asking team from Microsoft and a team 
 from oracle to optimise set of queries for the same set of data (bookies 
 data, loads of it) showed that the auto-tuning Microsoft has in their
 sql server performed much better than a team of over-sweating oracle dba's. 
 
 In my current work place/camp we have many deployments of the same system, 
 over different types of machines, each with different customer data that vary 
 so much that queries need to be rather generic. 
 Postgresql shows its strength with planner doing a good job for different 
 variants of data, however we do a very little tweaking to the configuration 
 parameters. Just because it is just too hard to overlook all of them. 
 I guess that the systems could behave much better, but no one is going to 
 tweak settings for 50 different installations over 50 different type of data 
 and 50 different sets of hardware. 
 If there was even a tiny amount of automation provided in the postgresql, I 
 would welcome it with open arms. 

I totally agree.  If we add a tuning parameter that does 10x better than
automatic, but only 1% of our users use it, we would be better off,
overall, with the automatic tuning.  See my blog post which talks about
the same tradeoff when adding configuration variables:

http://momjian.us/main/blogs/pgblog/2009.html#January_10_2009

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-27 Thread Bruce Momjian
Robert Haas wrote:
 On Sat, Feb 26, 2011 at 1:57 AM, Bruce Momjian br...@momjian.us wrote:
  Robert Haas wrote:
   Actually, we *do* have some idea which tables are hot. ?Or at least, we
   could. ? Currently, pg_stats for tables are timeless; they just
   accumulate from the last reset, which has always been a problem in
   general for monitoring. ?If we could make top-level table and index
   stats time-based, even in some crude way, we would know which tables
   were currently hot. ?That would also have the benefit of making server
   performance analysis and autotuning easier.
 
  I think there would be value in giving the DBA an easier way to see
  which tables are hot, but I am really leery about the idea of trying
  to feed that directly into the query planner. ?I think this is one of
  those cases where we let people tune it manually for starters, and
  then wait for feedback. ?Eventually someone will say oh, I never tune
  that by hand any more, ever since I wrote this script which does the
  following computation... and I just run it out cron. ?And then we
  will get out the party hats. ?But we will never get the experience we
  need to say what that auto-tuning algorithm will be unless we first
  provide the knob for someone to fiddle with manually.
 
  It is also possible we will implement a manual way and never get around
  to automating it. ? :-(
 
 You make it sound as if we know how but are just too lazy to right the
 code.  That is not one of the weaknesses that this community has.

Well, several automatic idea have been floated, but rejected because
they don't work well for queries that are planned and executed later. 
Perhaps we should consider auto-tuning of queries that are planned for
immediate execution.  I just posed that idea in an email to this thread.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-27 Thread Robert Haas
On Sun, Feb 27, 2011 at 3:01 AM, Bruce Momjian br...@momjian.us wrote:
 Grzegorz Jaskiewicz wrote:

 On 25 Feb 2011, at 13:18, Robert Haas wrote:

   People coming from Oracle are not favorably
  impressed either by the amount of monitoring data PostgreSQL can
  gather or by the number of knobs that are available to fix problems
  when they occur.  We don't need to have as many knobs as Oracle and we
  probably don't want to, and for that matter we probably couldn't if we
  did want to for lack of manpower, but that doesn't mean we should have
  none.

 Still, having more data a user can probe would be nice.

 I wonder why everyone avoids Microsoft's approach to the subject. 
 Apparently, they go in the 'auto-tune as much as possible' direction.
 And tests we did a while ago, involving asking team from Microsoft and a 
 team from oracle to optimise set of queries for the same set of data 
 (bookies data, loads of it) showed that the auto-tuning Microsoft has in 
 their
 sql server performed much better than a team of over-sweating oracle dba's.

 In my current work place/camp we have many deployments of the same system, 
 over different types of machines, each with different customer data that 
 vary so much that queries need to be rather generic.
 Postgresql shows its strength with planner doing a good job for different 
 variants of data, however we do a very little tweaking to the configuration 
 parameters. Just because it is just too hard to overlook all of them.
 I guess that the systems could behave much better, but no one is going to 
 tweak settings for 50 different installations over 50 different type of data 
 and 50 different sets of hardware.
 If there was even a tiny amount of automation provided in the postgresql, I 
 would welcome it with open arms.

 I totally agree.  If we add a tuning parameter that does 10x better than
 automatic, but only 1% of our users use it, we would be better off,
 overall, with the automatic tuning.

It's not an either/or proposition.  There is no reason why we can't
let things be tuned automatically, but provide overrides for cases
where the automatic tuning does not work well, of which there will
always be some.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-27 Thread Robert Haas
On Sun, Feb 27, 2011 at 3:03 AM, Bruce Momjian br...@momjian.us wrote:
 You make it sound as if we know how but are just too lazy to right the
 code.  That is not one of the weaknesses that this community has.

 Well, several automatic idea have been floated, but rejected because
 they don't work well for queries that are planned and executed later.
 Perhaps we should consider auto-tuning of queries that are planned for
 immediate execution.  I just posed that idea in an email to this thread.

Which ideas were rejected for that reason?  If we're talking about the
idea of using the current contents of the buffer cache and perhaps the
OS cache to plan queries, I think that's not likely to work well even
if we do restrict it to queries that we're going to execute
immediately.  Upthread I listed four problems with the idea of
planning queries based on the current contents of shared_buffers, and
this certainly doesn't address all four.

http://archives.postgresql.org/pgsql-hackers/2011-02/msg02206.php

To reiterate my basic theme here one more time, we have a very good
query planner, but it can fall on its face very badly when it is
unable to correctly estimate selectivity, or due to caching effects,
and we have very little to recommend to people who run afoul of those
problems right now.  The problems are real, significant, and affect a
large number of users, some of whom give up on PostgreSQL as a direct
result.  I am glad that we are committed to having a system that is
auto-tuning to the greatest degree possible, but I think it is very
short-sighted of us not to provide workarounds for the cases where
they are legitimately needed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-27 Thread Josh Berkus

 I think there would be value in giving the DBA an easier way to see
 which tables are hot, but I am really leery about the idea of trying
 to feed that directly into the query planner.  I think this is one of
 those cases where we let people tune it manually for starters, and
 then wait for feedback.  Eventually someone will say oh, I never tune
 that by hand any more, ever since I wrote this script which does the
 following computation... and I just run it out cron.  And then we
 will get out the party hats.  But we will never get the experience we
 need to say what that auto-tuning algorithm will be unless we first
 provide the knob for someone to fiddle with manually.

I'm not disagreeing with that.  I'm saying first, we give DBAs a way to
see which tables are currently hot.  Such a feature has multiple
benefits, making it worth the overhead and/or coding effort.

Whether we're shooting for autotuning or manual tuning, it starts with
having the data.


-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-26 Thread Grzegorz Jaskiewicz

On 25 Feb 2011, at 13:18, Robert Haas wrote:

  People coming from Oracle are not favorably
 impressed either by the amount of monitoring data PostgreSQL can
 gather or by the number of knobs that are available to fix problems
 when they occur.  We don't need to have as many knobs as Oracle and we
 probably don't want to, and for that matter we probably couldn't if we
 did want to for lack of manpower, but that doesn't mean we should have
 none.

Still, having more data a user can probe would be nice. 

I wonder why everyone avoids Microsoft's approach to the subject. Apparently, 
they go in the 'auto-tune as much as possible' direction. 
And tests we did a while ago, involving asking team from Microsoft and a team 
from oracle to optimise set of queries for the same set of data (bookies data, 
loads of it) showed that the auto-tuning Microsoft has in their
sql server performed much better than a team of over-sweating oracle dba's. 

In my current work place/camp we have many deployments of the same system, over 
different types of machines, each with different customer data that vary so 
much that queries need to be rather generic. 
Postgresql shows its strength with planner doing a good job for different 
variants of data, however we do a very little tweaking to the configuration 
parameters. Just because it is just too hard to overlook all of them. 
I guess that the systems could behave much better, but no one is going to tweak 
settings for 50 different installations over 50 different type of data and 50 
different sets of hardware. 
If there was even a tiny amount of automation provided in the postgresql, I 
would welcome it with open arms. 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-26 Thread Rod Taylor
On Fri, Feb 25, 2011 at 14:26, Alvaro Herrera alvhe...@commandprompt.comwrote:

 Excerpts from Rod Taylor's message of vie feb 25 14:03:58 -0300 2011:

  How practical would it be for analyze to keep a record of response times
 for
  given sections of a table as it randomly accesses them and generate some
  kind of a map for expected response times for the pieces of data it is
  analysing?

 I think what you want is random_page_cost that can be tailored per
 tablespace.


Yes, that can certainly help but does nothing to help with finding typical
hot-spots or cached sections of the table and sending that information to
the planner.

Between Analyze random sampling and perhaps some metric during actual IO of
random of queries we should be able to determine and record which pieces of
data tend to be hot/in cache, or readily available and what data tends not
to be.


If the planner knew that the value 1 tends to have a much lower cost to
fetch than any other value in the table (it is cached or otherwise readily
available), it can choose a plan better suited toward that.


Re: [HACKERS] WIP: cross column correlation ...

2011-02-26 Thread Robert Haas
On Sat, Feb 26, 2011 at 4:33 AM, Grzegorz Jaskiewicz
g...@pointblue.com.pl wrote:

 On 25 Feb 2011, at 13:18, Robert Haas wrote:

  People coming from Oracle are not favorably
 impressed either by the amount of monitoring data PostgreSQL can
 gather or by the number of knobs that are available to fix problems
 when they occur.  We don't need to have as many knobs as Oracle and we
 probably don't want to, and for that matter we probably couldn't if we
 did want to for lack of manpower, but that doesn't mean we should have
 none.

 Still, having more data a user can probe would be nice.

 I wonder why everyone avoids Microsoft's approach to the subject. Apparently, 
 they go in the 'auto-tune as much as possible' direction.
 And tests we did a while ago, involving asking team from Microsoft and a team 
 from oracle to optimise set of queries for the same set of data (bookies 
 data, loads of it) showed that the auto-tuning Microsoft has in their
 sql server performed much better than a team of over-sweating oracle dba's.

I don't think *anyone* is avoiding that approach.  There is almost
universal consensus here that auto-tuning is better than manual
tuning, even to the extent of being unwilling to add knobs to allow
manual tuning of settings we have no idea how to auto-tune and no
plans to auto-tune.

 In my current work place/camp we have many deployments of the same system, 
 over different types of machines, each with different customer data that vary 
 so much that queries need to be rather generic.
 Postgresql shows its strength with planner doing a good job for different 
 variants of data, however we do a very little tweaking to the configuration 
 parameters. Just because it is just too hard to overlook all of them.
 I guess that the systems could behave much better, but no one is going to 
 tweak settings for 50 different installations over 50 different type of data 
 and 50 different sets of hardware.
 If there was even a tiny amount of automation provided in the postgresql, I 
 would welcome it with open arms.

What do you have in mind?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-26 Thread Robert Haas
On Sat, Feb 26, 2011 at 1:57 AM, Bruce Momjian br...@momjian.us wrote:
 Robert Haas wrote:
  Actually, we *do* have some idea which tables are hot. ?Or at least, we
  could. ? Currently, pg_stats for tables are timeless; they just
  accumulate from the last reset, which has always been a problem in
  general for monitoring. ?If we could make top-level table and index
  stats time-based, even in some crude way, we would know which tables
  were currently hot. ?That would also have the benefit of making server
  performance analysis and autotuning easier.

 I think there would be value in giving the DBA an easier way to see
 which tables are hot, but I am really leery about the idea of trying
 to feed that directly into the query planner.  I think this is one of
 those cases where we let people tune it manually for starters, and
 then wait for feedback.  Eventually someone will say oh, I never tune
 that by hand any more, ever since I wrote this script which does the
 following computation... and I just run it out cron.  And then we
 will get out the party hats.  But we will never get the experience we
 need to say what that auto-tuning algorithm will be unless we first
 provide the knob for someone to fiddle with manually.

 It is also possible we will implement a manual way and never get around
 to automating it.   :-(

You make it sound as if we know how but are just too lazy to right the
code.  That is not one of the weaknesses that this community has.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-26 Thread PostgreSQL - Hans-Jürgen Schönig
 
 
 Still, having more data a user can probe would be nice.
 
 I wonder why everyone avoids Microsoft's approach to the subject. 
 Apparently, they go in the 'auto-tune as much as possible' direction.
 And tests we did a while ago, involving asking team from Microsoft and a 
 team from oracle to optimise set of queries for the same set of data 
 (bookies data, loads of it) showed that the auto-tuning Microsoft has in 
 their
 sql server performed much better than a team of over-sweating oracle dba's.
 
 I don't think *anyone* is avoiding that approach.  There is almost
 universal consensus here that auto-tuning is better than manual
 tuning, even to the extent of being unwilling to add knobs to allow
 manual tuning of settings we have no idea how to auto-tune and no
 plans to auto-tune.
 
 In my current work place/camp we have many deployments of the same system, 
 over different types of machines, each with different customer data that 
 vary so much that queries need to be rather generic.
 Postgresql shows its strength with planner doing a good job for different 
 variants of data, however we do a very little tweaking to the configuration 
 parameters. Just because it is just too hard to overlook all of them.
 I guess that the systems could behave much better, but no one is going to 
 tweak settings for 50 different installations over 50 different type of data 
 and 50 different sets of hardware.
 If there was even a tiny amount of automation provided in the postgresql, I 
 would welcome it with open arms.
 
 What do you have in mind?
 



what we are trying to do is to explicitly store column correlations. so, a 
histogram for (a, b) correlation and so on.
the planner code then goes through its restrictions in the query and finds the 
best / longest combination it can find and which has some statistics defined.
it seems we can also do this for join selectivity and expressions. the planner 
code for raw column correlation without expression ( cos(id) or so)  and 
joins is there (WIP, no ANALYZE support and so on so far).

i think auto tuning is a good thing to have and the door to actually do it is 
wide open with our approach.
all it takes is a mechanism to see which conditions are used how often and 
somebody could write a job which automatically tells the system which stats to 
collect / sample.
i think for an average user this is the most simplistic thing then. but, to 
get there we have to get the bloody sampling and the rest of the planner code 
right in the first place.
auto tuning in this area is still something which is far in the future - but at 
least the road to it is clear.

some people suggested some approach dealing with effective_cache_size and so on 
... there are many good approaches here but they don't address the actual 
problem of wrong size-estimates.

many thanks,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-26 Thread Greg Stark
2011/2/26 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at:
 what we are trying to do is to explicitly store column correlations. so, a 
 histogram for (a, b) correlation and so on.


The problem is that we haven't figured out how to usefully store a
histogram for a,b. Consider the oft-quoted example of a
city,postal-code  -- or city,zip code for Americans. A histogram
of the tuple is just the same as a histogram on the city. It doesn't
tell you how much extra selectivity the postal code or zip code gives
you. And if you happen to store a histogram of postal code, city by
mistake then it doesn't tell you anything at all.

We need a data structure that lets us answer the bayesian question
given a city of New York how selective is zip-code = 02139. I don't
know what that data structure would be.

Heikki and I had a wacky hand-crafted 2D histogram data structure that
I suspect doesn't actually work. And someone else did some research on
list and came up with a fancy sounding name of a statistics concept
that might be what we want.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-26 Thread Martijn van Oosterhout
On Sat, Feb 26, 2011 at 06:44:52PM +, Greg Stark wrote:
 2011/2/26 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at:
  what we are trying to do is to explicitly store column correlations. so, a 
  histogram for (a, b) correlation and so on.
 
 The problem is that we haven't figured out how to usefully store a
 histogram for a,b. Consider the oft-quoted example of a
 city,postal-code  -- or city,zip code for Americans. A histogram
 of the tuple is just the same as a histogram on the city. 

But there are cases where it can work. Frankly the example you mention
is odd because for we can't even build useful 1D histograms for city
and zip code, so the fact that 2D is hard is not surprising.

The histograms we do build work fine from  and , just equality. The
2D will handle the same.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] WIP: cross column correlation ...

2011-02-26 Thread Grzegorz Jaskiewicz

On 26 Feb 2011, at 14:45, Robert Haas wrote:

 On Sat, Feb 26, 2011 at 4:33 AM, Grzegorz Jaskiewicz
 
 
 I don't think *anyone* is avoiding that approach.  There is almost
 universal consensus here that auto-tuning is better than manual
 tuning, even to the extent of being unwilling to add knobs to allow
 manual tuning of settings we have no idea how to auto-tune and no
 plans to auto-tune.
 
Perhaps one step further is required. To change some settings so that it can be 
auto-tuned better. There are some even more drastic steps that would have to be 
taken
and I believe that Microsoft engineers had to take them. Steps back. For 
instance, if there is an issue with inability to find out how much of a table 
is in the cache, perhaps postgresql should
have an option to turn off cached reads/writes completely and thus allow DBA to 
regulate that using the shared_buffers setting. It doesn't sound great, but if 
you think about it
I'm sure there are people willing to use it, if that adds a bit more 
auto-tunning to the server. I would even go a step further, and say that I 
believe that some people will
embrace it on the basis that they can constraint the amount of memory 
PostgreSQL uses on their server as a whole, and that includes caches. 


 In my current work place/camp we have many deployments of the same system, 
 over different types of machines, each with different customer data that 
 vary so much that queries need to be rather generic.
 Postgresql shows its strength with planner doing a good job for different 
 variants of data, however we do a very little tweaking to the configuration 
 parameters. Just because it is just too hard to overlook all of them.
 I guess that the systems could behave much better, but no one is going to 
 tweak settings for 50 different installations over 50 different type of data 
 and 50 different sets of hardware.
 If there was even a tiny amount of automation provided in the postgresql, I 
 would welcome it with open arms.
 
 What do you have in mind?

All I'm trying to say, that whilst you guys focus mostly on single database 
server installations PostgreSQL has also a great user base that use it as part 
of a product that is deployed on different sized machines, 
and with same model but different data variation. We don't sell the product to 
the people and let them take care of it, but rather sell the service - you 
would say. But we also don't have a DBA per customer that would look solely
at the knob tweaking side of things. So my argument here is, that there isn't 
always a person who would know tables and databases by their characteristics 
and thus be able to tweak settings manually. 
That probably is just a one of many examples where it makes sense, and probably 
their primary property is that there's no DBA overlooking whole database and 
thus being able to tune it. 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-26 Thread Kevin Grittner
 Grzegorz Jaskiewicz  wrote:
 
 I guess that the systems could behave much better, but no one is
 going to tweak settings for 50 different installations over 50
 different type of data and 50 different sets of hardware.
 If there was even a tiny amount of automation provided in the
 postgresql, I would welcome it with open arms.
 
Hmmm...  Well, we have about 100 pieces of hardware with about 200
databases, and we *do* tune them individually, but it's not as
onerous as it might seem.  For our 72 production circuit court
servers, for example, we have one standard configuration which has as
its last line an include file for overrides.  For some counties that
override file is empty.  For many we override effective_cache_size
based on the RAM installed in the machine.  Since most of these
servers have the database fully cached, the standard file uses
equal, low settings for seq_page_cost and random_page_cost, but we
override that where necessary.  We don't generally tune anything else
differently among these servers.  (Maybe work_mem, I'd have to
check.)
 
Which leads me to think that these might be the key items to
autotune.  It's not actually that hard for me to imagine timing a
small percentage of randomly selected page accesses and developing
costing factors for the page costs on the fly.  It might be a bit
trickier to autotune effective_cache_size, but I can think of two or
three heuristics which might work.  Automatically generating sane
values for these three things would eliminate a significant fraction
of problems posted to the performance list.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-26 Thread Kevin Grittner
Greg Stark  wrote:
 
 Consider the oft-quoted example of a  -- or
  for Americans.
 
I'm not sure everyone realizes just how complicated this particular
issue is.  If we can do a good job with U.S. city, state, zip code we
will have something which will handle a lot of cases.
 
Consider:
 
(1)  Municipality name isn't unique in the U.S.  Many states besides
Wisconsin have a municipality called Madison (I seem to remember
there were over 20 of them).  So city without state doesn't
necessarily get you anywhere near having a unique zip code or range.
 
(2)  A large city has a set of zip codes, all starting with the same
first three digits.  So identifying the municipality doesn't always
identify the zip code, although for small cities it often does. 
Madison, Wisconsin has thirty-some zip codes, some of which are
rather specialized and don't see much use.
 
(3)  Small municipalities surrounded by or adjacent to a large city
may not get their own zip code.  53704 not only covers a large swath
of the northern end of the City of Madison, but is also the zip code
for the Village of Maple Bluff and at least parts of the Township of
Westport.
 
I guess what I'm saying is that this use case has enough complexity
to make an interesting problem to solve.  It may even be more
challenging than you would want for an initial trial of a technique.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-25 Thread Robert Haas
On Fri, Feb 25, 2011 at 1:33 AM, Bruce Momjian br...@momjian.us wrote:
 Josh Berkus wrote:
 On 2/23/11 7:10 AM, Robert Haas wrote:
  IME, most bad query plans are caused by either incorrect
  estimates of selectivity, or wrongheaded notions about what's likely
  to be cached.  If we could find a way, automated or manual, of
  providing the planner some better information about the facts of life
  in those areas, I think we'd be way better off.  I'm open to ideas
  about what the best way to do that is.

 As previously discussed, I'm fine with approaches which involve
 modifying database objects.  These are auditable and centrally managable
 and aren't devastating to upgrades.

 So thinks like the proposed CREATE SELECTIVITY would be OK in a way
 that decorating queries would not.

 Similiarly, I would love to be able to set cache % on a per-relation
 basis, and override the whole dubious calculation involving
 random_page_cost for scans of that table.

 We should just fine a way of checking what percentage of a table is
 already in the shared buffers.  That doesn't help us with the kernel
 cache, but it would be a good start and something that doesn't require
 user tuning.

You're reinventing a wheel that's already been discarded multiple
times.  There are at least four separate problems:

1. The percentage of the table which is cached in shared_buffers at
plan time need not match the percentage that is cached at execution
time.  A delay of even a few seconds between planning and execution
could make the numbers totally different, and plans can be cached for
much longer than that.

2. Because shared_buffers can turn over quite quickly, planning the
statement multiple times in relatively quick succession could give
different results each time.  Previous discussions on this topic have
concluded that DBAs hate plan instability, and hate GEQO because it
causes plan instability, and this would inject plan instabiilty into
the main planner.

3. The percentage of the table which is cached in shared_buffers is
not necessarily representative of the percentage which is cached in
general.  On a large machine, shared_buffers may be less than 10% of
the total cache.  It would be unwise to make guesses about what is and
is not cached based on a small percentage of the cache.

4. Even if we could accurately estimate the percentage of the table
that is cached, what then?  For example, suppose that a user issues a
query which retrieves 1% of a table, and we know that 1% of that table
is cached.  How much of the data that the user asked for is cache?
Hard to say, right?  It could be none of it or all of it.  The second
scenario is easy to imagine - just suppose the query's been executed
twice.  The first scenario isn't hard to imagine either.

One idea Tom and I kicked around previously is to set an assumed
caching percentage for each table based on its size relative to
effective_cache_size - in other words, assume that the smaller a table
is, the more of it will be cached.  Consider a system with 8GB of RAM,
and a table which is 64kB.  It is probably unwise to make any plan
based on the assumption that that table is less than fully cached.  If
it isn't before the query executes, it soon will be.  Going to any
amount of work elsewhere in the plan to avoid the work of reading that
table in from disk is probably a dumb idea.  Of course, one downside
of this approach is that it doesn't know which tables are hot and
which tables are cold, but it would probably still be an improvement
over the status quo.

All that having been said, I think that while Josh is thinking fuzzily
about the mathematics of his proposal, the basic idea is pretty
sensible.  It is not easy - likely not possible - for the system to
have a good idea which things will be in some kind of cache at the
time the query is executed; it could even change mid-query.  The
execution of one part of the query could evict from the cache data
which some other part of the plan assumed would be cached.  But DBAs
frequently have a very good idea of which stuff is in cache - they can
make observations over a period of time and then adjust settings and
then observe some more and adjust some more.

PostgreSQL is extremely easy to administer compared with some of its
competitors, and it's frequently necessary to change very little.  But
there's a difference between what you absolutely have to change to
make it work and what you have the option to change when necessary.
We need to decrease the amount of stuff in the first category (as we
recently did with wal_buffers) and increase the amount of stuff in the
second category.  People coming from Oracle are not favorably
impressed either by the amount of monitoring data PostgreSQL can
gather or by the number of knobs that are available to fix problems
when they occur.  We don't need to have as many knobs as Oracle and we
probably don't want to, and for that matter we probably couldn't if we
did want to for lack of manpower, but 

Re: [HACKERS] WIP: cross column correlation ...

2011-02-25 Thread Cédric Villemain
2011/2/25 Robert Haas robertmh...@gmail.com:
 On Fri, Feb 25, 2011 at 1:33 AM, Bruce Momjian br...@momjian.us wrote:
 Josh Berkus wrote:
 On 2/23/11 7:10 AM, Robert Haas wrote:
  IME, most bad query plans are caused by either incorrect
  estimates of selectivity, or wrongheaded notions about what's likely
  to be cached.  If we could find a way, automated or manual, of
  providing the planner some better information about the facts of life
  in those areas, I think we'd be way better off.  I'm open to ideas
  about what the best way to do that is.

 As previously discussed, I'm fine with approaches which involve
 modifying database objects.  These are auditable and centrally managable
 and aren't devastating to upgrades.

 So thinks like the proposed CREATE SELECTIVITY would be OK in a way
 that decorating queries would not.

 Similiarly, I would love to be able to set cache % on a per-relation
 basis, and override the whole dubious calculation involving
 random_page_cost for scans of that table.

 We should just fine a way of checking what percentage of a table is
 already in the shared buffers.  That doesn't help us with the kernel
 cache, but it would be a good start and something that doesn't require
 user tuning.

 You're reinventing a wheel that's already been discarded multiple
 times.  There are at least four separate problems:

 1. The percentage of the table which is cached in shared_buffers at
 plan time need not match the percentage that is cached at execution
 time.  A delay of even a few seconds between planning and execution
 could make the numbers totally different, and plans can be cached for
 much longer than that.

 2. Because shared_buffers can turn over quite quickly, planning the
 statement multiple times in relatively quick succession could give
 different results each time.  Previous discussions on this topic have
 concluded that DBAs hate plan instability, and hate GEQO because it
 causes plan instability, and this would inject plan instabiilty into
 the main planner.

 3. The percentage of the table which is cached in shared_buffers is
 not necessarily representative of the percentage which is cached in
 general.  On a large machine, shared_buffers may be less than 10% of
 the total cache.  It would be unwise to make guesses about what is and
 is not cached based on a small percentage of the cache.

 4. Even if we could accurately estimate the percentage of the table
 that is cached, what then?  For example, suppose that a user issues a
 query which retrieves 1% of a table, and we know that 1% of that table
 is cached.  How much of the data that the user asked for is cache?
 Hard to say, right?  It could be none of it or all of it.  The second
 scenario is easy to imagine - just suppose the query's been executed
 twice.  The first scenario isn't hard to imagine either.

 One idea Tom and I kicked around previously is to set an assumed
 caching percentage for each table based on its size relative to
 effective_cache_size - in other words, assume that the smaller a table
 is, the more of it will be cached.  Consider a system with 8GB of RAM,
 and a table which is 64kB.  It is probably unwise to make any plan
 based on the assumption that that table is less than fully cached.  If
 it isn't before the query executes, it soon will be.  Going to any
 amount of work elsewhere in the plan to avoid the work of reading that
 table in from disk is probably a dumb idea.  Of course, one downside
 of this approach is that it doesn't know which tables are hot and
 which tables are cold, but it would probably still be an improvement
 over the status quo.

Yes, good idea.


 All that having been said, I think that while Josh is thinking fuzzily
 about the mathematics of his proposal, the basic idea is pretty
 sensible.  It is not easy - likely not possible - for the system to
 have a good idea which things will be in some kind of cache at the
 time the query is executed; it could even change mid-query.  The
 execution of one part of the query could evict from the cache data
 which some other part of the plan assumed would be cached.  But DBAs
 frequently have a very good idea of which stuff is in cache - they can
 make observations over a period of time and then adjust settings and
 then observe some more and adjust some more.

I believe we can maintain a small map of area of a relation  which are
in the OS buffer cache (shared buffers move more), or at least a
percentage of the relation in OS cache. Getting autovacuum daemon
being able to update those maps/counters might be enought and easy to
do, it is really near what auto-analyze do. My observation is that
the percentage in cache is stable on a production workload after some
tens of minutes needed to warm the server.

What should really help here is to have hooks in the cost functions to
test those ideas without the need to patch postgresql-core a lot. Will
it be ok to have hooks or will it add to much CPU consumption in a
sensible part of 

Re: [HACKERS] WIP: cross column correlation ...

2011-02-25 Thread Rod Taylor
 4. Even if we could accurately estimate the percentage of the table
 that is cached, what then?  For example, suppose that a user issues a
 query which retrieves 1% of a table, and we know that 1% of that table
 is cached.  How much of the data that the user asked for is cache?
 Hard to say, right?  It could be none of it or all of it.  The second
 scenario is easy to imagine - just suppose the query's been executed
 twice.  The first scenario isn't hard to imagine either.


I have a set of slow disks which can impact performance nearly as much as in
cached in memory versus the fast disks.

How practical would it be for analyze to keep a record of response times for
given sections of a table as it randomly accesses them and generate some
kind of a map for expected response times for the pieces of data it is
analysing?

It may well discover, on it's own, that recent data (1 month old or less)
has a random read response time of N, older data (1 year old) in a different
section of the relation tends to have a response time of 1000N, and really
old data (5 year old) tends to have a response time of 3000N.


Re: [HACKERS] WIP: cross column correlation ...

2011-02-25 Thread Robert Haas
2011/2/25 Cédric Villemain cedric.villemain.deb...@gmail.com:
 All that having been said, I think that while Josh is thinking fuzzily
 about the mathematics of his proposal, the basic idea is pretty
 sensible.  It is not easy - likely not possible - for the system to
 have a good idea which things will be in some kind of cache at the
 time the query is executed; it could even change mid-query.  The
 execution of one part of the query could evict from the cache data
 which some other part of the plan assumed would be cached.  But DBAs
 frequently have a very good idea of which stuff is in cache - they can
 make observations over a period of time and then adjust settings and
 then observe some more and adjust some more.

 I believe we can maintain a small map of area of a relation  which are
 in the OS buffer cache (shared buffers move more), or at least a
 percentage of the relation in OS cache. Getting autovacuum daemon
 being able to update those maps/counters might be enought and easy to
 do, it is really near what auto-analyze do.  My observation is that
 the percentage in cache is stable on a production workload after some
 tens of minutes needed to warm the server.

I don't think we can assume that will be true in all workloads.
Imagine a server doing batch processing.  People submit large batches
of work that take, say, an hour to complete.  Not all batches use the
same set of tables - maybe they even run in different databases.
After a big batch process finishes crunching numbers in database A,
very little of database B will be cached.  But it's not necessarily
right to assume that when we start queries for a new batch in database
B, although it's more likely to be right for large tables (which will
take a long time to get cached meaningfully, if they ever do) than
small ones.  Also, it could lead to strange issues where batches run
much faster or slower depending on which batch immediately proceeded
them.  If we're going to do something a lot of times, it'd be better
to bite the bullet and read it all in rather than going to more work
elsewhere, but if we're only going to touch it once, then not so much.

You might also have this issue on systems that run OLTP workloads all
day and then do some batch processing at night to get ready for the
next business day.  Kevin Grittner wrote previously about those jobs
needing some different settings in his environment (I'm not
remembering which settings at the moment).  Suppose that the batch
process is going to issue a query that can be planned in one of two
possible ways.  One way involves reading 10% of a relation, and the
other way involves reading the whole thing.  The first plan takes 200
s to execute if the relation is not cached, and 180 s if the relevant
portion is cached.  The second plan takes 300 s to execute if the
relation is not cached, and 100 s if it is cached.  At the start of
the batch run, the relation won't be cached, because it's used *only*
by the overnight job and not by the daily OLTP traffic.  Which way
should we execute the query?

The answer is that if the batch job only needs to execute that query
*once*, we should do it the first way.  But if it needs to execute it
three or more times, the second way is better, but only if we use the
second plan every time.  If we start out with the first plan, we're
always better off sticking with it *unless* we know that we're going
to repeat the query at least twice more after the iteration we're
currently planning.  To make the right decision, the query planner
needs a crystal ball.  Or, a little help from the DBA.

 What should really help here is to have hooks in the cost functions to
 test those ideas without the need to patch postgresql-core a lot. Will
 it be ok to have hooks or will it add to much CPU consumption in a
 sensible part of the code ?

Depends on where you put them, I guess.  Hooks are pretty cheap, but
they're also pretty hard to use.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-25 Thread Alvaro Herrera
Excerpts from Rod Taylor's message of vie feb 25 14:03:58 -0300 2011:

 How practical would it be for analyze to keep a record of response times for
 given sections of a table as it randomly accesses them and generate some
 kind of a map for expected response times for the pieces of data it is
 analysing?

I think what you want is random_page_cost that can be tailored per
tablespace.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-25 Thread Cédric Villemain
2011/2/25 Robert Haas robertmh...@gmail.com:
 2011/2/25 Cédric Villemain cedric.villemain.deb...@gmail.com:
 All that having been said, I think that while Josh is thinking fuzzily
 about the mathematics of his proposal, the basic idea is pretty
 sensible.  It is not easy - likely not possible - for the system to
 have a good idea which things will be in some kind of cache at the
 time the query is executed; it could even change mid-query.  The
 execution of one part of the query could evict from the cache data
 which some other part of the plan assumed would be cached.  But DBAs
 frequently have a very good idea of which stuff is in cache - they can
 make observations over a period of time and then adjust settings and
 then observe some more and adjust some more.

 I believe we can maintain a small map of area of a relation  which are
 in the OS buffer cache (shared buffers move more), or at least a
 percentage of the relation in OS cache. Getting autovacuum daemon
 being able to update those maps/counters might be enought and easy to
 do, it is really near what auto-analyze do.  My observation is that
 the percentage in cache is stable on a production workload after some
 tens of minutes needed to warm the server.

 I don't think we can assume that will be true in all workloads.
 Imagine a server doing batch processing.  People submit large batches
 of work that take, say, an hour to complete.  Not all batches use the
 same set of tables - maybe they even run in different databases.
 After a big batch process finishes crunching numbers in database A,
 very little of database B will be cached.  But it's not necessarily
 right to assume that when we start queries for a new batch in database
 B, although it's more likely to be right for large tables (which will
 take a long time to get cached meaningfully, if they ever do) than
 small ones.  Also, it could lead to strange issues where batches run
 much faster or slower depending on which batch immediately proceeded
 them.  If we're going to do something a lot of times, it'd be better
 to bite the bullet and read it all in rather than going to more work
 elsewhere, but if we're only going to touch it once, then not so much.

 You might also have this issue on systems that run OLTP workloads all
 day and then do some batch processing at night to get ready for the
 next business day.  Kevin Grittner wrote previously about those jobs
 needing some different settings in his environment (I'm not
 remembering which settings at the moment).  Suppose that the batch
 process is going to issue a query that can be planned in one of two
 possible ways.  One way involves reading 10% of a relation, and the
 other way involves reading the whole thing.  The first plan takes 200
 s to execute if the relation is not cached, and 180 s if the relevant
 portion is cached.  The second plan takes 300 s to execute if the
 relation is not cached, and 100 s if it is cached.  At the start of
 the batch run, the relation won't be cached, because it's used *only*
 by the overnight job and not by the daily OLTP traffic.  Which way
 should we execute the query?

 The answer is that if the batch job only needs to execute that query
 *once*, we should do it the first way.  But if it needs to execute it
 three or more times, the second way is better, but only if we use the
 second plan every time.  If we start out with the first plan, we're
 always better off sticking with it *unless* we know that we're going
 to repeat the query at least twice more after the iteration we're
 currently planning.  To make the right decision, the query planner
 needs a crystal ball.  Or, a little help from the DBA.

Yes, we are talking of improving some part of the model.
Some workloads are dramatic  and need special customization. This is true.

Still there is a path of improvement, and probably it will remain a
path of improvement after the current model is updated.

I am not proposing something to solve all the issues, but way more
interesting IMHO than just letting the dba say : 'this table is in
cache at XX%'.

Btw, pgfincore already do solve the usecase you provide by helping the
DBA to prepare its batch processing, so in some sense I am familiar
with what you describe (take the second plan, pgfincore will preload
in the background, and your query will be done in 100s from the
first).


 What should really help here is to have hooks in the cost functions to
 test those ideas without the need to patch postgresql-core a lot. Will
 it be ok to have hooks or will it add to much CPU consumption in a
 sensible part of the code ?

 Depends on where you put them, I guess.  Hooks are pretty cheap, but
 they're also pretty hard to use.

Yes, it will be easier to make an extension, have people testing it
and validate or not the 'new' model

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] WIP: cross column correlation ...

2011-02-25 Thread Josh Berkus

 4. Even if we could accurately estimate the percentage of the table
 that is cached, what then?  For example, suppose that a user issues a
 query which retrieves 1% of a table, and we know that 1% of that table
 is cached.  How much of the data that the user asked for is cache?

FWIW, for a manual override setting, I was thinking that the % would
convert to a probability.  In that way, it wouldn't be different from
the existing RPC calculation; we're just estimating how *likely* it is
that the data the user wants is cached.

 One idea Tom and I kicked around previously is to set an assumed
 caching percentage for each table based on its size relative to
 effective_cache_size - in other words, assume that the smaller a table
 is, the more of it will be cached.  Consider a system with 8GB of RAM,
 and a table which is 64kB.  It is probably unwise to make any plan
 based on the assumption that that table is less than fully cached.  If
 it isn't before the query executes, it soon will be.  Going to any
 amount of work elsewhere in the plan to avoid the work of reading that
 table in from disk is probably a dumb idea.  Of course, one downside
 of this approach is that it doesn't know which tables are hot and
 which tables are cold, but it would probably still be an improvement
 over the status quo.

Actually, we *do* have some idea which tables are hot.  Or at least, we
could.   Currently, pg_stats for tables are timeless; they just
accumulate from the last reset, which has always been a problem in
general for monitoring.  If we could make top-level table and index
stats time-based, even in some crude way, we would know which tables
were currently hot.  That would also have the benefit of making server
performance analysis and autotuning easier.

 But DBAs
 frequently have a very good idea of which stuff is in cache - they can
 make observations over a period of time and then adjust settings and
 then observe some more and adjust some more.

Agreed.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-25 Thread Robert Haas
On Fri, Feb 25, 2011 at 2:26 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Rod Taylor's message of vie feb 25 14:03:58 -0300 2011:

 How practical would it be for analyze to keep a record of response times for
 given sections of a table as it randomly accesses them and generate some
 kind of a map for expected response times for the pieces of data it is
 analysing?

 I think what you want is random_page_cost that can be tailored per
 tablespace.

We have that.

But it's not the same as tracking *sections of a table*.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-25 Thread Robert Haas
On Fri, Feb 25, 2011 at 6:41 PM, Josh Berkus j...@agliodbs.com wrote:
 One idea Tom and I kicked around previously is to set an assumed
 caching percentage for each table based on its size relative to
 effective_cache_size - in other words, assume that the smaller a table
 is, the more of it will be cached.  Consider a system with 8GB of RAM,
 and a table which is 64kB.  It is probably unwise to make any plan
 based on the assumption that that table is less than fully cached.  If
 it isn't before the query executes, it soon will be.  Going to any
 amount of work elsewhere in the plan to avoid the work of reading that
 table in from disk is probably a dumb idea.  Of course, one downside
 of this approach is that it doesn't know which tables are hot and
 which tables are cold, but it would probably still be an improvement
 over the status quo.

 Actually, we *do* have some idea which tables are hot.  Or at least, we
 could.   Currently, pg_stats for tables are timeless; they just
 accumulate from the last reset, which has always been a problem in
 general for monitoring.  If we could make top-level table and index
 stats time-based, even in some crude way, we would know which tables
 were currently hot.  That would also have the benefit of making server
 performance analysis and autotuning easier.

I think there would be value in giving the DBA an easier way to see
which tables are hot, but I am really leery about the idea of trying
to feed that directly into the query planner.  I think this is one of
those cases where we let people tune it manually for starters, and
then wait for feedback.  Eventually someone will say oh, I never tune
that by hand any more, ever since I wrote this script which does the
following computation... and I just run it out cron.  And then we
will get out the party hats.  But we will never get the experience we
need to say what that auto-tuning algorithm will be unless we first
provide the knob for someone to fiddle with manually.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-25 Thread Bruce Momjian
Robert Haas wrote:
  Actually, we *do* have some idea which tables are hot. ?Or at least, we
  could. ? Currently, pg_stats for tables are timeless; they just
  accumulate from the last reset, which has always been a problem in
  general for monitoring. ?If we could make top-level table and index
  stats time-based, even in some crude way, we would know which tables
  were currently hot. ?That would also have the benefit of making server
  performance analysis and autotuning easier.
 
 I think there would be value in giving the DBA an easier way to see
 which tables are hot, but I am really leery about the idea of trying
 to feed that directly into the query planner.  I think this is one of
 those cases where we let people tune it manually for starters, and
 then wait for feedback.  Eventually someone will say oh, I never tune
 that by hand any more, ever since I wrote this script which does the
 following computation... and I just run it out cron.  And then we
 will get out the party hats.  But we will never get the experience we
 need to say what that auto-tuning algorithm will be unless we first
 provide the knob for someone to fiddle with manually.

It is also possible we will implement a manual way and never get around
to automating it.   :-(

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-24 Thread Bruce Momjian
Robert Haas wrote:
 On Wed, Feb 23, 2011 at 10:30 PM, Bruce Momjian br...@momjian.us wrote:
  Robert Haas wrote:
  If you want to take the above as in any way an exhaustive survey of
  the landscape (which it isn't), C seems like a standout, maybe
  augmented by the making the planner able to notice that A1 = x1 AND A2
  = x2 is equivalent to (A1,A2) = (x1, x2) so you don't have to rewrite
  queries as much.
 
  I don't really know how to handle the join selectivity problem. ?I am
  not convinced that there is a better solution to that than decorating
  the query. ?After all the join selectivity depends not only on the
  join clause itself, but also on what you've filtered out of each table
  in the meantime.
 
  Thinking some more, I think another downside to the decorate the query
  idea is that many queries use constants that are supplied only at
  runtime, so there would be no way to hard-code a selectivity value into
  a query when you don't know the value. ?Could a selectivity function
  handle that?
 
 Beats me.  What do you have in mind?

My point is just that many queries have constants who's values are not
known at the time the query is written, so any system should have a way
to handle that somehow.  This is why query decoration is usually not a
good solution, and why something more flexible that is stored as part of
the column is preferred.

Perhaps a selectivity function that has easy access to the computed
selectivity of the constant involved might be a win.  For example, for
the zip code/state code case we could have something like:

function mysel(zip, state) { pgsel(zip)}

meaning we would still use the selectivities found in the optimizer
statistics (pgsel), but modify them in some way.  In the case above, the
selectivity only comes from the zip code.  You could also do things like:

function mysel(x, y) { pgsel(x) * pgsel(y) * 0.001}

Such functions have a higher probability of working for all queries
involving that column.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-24 Thread Josh Berkus
On 2/23/11 7:10 AM, Robert Haas wrote:
 IME, most bad query plans are caused by either incorrect
 estimates of selectivity, or wrongheaded notions about what's likely
 to be cached.  If we could find a way, automated or manual, of
 providing the planner some better information about the facts of life
 in those areas, I think we'd be way better off.  I'm open to ideas
 about what the best way to do that is.

As previously discussed, I'm fine with approaches which involve
modifying database objects.  These are auditable and centrally managable
and aren't devastating to upgrades.

So thinks like the proposed CREATE SELECTIVITY would be OK in a way
that decorating queries would not.

Similiarly, I would love to be able to set cache % on a per-relation
basis, and override the whole dubious calculation involving
random_page_cost for scans of that table.

The great thing about object decorations is that we could then collect
data on which ones worked and which didn't through the performance list
and then use those to improve the query planner.  I doubt such would
work with query decorations.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-24 Thread Bruce Momjian
Josh Berkus wrote:
 On 2/23/11 7:10 AM, Robert Haas wrote:
  IME, most bad query plans are caused by either incorrect
  estimates of selectivity, or wrongheaded notions about what's likely
  to be cached.  If we could find a way, automated or manual, of
  providing the planner some better information about the facts of life
  in those areas, I think we'd be way better off.  I'm open to ideas
  about what the best way to do that is.
 
 As previously discussed, I'm fine with approaches which involve
 modifying database objects.  These are auditable and centrally managable
 and aren't devastating to upgrades.
 
 So thinks like the proposed CREATE SELECTIVITY would be OK in a way
 that decorating queries would not.
 
 Similiarly, I would love to be able to set cache % on a per-relation
 basis, and override the whole dubious calculation involving
 random_page_cost for scans of that table.

We should just fine a way of checking what percentage of a table is
already in the shared buffers.  That doesn't help us with the kernel
cache, but it would be a good start and something that doesn't require
user tuning.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread PostgreSQL - Hans-Jürgen Schönig

On Feb 23, 2011, at 2:58 AM, Robert Haas wrote:

 2011/2/22 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at:
 how does it work? we try to find suitable statistics for an arbitrary length 
 list of conditions so that the planner can use it directly rather than 
 multiplying all the selectivities. this should make estimates a lot more 
 precise.
 the current approach can be extended to work with expressions and well as 
 straight conditions.
 
 /me prepares to go down in flames.
 
 Personally, I think the first thing we ought to do is add a real, bona
 fide planner hint to override the selectivity calculation manually,
 maybe something like this:
 
 WHERE (x  5 AND y = 1) SELECTIVITY (0.1);


i thought there was an agreement that we don't want planner hints?
as tom pointed out - many broken queries come out of some query generator where 
even the design to make the design is broken by design.
personally i like query generators as long as other people use them ... telling 
people that this is the wrong way to go is actually financing my holiday next 
week ... ;).  in general - hibernate and stuff like that is a no-go.

personally i like the type of planner hints oleg and teodor came up with - i 
think we should do more of those hooks they are using but hiding it in some 
syntax is not a good idea.
it does not change the query and it still gives a lot of room to toy around. it 
looks like a compromise.

however, oleg's contrib module does not fix the core problem of cross column 
statistics because a hint is usually static but you want flexible selectivity.

regards,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread PostgreSQL - Hans-Jürgen Schönig
 Those are real problems, but I still want it.  The last time I hit
 this problem I spent two days redesigning my schema and adding
 triggers all over the place to make things work.  If I had been
 dealing with a 30TB database instead of a 300MB database I would have
 been royally up a creek.
 
 To put that another way, it's true that some people can't adjust their
 queries, but also some people can.  It's true that nonstandard stuff
 sucks, but queries that don't work suck, too.  And as for better
 solutions, how many major release cycles do we expect people to wait
 for them?  Even one major release cycle is an eternity when you're
 trying to get the application working before your company runs out of
 money, and this particular problem has had a lot of cycles expended on
 it without producing anything very tangible (proposed patch, which
 like you I can't spare a lot of cycles to look at just now, possibly
 excepted).



cheapest and easiest solution if you run into this: add fake functions which 
the planner cannot estimate properly.
use OR to artificially prop up estimates or use AND to artificially lower them. 
there is actually no need to redesign the schema to get around it but it is 
such an ugly solution that it does not even deserve to be called ugly ...
however, fast and reliable way to get around it.

regards,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread Robert Haas
On Wed, Feb 23, 2011 at 12:50 AM, Nathan Boley npbo...@gmail.com wrote:
 Personally, I think the first thing we ought to do is add a real, bona
 fide planner hint to override the selectivity calculation manually,
 maybe something like this:

 WHERE (x  5 AND y = 1) SELECTIVITY (0.1);

 If you're going to go that far, why not just collect statistics on
 that specific predicate?

 ie,  ANALYZE SELECTIVITY ON tablename (x, y) WHERE (x  5 AND y = 1);

 Then it won't fall subject to all of the pitfalls that Tom outlines below.

 Selectivities are easy to estimate if we know the predicate. They only
 become hard when they have to work for every possible predicate.

Fair point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread Robert Haas
2011/2/23 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at:
 Those are real problems, but I still want it.  The last time I hit
 this problem I spent two days redesigning my schema and adding
 triggers all over the place to make things work.  If I had been
 dealing with a 30TB database instead of a 300MB database I would have
 been royally up a creek.

 To put that another way, it's true that some people can't adjust their
 queries, but also some people can.  It's true that nonstandard stuff
 sucks, but queries that don't work suck, too.  And as for better
 solutions, how many major release cycles do we expect people to wait
 for them?  Even one major release cycle is an eternity when you're
 trying to get the application working before your company runs out of
 money, and this particular problem has had a lot of cycles expended on
 it without producing anything very tangible (proposed patch, which
 like you I can't spare a lot of cycles to look at just now, possibly
 excepted).

 cheapest and easiest solution if you run into this: add fake functions 
 which the planner cannot estimate properly.
 use OR to artificially prop up estimates or use AND to artificially lower 
 them. there is actually no need to redesign the schema to get around it but 
 it is such an ugly solution that it does not even deserve to be called ugly 
 ...
 however, fast and reliable way to get around it.

We couldn't possibly design a hint mechanism that would be uglier or
less future-proof than this workaround (which, by the way, I'll keep
in mind for the next time I get bitten by this).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread PostgreSQL - Hans-Jürgen Schönig
 
 
 cheapest and easiest solution if you run into this: add fake functions 
 which the planner cannot estimate properly.
 use OR to artificially prop up estimates or use AND to artificially lower 
 them. there is actually no need to redesign the schema to get around it but 
 it is such an ugly solution that it does not even deserve to be called 
 ugly ...
 however, fast and reliable way to get around it.
 
 We couldn't possibly design a hint mechanism that would be uglier or
 less future-proof than this workaround (which, by the way, I'll keep
 in mind for the next time I get bitten by this).
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 


i think the main issue is: what we do is ugly because of despair and a lack of 
alternative ... what you proposed is ugly by design ;).
overall: the workaround will win the ugliness contest, however ;).

many thanks,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread PostgreSQL - Hans-Jürgen Schönig

On Feb 23, 2011, at 3:46 PM, Robert Haas wrote:

 On Wed, Feb 23, 2011 at 12:50 AM, Nathan Boley npbo...@gmail.com wrote:
 Personally, I think the first thing we ought to do is add a real, bona
 fide planner hint to override the selectivity calculation manually,
 maybe something like this:
 
 WHERE (x  5 AND y = 1) SELECTIVITY (0.1);
 
 If you're going to go that far, why not just collect statistics on
 that specific predicate?
 
 ie,  ANALYZE SELECTIVITY ON tablename (x, y) WHERE (x  5 AND y = 1);
 
 Then it won't fall subject to all of the pitfalls that Tom outlines below.
 
 Selectivities are easy to estimate if we know the predicate. They only
 become hard when they have to work for every possible predicate.
 
 Fair point.
 
 -- 
 Robert Haas


basically we got the idea of allowing expressions in cross column stuff. i 
think this can be very useful. it would fix the problem of a query like that:

SELECT * FROM table WHERE cos(field) = some_number;

this takes a constant fraction of the table which is usually plain wrong as 
well (and the error tends to multiply inside the plan).
i am just not sure if i have understood all corner cases of that already.
ultimate goal: get it right for join estimates (this is why a syntax extension 
is definitely needed - you cannot track all of them automatically).

many thanks,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread Robert Haas
2011/2/23 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at:
 i thought there was an agreement that we don't want planner hints?

Well, I want them.  I think some other people do, too.  Whether those
people are more numerous than than the people who don't want them, and
how much that matters either way, is another question.  I don't want
to have to use them very often, but I like to have an out when I get
desperate.

 as tom pointed out - many broken queries come out of some query generator 
 where even the design to make the design is broken by design.
 personally i like query generators as long as other people use them ... 
 telling people that this is the wrong way to go is actually financing my 
 holiday next week ... ;).  in general - hibernate and stuff like that is a 
 no-go.

 personally i like the type of planner hints oleg and teodor came up with - i 
 think we should do more of those hooks they are using but hiding it in some 
 syntax is not a good idea.
 it does not change the query and it still gives a lot of room to toy around. 
 it looks like a compromise.

 however, oleg's contrib module does not fix the core problem of cross column 
 statistics because a hint is usually static but you want flexible selectivity.

IIRC, what Teodor and Oleg did was a contrib module that excluded a
certain index from consideration based on a GUC.  That to me is a
little more hacky than just wiring the selectivity estimate.  You're
going to need to set that just before each query that needs it, and
reset it afterwards, so it's actually worse than just decorating the
queries, IMHO.  Also, I haven't run into any actual problems in the
field that would be solved by this approach, though I am sure others
have.  IME, most bad query plans are caused by either incorrect
estimates of selectivity, or wrongheaded notions about what's likely
to be cached.  If we could find a way, automated or manual, of
providing the planner some better information about the facts of life
in those areas, I think we'd be way better off.  I'm open to ideas
about what the best way to do that is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread Bruce Momjian
PostgreSQL - Hans-J?rgen Sch?nig wrote:
  Those are real problems, but I still want it.  The last time I hit
  this problem I spent two days redesigning my schema and adding
  triggers all over the place to make things work.  If I had been
  dealing with a 30TB database instead of a 300MB database I would have
  been royally up a creek.
 
  To put that another way, it's true that some people can't adjust their
  queries, but also some people can.  It's true that nonstandard stuff
  sucks, but queries that don't work suck, too.  And as for better
  solutions, how many major release cycles do we expect people to wait
  for them?  Even one major release cycle is an eternity when you're
  trying to get the application working before your company runs out of
  money, and this particular problem has had a lot of cycles expended on
  it without producing anything very tangible (proposed patch, which
  like you I can't spare a lot of cycles to look at just now, possibly
  excepted).
 
 
 
 cheapest and easiest solution if you run into this: add fake functions
 which the planner cannot estimate properly.  use OR to artificially
 prop up estimates or use AND to artificially lower them. there is
 actually no need to redesign the schema to get around it but it is such
 an ugly solution that it does not even deserve to be called ugly ...
 however, fast and reliable way to get around it.

I agree that is super-ugly and we do need to address the cross-column
statistics better.  I personally like the 2-D histogram idea:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg00913.php

--
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread Bruce Momjian
Robert Haas wrote:
 2011/2/23 PostgreSQL - Hans-J?rgen Sch?nig postg...@cybertec.at:
  i thought there was an agreement that we don't want planner hints?
 
 Well, I want them.  I think some other people do, too.  Whether those
 people are more numerous than than the people who don't want them, and
 how much that matters either way, is another question.  I don't want
 to have to use them very often, but I like to have an out when I get
 desperate.
 
  as tom pointed out - many broken queries come out of some query generator 
  where even the design to make the design is broken by design.
  personally i like query generators as long as other people use them ... 
  telling people that this is the wrong way to go is actually financing my 
  holiday next week ... ;). ?in general - hibernate and stuff like that is a 
  no-go.
 
  personally i like the type of planner hints oleg and teodor came up with - 
  i think we should do more of those hooks they are using but hiding it in 
  some syntax is not a good idea.
  it does not change the query and it still gives a lot of room to toy 
  around. it looks like a compromise.
 
  however, oleg's contrib module does not fix the core problem of cross 
  column statistics because a hint is usually static but you want flexible 
  selectivity.
 
 IIRC, what Teodor and Oleg did was a contrib module that excluded a
 certain index from consideration based on a GUC.  That to me is a
 little more hacky than just wiring the selectivity estimate.  You're
 going to need to set that just before each query that needs it, and
 reset it afterwards, so it's actually worse than just decorating the
 queries, IMHO.  Also, I haven't run into any actual problems in the
 field that would be solved by this approach, though I am sure others
 have.  IME, most bad query plans are caused by either incorrect
 estimates of selectivity, or wrongheaded notions about what's likely
 to be cached.  If we could find a way, automated or manual, of
 providing the planner some better information about the facts of life
 in those areas, I think we'd be way better off.  I'm open to ideas
 about what the best way to do that is.

For me the key is finding a way to get that information to the planner
so all queries can benefit, not just the queries we decorate.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread Josh Berkus

 Personally, I think the first thing we ought to do is add a real, bona
 fide planner hint to override the selectivity calculation manually,
 maybe something like this:
 
 WHERE (x  5 AND y = 1) SELECTIVITY (0.1);
 
 Then, having provided a method for the DBA to extinguish the raging
 flames of searing agony which are consuming them while a crocodile
 chews off their leg and their boss asks them why they didn't use
 Oracle, we can continue bikeshedding about the best way of fixing this
 problem in a more user-transparent fashion.

Is there some way we can do that without adding the selectivity hint to
the query itself?  That's the biggest issue with hints.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread Robert Haas
On Wed, Feb 23, 2011 at 8:09 PM, Josh Berkus j...@agliodbs.com wrote:
 Personally, I think the first thing we ought to do is add a real, bona
 fide planner hint to override the selectivity calculation manually,
 maybe something like this:

 WHERE (x  5 AND y = 1) SELECTIVITY (0.1);

 Then, having provided a method for the DBA to extinguish the raging
 flames of searing agony which are consuming them while a crocodile
 chews off their leg and their boss asks them why they didn't use
 Oracle, we can continue bikeshedding about the best way of fixing this
 problem in a more user-transparent fashion.

 Is there some way we can do that without adding the selectivity hint to
 the query itself?  That's the biggest issue with hints.

I've been mulling this issue over a bit more - Nathan Boley raised a
similar point upthread.  I think it's useful to consider some concrete
cases which can occur.

1. Default estimate.  The planner tends to estimate that the
selectivity of something = something is 0.005, and that the
selectivity of something != something is 0.995, when it doesn't
know any better.  This estimate often sucks.  Sometimes it sucks
because it's too high, other times because it's too low, and of course
sometimes it is close enough for government work.

2. One special customer.  Suppose we have a database that contains
lots and lots of people and associates different attributes to those
people, including customer_id.  We put all of our employees in the
table too, and assign them customer_id = 1, since the record with
customer.id = 1 represents us.  I've built this kind of system for
several different employers over the years.  Turns out, the subset of
the person table with customer_id = 1 looks very different, in terms
of the MCVs on the remaining columns and the distribution of the
values otherwise, than the records with customer_id != 1.  I'm sure
this problem comes up in different forms in other domains; this is
just where I've seen it the most.

3. The mostly-redundant condition.  Something like creation_date 
'some timestamp' AND active.  Turns out, most of the not active stuff
is also... old.  A variant of this is creation_date  'some timestamp'
AND customer_id = 1, which overlaps #2.  For extra fun the creation
date and customer_id may be in different tables, with some
intermediate join muddying the waters.

4. The condition that's redundant except when it isn't.  The classic
example here is WHERE zipcode = constant AND state = constant.
Most of the time, the selectivity of the two clauses together is much
higher than the product of their individually selectivities; you might
as well ignore the second part altogether.  But if some numbskull user
enters a state that doesn't match the zipcode, then suddenly it
matters a lot - the selectivity drops to zero when the second part is
added.

5. The bitfield.  Conditions like (x  64) != 0.  I know disk is
cheap, but people keep doing this.

There are probably some others I'm missing, too.  That's just off the
top of my head.  Now here are some possible approaches to fixing it:

A. Decorate the query.  This would often be useful for case #1, and
some instances of #3 and #5.  It's useless for #2 and #4.

B. Specify a particular predicate and the selectivity thereof.  Like,
whenever you see (x  64) = 0, assume the selectivity is 0.5.  Upon
reflection, this seems pretty terrible in every respect.  Unless you
only ever issue an extremely limited range of queries, you're going to
be hardwiring a lot of selectivities.  I think this really only
handles case #5 well, and maybe some instances of case #1.

C. Specify an expression and gather statistics on it as if it were a
column: i.e. ALTER TABLE tab ADD VIRTUAL STATISTICS COLUMN x  64.
This is pretty good.  It is pretty much ideal for #2 and also handles
#5 and some cases of #3 and #1 well.  You could even make it handle
some instances of #4 if you made the virtual column ROW(state,
zipcode) and rewrote the query as a row comparison.

D. N x N implicativeness matrix.  Record for each pair of attributes
the extent to which a given value for A implies a value for B, and
derate the selectivity multipliers based on this information.  This is
an idea of Heikki's.  It seemed good to me when he proposed it, and I
think he proposed it in regards to #4, but I'm not sure we really ever
figured out how to make it work.

E. Given a set of columns (A1, .., An), collect MCVs and make a
histogram for ROW(A1, ..., An), and then use it to handle cases like
#4.  This is similar to C and is intended to handle the zipcode
problem, but it's not as flexible (because you are only specifying
columns, not expressions).  However, it's intended to work without
rewriting the state/zipcode comparisons as a rowcompare.

If you want to take the above as in any way an exhaustive survey of
the landscape (which it isn't), C seems like a standout, maybe
augmented by the making the planner able to notice that A1 = x1 AND A2
= x2 is equivalent to 

Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread Bruce Momjian
Robert Haas wrote:
 If you want to take the above as in any way an exhaustive survey of
 the landscape (which it isn't), C seems like a standout, maybe
 augmented by the making the planner able to notice that A1 = x1 AND A2
 = x2 is equivalent to (A1,A2) = (x1, x2) so you don't have to rewrite
 queries as much.
 
 I don't really know how to handle the join selectivity problem.  I am
 not convinced that there is a better solution to that than decorating
 the query.  After all the join selectivity depends not only on the
 join clause itself, but also on what you've filtered out of each table
 in the meantime.

Thinking some more, I think another downside to the decorate the query
idea is that many queries use constants that are supplied only at
runtime, so there would be no way to hard-code a selectivity value into
a query when you don't know the value.  Could a selectivity function
handle that?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread Robert Haas
On Wed, Feb 23, 2011 at 10:30 PM, Bruce Momjian br...@momjian.us wrote:
 Robert Haas wrote:
 If you want to take the above as in any way an exhaustive survey of
 the landscape (which it isn't), C seems like a standout, maybe
 augmented by the making the planner able to notice that A1 = x1 AND A2
 = x2 is equivalent to (A1,A2) = (x1, x2) so you don't have to rewrite
 queries as much.

 I don't really know how to handle the join selectivity problem.  I am
 not convinced that there is a better solution to that than decorating
 the query.  After all the join selectivity depends not only on the
 join clause itself, but also on what you've filtered out of each table
 in the meantime.

 Thinking some more, I think another downside to the decorate the query
 idea is that many queries use constants that are supplied only at
 runtime, so there would be no way to hard-code a selectivity value into
 a query when you don't know the value.  Could a selectivity function
 handle that?

Beats me.  What do you have in mind?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-23 Thread PostgreSQL - Hans-Jürgen Schönig
On Feb 24, 2011, at 2:09 AM, Josh Berkus wrote:

 
 Personally, I think the first thing we ought to do is add a real, bona
 fide planner hint to override the selectivity calculation manually,
 maybe something like this:
 
 WHERE (x  5 AND y = 1) SELECTIVITY (0.1);
 
 Then, having provided a method for the DBA to extinguish the raging
 flames of searing agony which are consuming them while a crocodile
 chews off their leg and their boss asks them why they didn't use
 Oracle, we can continue bikeshedding about the best way of fixing this
 problem in a more user-transparent fashion.
 
 Is there some way we can do that without adding the selectivity hint to
 the query itself?  That's the biggest issue with hints.
 



well, you could hide this hint in the system table - say; instead of decorating 
the query you could store the decoration in some system relation ... but, if 
you get it right, you call this decoration histogram ;).
i think the patch with a multi-dim histogram is good (i have seen something 
similar for PostGIS).
what is still needed in our patch is a.) multi-dim sampling (no idea how to get 
it right) and b.) investigating how to deal with joins and expressions (e.g. 
cos(id) ).
hints into the right direction are highly welcome.

many thanks,

hans


--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-22 Thread Robert Haas
2011/2/22 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at:
 how does it work? we try to find suitable statistics for an arbitrary length 
 list of conditions so that the planner can use it directly rather than 
 multiplying all the selectivities. this should make estimates a lot more 
 precise.
 the current approach can be extended to work with expressions and well as 
 straight conditions.

/me prepares to go down in flames.

Personally, I think the first thing we ought to do is add a real, bona
fide planner hint to override the selectivity calculation manually,
maybe something like this:

WHERE (x  5 AND y = 1) SELECTIVITY (0.1);

Then, having provided a method for the DBA to extinguish the raging
flames of searing agony which are consuming them while a crocodile
chews off their leg and their boss asks them why they didn't use
Oracle, we can continue bikeshedding about the best way of fixing this
problem in a more user-transparent fashion.

As to the approach you've proposed here, I'm not sure I understand
what this is actually doing.  Selectivity estimates aren't made
directly for predicates; they're made based on MCV and histogram
information for predicates.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-22 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 /me prepares to go down in flames.

 Personally, I think the first thing we ought to do is add a real, bona
 fide planner hint to override the selectivity calculation manually,
 maybe something like this:

 WHERE (x  5 AND y = 1) SELECTIVITY (0.1);

One of the criteria we've always had for a suitable hint-or-whatever-
you-call-it design is that it *not* involve decorating the queries.
There are a number of reasons for that, some of the killer ones being

(1) People frequently *can't* adjust their queries that way, because
they're coming out of some broken query generator or other.  (Crappy
query generators are of course one of the prime reasons for
poor-performing queries in the first place, so you can't write this off
as not being a key use case.)

(2) Anything we do like that, we'd be locked into supporting forever,
even after we think of better solutions.

(3) People don't like decorating their queries with nonstandard stuff;
it smells of vendor lock-in.  Especially if it's actually SQL syntax
and not comments.  Once you put something into the DML it's just too
hard to fix applications to get rid of it (the inverse case of point
#1).

I haven't looked at Hans' patch in any detail, and don't intend to
do so while the CF is still running; but at least he got this point
right.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-22 Thread Robert Haas
On Tue, Feb 22, 2011 at 9:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 /me prepares to go down in flames.

 Personally, I think the first thing we ought to do is add a real, bona
 fide planner hint to override the selectivity calculation manually,
 maybe something like this:

 WHERE (x  5 AND y = 1) SELECTIVITY (0.1);

 One of the criteria we've always had for a suitable hint-or-whatever-
 you-call-it design is that it *not* involve decorating the queries.
 There are a number of reasons for that, some of the killer ones being

 (1) People frequently *can't* adjust their queries that way, because
 they're coming out of some broken query generator or other.  (Crappy
 query generators are of course one of the prime reasons for
 poor-performing queries in the first place, so you can't write this off
 as not being a key use case.)

 (2) Anything we do like that, we'd be locked into supporting forever,
 even after we think of better solutions.

 (3) People don't like decorating their queries with nonstandard stuff;
 it smells of vendor lock-in.  Especially if it's actually SQL syntax
 and not comments.  Once you put something into the DML it's just too
 hard to fix applications to get rid of it (the inverse case of point
 #1).

Those are real problems, but I still want it.  The last time I hit
this problem I spent two days redesigning my schema and adding
triggers all over the place to make things work.  If I had been
dealing with a 30TB database instead of a 300MB database I would have
been royally up a creek.

To put that another way, it's true that some people can't adjust their
queries, but also some people can.  It's true that nonstandard stuff
sucks, but queries that don't work suck, too.  And as for better
solutions, how many major release cycles do we expect people to wait
for them?  Even one major release cycle is an eternity when you're
trying to get the application working before your company runs out of
money, and this particular problem has had a lot of cycles expended on
it without producing anything very tangible (proposed patch, which
like you I can't spare a lot of cycles to look at just now, possibly
excepted).

I agree that if we can get something that actually works that doesn't
involve decorating the queries, that is better.  But I would surely
rather decorate the queries than rewrite the entire application around
the problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-22 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Feb 22, 2011 at 9:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 One of the criteria we've always had for a suitable hint-or-whatever-
 you-call-it design is that it *not* involve decorating the queries.

 [ snip ]
 To put that another way, it's true that some people can't adjust their
 queries, but also some people can.  It's true that nonstandard stuff
 sucks, but queries that don't work suck, too.  And as for better
 solutions, how many major release cycles do we expect people to wait
 for them?

Well, a decorating-the-queries solution that isn't utter crap is not
going to be a small amount of work, either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-22 Thread Nathan Boley
 Personally, I think the first thing we ought to do is add a real, bona
 fide planner hint to override the selectivity calculation manually,
 maybe something like this:

 WHERE (x  5 AND y = 1) SELECTIVITY (0.1);



If you're going to go that far, why not just collect statistics on
that specific predicate?

ie,  ANALYZE SELECTIVITY ON tablename (x, y) WHERE (x  5 AND y = 1);

Then it won't fall subject to all of the pitfalls that Tom outlines below.

Selectivities are easy to estimate if we know the predicate. They only
become hard when they have to work for every possible predicate.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers