Re: [HACKERS] Simple postgresql.conf wizard -- Statistics idea...

2008-11-26 Thread Joshua Tolley
On Tue, Nov 25, 2008 at 06:59:25PM -0800, Dann Corbit wrote:
 I do have a statistics idea/suggestion (possibly useful with some future
 PostgreSQL 9.x or something):
 It is a simple matter to calculate lots of interesting univarate summary
 statistics with a single pass over the data (perhaps during a vacuum
 full).
 For instance with numerical columns, you can calculate mean, min, max,
 standard deviation, skew, kurtosis and things like that with a single
 pass over the data.

Calculating interesting univariate summary statistics and having
something useful to do with them are two different things entirely. Note
also that whereas this is simple for numeric columns, it's a very
different story for non-numeric data types, that don't come from a
metric space. That said, the idea of a probability metric space is well
explored in the literature, and may have valuable application. The
current histogram implementation is effectively a description of the
probability metric space the column data live in.

 Now, if you store a few numbers calculated in this way, it can be used
 to augment your histogram data when you want to estimate the volume of a
 request. So (for instance) if someone asks for a scalar that is 
 value you can look to see what percentage of the tail will hang out in
 that neck of the woods using standard deviation and the mean.

Only if you know that the data follow a distribution that can be
described accurately with a standard deviation and a mean. If your data
don't follow a Gaussian distribution, this will give you bad estimates.

- Josh / eggyknap


signature.asc
Description: Digital signature


Re: [HACKERS] Simple postgresql.conf wizard -- Statistics idea...

2008-11-26 Thread Decibel!

On Nov 25, 2008, at 8:59 PM, Dann Corbit wrote:
It is a simple matter to calculate lots of interesting univarate  
summary

statistics with a single pass over the data (perhaps during a vacuum
full).



I don't think that the problem we have is how to collect statistics  
(well, except for cross-field stuff); the problem is what to actually  
do with them. What we need people to look at is how we can improve  
query plan estimates across the board. Row count estimates, page  
access estimates, the cost estimates for accessing those pages, etc.  
This isn't a coding problem, it's an algorithm problem. It needs  
someone with an advanced (if not expert) grasp of statistics who can  
come up with better ways of estimating these things.


So, if you have a statistics hammer to wield, I think you'll find a  
lot of nails sticking up in the planner code. Hammer on those before  
worrying about additional stats to collect. :)

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simple postgresql.conf wizard -- Statistics idea...

2008-11-25 Thread Dann Corbit
 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, November 25, 2008 5:33 PM
 To: Dann Corbit
 Cc: Gregory Stark; Decibel!; Robert Haas; Bruce Momjian; Mark Wong;
 Heikki Linnakangas; Josh Berkus; Greg Smith; pgsql-
 [EMAIL PROTECTED]
 Subject: Re: [HACKERS] Simple postgresql.conf wizard
 
 Dann Corbit [EMAIL PROTECTED] writes:
  I also do not believe that there is any value that will be the right
  answer.  But a table of data might be useful both for people who
want
 to
  toy with altering the values and also for those who want to set the
  defaults.  I guess that at one time such a table was generated to
  produce the initial estimates for default values.
 
 Sir, you credit us too much :-(.  The actual story is that the current
 default of 10 was put in when we first implemented stats histograms,
 replacing code that kept track of only a *single* most common value
 (and not very well, at that).  So it was already a factor of 10 more
 stats than we had experience with keeping, and accordingly
conservatism
 suggested not boosting the default much past that.
 
 So we really don't have any methodically-gathered evidence about the
 effects of different stats settings.  It wouldn't take a lot to
 convince
 us to switch to a different default, I think, but it would be nice to
 have more than none.

I do have a statistics idea/suggestion (possibly useful with some future
PostgreSQL 9.x or something):
It is a simple matter to calculate lots of interesting univarate summary
statistics with a single pass over the data (perhaps during a vacuum
full).
For instance with numerical columns, you can calculate mean, min, max,
standard deviation, skew, kurtosis and things like that with a single
pass over the data.
Here is a C++ template I wrote to do that:
http://cap.connx.com/tournament_software/STATS.HPP
It also uses this template:
http://cap.connx.com/tournament_software/Kahan.Hpp
which is a high-accuracy adder.  These things could easily be rewritten
in C instead of C++.

Now, if you store a few numbers calculated in this way, it can be used
to augment your histogram data when you want to estimate the volume of a
request. So (for instance) if someone asks for a scalar that is 
value you can look to see what percentage of the tail will hang out in
that neck of the woods using standard deviation and the mean.

I have another, similar idea (possibly useful someday far off in the
future) that I think may have some merit.  The idea is to create a
statistical index.  This index is updated whenever data values are
modified in any way.  For scalar/ordinal values such as float, integer,
numeric it would simply store and update a statistics accumulator (a
small vector of a few items holding statistical moments, counts and
sums) for the column of interest.   These indexes would be very small
and inexpensive to {for instance} memory map.

For categorical values (things like 'color' or 'department') we might
store the count for the number of items that correspond to a hash in our
statistical index.  It would give you a crude count distinct for any
item -- the only caveat being that more than one item could possibly
have the same hash code (we would also keep a count of null items).  If
the count needed to be exact, we could generate a perfect hash for the
data or store each distinct column value in the categorical index with
its hash.  The size of such an index would depend on the data so that
bit or char='m'/'f' for male/female or 'y'/'n' for yes/no indexes would
contain just two counts and a column that is unique would have one hash
paired with the number one for each row of the table (a regular unique
index would clearly be better in such a case, but such distributions
could easily arise).  The value of an index like this is that it is good
where the normal index types are bad (e.g. it is useless to create a
btree index on a bit column or male/female, yes/no -- things of that
nature but a statistics counter would work nicely -- to give you whole
table measures only of course).  The thing that is odd about these
statistics indexes is that they do not even bother to point back to the
data that they represent -- they are just abstract measures of the
general contents.

It seems to me that this kind of information could be used to improve
the query plans for both categorical values and scalars and also be used
to generate instant answers for some kinds of statistical queries.  If
used only for query planning, the values would not need to be exact, and
could be updated only at vacuum full time or some other convenient time.
The notion behind creation of a stats index would be that we may not
need to maintain this detailed information for every column, but we can
maintain such data for columns that we often filter with in our queries
to get an idea of cardinality for a subset.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription: