Re: [HACKERS] proposal : cross-column stats

2010-12-27 Thread Tomas Vondra
Dne 24.12.2010 13:37, Florian Pflug napsal(a):
 On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
 
 2010/12/24 Florian Pflug f...@phlo.org:

 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(

 I have the impression that both of you are forgetting that there are 8
 bits in a byte. 10 bits per element = 1.25MB per milion elements.
 
 Uh, of course. So in the real universe, the numbers are
 
 ~1.2MB per 1e6 elements for a false positive rate of 1%
 ~0.5MB per 1e6 elements for a false positive rate of 10%
 
 Hm. So for a table with a billion distinct elements, we'd need half
 a gigabyte per column for the filter. A tuple with two int columns
 takes at least 24+2*4 = 32bytes to store I think, making such a table
 at least 32GB in size. The filter size would thus be 1/64 of the table
 size in the worst case. 

I've read several papers about estimating the number of distinct values,
and I have some bad as well as good news. I don't want this thread to
grow infinitely, and it's a rather separate topic so I'll start a new
thread for ndistinct estimates.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Nicolas Barbier
2010/12/24 Florian Pflug f...@phlo.org:

 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(

I have the impression that both of you are forgetting that there are 8
bits in a byte. 10 bits per element = 1.25MB per milion elements.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread tv
 2010/12/24 Florian Pflug f...@phlo.org:

 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(

 I have the impression that both of you are forgetting that there are 8
 bits in a byte. 10 bits per element = 1.25MB per milion elements.

We are aware of that, but we really needed to do some very rough estimates
and it's much easier to do the calculations with 10. Actually according to
wikipedia it's not 10bits per element but 9.6, etc. But it really does not
matter if there is 10MB or 20MB of data, it's still a lot of data ...

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Florian Pflug
On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:

 2010/12/24 Florian Pflug f...@phlo.org:
 
 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
 
   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.
 
 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(
 
 I have the impression that both of you are forgetting that there are 8
 bits in a byte. 10 bits per element = 1.25MB per milion elements.

Uh, of course. So in the real universe, the numbers are

~1.2MB per 1e6 elements for a false positive rate of 1%
~0.5MB per 1e6 elements for a false positive rate of 10%

Hm. So for a table with a billion distinct elements, we'd need half
a gigabyte per column for the filter. A tuple with two int columns
takes at least 24+2*4 = 32bytes to store I think, making such a table
at least 32GB in size. The filter size would thus be 1/64 of the table
size in the worst case. 

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 13:15, t...@fuzzy.cz napsal(a):
 2010/12/24 Florian Pflug f...@phlo.org:

 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(

 I have the impression that both of you are forgetting that there are 8
 bits in a byte. 10 bits per element = 1.25MB per milion elements.
 
 We are aware of that, but we really needed to do some very rough estimates
 and it's much easier to do the calculations with 10. Actually according to
 wikipedia it's not 10bits per element but 9.6, etc. But it really does not
 matter if there is 10MB or 20MB of data, it's still a lot of data ...

Oooops, now I see what's the problem. I thought you were pointing out
something out, but I've actually used 1B = 1b (which is obviously
wrong). But Florian already noticed that and fixed the estimates.

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 13:37, Florian Pflug napsal(a):
 On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
 
 2010/12/24 Florian Pflug f...@phlo.org:

 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(

 I have the impression that both of you are forgetting that there are 8
 bits in a byte. 10 bits per element = 1.25MB per milion elements.
 
 Uh, of course. So in the real universe, the numbers are
 
 ~1.2MB per 1e6 elements for a false positive rate of 1%
 ~0.5MB per 1e6 elements for a false positive rate of 10%
 
 Hm. So for a table with a billion distinct elements, we'd need half
 a gigabyte per column for the filter. A tuple with two int columns
 takes at least 24+2*4 = 32bytes to store I think, making such a table
 at least 32GB in size. The filter size would thus be 1/64 of the table
 size in the worst case. 

Yes, but in reality you need three such filters - one for each column,
one for the combination. So that is 1.5GB (with 10% error rate) or 3.6GB
(with 1% error rate).

But this is severely excessive compared to the real needs, as there are
usually much less distinct (not equal to the number of tuples as we
assume in these computations).

I was thinking about a simple heuristics to scale the filter properly,
something like this:

1) sample a small portion of the table and count distinct of values
2) compute number of dist. values / number of sampled tuples
3) scale this to the whole table and scale the filter

Say there are really 50 distinct values, 1.000 rows will be sampled but
20 distinct values are missing in the sample. This gives 5% in step (2)
and if the table has 1.000.000 tuples you'll get 50.000 in (3). So the
filter needs just 60kB. Which is a huge improvement compared to the
previous approach (1.2MB).

Obviously this will still lead to overestimates in most cases, and there
are probably some other fail cases, but I think it's a reasonable
solution. I don't think this can result in an underestimate (which is
the case where you loose precision).

And in case we want to build this incrementally (from a VACUUM) we
really need to use a bit larger filter, because rescaling the filter is
not possible AFAIK (without rebuilding it from scratch).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 04:41, Florian Pflug napsal(a):
 The filter size could be derived from the table's statistics target, or
 be otherwise user-definable. We could also auto-resize once it gets too
 full. But still, that all seems awfully complex :-(

Using a statistics target is a good idea I think. I think we could use
it to determine error rate of the filter. Something like

   error rate = 10 - 0.9 * (statistics_target - 100)

which gives

   1%  for statistics target 1000
   10% for statistics target 100

or maybe something like this (where the error rate grows faster for
smaller statistic target values)

   error rate = 11 - 91000 / (statistics_target^2)

which gives about

   1%  for statistics target 1000
   10% for statistics targer 100
   36% for statistics target 50

But I guess 10% error rate is the minimum we need so it does not make
much sense to use lower values.

Another possibility is to collect the data from just a small portion
of a table and then use the result to estimate the number of distinct
values for the whole table. But I'm not sure we can do this reliably,
I see many traps in this.
 This is how it works currently. The problem with this approach is that
 it gives you very little guarantees about how precise the result will be.
 Extrapolating works very well for things like MKVs and histograms, because
 there you're by definition interested mostly in values which occur often -
 and thus with a high probability in the relative few rows you sample. For
 the number of distinct values, however, this isn't true - if ndistinct
 is an order of magnitude smaller than the number of rows, relatively few
 rows can account for a large percentage of the distinct values...

That basically means we need to sample a large portion of the table :-(

 Another idea would be to obtain the ndistinct values from an index somehow.
 Postgres cannot currently scan an index in physical order, only in logical
 order, due to locking considerations. But since we'd only be interested in
 an estimate, maybe a scan in physical block order would work for ndistinc
 estimates? Just a wild idea, mind you, I haven't checked at all if that'd
 be even remotely feasible.

I was thinking about that too, and I think we could do this using
pageinspect contrib module. Sure, there might be a problem with bloated
indexes.

And relying on this actually means it's required to have a multi-column
index on all the columns. Individual indexes are not enough as we need
to get the number of distinct combinations too.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Tomas Vondra
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
 I think that maybe it'd be acceptable to scan a large portion of the
 table to estimate dist(A,B), *if* we must only do so only once in a while. 
 But even with
 a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all 
 values in memory,
 and spilling them into, say, an on-disk hash table adds even more overhead to 
 the already
 expensive full scan. Maybe using a bloom filter instead of a hash table could 
 avoid
 the spilling to disk, in exchange for a slightly less precise result...

I've read some basics about a Bloom filters, and I'm not sure we can use
it in this case. I see two problems with this approach:

1) impossibility to predict the number of elements in advance

   To build a Bloom filter with limited error rate, you need to size it
   properly (number of hash function and size of the bit array). But
   that's exactly the the information we're looking for.

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

   That's not much but this needs to be done for each column separately
   and for the combination - for N columns you need (at least) N+1
   filters.

   Sure - increasing the error rate and using a different estimate to
   build the bloom filter would significantly decrease the memory
   requirements.

2) sampling the whole table

   A naive approach would be to sample the whole table each time, but
   for large tables that might be a bit of a problem. So I'm thinking
   about what alternatives are there ...

   One possibility is to build the Bloom filter incrementally and store
   it on disk, or something like that. I guess this could be done
   during VACUUM or something like that. Anyway there's an issue how to
   set the filter size initially (estimate the number of elements),
   just as in the previous section.

   Another possibility is to collect the data from just a small portion
   of a table and then use the result to estimate the number of distinct
   values for the whole table. But I'm not sure we can do this reliably,
   I see many traps in this.

But maybe I'm just missing something important.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Tomas Vondra
Dne 21.12.2010 19:03, Tomas Vondra napsal(a):
 My plan for the near future (a few weeks) is to build a simple 'module'
 with the ability to estimate the number of rows for a given condition.
 This could actually be quite useful as a stand-alone contrib module, as
 the users often ask how to get a number of rows fast (usually for paging).

Hm, I've been thinking about where to place this new module - I thought
pgfoundry might be the right place, but I've noticed it supports just
CVS. Well, that's a bit antiquated SCM I think - I'm very happy with the
comfort provided by Git or SVN.

Is there some other place commonly used for pg-related projects? I've
been thinking about github or something like that ...

And I've finally set up the wiki-page about this effort - it's just a
summary of this whole thread.

http://wiki.postgresql.org/wiki/Cross_Columns_Stats

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Florian Pflug
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
 Dne 21.12.2010 16:54, Florian Pflug napsal(a):
 I think that maybe it'd be acceptable to scan a large portion of the
 table to estimate dist(A,B), *if* we must only do so only once in a while. 
 But even with
 a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all 
 values in memory,
 and spilling them into, say, an on-disk hash table adds even more overhead 
 to the already
 expensive full scan. Maybe using a bloom filter instead of a hash table 
 could avoid
 the spilling to disk, in exchange for a slightly less precise result...
 
 I've read some basics about a Bloom filters, and I'm not sure we can use
 it in this case. I see two problems with this approach:
 
 1) impossibility to predict the number of elements in advance
 
   To build a Bloom filter with limited error rate, you need to size it
   properly (number of hash function and size of the bit array). But
   that's exactly the the information we're looking for.
 
   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.
Drat. I had expected these number to come out quite a bit lower than
that, at least for a higher error target. But even with 10% false
positive rate, it's still 4.5MB per 1e6 elements. Still too much to
assume the filter will always fit into memory, I fear :-(

 2) sampling the whole table
 
   A naive approach would be to sample the whole table each time, but
   for large tables that might be a bit of a problem. So I'm thinking
   about what alternatives are there ..


   One possibility is to build the Bloom filter incrementally and store
   it on disk, or something like that. I guess this could be done
   during VACUUM or something like that. Anyway there's an issue how to
   set the filter size initially (estimate the number of elements),
   just as in the previous section.
The filter size could be derived from the table's statistics target, or
be otherwise user-definable. We could also auto-resize once it gets too
full. But still, that all seems awfully complex :-(

   Another possibility is to collect the data from just a small portion
   of a table and then use the result to estimate the number of distinct
   values for the whole table. But I'm not sure we can do this reliably,
   I see many traps in this.
This is how it works currently. The problem with this approach is that
it gives you very little guarantees about how precise the result will be.
Extrapolating works very well for things like MKVs and histograms, because
there you're by definition interested mostly in values which occur often -
and thus with a high probability in the relative few rows you sample. For
the number of distinct values, however, this isn't true - if ndistinct
is an order of magnitude smaller than the number of rows, relatively few
rows can account for a large percentage of the distinct values...

Another idea would be to obtain the ndistinct values from an index somehow.
Postgres cannot currently scan an index in physical order, only in logical
order, due to locking considerations. But since we'd only be interested in
an estimate, maybe a scan in physical block order would work for ndistinc
estimates? Just a wild idea, mind you, I haven't checked at all if that'd
be even remotely feasible.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
 On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
 It seems to me you're missing one very important thing - this was not
 meant as a new default way to do estimates. It was meant as an option
 when the user (DBA, developer, ...) realizes the current solution gives
 really bad estimates (due to correlation). In that case he could create
 'cross-column' statistics on those columns, and the optimizer would use
 that info to do the estimates.

 I do understand that. I just have the nagging feeling that there is a
 way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense
 to apply the uniform bayesian approach or to assume the columns are
 unrelated.

I doubt there is a way to this decision with just dist(A), dist(B) and
dist(A,B) values. Well, we could go with a rule

  if [dist(A) == dist(A,B)] the [A = B]

but that's very fragile. Think about estimates (we're not going to work
with exact values of dist(?)), and then about data errors (e.g. a city
matched to an incorrect ZIP code or something like that).

So for a real-world dataset, the condition [dist(A)==dist(A,B)] will
almost never hold. And about the same holds for the uniform correlation
assumption which is the basis for the formula I posted.

So actually we're looking for a formula that does reasonable estimates and
is robust even in cases where the correlation is not uniform or the
estimates are a reasonably unprecise.

 This motivates the definition

 F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [dist(A,B) * ( dist(B) - 1)]

 (You can probably drop the -1, it doesn't make much of a difference
 for larger values of dist(B).

 F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
 dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
 a value of 1 indicates that dist(A) == dist(A,B).

 So F(A,B) is a suitable measure of Implicativeness - it's higher
 if the table (A,B) looks more like a function A - B.

 You might use that to decide if either A-B or B-a looks function-like
 enough to use the uniform bayesian approach. Or you might even go further,
 and decide *with* bayesian formula to use - the paper you cited always
 averages

   P(A=x|B=y)*P(B=y) and
   P(B=y|A=x)*P(A=x)

 but they offer no convincing reason for that other than We don't know
 which to pick.

Well, the reason why they chose the sum/2 approach is they were unable to
infer which of the values is 'better' and the sum/2 limits the errors.

I haven't studied this thoroughly, but my impression is that you are going
in the same direction as they did, i.e. while they've done

   P(A,B) = (P(A|B)*P(A) + P(B|A)*P(B)) / 2

with P(A|B) = dist(A) / dist(A,B), you've chosen P(A|B) ~ F(B,A) or
something like that.

You'll probably object that you could compute F(A,B) and F(B,A) and then
use only the part corresponding to the larger value, but what if the
F(A,B) and F(B,A) are about the same?

This is the reason why they choose to always combine the values (with
varying weights).

 I'd like to find a statistical explanation for that definition of
 F(A,B), but so far I couldn't come up with any. I created a Maple 14
 worksheet while playing around with this - if you happen to have a
 copy of Maple available I'd be happy to send it to you..

No, I don't have Maple. Have you tried Maxima
(http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage
even has an online notebook - that seems like a very comfortable way to
exchange this kind of data.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Robert Haas
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug f...@phlo.org wrote:
 I tried to pick up Robert's idea of quantifying Implicativeness -
 i.e., finding a number between 0 and 1 that describes how close the
 (A,B) are to representing a function A - B.

Actually Heikki's idea...

 Observe that dist(A),dist(B) = dist(A,B) = dist(A)*dist(B) if the
 estimates of dist(?) are consistent. From that you easily get

  dist(A,B)/dist(B) = dist(A) = dist(A,B) and
  dist(A,B)/dist(A) = dist(B) = dist(A,B)

 If dist(A) == dist(A,B), then there is a functional dependency
 A - B, and conversely if dist(B) == dist(A,B) there is a functional
 dependency B - A. Note that you can have both at the same time!

 On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the
 smallest number of distinct values possible for a given combination
 of dist(A,B) and dist(A). This is the anti-function case.

 This motivates the definition

  F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ]

 (You can probably drop the -1, it doesn't make much of a difference
 for larger values of dist(B).

 F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
 dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
 a value of 1 indicates that dist(A) == dist(A,B).

 So F(A,B) is a suitable measure of Implicativeness - it's higher
 if the table (A,B) looks more like a function A - B.

 You might use that to decide if either A-B or B-a looks function-like
 enough to use the uniform bayesian approach. Or you might even go further,
 and decide *with* bayesian formula to use - the paper you cited always
 averages

  P(A=x|B=y)*P(B=y) and
  P(B=y|A=x)*P(A=x)

 but they offer no convincing reason for that other than We don't know
 which to pick.

Ideally you want to somehow make this a continuous transaition between
the available formulas rather than a discrete transition, I think.  If
F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and
if it's 0, then it's P(A=x)*P(B=y).  But suppose F(A,B)=0.5.  Then
what?  A naive approach would be to estimate P(A=x  B=y) = P(A=x) *
(1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and
P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1
we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9)
= 0.055.  Of course I'm just hand-waving here, and this is without any
mathematical basis, being just the simplest formula I could think of
that gets the endpoints right and plots some sort of smooth curve
between them in the middle.  A similar formula with a believable
argument to back it up seems like it would be a big step forward for
this method.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
 On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug f...@phlo.org wrote:
 You might use that to decide if either A-B or B-a looks function-like
 enough to use the uniform bayesian approach. Or you might even go
 further,
 and decide *with* bayesian formula to use - the paper you cited always
 averages

  P(A=x|B=y)*P(B=y) and
  P(B=y|A=x)*P(A=x)

 but they offer no convincing reason for that other than We don't know
 which to pick.

 Ideally you want to somehow make this a continuous transaition between
 the available formulas rather than a discrete transition, I think.  If
 F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and
 if it's 0, then it's P(A=x)*P(B=y).  But suppose F(A,B)=0.5.  Then
 what?  A naive approach would be to estimate P(A=x  B=y) = P(A=x) *
 (1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and
 P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1
 we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9)
 = 0.055.  Of course I'm just hand-waving here, and this is without any
 mathematical basis, being just the simplest formula I could think of
 that gets the endpoints right and plots some sort of smooth curve
 between them in the middle.  A similar formula with a believable
 argument to back it up seems like it would be a big step forward for
 this method.

This somehow reminds me how the various t-norms in fuzzy logic evolved.
I'm not saying we should use fuzzy logic here, but the requirements are
very similar so it might be an interesting inspiration. See for example
this http://plato.stanford.edu/entries/logic-fuzzy (chapter 4).

And there's one additional - IMHO very important - requirement. The whole
thing should easily extend to more than two columns. This IF (F(A,B) 
F(B,A)) THEN ... probably is not a good solution regarding this.

For example given 3 columns A,B,C, would you do that comparison for each
pair of columns, or would you do that for A vs (B,C)? Or maybe a
completely different approach? Because that would require to collect a lot
more data (number of distinct values in each combination) etc.

I'm not saying for example there is a table with (C=A+B)

  A | B | C
 ===
  1 | 1 | 2
  1 | 2 | 3
  1 | 3 | 4
  2 | 1 | 3
  2 | 2 | 4
  2 | 3 | 5
  3 | 1 | 4
  3 | 2 | 5
  3 | 3 | 6

So that dist(A)=dist(B)=3, dist(C)=6 and dist(A,B,C)=dist(A,B)=9. Given
the paper, you get something like

 P(A,B,C) = [dist(A)*P(A) +  dist(B)*P(B) + dist(C)*P(C)] / [3*dist(A,B,C)]
  = [P(A) + P(B) + 2*P(C)] / 9

so for example

 P(A=3,B=2,C=5) = [1/3 + 1/3 + 2/9]/9 = (8/9)/9

which is almost correct (by 1/81).

Don't get me wrong - I'm not a fanatic who thinks this particular formula
is the best formula possible. I'm just saying we could end up with a
formula that works beautifully in 2D, but once we get to 3 columns it
fails miserably.

Hmmm, maybe we could give this possibility (to identify two separate
groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
user would say 'build stats for (A,B) and (C)' - this actually represents
apriori knowledge of dependencies supplied by the user.

In that case we could search for 'implicativeness' between those two
groups (and not within the groups), and we could restrict ourselves to 2D
(and thus use a more sophisticated formula).

But we should be able to do some basic analysis even when the user
supplies a list of columns without such apriori knowledge.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Florian Pflug
On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote:
 I doubt there is a way to this decision with just dist(A), dist(B) and
 dist(A,B) values. Well, we could go with a rule
 
  if [dist(A) == dist(A,B)] the [A = B]
 
 but that's very fragile. Think about estimates (we're not going to work
 with exact values of dist(?)), and then about data errors (e.g. a city
 matched to an incorrect ZIP code or something like that).

Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
kind of fragility without penalizing the non-correlated case...

 This is the reason why they choose to always combine the values (with
 varying weights).

There are no varying weights involved there. What they do is to express
P(A=x,B=y) once as

  P(A=x,B=y) = P(B=y|A=x)*P(A=x) and then as
  P(A=x,B=y) = P(A=x|B=y)*P(B=y).

Then they assume

  P(B=y|A=x) ~= dist(A)/dist(A,B) and
  P(A=x|B=y) ~= dist(B)/dist(A,B),

and go on to average the two different ways of computing P(A=x,B=y), which
finally gives

  P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
  = dist(A)*P(A=x)/(2*dist(A,B)) + dist(B)*P(B=x)/(2*dist(A,B))
  = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))

That averaging steps add *no* further data-dependent weights. 

 I'd like to find a statistical explanation for that definition of
 F(A,B), but so far I couldn't come up with any. I created a Maple 14
 worksheet while playing around with this - if you happen to have a
 copy of Maple available I'd be happy to send it to you..
 
 No, I don't have Maple. Have you tried Maxima
 (http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage
 even has an online notebook - that seems like a very comfortable way to
 exchange this kind of data.

I haven' tried them, but I will. That java-based GUI of Maple is driving
me nuts anyway... Thanks for the pointers!

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
 On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote:
 I doubt there is a way to this decision with just dist(A), dist(B) and
 dist(A,B) values. Well, we could go with a rule

  if [dist(A) == dist(A,B)] the [A = B]

 but that's very fragile. Think about estimates (we're not going to work
 with exact values of dist(?)), and then about data errors (e.g. a city
 matched to an incorrect ZIP code or something like that).

 Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
 kind of fragility without penalizing the non-correlated case...

Yes, I understand the intention, but I'm not sure how exactly do you want
to use the F(?,?) function to compute the P(A,B) - which is the value
we're looking for.

If I understand it correctly, you proposed something like this

  IF (F(A,B)  F(B,A)) THEN
P(A,B) := c*P(A);
  ELSE
P(A,B) := d*P(B);
  END IF;

or something like that (I guess c=dist(A)/dist(A,B) and
d=dist(B)/dist(A,B)). But what if F(A,B)=0.6 and F(B,A)=0.59? This may
easily happen due to data errors / imprecise estimate.

And this actually matters, because P(A) and P(B) may be actually
significantly different. So this would be really vulnerable to slight
changes in the estimates etc.

 This is the reason why they choose to always combine the values (with
 varying weights).

 There are no varying weights involved there. What they do is to express
 P(A=x,B=y) once as

 ...

   P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
   = dist(A)*P(A=x)/(2*dist(A,B)) +
 dist(B)*P(B=x)/(2*dist(A,B))
   = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))

 That averaging steps add *no* further data-dependent weights.

Sorry, by 'varying weights' I didn't mean that the weights are different
for each value of A or B. What I meant is that they combine the values
with different weights (just as you explained).

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Florian Pflug
On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
 This is the reason why they choose to always combine the values (with
 varying weights).
 
 There are no varying weights involved there. What they do is to express
 P(A=x,B=y) once as
 
 ...
 
  P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
  = dist(A)*P(A=x)/(2*dist(A,B)) +
 dist(B)*P(B=x)/(2*dist(A,B))
  = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
 
 That averaging steps add *no* further data-dependent weights.
 
 Sorry, by 'varying weights' I didn't mean that the weights are different
 for each value of A or B. What I meant is that they combine the values
 with different weights (just as you explained).

I'm still not sure we're on the same page here. The resulting formula
is *not* a weighted average of P(A=x) and P(B=y), since in general
dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one
syntactically, but that's about it.

The resulting formula instead is an *unweighted* (weights 1) average of
the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just
as well estimate P(A=x,B=y) with

  P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)

and it's *still* be the very same uniform bayesian approach, just no
longer symmetric in A and B. Which may easily be preferable if you
have reasons to believe that this estimate is more correct than the
one obtained by swapping A and B. The original paper doesn't deal with
that case simply because they don't mention how P(A=x) and P(B=y)
are obtained at all. The postgres estimator, on the other hand,
knows quite well how it derived P(A=x) and P(B=y) and may have much
higher confidence in one value than in the other.

Assume for example that you're preparing the statement

  SELECT * FROM T WHERE A = ? AND B = 1

We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better
without an actual value for the parameter ?. The estimate for P(B=1),
on the other hand, can use the histogram, and will thus very likely be
much more precise. The two estimates for P(A=?,B=1) in this case are

  P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and
  P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B).

There's a good chance that the former beats the latter, and thus also
the average, then.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Florian Pflug
On Dec21, 2010, at 13:25 , t...@fuzzy.cz wrote:
 And there's one additional - IMHO very important - requirement. The whole
 thing should easily extend to more than two columns. This IF (F(A,B) 
 F(B,A)) THEN ... probably is not a good solution regarding this.
 
 For example given 3 columns A,B,C, would you do that comparison for each
 pair of columns, or would you do that for A vs (B,C)? Or maybe a
 completely different approach? Because that would require to collect a lot
 more data (number of distinct values in each combination) etc.

That's certainly a valid concern. The uniform bayesian approach avoids that
quite beautifully...

 Hmmm, maybe we could give this possibility (to identify two separate
 groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
 user would say 'build stats for (A,B) and (C)' - this actually represents
 apriori knowledge of dependencies supplied by the user.
 
 In that case we could search for 'implicativeness' between those two
 groups (and not within the groups), and we could restrict ourselves to 2D
 (and thus use a more sophisticated formula).

Hm, I hated this idea at first, but I'm starting to like it more and more.
It *does* seem rather unrealistic that a user would know that a bunch of
columns are correlated, but have no idea in what way... 

Any examples when this's be the case would be very much appreciated - Maybe
we should ask around on -general about this?

 But we should be able to do some basic analysis even when the user
 supplies a list of columns without such apriori knowledge.

That, I think, overcomplicates things, at least for a first cut.

To summarize, I think you've shown quite nicely that the uniform bayesian
approach is a very sensible first step towards better estimates in the case
of correlated columns. It's statistically sound, and the dist(A,B) estimates
it requires are probably a necessary ingredient of any solution to the problem.
If we can make it degrade more gracefully if the columns are uncorrelated we
should do that, but if we can't thats still no reason to drop the whole idea.

So I guess we should turn our attention to how we'd obtain reasonably good 
estimates
of dist(A,B), and return to the current discussion once the other pieces are in 
place.

I think that maybe it'd be acceptable to scan a large portion of the
table to estimate dist(A,B), *if* we must only do so only once in a while. But 
even with
a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all 
values in memory,
and spilling them into, say, an on-disk hash table adds even more overhead to 
the already
expensive full scan. Maybe using a bloom filter instead of a hash table could 
avoid
the spilling to disk, in exchange for a slightly less precise result...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
 On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
 This is the reason why they choose to always combine the values (with
 varying weights).

 There are no varying weights involved there. What they do is to express
 P(A=x,B=y) once as

 ...

  P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
  = dist(A)*P(A=x)/(2*dist(A,B)) +
 dist(B)*P(B=x)/(2*dist(A,B))
  = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))

 That averaging steps add *no* further data-dependent weights.

 Sorry, by 'varying weights' I didn't mean that the weights are different
 for each value of A or B. What I meant is that they combine the values
 with different weights (just as you explained).

 I'm still not sure we're on the same page here. The resulting formula
 is *not* a weighted average of P(A=x) and P(B=y), since in general
 dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one
 syntactically, but that's about it.

OK, another crazy usage or 'weights' on my side :-(

What I meant is that in the end you have two equations of P(A,B):

  P(A=x|B=y)*P(B=y) = dist(B)*P(B=y)/dist(A,B)
  P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)

and you need to combine those two estimates. They did that by averaging,
as they don't know which of the estimates is better.

Generally I think that is a good solution, unless you know one of the
estimates is much more reliable (although I'm not sure we should
completely omit the other estimate).

 The resulting formula instead is an *unweighted* (weights 1) average of
 the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just
 as well estimate P(A=x,B=y) with

   P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)

 and it's *still* be the very same uniform bayesian approach, just no
 longer symmetric in A and B. Which may easily be preferable if you
 have reasons to believe that this estimate is more correct than the
 one obtained by swapping A and B. The original paper doesn't deal with
 that case simply because they don't mention how P(A=x) and P(B=y)
 are obtained at all. The postgres estimator, on the other hand,
 knows quite well how it derived P(A=x) and P(B=y) and may have much
 higher confidence in one value than in the other.

OK, good point. I haven't realized that one of the estimates may be much
more reliable.

But let's assume both estimates are about the same (regarding reliability)
and let's see the following example

 A | B
 =
 1 | 1
 1 | 1
 1 | 1
 1 | 2
 2 | 1
 2 | 2
 2 | 2
 2 | 2

Thus dist(A)=dist(B)=2, dist(A,B)=4 and

  P(A=1)=P(A=2)=P(B=1)=P(B=2)=1/2
  P(A=1,B=1)=P(A=2,B=2)=3/8
  P(A=1,B=2)=P(A=1,B=1)=1/8

According to the formula presented in the paper, the partial estimates for
P(A=1,B=2) are

  P(A=1|B=2)*P(B=2) = dist(A)/dist(A,B) * P(B=2) = 2/4 * 1/2 = 1/4
  P(B=2|A=1)*P(A=1) = dist(B)/dist(A,B) * P(A=1) = 2/4 * 1/2 = 1/4

Thus P(A=1,B=2) = (1/4 + 1/4)/2 = 1/4, so it's overestimated (2x)

 A | B
 =
 1 | 1
 1 | 2
 1 | 2
 1 | 2
 2 | 1
 2 | 1
 2 | 1
 2 | 2

This obviously has exactly the same features (regarding number of distinct
values), and the produced estimate is exactly the same. But in this case

  P(A=1,B=2)=P(A=2,B=1)=3/8
  P(A=1,B=1)=P(A=2,B=2)=1/8

and thus the 1/4 is an underestimate (compared to 3/8).

The problem is the F(A,B) does not change at all. It's very simple to
construct examples (just use more rows) where F(A,B) returns exactly the
same value, but the estimates are off. The averaging somehow (smooths)
this of ...

But I think I'm missing something about how to use the F(?,?) to derive
the final estimate. So maybe the resulting estimate would be better.

Say there are two tables

   A | B | number of such rows
  ==
   1 | 1 | 1000
   1 | 2 | 1000
   2 | 1 | 1000
   1 | 2 | 1000

   A | B | number of such rows
  ==
   1 | 1 | 1
   1 | 2 | 1999
   2 | 1 | 1999
   1 | 2 | 1

How would you estimate the P(A=1,B=1) in those cases? Assume that both
estimates are equally reliable - i.e. deduced from a histogram or MCV.


 Assume for example that you're preparing the statement

   SELECT * FROM T WHERE A = ? AND B = 1

 We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better
 without an actual value for the parameter ?. The estimate for P(B=1),
 on the other hand, can use the histogram, and will thus very likely be
 much more precise. The two estimates for P(A=?,B=1) in this case are

   P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and
   P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B).

 There's a good chance that the former beats the latter, and thus also
 the average, then.

OK, good point. I was not thinking about prepared statements. In this case
it makes sense to use only one of the estimates ...

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Tomas Vondra
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
 Hmmm, maybe we could give this possibility (to identify two separate
 groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
 user would say 'build stats for (A,B) and (C)' - this actually represents
 apriori knowledge of dependencies supplied by the user.

 In that case we could search for 'implicativeness' between those two
 groups (and not within the groups), and we could restrict ourselves to 2D
 (and thus use a more sophisticated formula).
 
 Hm, I hated this idea at first, but I'm starting to like it more and more.
 It *does* seem rather unrealistic that a user would know that a bunch of
 columns are correlated, but have no idea in what way... 

Yes, that's true. Although sometimes the dependency may be very
complicated - but let's restrict to 2D for now, build something that
solves this simplified case and then we can discuss higher dimensions.

 Any examples when this's be the case would be very much appreciated - Maybe
 we should ask around on -general about this?

Well, I think the ZIP code example i a typical case of this - the users
know about the dependency between ZIP codes and cities. A natural
workaround would be to omit the dependent column from the query, but
that's not always possible (e.g. when an ORM is involved, building the
queries automatically).

 But we should be able to do some basic analysis even when the user
 supplies a list of columns without such apriori knowledge.
 
 That, I think, overcomplicates things, at least for a first cut.
 
 To summarize, I think you've shown quite nicely that the uniform bayesian
 approach is a very sensible first step towards better estimates in the case
 of correlated columns. It's statistically sound, and the dist(A,B) estimates
 it requires are probably a necessary ingredient of any solution to the 
 problem.
 If we can make it degrade more gracefully if the columns are uncorrelated we
 should do that, but if we can't thats still no reason to drop the whole idea.

Agreed. IMHO the uncorrelated case is not a big concern, as the users
usually know something's wrong with the columns. But we should introduce
some 'autodetect' but let's leave that for the future.

 So I guess we should turn our attention to how we'd obtain reasonably good 
 estimates
 of dist(A,B), and return to the current discussion once the other pieces are 
 in place.
 
 I think that maybe it'd be acceptable to scan a large portion of the
 table to estimate dist(A,B), *if* we must only do so only once in a while. 
 But even with
 a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all 
 values in memory,
 and spilling them into, say, an on-disk hash table adds even more overhead to 
 the already
 expensive full scan. Maybe using a bloom filter instead of a hash table could 
 avoid
 the spilling to disk, in exchange for a slightly less precise result...

I have no idea what a Bloom filter is (shame on me). I was not thinking
about collecting the stats, I was interested primarily in what data do
we actually need. And my knowledge about the algorithms currently used
is very limited :-(

But I agree we should at least discuss the possible solutions. Until now
I've done something like this

   SELECT COUNT(DISTINCT a) AS dist_a,
  COUNT(DISTINCT b) AS dist_b,
  COUNT(DISTINCT a || ':' || b) AS dist_ab FROM my_table;

but that's not very efficient.

My plan for the near future (a few weeks) is to build a simple 'module'
with the ability to estimate the number of rows for a given condition.
This could actually be quite useful as a stand-alone contrib module, as
the users often ask how to get a number of rows fast (usually for paging).

That may be quite slow when the query returns too many rows, even when
there is an index. It may be even much slower than the actual query (as
it usually contains a small LIMIT).

An estimate is often sufficient, but the 'pg_class.tuples' does not
really work with conditions. So this might be an improvement ...

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-20 Thread Florian Pflug
On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
 It seems to me you're missing one very important thing - this was not
 meant as a new default way to do estimates. It was meant as an option
 when the user (DBA, developer, ...) realizes the current solution gives
 really bad estimates (due to correlation). In that case he could create
 'cross-column' statistics on those columns, and the optimizer would use
 that info to do the estimates.

I do understand that. I just have the nagging feeling that there is a
way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense
to apply the uniform bayesian approach or to assume the columns are
unrelated.

I play with this for a bit over the weekend, but unfortunately ran out
of time. So I'm writing up what I found, to prevent it from getting lost.

I tried to pick up Robert's idea of quantifying Implicativeness -
i.e., finding a number between 0 and 1 that describes how close the
(A,B) are to representing a function A - B.

Observe that dist(A),dist(B) = dist(A,B) = dist(A)*dist(B) if the
estimates of dist(?) are consistent. From that you easily get

  dist(A,B)/dist(B) = dist(A) = dist(A,B) and
  dist(A,B)/dist(A) = dist(B) = dist(A,B)

If dist(A) == dist(A,B), then there is a functional dependency
A - B, and conversely if dist(B) == dist(A,B) there is a functional
dependency B - A. Note that you can have both at the same time!

On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the
smallest number of distinct values possible for a given combination
of dist(A,B) and dist(A). This is the anti-function case.

This motivates the definition

  F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ]

(You can probably drop the -1, it doesn't make much of a difference
for larger values of dist(B).

F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
a value of 1 indicates that dist(A) == dist(A,B).

So F(A,B) is a suitable measure of Implicativeness - it's higher
if the table (A,B) looks more like a function A - B.

You might use that to decide if either A-B or B-a looks function-like
enough to use the uniform bayesian approach. Or you might even go further,
and decide *with* bayesian formula to use - the paper you cited always
averages

  P(A=x|B=y)*P(B=y) and
  P(B=y|A=x)*P(A=x)

but they offer no convincing reason for that other than We don't know
which to pick.

I'd like to find a statistical explanation for that definition of
F(A,B), but so far I couldn't come up with any. I created a Maple 14
worksheet while playing around with this - if you happen to have a
copy of Maple available I'd be happy to send it to you..

This is what I got so far - I hope it may prove to be of use somehow.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-19 Thread Simon Riggs
On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
  The proposed solution is based on contingency tables, built for selected
  groups of columns (not for each possible group). And the contingency
  table gives you the ability to estimate the probabilities needed to
  compute the selectivity. Or am I missing something?
 
  Well, I'm not real familiar with contingency tables, but it seems like
  you could end up needing to store a huge amount of data to get any
  benefit out of it, in some cases.
 
 The reason that this wasn't done years ago is precisely that nobody's
 figured out how to do it with a tolerable amount of stats data and a
 tolerable amount of processing time (both at ANALYZE time and during
 query planning).  It's not hard to see what we'd ideally like to do;
 it's getting from there to something useful in production that's hard.

I think we have to face up to the fact that attempting to derive
meaningful cross-column stats will require larger sample sizes.

If we collect everything we're going to have ~10^9 stats slots with
default stats_target 100 and a 100 column table.

We should disconnect sample size from histogram size, and we need to
make the initial column pairings vastly fewer than all combinations.
Manual specification seems like it will be required for the cases where
we decide not to include it automatically, so it seems we'll need manual
specification anyway. In that case, we should do manual specification
first.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-19 Thread Tomas Vondra
Dne 19.12.2010 21:21, Simon Riggs napsal(a):
 On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 The proposed solution is based on contingency tables, built for selected
 groups of columns (not for each possible group). And the contingency
 table gives you the ability to estimate the probabilities needed to
 compute the selectivity. Or am I missing something?

 Well, I'm not real familiar with contingency tables, but it seems like
 you could end up needing to store a huge amount of data to get any
 benefit out of it, in some cases.

 The reason that this wasn't done years ago is precisely that nobody's
 figured out how to do it with a tolerable amount of stats data and a
 tolerable amount of processing time (both at ANALYZE time and during
 query planning).  It's not hard to see what we'd ideally like to do;
 it's getting from there to something useful in production that's hard.
 
 I think we have to face up to the fact that attempting to derive
 meaningful cross-column stats will require larger sample sizes.

Amen.

 If we collect everything we're going to have ~10^9 stats slots with
 default stats_target 100 and a 100 column table.

 We should disconnect sample size from histogram size, and we need to
 make the initial column pairings vastly fewer than all combinations.
 Manual specification seems like it will be required for the cases where
 we decide not to include it automatically, so it seems we'll need manual
 specification anyway. In that case, we should do manual specification
 first.

Well, not really. The more bins you have, the larger sample you need to
get a representative representation of the stats. So the histogram and
sample size are naturally connected.

And there are some (mostly heuristics) rules to determine how large the
sample should be. E.g. when building a contingency table for a
chi-squared test, a common rule is that each bin should contain at least
5 values. So the more bins you have, the larger sample you need.

I like the way oracle does this - you can either let them decide what is
the proper sample size, or you can specify how large the sample should
be (what portion of the table).

So the tricky part here is determining the number of bins in the
histogram. In the one-dimensional case, stats_target=100 actually means
each bin contains about 1% of data. So I guess we should use a similar
approach in the multi-dimensional case too, i.e. let the user determine
a desired precision and then derive the number of bins automatically
(which is a bit tricky because of the multiple dimensions).

But yeah, in general you're right - this will require larger samples,
more complex stats to collect, we'll need some space to store the
collected stats ...

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-18 Thread Florian Pflug
On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:

 On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
 Well, not really - I haven't done any experiments with it. For two
 columns selectivity equation is
 
 (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
 
 where A and B are columns, dist(X) is number of distinct values in
 column X and sel(X) is selectivity of column X.
 
 Huh? This is the selectivity estimate for A = x AND B = y? Surely,
 if A and B are independent, the formula must reduce to sel(A) * sel(B),
 and I cannot see how that'd work with the formula above.
 
 Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
 conditional probability, as
 
   P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)
 
 and uniform correlation assumption so that it's possible to replace the
 conditional probabilities with constants. And those constants are then
 estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).

Ok, I think I understand this now. The selectivity equation actually
*does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate
for sel(A).

Take the clause A = x AND B = y for example. Without knowing anything
about x and y, reasonable guesses for sel(A=x) and sel(B=y) are

  sel(A=x) = 1 / dist(A)
  sel(B=y) = 1 / dist(B).

This is also what we do currently, according to var_eq_non_const() in
src/backend/utils/adt/selfuncs.c, if we don't have any additional
knowledge about x (Actually, we also factor the probability of A being
NULL into this).

With these estimates, your formula becomes

  sel(A=x,B=y) = 1 / dist(A,B).

and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus

  sel(A=x,B=y) = sel(A=x) * sel(B=y).

If, however, y is a constant, then we use the MKVs to estimate sel(B=y)
(var_eq_const() in src/backend/utils/adt/selfuncs.c). If

  sel(B=y) ~= 0,

we'd currently also conclude that

  sel(A=x,B=y) ~= 0.

With the uniform correlation approach, we'd instead estimate

  sel(A=x,B=y) ~= sel(A=x) / dist(B)

assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated.
If dist(B) is small, this estimate is much worse than what we'd currently
get, since we've effectively ignored the information that the restriction
B=y alone guarantees that only very few rows will match.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-18 Thread Tomas Vondra
Dne 18.12.2010 16:59, Florian Pflug napsal(a):
 On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:
 
 On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
 Well, not really - I haven't done any experiments with it. For two
 columns selectivity equation is

 (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))

 where A and B are columns, dist(X) is number of distinct values in
 column X and sel(X) is selectivity of column X.

 Huh? This is the selectivity estimate for A = x AND B = y? Surely,
 if A and B are independent, the formula must reduce to sel(A) * sel(B),
 and I cannot see how that'd work with the formula above.

 Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
 conditional probability, as

   P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)

 and uniform correlation assumption so that it's possible to replace the
 conditional probabilities with constants. And those constants are then
 estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).
 
 Ok, I think I understand this now. The selectivity equation actually
 *does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate
 for sel(A).
 
 Take the clause A = x AND B = y for example. Without knowing anything
 about x and y, reasonable guesses for sel(A=x) and sel(B=y) are
 
   sel(A=x) = 1 / dist(A)
   sel(B=y) = 1 / dist(B).
 
 This is also what we do currently, according to var_eq_non_const() in
 src/backend/utils/adt/selfuncs.c, if we don't have any additional
 knowledge about x (Actually, we also factor the probability of A being
 NULL into this).
 
 With these estimates, your formula becomes
 
   sel(A=x,B=y) = 1 / dist(A,B).
 
 and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus
 
   sel(A=x,B=y) = sel(A=x) * sel(B=y).
 
 If, however, y is a constant, then we use the MKVs to estimate sel(B=y)
 (var_eq_const() in src/backend/utils/adt/selfuncs.c). If
 
   sel(B=y) ~= 0,
 
 we'd currently also conclude that
 
   sel(A=x,B=y) ~= 0.
 
 With the uniform correlation approach, we'd instead estimate
 
   sel(A=x,B=y) ~= sel(A=x) / dist(B)
 
 assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated.
 If dist(B) is small, this estimate is much worse than what we'd currently
 get, since we've effectively ignored the information that the restriction
 B=y alone guarantees that only very few rows will match.

Well, I guess you're right. Let's see two examples - uncorrelated and
higly correlated data (and then a few comments on what I think you're
missing).

uncorrelated

Let there be 10 distinct values in A, 100 distinct values in B, and
there are all possible combinations (10x100 = 1000). All the values (and
combinations) are uniformly distributed.

The current estimate is correct, i.e.

  sel(A=a) = 1/10 = 1/dist(A)
  sel(B=b) = 1/100 = 1/dist(B)
  sel(A=a, B=b) = sel(A) * sel(B) = 1/1000  /* due to AVI */

the proposed estimate is

  sel(A=a, B=b) = (dist(A)*sel(A=a) + dist(B)*sel(B=b)) / (2*dist(A,B))
= (10 * 1/10 + 100 * 1/100) / 2000 = 1/1000

so actually it produces the same estimate, but thats due to the uniform
distribution assumption.

Let's say the selectivities for A and B are very different - A is 10x
les common, B is 10x more common (let's say we know this). Then the
current estimate is still correct, i.e. it gives

  sel(A=a) = 1/100
  sel(B=b) = 1/10

  = sel(A=a,B=b) = 1/1000

but the new estimate is

  (10 * 1/100 + 100 * 1/10) / 2000 = (101/10) / 2000 ~ 1/200

So yes, in case of uncorrelated data this overestimates the result.


highly correlated
-
Again, let dist(A)=10, dist(B)=100. But this case let B=A i.e. for each
value in B, there is a unique value in A. Say B=mod(A,10) or something
like that. So now there is dist(A,B)=100.

Assuming uniform distribution, the correct estimate is

  sel(A=a,B=b) = sel(B=b) = 1/100

and the current estimate is

  sel(A=a,B=b) = sel(A=a) * sel(B=b) = 1/10 * 1/100 = 1/1000

The proposed estimate is

  (10 * 1/10 + 100 * 1/100) / 200 = 2/200 = 1/100

which is correct.

Without the uniform distribution (again, sel(A)=1/100, sel(B)=1/10), we
currently get (just as before)

  sel(A=a,B=b) = 1/10 * 1/100 = 1/1000

which is 100x less than the correct value (sel(B)=1/10). The new
estimate yields

  (10*1/100 + 100*1/10) / 200 = (1010/100) / 200 ~ 1/20

which is not as accurate as with uniform distribution, but it's
considerably more accurate than the current estimate (1/1000).


comments

I really don't think this a perfect solution giving 100% accurate
results in all cases. But the examples above illustrate that it gives
much better estimates in case of correlated data.

It seems to me you're missing one very important thing - this was not
meant as a new default way to do estimates. It was meant as an option
when the user (DBA, developer, ...) realizes the current solution gives
really bad estimates (due to correlation). In that case he could create
'cross-column' statistics on those columns, 

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
 On 12.12.2010 15:17, Martijn van Oosterhout wrote:
 On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
 Very cool that you're working on this.
 
 +1
 
 Lets talk about one special case - I'll explain how the proposed
 solution works, and then I'll explain how to make it more general, what
 improvements are possible, what issues are there. Anyway this is by no
 means a perfect or complete solution - it's just a starting point.

 It looks like you handled most of the issues. Just a few points:

 - This is obviously applicable to more than just integers, probably
anything with a b-tree operator class. What you've coded seems rely
on calculations on the values. Have you thought about how it could
work for, for example, strings?

 The classic failure case has always been: postcodes and city names.
 Strongly correlated, but in a way that the computer can't easily see.
 
 Yeah, and that's actually analogous to the example I used in my
 presentation.
 
 The way I think of that problem is that once you know the postcode,
 knowing the city name doesn't add any information. The postcode implies
 the city name. So the selectivity for postcode = ? AND city = ? should
 be the selectivity of postcode = ? alone. The measurement we need is
 implicativeness: How strongly does column A imply a certain value for
 column B. Perhaps that could be measured by counting the number of
 distinct values of column B for each value of column A, or something
 like that. I don't know what the statisticians call that property, or if
 there's some existing theory on how to measure that from a sample.
 
 That's assuming the combination has any matches. It's possible that the
 user chooses a postcode and city combination that doesn't exist, but
 that's no different from a user doing city = 'fsdfsdfsd' on a single
 column, returning no matches. We should assume that the combination
 makes sense.

Well, I've read several papers this week, in an attempt to find out what
possibilities are there when implementing cross-column statistics. One
of the papers seems like a reasonable solution to this particular
problem - discrete data + equality queries.

The article is A Bayesian Approach to Estimating The Selectivity of
Conjuctive Predicates (written by Heimel, Markl and Murthy). It's
freely available as a PDF

http:://subs.emis.de/LNI/Proceedings/Proceedings144/52.pdf

I do have a PDF with my own notes in it (as annotations), let me know if
you're interested ...

The basic principle is that instead of 'attribute value independence'
(which is the problem with correlated columns) they assume 'uniform
correlation' which is a much weaker assumption.

In the end, all they need to compute an estimate is number of distinct
values for each of the columns (we already have that in pg_stats) and a
number of distinct values for the group of columns in a query. They
really don't need any multidimensional histogram or something like that.

The funny thing is I've implemented most of the PoC before reading the
article - the only difference was the way to combine the estimates (

 pros and cons --

So let's see the pros:

* it's reasonably simple, the overhead should be minimal I guess (we're
  already estimating distinct values for the columns, so I guess we can
  do that for multiple columns with reasonable impact)

* there are no 'advanced data structures' as multi-dimensional
   histograms that are expensive to build and maintain

* it consistently produces better estimates than the current estimator
  based on attribute value independence assumption, and it's reasonably
  accurate (I've been playing with it for some time, so we'll need
  more tests of course)

* it's easily extensible to more columns

* I guess we could improve the estimated by our own heuristicts, to
  catch the special case when one column is perfectly implied by
  another one (e.g. when state is implied by ZIP code) - we can catch
  that by 'distinct for combination = distinct for column' and use just
  the estimate for one of the column

but there are some cons too:

* this really is not designed to work with real-valued data, it's a
  very nice solution for discrete data (namely 'labels' as for example
  city/state names, ZIP codes etc.)

* you need to know the list of columns in advance (there's nothing like
  'let's build' the stats for columns (a,b,c) and then query just (a,b))
  as you need to know the count of distinct values for the queried
  columns (well, maybe it's possible - I guess we could 'integrate'
  over all possible values of the omited column)

* it's built for 'equality' qeuries, although maybe we could modify it
  for inequalities too (but it won't be as elegant), but I guess the
  equality queries are much more common in case of discrete data (OK,
  you can run something like

WHERE (zip_code BETWEEN '49389' AND '54980') AND state = '48'

  

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Robert Haas
On Fri, Dec 17, 2010 at 12:58 PM, Tomas Vondra t...@fuzzy.cz wrote:
 In the end, all they need to compute an estimate is number of distinct
 values for each of the columns (we already have that in pg_stats) and a
 number of distinct values for the group of columns in a query. They
 really don't need any multidimensional histogram or something like that.

I haven't read the paper yet (sorry) but just off the top of my head,
one possible problem here is that our n_distinct estimates aren't
always very accurate, especially for large tables.  As we've discussed
before, making them accurate requires sampling a significant
percentage of the table, whereas all of our other statistics can be
computed reasonably accurately by sampling a fixed amount of an
arbitrarily large table.  So it's possible that relying more heavily
on n_distinct could turn out worse overall even if the algorithm is
better.  Not sure if that's an issue here, just throwing it out
there...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Hi,

I've read about 10 papers about estimation this week, and I have gained
some insight. I'm not going to describe each of the papers here, I plan
to set up a wiki page about cross column statistics

  http://wiki.postgresql.org/wiki/Cross_Columns_Stats

and I'll put the list of papers and some basic info there. There was a
lot of discussion in this thread, and it's difficult to follow it, so
I'll put there some details about the proposal etc.

So I'll just briefly describe the interesting things I've learned from
those papers.

-- instances of the problem --

Generally, there are four quite different cases of the problem.

First, the columns may be either real-valued or discrete. And by
'discrete' I mean the value is rather a label than a value. It does not
make sense to add or subtract them, etc. So for example city names or
zip codes are discrete values (although zip codes are numbers etc).

Second, the queries are consist of equality or inequality (range)
conditions.

So actually there are four instances of the problem:

  | equality conditions | inequality conditions |
 
  discrete values |  A  |   D   |
  numeric values  |  C  |   B   |
 

I think those four problems should be treated separately, although some
of the techniques may be common.

A) discrete values and inequality conditions

   One of the papers (A Bayesian Approach to Estimating The Selectivity
   of Conjuctive Predicates) describes a quite interesting solution to
   this problem - I've already posted a description on how to apply it
   to the ZIP code / city name problem - see this

   http://archives.postgresql.org/pgsql-hackers/2010-12/msg01576.php

B) numeric values and inequality conditions

   Most of the papers dealing with this problem are based on
   discretization and multi-dimensional histograms to approximate the
   joint distribution. So I believe my initial proposal was not a
   complete nonsense.

   Once we have a working histogram-based solution, we can work on
   precision and efficiency (how much space is needed to store it, how
   long does it take to compute an estimate, etc.). There are two
   ways to do that.

   First, most of the papers offer an enhanced type of histogram
   (although the histogram proposed in the paper is always evaluated as
   the most efficient one, which is a bit suspicious). Anyway there are
   advanced quite-promising ways to build multi-dimensional histograms.

   Second, the paper Selectivity estimation using probabilistic
   models) describes a solution based on bayesian networks. That means
   really really promising (actually it addresses join selectivity
   estimation too).

   So yeas, I'm quite confident we can solve this issue and improve the
   efficiency - either by an advanced histogram or using bayesian
   networks.

C) numeric values and equality conditions

   OK, I'm not sure how to solve this case. But the queries against
   numeric queries are range queries in most cases I guess, so maybe
   that's not that big deal.

D) discrete values and inequality conditions

   Basically, this can be handled just like numeric values after
   discretization, i.e. using a histogram. But this is usually

E) a combination of discrete / numeric columns

   I'm not sure how to deal with this. Obviously it's possible to
   build multi-dimensional histogram, and estimate as many queries as
   possible.

---

The papers describe some interesting alternative approaches, e.g. based
on SVD (singular value decomposition) or random sampling (or kernel
estimators, which an enhanced version of sampling).

But there are various issues associated with those solutions. SVD is
applicable to 2D only, so we'd be stuck with 2 columns, and random
sampling sounds a bit suspicious to me).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 19:58, Robert Haas napsal(a):
 I haven't read the paper yet (sorry) but just off the top of my head,
 one possible problem here is that our n_distinct estimates aren't
 always very accurate, especially for large tables.  As we've discussed
 before, making them accurate requires sampling a significant
 percentage of the table, whereas all of our other statistics can be
 computed reasonably accurately by sampling a fixed amount of an
 arbitrarily large table.  So it's possible that relying more heavily
 on n_distinct could turn out worse overall even if the algorithm is
 better.  Not sure if that's an issue here, just throwing it out
 there...

Yes, you're right - the paper really is based on (estimates of) number
of distinct values for each of the columns as well as for the group of
columns.

AFAIK it will work with reasonably precise estimates, but the point is
you need an estimate of distinct values of the whole group of columns.
So when you want to get an estimate for queries on columns (a,b), you
need the number of distinct value combinations of these two columns.

And I think we're not collecting this right now, so this solution
requires scanning the table (or some part of it).

I know this is a weak point of the whole solution, but the truth is
every cross-column stats solution will have to do something like this. I
don't think we'll find a solution with 0 performance impact, without the
need to scan sufficient part of a table.

That's why I want to make this optional so that the users will use it
only when really needed.

Anyway one possible solution might be to allow the user to set these
values manually (as in case when ndistinct estimates are not precise).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 AFAIK it will work with reasonably precise estimates, but the point is
 you need an estimate of distinct values of the whole group of columns.
 So when you want to get an estimate for queries on columns (a,b), you
 need the number of distinct value combinations of these two columns.

That seems likely to be even more unreliable than our existing
single-column estimates :-(

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Heikki Linnakangas

On 17.12.2010 23:13, Tomas Vondra wrote:

Dne 17.12.2010 19:58, Robert Haas napsal(a):

I haven't read the paper yet (sorry) but just off the top of my head,
one possible problem here is that our n_distinct estimates aren't
always very accurate, especially for large tables.  As we've discussed
before, making them accurate requires sampling a significant
percentage of the table, whereas all of our other statistics can be
computed reasonably accurately by sampling a fixed amount of an
arbitrarily large table.  So it's possible that relying more heavily
on n_distinct could turn out worse overall even if the algorithm is
better.  Not sure if that's an issue here, just throwing it out
there...


Yes, you're right - the paper really is based on (estimates of) number
of distinct values for each of the columns as well as for the group of
columns.

AFAIK it will work with reasonably precise estimates, but the point is
you need an estimate of distinct values of the whole group of columns.
So when you want to get an estimate for queries on columns (a,b), you
need the number of distinct value combinations of these two columns.

And I think we're not collecting this right now, so this solution
requires scanning the table (or some part of it).


Any idea how sensitive it is to the accuracy of that estimate on 
distinct value combinations? If we get that off by a factor of ten or a 
hundred, what kind of an effect does it have on the final cost estimates?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 22:24, Tom Lane napsal(a):
 That seems likely to be even more unreliable than our existing
 single-column estimates :-(
 
   regards, tom lane

Well, yes :-(

I guess this is a place where we could use a multi-column index, if it
contains all the interesting columns. We could scan the index, not the
whole table.

That could save us tremendous amount of I/O and should be quite precise
(unless it's severely bloated).

Another thing is those 'discrete' columns are usually quite stable, i.e.
there's usually a limited list of values and it does not change very
often. Think about ZIP codes, states, etc. And the combinations are
quite stable too (counties do not move to other states, etc.).

So I think scanning a reasonable part of a table should be enough in
these cases.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
 On 17.12.2010 23:13, Tomas Vondra wrote:
 Dne 17.12.2010 19:58, Robert Haas napsal(a):
 I haven't read the paper yet (sorry) but just off the top of my head,
 one possible problem here is that our n_distinct estimates aren't
 always very accurate, especially for large tables.  As we've discussed
 before, making them accurate requires sampling a significant
 percentage of the table, whereas all of our other statistics can be
 computed reasonably accurately by sampling a fixed amount of an
 arbitrarily large table.  So it's possible that relying more heavily
 on n_distinct could turn out worse overall even if the algorithm is
 better.  Not sure if that's an issue here, just throwing it out
 there...

 Yes, you're right - the paper really is based on (estimates of) number
 of distinct values for each of the columns as well as for the group of
 columns.

 AFAIK it will work with reasonably precise estimates, but the point is
 you need an estimate of distinct values of the whole group of columns.
 So when you want to get an estimate for queries on columns (a,b), you
 need the number of distinct value combinations of these two columns.

 And I think we're not collecting this right now, so this solution
 requires scanning the table (or some part of it).
 
 Any idea how sensitive it is to the accuracy of that estimate on
 distinct value combinations? If we get that off by a factor of ten or a
 hundred, what kind of an effect does it have on the final cost estimates?

Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is

  (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))

where A and B are columns, dist(X) is number of distinct values in
column X and sel(X) is selectivity of column X.

dependency on dist(A), dist(B) and dist(A,C)


So it seems to be proportional to dist(A) and dist(B), and inverse
proportional to dist(A,B). So when increasing the dist(A) and dist(B)
10x, you'll get a 10x the original estimate. Similarly, by increasing
the dist(A,B) 10x, you'll get 10x lower estimate.

upper bound
---

Unless dist(A) or dist(B) is greater than dist(A,B), the estimated
selectivity is bounded by

  (sel(A) + sel(B)) / 2

I guess we can say that (sel(A)  sel(A,B)) is generally the same as

  sel(A) = sel(A,B)

i.e. we can use the heuristict I've already offered in the PoC.

lower bound
---

Not sure what the lower bound is, but it might be 0 if the dist(A) and
dist(B) are very small and/or dist(A,B) is huge.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Florian Pflug
On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
 Well, not really - I haven't done any experiments with it. For two
 columns selectivity equation is
 
  (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
 
 where A and B are columns, dist(X) is number of distinct values in
 column X and sel(X) is selectivity of column X.

Huh? This is the selectivity estimate for A = x AND B = y? Surely,
if A and B are independent, the formula must reduce to sel(A) * sel(B),
and I cannot see how that'd work with the formula above.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread tv
 On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
 Well, not really - I haven't done any experiments with it. For two
 columns selectivity equation is

  (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))

 where A and B are columns, dist(X) is number of distinct values in
 column X and sel(X) is selectivity of column X.

 Huh? This is the selectivity estimate for A = x AND B = y? Surely,
 if A and B are independent, the formula must reduce to sel(A) * sel(B),
 and I cannot see how that'd work with the formula above.

Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
conditional probability, as

   P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)

and uniform correlation assumption so that it's possible to replace the
conditional probabilities with constants. And those constants are then
estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).

So it does not reduce to sel(A)*sel(B) exactly, as the dist(A)/dist(A,B)
is just an estimate of P(B|A). The paper states that this works best for
highly correlated data, while for low correlated data it (at least)
matches the usual estimates.

I don't say it's perfect, but it seems to produce reasonable estimates.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Yeb Havinga

On 2010-12-13 03:28, Robert Haas wrote:

Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases.  For example, in the United States,
there are over 40,000 postal codes, and some even larger number of
city names, and doesn't the number of entries go as O(m*n)?  Now maybe
this is useful enough anyway that we should Just Do It, but it'd be a
lot cooler if we could find a way to give the planner a meaningful
clue out of some more compact representation.
A sparse matrix that holds only 'implicative' (P(A|B)  P(A*B)?) 
combinations? Also, some information might be deduced from others. For 
Heikki's city/region example, for each city it would be known that it is 
100% in one region. In that case it suffices to store only that 
information, since 0% in all other regions ca be deduced. I wouldn't be 
surprized if storing implicatures like this would reduce the size to O(n).


regards,
Yeb Havinga


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread tv
 On 2010-12-13 03:28, Robert Haas wrote:
 Well, I'm not real familiar with contingency tables, but it seems like
 you could end up needing to store a huge amount of data to get any
 benefit out of it, in some cases.  For example, in the United States,
 there are over 40,000 postal codes, and some even larger number of
 city names, and doesn't the number of entries go as O(m*n)?  Now maybe
 this is useful enough anyway that we should Just Do It, but it'd be a
 lot cooler if we could find a way to give the planner a meaningful
 clue out of some more compact representation.
 A sparse matrix that holds only 'implicative' (P(A|B)  P(A*B)?)
 combinations? Also, some information might be deduced from others. For
 Heikki's city/region example, for each city it would be known that it is
 100% in one region. In that case it suffices to store only that
 information, since 0% in all other regions ca be deduced. I wouldn't be
 surprized if storing implicatures like this would reduce the size to O(n).

OK, but I'll leave this for the future. My plan is to build a small PoC,
just to see whether the contingency-table + probability-estimates approach
works for the failure case mentioned by Heikki. I'l like to do this till
the end of this week, if possible.

I'll read the articles/mentioned by Nathan Boley (thanks for those links,
if you have more of them just let me know).

Once we have a solution that solves (or significantly improves) these
failure cases, we can do further plans (how to do that ascually in the
code etc.).

BTW thanks for all the comments!

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 Well, until this point we've discussed failure cases involving 'AND'
 conditions. What about 'OR' conditions? I think the current optimizer
 computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
 found in backend/optimizer/path/clausesel.c:630).

If you can solve the AND case, the OR case falls out of that.  Just
replace s1*s2 with a more accurate AND calculation.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 The proposed solution is based on contingency tables, built for selected
 groups of columns (not for each possible group). And the contingency
 table gives you the ability to estimate the probabilities needed to
 compute the selectivity. Or am I missing something?

 Well, I'm not real familiar with contingency tables, but it seems like
 you could end up needing to store a huge amount of data to get any
 benefit out of it, in some cases.

The reason that this wasn't done years ago is precisely that nobody's
figured out how to do it with a tolerable amount of stats data and a
tolerable amount of processing time (both at ANALYZE time and during
query planning).  It's not hard to see what we'd ideally like to do;
it's getting from there to something useful in production that's hard.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Joshua Tolley
On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
 Another quick note: I think that storing the full contingency table is
 wasteful since the marginals are already stored in the single column
 statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
 looking at this a couple years back ).

Josh Tolley still looks at it occasionally, though time hasn't permitted any
sort of significant work for quite some time. The multicolstat branch on my
git.postgresql.org repository will create an empirical copula each
multi-column index, and stick it in pg_statistic. It doesn't yet do anything
useful with that information, nor am I convinced it's remotely bug-free. In a
brief PGCon discussion with Tom a while back, it was suggested a good place
for the planner to use these stats would be clausesel.c, which is responsible
for handling code such as ...WHERE foo  4 AND foo  5.

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com


signature.asc
Description: Digital signature


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 16:34, Tom Lane napsal(a):
 Tomas Vondra t...@fuzzy.cz writes:
 Well, until this point we've discussed failure cases involving 'AND'
 conditions. What about 'OR' conditions? I think the current optimizer
 computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
 found in backend/optimizer/path/clausesel.c:630).
 
 If you can solve the AND case, the OR case falls out of that.  Just
 replace s1*s2 with a more accurate AND calculation.

Oh yeah, now I see - it's just the usual equation

   P(A or B) = P(A) + P(B) - P(A and B)

and we're estimating P(A and B) as P(A)*P(B).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 16:38, Tom Lane napsal(a):
 The reason that this wasn't done years ago is precisely that nobody's
 figured out how to do it with a tolerable amount of stats data and a
 tolerable amount of processing time (both at ANALYZE time and during
 query planning).  It's not hard to see what we'd ideally like to do;
 it's getting from there to something useful in production that's hard.

OK, I fully realize that. My plan is to simply

  (a) find out what statistics do we need to collect and how to use it
  (b) implement a really stupid inefficient solution
  (c) optimize in iterations, i.e. making it faster, consuming less
  space etc.

It will take time, it won't be perfect for a long time, but every
journey starts somewhere.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 18:59, Joshua Tolley napsal(a):
 On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
 Another quick note: I think that storing the full contingency table is
 wasteful since the marginals are already stored in the single column
 statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
 looking at this a couple years back ).
 
 Josh Tolley still looks at it occasionally, though time hasn't permitted any
 sort of significant work for quite some time. The multicolstat branch on my
 git.postgresql.org repository will create an empirical copula each
 multi-column index, and stick it in pg_statistic. It doesn't yet do anything
 useful with that information, nor am I convinced it's remotely bug-free. In a
 brief PGCon discussion with Tom a while back, it was suggested a good place
 for the planner to use these stats would be clausesel.c, which is responsible
 for handling code such as ...WHERE foo  4 AND foo  5.

Well, that's good news ;-)

I've done a bit of research today, and I've found some quite interesting
papers on this topic (probably, I did not have time to read them, in
most cases I've read just the title and abstract).

[1] Selectivity estimators for multidimensional range queries over real
attributes
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.914

This seems like a very good starting point. AFAIK it precisely
describes what data need to be collected, how to do the math etc.

[2] Selectivity Estimation Without the Attribute Value Independence
Assumption
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.8126

This obviously deals with the independence problem. Haven't
investigated it further, but it seems worth to read.

[3] On Analyzing the Cost of Queries with Multi-Attribute Restrictions
and Sort Operations (A Cost Function for Uniformly Partitioned
UB-Trees)
http://mistral.in.tum.de/results/publications/MB00_ideas.pdf

Describes something called UB-Tree, and shows how it may be used to
do estimates. Might be interesting as an alternative to the
traditional histograms.

There are more details about UB-Trees at

http://mistral.in.tum.de/results/publications/

[4] http://www.dbnet.ece.ntua.gr/~nikos/edith/qopt_bibl/

A rather nice collection of papers related to estimation (including
some of the papers listed above).

Hm, I planned to finally read the Understanding MySQL Internals over
the Xmas ... that obviously won't happen.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Josh Berkus
Tomas,

   (a) find out what statistics do we need to collect and how to use it
   (b) implement a really stupid inefficient solution
   (c) optimize in iterations, i.e. making it faster, consuming less
   space etc.

I'll suggest again how to decide *which* columns to cross: whichever
columns are combined in composite indexes.  In version 2, allow the DBA
to specify combinations.

In the unlikely event that correlation could be reduced to a single
float number, it would be conceivable for each column to have an array
of correlation stats for every other column where correlation was
non-random; on most tables (i.e. ones with less than 100 columns) we're
not talking about that much storage space.

The main cost would be the time spent collecting that info ...

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 22:50, Josh Berkus napsal(a):
 Tomas,
 
   (a) find out what statistics do we need to collect and how to use it
   (b) implement a really stupid inefficient solution
   (c) optimize in iterations, i.e. making it faster, consuming less
   space etc.
 
 I'll suggest again how to decide *which* columns to cross: whichever
 columns are combined in composite indexes.  In version 2, allow the DBA
 to specify combinations.
 
 In the unlikely event that correlation could be reduced to a single
 float number, it would be conceivable for each column to have an array
 of correlation stats for every other column where correlation was
 non-random; on most tables (i.e. ones with less than 100 columns) we're
 not talking about that much storage space.
 
 The main cost would be the time spent collecting that info ...

I think this is a bit early to discuss this, given the fact that we
don't have a working solution yet. But OK, let's discuss these options
anyway

1) collecting the automatically for composite indexes

   I don't think this is wise idea. The first versions definitely won't
   be very efficient, and collecting the data for each composite
   index means everyone will be hit by this inefficiency, even if he
   actually does not need that (e.g. the columns are independent so the
   current estimates are quite accurate or he's not using those columns
   very often in the same WHERE clause).

   Another reason against this is that many DBAs don't actually use
   composed indexes - they simply create indexes on each column and let
   the bitmap index scan to work it out. And this would not work for
   this case.

   And actually it's not very complicated to allow the DBA to do this,
   this can be a quite simple PL/pgSQL procedure.

2) collecting correlation for each pair of columns

   Again, you're effectively forcing everyone to pay the price even
   though he may not need the feature. Maybe we'll get there one day,
   but it's not a good idea to do that from the beginning.

   And the correlation itself has a very limited use in real life, as
   it's not possible to compute it for character columns and is not
   very useful in case of some numerical columns (e.g. ZIP codes).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Pavel Stehule
2010/12/13 Josh Berkus j...@agliodbs.com:
 Tomas,

   (a) find out what statistics do we need to collect and how to use it
   (b) implement a really stupid inefficient solution
   (c) optimize in iterations, i.e. making it faster, consuming less
       space etc.

 I'll suggest again how to decide *which* columns to cross: whichever
 columns are combined in composite indexes.  In version 2, allow the DBA
 to specify combinations.

It's really good idea? Composite index can be created when single
columns are too less unique - (name, surname). DBA specification can
be cheeper.  We can set a options for relation? So it can be used.

Pavel


 In the unlikely event that correlation could be reduced to a single
 float number, it would be conceivable for each column to have an array
 of correlation stats for every other column where correlation was
 non-random; on most tables (i.e. ones with less than 100 columns) we're
 not talking about that much storage space.

 The main cost would be the time spent collecting that info ...

 --
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Robert Haas
On Mon, Dec 13, 2010 at 5:45 PM, Tomas Vondra t...@fuzzy.cz wrote:
 I'll suggest again how to decide *which* columns to cross: whichever
 columns are combined in composite indexes.  In version 2, allow the DBA
 to specify combinations.
 I think this is a bit early to discuss this, given the fact that we
 don't have a working solution yet.

Agreed.  Not to mention that adding syntax is pretty easy.  The hard
part is figuring out what data you want to collect and what you want
to do with it.  If we can figure that out, the syntax will be simple
by comparison.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Martijn van Oosterhout
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
 Hi everyone,
 
 one of the ssesion I've attended on PgDay last week was Heikki's session
 about statistics in PostgreSQL. One of the issues he mentioned (and one
 I regularly run into) is the absence of cross-column stats. When the
 columns are not independent, this usually result in poor estimates (and
 then in suboptimal plans).

Very cool that you're working on this.

 Lets talk about one special case - I'll explain how the proposed
 solution works, and then I'll explain how to make it more general, what
 improvements are possible, what issues are there. Anyway this is by no
 means a perfect or complete solution - it's just a starting point.

It looks like you handled most of the issues. Just a few points:

- This is obviously applicable to more than just integers, probably
  anything with a b-tree operator class. What you've coded seems rely
  on calculations on the values. Have you thought about how it could
  work for, for example, strings?

The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.
Not that I suggest you fix this, but it's food for though. Though
strictly speaking this is a different kind of correlation than what
you're looking at.

 2) I really don't think we should collect stats for all combinations of
columns of a table - I do like the Oracle-like approach where a DBA
has to enable cross-column stats using an ALTER TABLE (for a
particular list of columns).
 
The only exception might be columns from a multi-column index. It
might be quite efficient I guess?

In the past it has been suggested to only do it for multi-column
indexes, but I find these days I find in some situations I prefer to
make individual indexes and let the bitmap scan code combine them. So
perhaps it would be best to let it be configured by the DBA.

 3) There are independence tests for contingency tables (e.g. Pearson's
Chi-squared test), so that it's easy to find out whether the columns
are independent. In that case we can just throw away these stats and
use the simple estimation.
 
http://mathworld.wolfram.com/Chi-SquaredTest.html

I think this would be good to include, if possible. 

Actually, I wonder if the existing stats collection code could be
altered to attempt to calculate the correlation between columns as part
of its other work.

 4) Or we could store just those cells where expected and observed values
differ significantly (may help if most of the values are indendent,
but there's a small glitch somewhere).

Comrpessing that grid would be useful, given that for many dimensions
most of the grid will be not interesting. In fact, storing the 20
largest values may be enough. Worth an experiment.

Hope this helps,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Heikki Linnakangas

On 12.12.2010 15:17, Martijn van Oosterhout wrote:

On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
Very cool that you're working on this.


+1


Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.


It looks like you handled most of the issues. Just a few points:

- This is obviously applicable to more than just integers, probably
   anything with a b-tree operator class. What you've coded seems rely
   on calculations on the values. Have you thought about how it could
   work for, for example, strings?

The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.


Yeah, and that's actually analogous to the example I used in my 
presentation.


The way I think of that problem is that once you know the postcode, 
knowing the city name doesn't add any information. The postcode implies 
the city name. So the selectivity for postcode = ? AND city = ? should 
be the selectivity of postcode = ? alone. The measurement we need is 
implicativeness: How strongly does column A imply a certain value for 
column B. Perhaps that could be measured by counting the number of 
distinct values of column B for each value of column A, or something 
like that. I don't know what the statisticians call that property, or if 
there's some existing theory on how to measure that from a sample.


That's assuming the combination has any matches. It's possible that the 
user chooses a postcode and city combination that doesn't exist, but 
that's no different from a user doing city = 'fsdfsdfsd' on a single 
column, returning no matches. We should assume that the combination 
makes sense.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Hi!

Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a):
 Lets talk about one special case - I'll explain how the proposed
 solution works, and then I'll explain how to make it more general, what
 improvements are possible, what issues are there. Anyway this is by no
 means a perfect or complete solution - it's just a starting point.
 
 It looks like you handled most of the issues. Just a few points:
 
 - This is obviously applicable to more than just integers, probably
   anything with a b-tree operator class. What you've coded seems rely
   on calculations on the values. Have you thought about how it could
   work for, for example, strings?

Yes, I know, I just forgot to address this in my previous e-mail. The
contingency tables have a really nice feature - they are based on
splitting the sets into groups (~ bins of the histograms for each
column). And this can be done if you can sort the values, you really
don't need any calculations. So it should work with strings.

And another thing I somehow forgot is handling the case when there is no
histogram, just MCV. That's mostly the same - each of the values might
be a separate group, or the values might be grouped to form less groups,
etc.

 The classic failure case has always been: postcodes and city names.
 Strongly correlated, but in a way that the computer can't easily see.
 Not that I suggest you fix this, but it's food for though. Though
 strictly speaking this is a different kind of correlation than what
 you're looking at.

Hmmm, I see. I think the proposal does not fix this particular case,
although it might improve the situation a little bit (limit the error
between expected and observed number of rows).

The problem is that once we get to a cell-level of the contingency
table, there is no additional (more detailed) information. So we're
stuck with the multiplication estimate, or something like that.

I was thinking about it actually, and I think we could collect some more
info - a correlation coefficient for each bin, or something like that.
But that was not part of my proposal, and I'm not sure how to do that.

 2) I really don't think we should collect stats for all combinations of
columns of a table - I do like the Oracle-like approach where a DBA
has to enable cross-column stats using an ALTER TABLE (for a
particular list of columns).

The only exception might be columns from a multi-column index. It
might be quite efficient I guess?
 
 In the past it has been suggested to only do it for multi-column
 indexes, but I find these days I find in some situations I prefer to
 make individual indexes and let the bitmap scan code combine them. So
 perhaps it would be best to let it be configured by the DBA.

Yes, I prefer individual indexes too.

The idea behind collecting cross-column stats for multi-column indexes
was that maybe we could 'append' this to the current functionality
(building the index or something like that) so that it does not
introduce significant performance problems.

 3) There are independence tests for contingency tables (e.g. Pearson's
Chi-squared test), so that it's easy to find out whether the columns
are independent. In that case we can just throw away these stats and
use the simple estimation.

http://mathworld.wolfram.com/Chi-SquaredTest.html
 
 I think this would be good to include, if possible. 
 
 Actually, I wonder if the existing stats collection code could be
 altered to attempt to calculate the correlation between columns as part
 of its other work.

I guess that would be rather expensive - to compute correlation you need
two passes, and you need to do that for each pair or columns. So I'd be
surprised if it is possible (and effective).

Another thing is that you can compute correlation only for numeric
columns, so it's not possible to do that for city/ZIP code mentioned
above. More precisely - it's possible to do that (if you map strings to
numbers somehow), but I doubt you'll get useful results as the
assignment is rather random.

Well, you could ask the governments to assign the ZIP codes to cities in
strictly alphabecital order, but I guess they'll say no.

 4) Or we could store just those cells where expected and observed values
differ significantly (may help if most of the values are indendent,
but there's a small glitch somewhere).
 
 Comrpessing that grid would be useful, given that for many dimensions
 most of the grid will be not interesting. In fact, storing the 20
 largest values may be enough. Worth an experiment.

Not exactly just the largest values - rather values that are
significantly different from the expected values. Generally there are
two interesting cases

expected  observed - The optimizer may choose index scan, although
   the seq scan would be better.

expected  observed - The optimizer may choose seq scan, although
   the index scan would be better.

regards
Tomas

-- 
Sent via pgsql-hackers mailing 

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
 The classic failure case has always been: postcodes and city names.
 Strongly correlated, but in a way that the computer can't easily see.
 
 Yeah, and that's actually analogous to the example I used in my
 presentation.
 
 The way I think of that problem is that once you know the postcode,
 knowing the city name doesn't add any information. The postcode implies
 the city name. So the selectivity for postcode = ? AND city = ? should
 be the selectivity of postcode = ? alone. The measurement we need is
 implicativeness: How strongly does column A imply a certain value for
 column B. Perhaps that could be measured by counting the number of
 distinct values of column B for each value of column A, or something
 like that. I don't know what the statisticians call that property, or if
 there's some existing theory on how to measure that from a sample.

Yes, those issues are a righteous punishment for breaking BCNF rules ;-)

I'm not sure it's solvable using the contingency tables, as it requires
knowledge about dependencies between individual values (working with
cells is not enough, although it might improve the estimates).

Well, maybe we could collect these stats (number of cities for a given
ZIP code and number of ZIP codes for a given city). Collecting a good
stats about this is a bit tricky, but possible. What about collecting
this for the MCVs from both columns?

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Florian Pflug
On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
 The way I think of that problem is that once you know the postcode, knowing 
 the city name doesn't add any information. The postcode implies the city 
 name. So the selectivity for postcode = ? AND city = ? should be the 
 selectivity of postcode = ? alone. The measurement we need is 
 implicativeness: How strongly does column A imply a certain value for 
 column B. Perhaps that could be measured by counting the number of distinct 
 values of column B for each value of column A, or something like that. I 
 don't know what the statisticians call that property, or if there's some 
 existing theory on how to measure that from a sample.

The statistical term for this is conditional probability, written P(A|B), 
meaning the probability of A under the assumption or knowledge of B. The basic 
tool for working with conditional probabilities is bayes' theorem which states 
that

P(A|B) = P(A and B) / P(B).

Currently, we assume that P(A|B) = P(A), meaning the probability (or 
selectivity as we call it) of an event (like a=3) does not change under 
additional assumptions like b=4. Bayes' theorem thus becomes

P(A) = P(A and B) / P(B)=
P(A and B) = P(A)*P(B)

which is how we currently compute the selectivity of a clause such as WHERE 
a=3 AND b=4.

I believe that measuring this by counting the number of distinct values of 
column B for each A is basically the right idea. Maybe we could count the 
number of distinct values of b for every one of the most common values of 
a, and compare that to the overall number of distinct values of b...

A (very) quick search on scholar.google.com for estimate conditional 
probability didn't turn up anything useful, but it's hard to believe that 
there isn't at least some literature on the subject.

best regards,
Florian Pflug
 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 12.12.2010 17:33, Florian Pflug napsal(a):
 On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
 The way I think of that problem is that once you know the postcode, knowing 
 the city name doesn't add any information. The postcode implies the city 
 name. So the selectivity for postcode = ? AND city = ? should be the 
 selectivity of postcode = ? alone. The measurement we need is 
 implicativeness: How strongly does column A imply a certain value for 
 column B. Perhaps that could be measured by counting the number of distinct 
 values of column B for each value of column A, or something like that. I 
 don't know what the statisticians call that property, or if there's some 
 existing theory on how to measure that from a sample.
 
 The statistical term for this is conditional probability, written P(A|B), 
 meaning the probability of A under the assumption or knowledge of B. The 
 basic tool for working with conditional probabilities is bayes' theorem which 
 states that
 
 P(A|B) = P(A and B) / P(B).
 
 Currently, we assume that P(A|B) = P(A), meaning the probability (or 
 selectivity as we call it) of an event (like a=3) does not change under 
 additional assumptions like b=4. Bayes' theorem thus becomes
 
 P(A) = P(A and B) / P(B)=
 P(A and B) = P(A)*P(B)
 
 which is how we currently compute the selectivity of a clause such as WHERE 
 a=3 AND b=4.
 
 I believe that measuring this by counting the number of distinct values of 
 column B for each A is basically the right idea. Maybe we could count the 
 number of distinct values of b for every one of the most common values of 
 a, and compare that to the overall number of distinct values of b...

Good point!

Well, I was thinking about this too - generally this means creating a
contingency table with the MCV as bins. Then you can compute these
interesting probabilities P(A and B). (OK, now I definitely look like
some contingency table weirdo, who tries to solve everything with a
contingency table. OMG!)

The question is - what are we going to do when the values in the query
are not in the MCV list? Is there some heuristics to estimate the
probability from MCV, or something like that? Could we use some
average probability or what?

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Robert Haas
On Sun, Dec 12, 2010 at 9:43 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 The way I think of that problem is that once you know the postcode, knowing
 the city name doesn't add any information. The postcode implies the city
 name. So the selectivity for postcode = ? AND city = ? should be the
 selectivity of postcode = ? alone. The measurement we need is
 implicativeness: How strongly does column A imply a certain value for
 column B. Perhaps that could be measured by counting the number of distinct
 values of column B for each value of column A, or something like that. I
 don't know what the statisticians call that property, or if there's some
 existing theory on how to measure that from a sample.

This is a good idea, but I guess the question is what you do next.  If
you know that the applicability is 100%, you can disregard the
restriction clause on the implied column.  And if it has no
implicatory power, then you just do what we do now.  But what if it
has some intermediate degree of implicability?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 13.12.2010 01:05, Robert Haas napsal(a):
 This is a good idea, but I guess the question is what you do next.  If
 you know that the applicability is 100%, you can disregard the
 restriction clause on the implied column.  And if it has no
 implicatory power, then you just do what we do now.  But what if it
 has some intermediate degree of implicability?

Well, I think you've missed the e-mail from Florian Pflug - he actually
pointed out that the 'implicativeness' Heikki mentioned is called
conditional probability. And conditional probability can be used to
express the AND probability we are looking for (selectiveness).

For two columns, this is actually pretty straighforward - as Florian
wrote, the equation is

   P(A and B) = P(A|B) * P(B) = P(B|A) * P(A)

where P(B) may be estimated from the current histogram, and P(A|B) may
be estimated from the contingency (see the previous mails). And P(A and
B) is actually the value we're looking for.

Anyway there really is no intermediate degree of aplicability, it just
gives you the right estimate.

And AFAIR this is easily extensible to more than two columns, as

  P(A and B and C) = P(A and (B and C)) = P(A|(B and C)) * P(B and C)

so it's basically a recursion.

Well, I hope my statements are really correct - it's been a few years
since I gained my degree in statistics ;-)

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Robert Haas
On Sun, Dec 12, 2010 at 8:46 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Dne 13.12.2010 01:05, Robert Haas napsal(a):
 This is a good idea, but I guess the question is what you do next.  If
 you know that the applicability is 100%, you can disregard the
 restriction clause on the implied column.  And if it has no
 implicatory power, then you just do what we do now.  But what if it
 has some intermediate degree of implicability?

 Well, I think you've missed the e-mail from Florian Pflug - he actually
 pointed out that the 'implicativeness' Heikki mentioned is called
 conditional probability. And conditional probability can be used to
 express the AND probability we are looking for (selectiveness).

 For two columns, this is actually pretty straighforward - as Florian
 wrote, the equation is

   P(A and B) = P(A|B) * P(B) = P(B|A) * P(A)

Well, the question is what data you are actually storing.  It's
appealing to store a measure of the extent to which a constraint on
column X constrains column Y, because you'd only need to store
O(ncolumns^2) values, which would be reasonably compact and would
potentially handle the zip code problem - a classic hard case rather
neatly.  But that wouldn't be sufficient to use the above equation,
because there A and B need to be things like column X has value x,
and it's not going to be practical to store a complete set of MCVs for
column X for each possible value that could appear in column Y.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
 P(A|B) = P(A and B) / P(B).

Well, until this point we've discussed failure cases involving 'AND'
conditions. What about 'OR' conditions? I think the current optimizer
computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
found in backend/optimizer/path/clausesel.c:630).

Sometimes that may return nearly 2x the actual selectivity, but in
general it's a reasonable estimate. Are there any severe failure cases
that produce much worse estimates?

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 13.12.2010 03:00, Robert Haas napsal(a):
 Well, the question is what data you are actually storing.  It's
 appealing to store a measure of the extent to which a constraint on
 column X constrains column Y, because you'd only need to store
 O(ncolumns^2) values, which would be reasonably compact and would
 potentially handle the zip code problem - a classic hard case rather
 neatly.  But that wouldn't be sufficient to use the above equation,
 because there A and B need to be things like column X has value x,
 and it's not going to be practical to store a complete set of MCVs for
 column X for each possible value that could appear in column Y.

O(ncolumns^2) values? You mean collecting such stats for each possible
pair of columns? Well, I meant something different.

The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the probabilities needed to
compute the selectivity. Or am I missing something?

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Robert Haas
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Dne 13.12.2010 03:00, Robert Haas napsal(a):
 Well, the question is what data you are actually storing.  It's
 appealing to store a measure of the extent to which a constraint on
 column X constrains column Y, because you'd only need to store
 O(ncolumns^2) values, which would be reasonably compact and would
 potentially handle the zip code problem - a classic hard case rather
 neatly.  But that wouldn't be sufficient to use the above equation,
 because there A and B need to be things like column X has value x,
 and it's not going to be practical to store a complete set of MCVs for
 column X for each possible value that could appear in column Y.

 O(ncolumns^2) values? You mean collecting such stats for each possible
 pair of columns? Well, I meant something different.

 The proposed solution is based on contingency tables, built for selected
 groups of columns (not for each possible group). And the contingency
 table gives you the ability to estimate the probabilities needed to
 compute the selectivity. Or am I missing something?

Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases.  For example, in the United States,
there are over 40,000 postal codes, and some even larger number of
city names, and doesn't the number of entries go as O(m*n)?  Now maybe
this is useful enough anyway that we should Just Do It, but it'd be a
lot cooler if we could find a way to give the planner a meaningful
clue out of some more compact representation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Nathan Boley
 The proposed solution is based on contingency tables, built for selected
 groups of columns (not for each possible group). And the contingency
 table gives you the ability to estimate the probabilities needed to
 compute the selectivity. Or am I missing something?

 Well, I'm not real familiar with contingency tables, but it seems like
 you could end up needing to store a huge amount of data to get any
 benefit out of it, in some cases.  For example, in the United States,
 there are over 40,000 postal codes, and some even larger number of
 city names, and doesn't the number of entries go as O(m*n)?

Not to mention that the model parameters will be impossible to estimate well.

( I've only scanned the thread, so sorry if Im repeating something
that's already been said )

My intuition is that storing the covariance structure will be
unworkable both technically and statistically for all but the simplest
of cases. That being said, I dont think the problem is a lack of ways
to parameterize the covariance structure ( there are several papers on
mutli-dim histogram estimators, at least one of whichtalks about
databases explicitly, not to mention approaches like CART[1] ) , but a
complete lack of infrastructure to do anything with the estimates. So
keep working on it ;-) ( but try to make the framework general enough
to allow better estimators ).

I wonder if a good first step might be to focus on the AND and OR
operators since they seem like the most common cases and union and
intersection commute ( although it's important to remember that the
estimate variances do NOT )  That is, what about estimating the
selectivity of the condition

WHERE X=A and Y=B

by

f(A,B) = x_1*(selectivity(X = A) + selectivity( Y = B )) +
x_2*selectivity(X = A)*selectivity( Y = B ) + x_3

where x_{1,2,3} are parameters to be estimated from the data.

Another quick note: I think that storing the full contingency table is
wasteful since the marginals are already stored in the single column
statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
looking at this a couple years back ).

Best,
Nathan

[1] http://en.wikipedia.org/wiki/Classification_and_regression_tree
[2] http://en.wikipedia.org/wiki/Copula_%28statistics%29

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread tv
 On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Dne 13.12.2010 03:00, Robert Haas napsal(a):
 Well, the question is what data you are actually storing.  It's
 appealing to store a measure of the extent to which a constraint on
 column X constrains column Y, because you'd only need to store
 O(ncolumns^2) values, which would be reasonably compact and would
 potentially handle the zip code problem - a classic hard case rather
 neatly.  But that wouldn't be sufficient to use the above equation,
 because there A and B need to be things like column X has value x,
 and it's not going to be practical to store a complete set of MCVs for
 column X for each possible value that could appear in column Y.

 O(ncolumns^2) values? You mean collecting such stats for each possible
 pair of columns? Well, I meant something different.

 The proposed solution is based on contingency tables, built for selected
 groups of columns (not for each possible group). And the contingency
 table gives you the ability to estimate the probabilities needed to
 compute the selectivity. Or am I missing something?

 Well, I'm not real familiar with contingency tables, but it seems like
 you could end up needing to store a huge amount of data to get any
 benefit out of it, in some cases.  For example, in the United States,
 there are over 40,000 postal codes, and some even larger number of
 city names, and doesn't the number of entries go as O(m*n)?  Now maybe
 this is useful enough anyway that we should Just Do It, but it'd be a
 lot cooler if we could find a way to give the planner a meaningful
 clue out of some more compact representation.

Yes, storing a complete contingency table is not feasible in such cases.
My original proposal actually did not address this particular issue
(cities and ZIP codes) as it was based on simplified contingency tables
(with bins corresponding to current histograms, not to individual values).
So the number of values to store would grow much slower.

On the other hand, this generalization really makes it unusable in some
cases, and the issue we're discussing here (cities and ZIP codes) is one
of them. I think in such cases we could build a contingency table for MCV
and then use it to estimate those conditional probabilities we need, but I
expect it to be very tricky.

Thanks for the comments.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] proposal : cross-column stats

2010-12-11 Thread Tomas Vondra
Hi everyone,

one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
columns are not independent, this usually result in poor estimates (and
then in suboptimal plans).

I was thinking about this issue before, but that session was the last
impulse that pushed me to try to hack up a proof of concept. So here it
is ;-)

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

 Short introduction 

Say we have a table with two INT columns - col_a and col_b, and we want
to estimate number of rows for a condition involving both columns:

   WHERE (col_a BETWEEN m AND n) AND (col_b BETWEEN p AND q)

When the columns are independent, doing the estimate is just a matter of
multiplication. When the columns are dependent, the estimate may be way off.

Lets assume there are histograms with 5 bins for both columns. What I
propose is basically building a 2D histogram. It kind of resembles
contingency table.

So we do have a table like this

col_b \ col_a  || 20% | 20% | 20% | 20% | 20% |
===||==
  20%  ||  4% |  4% |  4% |  4% |  4% |
  20%  ||  4% |  4% |  4% |  4% |  4% |
  20%  ||  4% |  4% |  4% |  4% |  4% |
  20%  ||  4% |  4% |  4% |  4% |  4% |
  20%  ||  4% |  4% |  4% |  4% |  4% |
===||==

where each column / row represents a bin in the original histograms, and
each cell represents an expected number of rows in it (for really
independent columns, it's 4%).

For dependent columns the actual values may be actually very different,
of course - e.g. for strongly dependent columns it might be like this

col_b \ col_a  || 20% | 20% | 20% | 20% | 20% |
===||==
  20%  || 20% |  0% |  0% |  0% |  0% |
  20%  ||  0% | 20% |  0% |  0% |  0% |
  20%  ||  0% |  0% | 20% |  0% |  0% |
  20%  ||  0% |  0% |  0% | 20% |  0% |
  20%  ||  0% |  0% |  9% |  0% | 20% |
===||==

To estimate the number of rows matching the condition, you'd sum
estimates for cells matching the condition. I.e. when the condition on
col_a matches the lowest 20% (the first histogram bin) and the condition
on col_b matches the lowest 20% of values, this corresponds to the first
cell (20% of rows).

Current optimizer estimates this to be 4% as it believes the columns are
independent.

I'm not sure whether I've explained it well enough, but the essence of
the proposal is to build N-dimensional histograms (where N is the number
of columns covered by the statistics) just as we are building histograms
today.

--- Proof of concept ---

I've hacked a nasty proof of concept in PL/pgSQL (see the attachment).
It creates two tables - test_table and cross_stats. The former contains
the data (in two columns), the latter is used to store cross-column
statistics of test_table.

Then there are several PL/pgSQL functions - two of them are really
important:

collect_stats(p_bins_a INT, p_bins_b INT)
  - this one is used to collect the stats, the parameters represent
number of bins for the columns (sides of the contingency table)
  - collect_stats(10,10) will build contingency table with 100 cells

get_estimate(p_from_a INT, p_to_a INT, p_from_b INT, p_to_b INT)
  - computes estimate for the condition listed above (ranges on both
columns)

So to run the PoC, just do something like this:

1) fill the table with some data

   INSERT INTO test_table SELECT round(random()*1000),
   round(random()*1000) FROM generate_series(1,10);

2) collect the cross-column statistics

   SELECT collect_stats(10,10);

3) see current estimated and actual number of rows

   EXPLAIN ANALYZE SELECT * FROM test_table
   WHERE (col_a BETWEEN 30 AND 129)
 AND (col_b BETWEEN 234 AND 484);

4) see the estimate based on contingency table

   SELECT get_estimate(30, 129, 234, 484);

Just two very simple tests for now - col_a/col_b contain the range from
the query, then there are actual number of matching rows and a current
estimate, And finally the new estimate based on contingency table with
various number of bins.

A) independent columns (proof that it produces just as good estimates
   as the current code)

   col_a   |   col_b   | actual | expected | 10x10 | 20x20 |
  [50,70]  |  [50,70]  |41  | 40   |   41  |47 |
  [50,250] |  [50,250] |  4023  |   4024   | 4436  |  3944 |
  [50,250] | [750,950] |  4023  |   3955   | 4509  |