Re: [HACKERS] Cross-column statistics revisited

2008-10-19 Thread Nathan Boley
 I still need to go through backend/utils/adt/selfuncs.c
 to figure out exactly how we use the one-dimensional values.


Here's a page that helped me figure all this out.

http://www.postgresql.org/docs/8.1/static/planner-stats-details.html


 2) Do we want to fold the MCV's into the dependence histogram? That
 will cause problems in our copula approach but I'd hate to have to
 keep an N^d histogram dependence relation in addition to the copula.

 Yeah, if we're already trying to figure out how to compress copulae,
 having also to compress MCV matrices seems painful and error-prone.
 But I'm not sure why it would cause problems to keep them in the
 copula -- is that just because we are most interested in the copula
 modeling the parts of the distribution that are most sparsely
 populated?


The problem I was thinking of is that we don't currently store the
true marginal distribution. As it stands, histograms only include non
mcv values. So we would either need to take the mcv's separately (
which would assume independence between mcv's and non-mcv values ) or
store multiple histograms.

 4) How will this approach deal with histogram buckets that have
 scaling count sizes ( ie -0.4 )?

 I'm not sure what you mean here.


That was more a note to myself, and should have been numbered 3.5.
ndistinct estimates currently start to scale after a large enough
row/ndistinct ratio. If we try to model ndistinct, we need to deal
with scaling ndistinct counts somehow. But that's way off in the
future, it was probably pointless to mention it.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-18 Thread Joshua Tolley
On Fri, Oct 17, 2008 at 7:54 PM, Nathan Boley [EMAIL PROTECTED] wrote:
 I'm still working my way around the math, but copulas sound better
 than anything else I've been playing with.

 I think the easiest way to think of them is, in 2-D finite spaces,
 they are just a plot of the order statistics against one another. Feel
 free to mail me off list if you have any math questions.

I'm still working out what a copula is, and how I've got to figure out
the stuff below, too! :) /me has reading to do... The treatment of
copulae in one of my stats texts is much better than wikipedia, I
though, so I'm progressing. They sound like an excellent potential
solution the more I figure out what they are, for whatever my opinion
is worth, but making sure they work decently for all data types,
including those where there's no real notion of distance, may be
interesting. I still need to go through backend/utils/adt/selfuncs.c
to figure out exactly how we use the one-dimensional values.

 I've previously thought that, at least in the 2D case, we could use
 image compression algorithms to compress the copula, but recently I've
 realized that this is a change point problem. In terms of compression,
 we want to decompose the copula into regions that are as homogenous as
 possible.  I'm not familiar with change point problems in multiple
 dimensions, but I'll try and ask someone that is, probably late next
 week. If you decide to go the copula route, I'd be happy to write the
 decomposition algorithm - or at least work on the theory.

 Finally, a couple points that I hadn't seen mentioned earlier that
 should probably be considered-

 1) NULL's need to be treated specially - I suspect the assumption of
 NULL independence is worse than other independence assumptions. Maybe
 dealing with NULL dependence could be a half step towards full
 dependence calculations?

Agreed, though how to treat them I have no idea.


 2) Do we want to fold the MCV's into the dependence histogram? That
 will cause problems in our copula approach but I'd hate to have to
 keep an N^d histogram dependence relation in addition to the copula.

Yeah, if we're already trying to figure out how to compress copulae,
having also to compress MCV matrices seems painful and error-prone.
But I'm not sure why it would cause problems to keep them in the
copula -- is that just because we are most interested in the copula
modeling the parts of the distribution that are most sparsely
populated?

 3) For equality selectivity estimates, I believe the assumption that
 the ndistinct value distribution is uniform in the histogram will
 become worse as the dimension increases. I proposed keeping track of
 ndistinct per histogram beckets earlier in the marginal case partially
 motivated by this exact scenario. Does that proposal make more sense
 in this case? If so we'd need to store two distributions - one of the
 counts and one of ndistinct.

It's definitely worth investigating (or seeing if someone else has
already published an investigation). There are probably several
similar stats we could track and make use of, provided it didn't slow
the planner too much to make use of them.

 4) How will this approach deal with histogram buckets that have
 scaling count sizes ( ie -0.4 )?

I'm not sure what you mean here.

- Josh / eggyknap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Martijn van Oosterhout
On Thu, Oct 16, 2008 at 09:17:03PM -0600, Joshua Tolley wrote:
 Because I'm trying to picture geometrically how this might work for
 the two-column case, and hoping to extend that to more dimensions, and
 am finding that picturing a quantile-based system like the one we have
 now in multiple dimensions is difficult. 

Just a note: using a multidimensional histograms will work well for the
cases like (startdate,enddate) where the histogram will show a
clustering of values along the diagonal. But it will fail for the case
(zipcode,state) where one implies the other. Histogram-wise you're not
going to see any correlation at all but what you want to know is:

count(distinct zipcode,state) = count(distinct zipcode)

So you might need to think about storing/searching for different kinds
of correlation.

Secondly, my feeling about multidimensional histograms is that you're
not going to need the matrix to have 100 bins along each axis, but that
it'll be enough to have 1000 bins total. The cases where we get it
wrong enough for people to notice will probably be the same cases where
the histogram will have noticable variation even for a small number of
bins.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Martijn van Oosterhout
On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote:
 Correlation is the wrong tool. In fact zip codes and city have nearly  
 zero correlation.  Zip codes near 0 are no more likely to be in  
 cities starting with A than Z.

I think we need to define our terms better. In terms of linear
correlation you are correct. However, you can define invertable mappings
from zip codes and cities onto the integers which will then have an
almost perfect correlation.

According to a paper I found this is related to the principle of
maximum entropy. The fact that you can't determine such functions
easily in practice doesn't change the fact that zip codes and city
names are highly correlated.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Gregory Stark
Martijn van Oosterhout [EMAIL PROTECTED] writes:

 On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote:
 Correlation is the wrong tool. In fact zip codes and city have nearly  
 zero correlation.  Zip codes near 0 are no more likely to be in  
 cities starting with A than Z.

 I think we need to define our terms better. In terms of linear
 correlation you are correct. However, you can define invertable mappings
 from zip codes and cities onto the integers which will then have an
 almost perfect correlation.

 According to a paper I found this is related to the principle of
 maximum entropy. The fact that you can't determine such functions
 easily in practice doesn't change the fact that zip codes and city
 names are highly correlated.

They're certainly very much not independent variables. There are lots of ways
of measuring how much dependence there is between them. I don't know enough
about the math to know if your maps are equivalent to any of them.

In any case as I described it's not enough information to know that the two
data sets are heavily dependent. You need to know for which pairs (or ntuples)
that dependency results in a higher density and for which it results in lower
density and how much higher or lower. That seems like a lot of information to
encode (and a lot to find in the sample).

Perhaps just knowing whether that there's a dependence between two data sets
might be somewhat useful if the planner kept a confidence value for all its
estimates. It would know to have a lower confidence value for estimates coming
from highly dependent clauses? It wouldn't be very easy for the planner to
distinguish safe plans for low confidence estimates and risky plans which
might blow up if the estimates are wrong though. And of course that's a lot
less interesting than just getting better estimates :)

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Richard Huxton
Gregory Stark wrote:
 They're certainly very much not independent variables. There are lots of ways
 of measuring how much dependence there is between them. I don't know enough
 about the math to know if your maps are equivalent to any of them.

I think dependency captures the way I think about it rather than
correlation (although I can see there must be function that could map
that dependency onto how we think of correlations).

 In any case as I described it's not enough information to know that the two
 data sets are heavily dependent. You need to know for which pairs (or ntuples)
 that dependency results in a higher density and for which it results in lower
 density and how much higher or lower. That seems like a lot of information to
 encode (and a lot to find in the sample).

Like Josh Berkus mentioned a few points back, it's the handful of
plan-changing values you're looking for.

So, it seems like we've got:
1. Implied dependencies: zip-code=city
2. Implied+constraint: start-date  end-date and the difference between
the two is usually less than a week
3. Top-heavy foreign-key stats.

#1 and #2 obviously need new infrastructure.

From a non-dev point of view it looks like #3 could use the existing
stats on each side of the join. I'm not sure whether you could do
anything meaningful for joins that don't explicitly specify one side of
the join though.

 Perhaps just knowing whether that there's a dependence between two data sets
 might be somewhat useful if the planner kept a confidence value for all its
 estimates. It would know to have a lower confidence value for estimates coming
 from highly dependent clauses? It wouldn't be very easy for the planner to
 distinguish safe plans for low confidence estimates and risky plans which
 might blow up if the estimates are wrong though. And of course that's a lot
 less interesting than just getting better estimates :)

If we could abort a plan and restart then we could just try the
quick-but-risky plan and if we reach 50 rows rather than the expected 10
try a different approach. That way we'd not need to gather stats, just
react to the situation in individual queries.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Tom Lane
Martijn van Oosterhout [EMAIL PROTECTED] writes:
 Just a note: using a multidimensional histograms will work well for the
 cases like (startdate,enddate) where the histogram will show a
 clustering of values along the diagonal. But it will fail for the case
 (zipcode,state) where one implies the other. Histogram-wise you're not
 going to see any correlation at all

Huh?  Sure you are.  What the histogram will show is that there is only
one state value per zipcode, and only a limited subset of zipcodes per
state.  The nonempty cells won't cluster along the diagonal but we
don't particularly care about that.

What we really want from this is to not think that
WHERE zip = '80210' AND state = 'CA'
is significantly more selective than just
WHERE zip = '80210'
A histogram is certainly capable of telling us that.  Whether it's the
most compact representation is another question of course --- in an
example like this, only about 1/50th of the cells would contain nonzero
counts ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Ron Mayer

Tom Lane wrote:

A bad estimate for physical-position correlation has only limited
impact,


Ah!  This seems very true with 8.3 but much less true with 8.0.

On a legacy 8.0 system I have a hard time avoiding cases where
a query like
 select * from addresses where add_state_or_province = 'CA';
does a 2-second full-table scan instead of a 300ms index scan
thanks to a poor physical order guess.

I just sucked the table into 8.3 and am pleased to say that
it picks a 200ms bitmap scan even with the misleading correlation.

Thanks for bitmap scans guys!

I'll shut up about this physical ordering stuff now
and try to do better upgrading before posting.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Nathan Boley
 Right now our
 histogram values are really quantiles; the statistics_target T for a
 column determines a number of quantiles we'll keep track of, and we
 grab values from into an ordered list L so that approximately 1/T of
 the entries in that column fall between values L[n] and L[n+1]. I'm
 thinking that multicolumn statistics would instead divide the range of
 each column up into T equally sized segments,

 Why would you not use the same histogram bin bounds derived for the
 scalar stats (along each axis of the matrix, of course)?  This seems to
 me to be arbitrarily replacing something proven to work with something
 not proven.  Also, the above forces you to invent a concept of equally
 sized ranges, which is going to be pretty bogus for a lot of datatypes.

 Because I'm trying to picture geometrically how this might work for
 the two-column case, and hoping to extend that to more dimensions, and
 am finding that picturing a quantile-based system like the one we have
 now in multiple dimensions is difficult. I believe those are the same
 difficulties Gregory Stark mentioned having in his first post in this
 thread. But of course that's an excellent point, that what we do now
 is proven. I'm not sure which problem will be harder to solve -- the
 weird geometry or the equally sized ranges for data types where that
 makes no sense.


Look at copulas. They are a completely general method of describing
the dependence between two marginal distributions. It seems silly to
rewrite the stats table in terms of joint distributions when we'll
still need the marginals anyways. Also, It might be easier to think of
the dimension reduction problem in that form.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Joshua Tolley
On Fri, Oct 17, 2008 at 3:47 PM, Nathan Boley [EMAIL PROTECTED] wrote:
 Right now our
 histogram values are really quantiles; the statistics_target T for a
 column determines a number of quantiles we'll keep track of, and we
 grab values from into an ordered list L so that approximately 1/T of
 the entries in that column fall between values L[n] and L[n+1]. I'm
 thinking that multicolumn statistics would instead divide the range of
 each column up into T equally sized segments,

 Why would you not use the same histogram bin bounds derived for the
 scalar stats (along each axis of the matrix, of course)?  This seems to
 me to be arbitrarily replacing something proven to work with something
 not proven.  Also, the above forces you to invent a concept of equally
 sized ranges, which is going to be pretty bogus for a lot of datatypes.

 Because I'm trying to picture geometrically how this might work for
 the two-column case, and hoping to extend that to more dimensions, and
 am finding that picturing a quantile-based system like the one we have
 now in multiple dimensions is difficult. I believe those are the same
 difficulties Gregory Stark mentioned having in his first post in this
 thread. But of course that's an excellent point, that what we do now
 is proven. I'm not sure which problem will be harder to solve -- the
 weird geometry or the equally sized ranges for data types where that
 makes no sense.


 Look at copulas. They are a completely general method of describing
 the dependence between two marginal distributions. It seems silly to
 rewrite the stats table in terms of joint distributions when we'll
 still need the marginals anyways. Also, It might be easier to think of
 the dimension reduction problem in that form.


I'm still working my way around the math, but copulas sound better
than anything else I've been playing with. What's more, there are
plenty of existing implementations to refer to, provided it's done in
a licensing-friendly way. A multidimensional extension of our existing
stuff, at least in the ways I've been thinking of it, quickly becomes
a recursive problem -- perhaps some dynamic programming solution would
solve it, but copulas seem the more common solution for similar
problems in other fields. Thanks.

- Josh / eggyknap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Nathan Boley
 I'm still working my way around the math, but copulas sound better
 than anything else I've been playing with.

I think the easiest way to think of them is, in 2-D finite spaces,
they are just a plot of the order statistics against one another. Feel
free to mail me off list if you have any math questions.

I've previously thought that, at least in the 2D case, we could use
image compression algorithms to compress the copula, but recently I've
realized that this is a change point problem. In terms of compression,
we want to decompose the copula into regions that are as homogenous as
possible.  I'm not familiar with change point problems in multiple
dimensions, but I'll try and ask someone that is, probably late next
week. If you decide to go the copula route, I'd be happy to write the
decomposition algorithm - or at least work on the theory.

Finally, a couple points that I hadn't seen mentioned earlier that
should probably be considered-

1) NULL's need to be treated specially - I suspect the assumption of
NULL independence is worse than other independence assumptions. Maybe
dealing with NULL dependence could be a half step towards full
dependence calculations?

2) Do we want to fold the MCV's into the dependence histogram? That
will cause problems in our copula approach but I'd hate to have to
keep an N^d histogram dependence relation in addition to the copula.

3) For equality selectivity estimates, I believe the assumption that
the ndistinct value distribution is uniform in the histogram will
become worse as the dimension increases. I proposed keeping track of
ndistinct per histogram beckets earlier in the marginal case partially
motivated by this exact scenario. Does that proposal make more sense
in this case? If so we'd need to store two distributions - one of the
counts and one of ndistinct.

4) How will this approach deal with histogram buckets that have
scaling count sizes ( ie -0.4 )?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Martijn van Oosterhout
On Wed, Oct 15, 2008 at 04:53:10AM -0600, Joshua Tolley wrote:
 I've been interested in what it would take to start tracking
 cross-column statistics. A review of the mailing lists as linked from
 the TODO item on the subject [1] suggests the following concerns:
 
 1) What information exactly would be tracked?
 2) How would it be kept from exploding in size?
 3) For which combinations of columns would statistics be kept?

I think you need to go a step back: how are you going to use this data?
Whatever structure you choose the eventual goal you take a discription
of the column (a,b) and take a clause like 'a  5' and be able to
generate an estimate of the distribution of b.

Secondly, people arn't going to ask for multi-column stats on column
that arn't correlated in some way. So you need to work out what kinds
of correlation people are interested in and see how you can store them.

One potential use case is the (startdate,enddate) columns. Here what
you want to detect somehow that the distribution of (enddate-startdate)
is constant.

I think the real question is: what other kinds of correlation might
people be interested in representing?

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Tom Lane
Martijn van Oosterhout [EMAIL PROTECTED] writes:
 I think you need to go a step back: how are you going to use this data?

The fundamental issue as the planner sees it is not having to assume
independence of WHERE clauses.  For instance, given

WHERE a  5 AND b  10

our current approach is to estimate the fraction of rows with a  5
(using stats for a), likewise estimate the fraction with b  10
(using stats for b), and then multiply these fractions together.
This is correct if a and b are independent, but can be very bad if
they aren't.  So if we had joint statistics on a and b, we'd want to
somehow match that up to clauses for a and b and properly derive
the joint probability.

(I'm not certain of how to do that efficiently, even if we had the
right stats :-()

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Greg Stark

[sorry for top osting - dam phone]

It's pretty straightforward to to a chi-squared test on all the pairs.  
But that tells you that the product is more likely to be wrong. It  
doesn't tell you whether it's going to be too high or too low...


greg

On 16 Oct 2008, at 07:20 PM, Tom Lane [EMAIL PROTECTED] wrote:


Martijn van Oosterhout [EMAIL PROTECTED] writes:
I think you need to go a step back: how are you going to use this  
data?


The fundamental issue as the planner sees it is not having to assume
independence of WHERE clauses.  For instance, given

   WHERE a  5 AND b  10

our current approach is to estimate the fraction of rows with a  5
(using stats for a), likewise estimate the fraction with b  10
(using stats for b), and then multiply these fractions together.
This is correct if a and b are independent, but can be very bad if
they aren't.  So if we had joint statistics on a and b, we'd want to
somehow match that up to clauses for a and b and properly derive
the joint probability.

(I'm not certain of how to do that efficiently, even if we had the
right stats :-()

   regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Robert Haas
 I think the real question is: what other kinds of correlation might
 people be interested in representing?

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?

I suspect that a lot of the correlations people care about are
extreme.  For example, it's fairly common for me to have a table where
column B is only used at all for certain values of column A.  Like,
atm_machine_id is usually or always NULL unless transaction_type is
ATM, or something.  So a clause of the form transaction_type = 'ATM'
and atm_machine_id  1 looks more selective than it really is
(because the first half is redundant).

The other half of this is that bad selectivity estimates only matter
if they're bad enough to change the plan, and I'm not sure whether
cases like this are actually a problem in practice.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Martijn van Oosterhout
On Thu, Oct 16, 2008 at 01:34:59PM -0400, Robert Haas wrote:
 I suspect that a lot of the correlations people care about are
 extreme.  For example, it's fairly common for me to have a table where
 column B is only used at all for certain values of column A.  Like,
 atm_machine_id is usually or always NULL unless transaction_type is
 ATM, or something.  So a clause of the form transaction_type = 'ATM'
 and atm_machine_id  1 looks more selective than it really is
 (because the first half is redundant).

That case is easily done by simply considering the indexed values of a
column with a partial index. This should be fairly easy to do I think.

It might be worthwhile someone trawling through the archives looking
for examples where we estimate the correlation wrong.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Ron Mayer

Robert Haas wrote:

I think the real question is: what other kinds of correlation might
people be interested in representing?


Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?


The one that affects our largest tables are ones where we
have an address (or other geo-data) clustered by zip, but
with other columns (city, county, state, school-zone, police
beat, etc) used in queries.

Postgres considers those unclustered (correlation 0 in the stats),
despite all rows for a given value residing on the same few pages.


I could imagine that this could be handled by either some cross-column
correlation (each zip has only 1-2 cities); or by an enhanced
single-column statistic (even though cities aren't sorted alphabetically,
all rows on a page tend to refer to the same city).



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Josh Berkus
Tom,

 (I'm not certain of how to do that efficiently, even if we had the
 right stats :-()

I was actually talking to someone about this at pgWest.  Apparently there's 
a fair amount of academic algorithms devoted to this topic.  Josh, do you 
remember who was talking about this?

-- 
--Josh

Josh Berkus
PostgreSQL
San Francisco

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Joshua Tolley
On Thu, Oct 16, 2008 at 2:54 PM, Josh Berkus [EMAIL PROTECTED] wrote:
 Tom,

 (I'm not certain of how to do that efficiently, even if we had the
 right stats :-()

 I was actually talking to someone about this at pgWest.  Apparently there's
 a fair amount of academic algorithms devoted to this topic.  Josh, do you
 remember who was talking about this?

Actually, it was me :) My biggest concern at the time was finding ways
to compress the correlation data, on the apparently fairly tenuous
assumption that we'd somehow be able to make use of it. As it turns
out, the particular set of algorithms I had in mind don't compress
anything, but there are other methods that do.

Most of the comments on this thread have centered around the questions
of what we'd store and how we'd use it, which might be better
phrased as, The database assumes columns are independent, but we know
that's not always true. Does this cause enough problems to make it
worth fixing? How might we fix it? I have to admit an inability to
show that it causes problems, though Neil Conway pointed me to some
literature[1] on the subject I've not yet had time to go through. My
basic idea is that if we have a cross-column frequency count, it will
help us get better plans, but I don't know the internals enough to
have details further than that.

- Josh / eggyknap

[1] http://www.cs.umd.edu/~amol/papers/paper-dep.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Josh Berkus

 Yes, or to phrase that another way: What kinds of queries are being
 poorly optimized now and why?

Well, we have two different correlation problems.  One is the problem of 
dependant correlation, such as the 1.0 correlation of ZIP and CITY fields 
as a common problem.  This could in fact be fixed, I believe, via a linear 
math calculation based on the sampled level of correlation, assuming we 
have enough samples.  And it's really only an issue if the correlation is 
 0.5.

The second type of correlation issue we have is correlating values in a 
parent table with *rows* in child table (i.e. FK joins).  Currently, the 
planner assumes that all rows in the child table are evenly distributed 
against keys in the parent table.  But many real-world databases have this 
kind of problem:

A   B
1   1 rows
2   1 rows
3   1000 rows
4 .. 1000   0 to 1 rows

For queries which cover values between 4..1000 on A, the misestimate won't 
be much of a real execution problem.  But for values 1,2,3, the query will 
bomb.

 The other half of this is that bad selectivity estimates only matter
 if they're bad enough to change the plan, and I'm not sure whether
 cases like this are actually a problem in practice.

My experience is that any estimate which is more than 5x wrong (i.e.  .2 
or  5.0) usually causes problems, and 3x sometimes causes problems. 

-- 
--Josh

Josh Berkus
PostgreSQL
San Francisco

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Greg Stark
Correlation is the wrong tool. In fact zip codes and city have nearly  
zero correlation.  Zip codes near 0 are no more likely to be in  
cities starting with A than Z.


Even if you use an appropriate tool I'm not clear what to do with the  
information. Consider the case of WHERE city='boston' and zip='02139'  
and another query with WHERE city='boston' and zip='90210'. One will  
produce many more records than the separate histograms would predict  
and the other would produce zero. How do you determine which category  
a given pair of constants falls into?


Separately you mention cross-table stats - but that' a whole other  
kettle of worms. I'm not sure which is easier but let's do one at a  
time?



greg

On 17 Oct 2008, at 12:12 AM, Josh Berkus [EMAIL PROTECTED] wrote:




Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?


Well, we have two different correlation problems.  One is the  
problem of
dependant correlation, such as the 1.0 correlation of ZIP and CITY  
fields
as a common problem.  This could in fact be fixed, I believe, via a  
linear
math calculation based on the sampled level of correlation, assuming  
we
have enough samples.  And it's really only an issue if the  
correlation is

0.5.


The second type of correlation issue we have is correlating values  
in a
parent table with *rows* in child table (i.e. FK joins).  Currently,  
the
planner assumes that all rows in the child table are evenly  
distributed
against keys in the parent table.  But many real-world databases  
have this

kind of problem:

AB
11 rows
21 rows
31000 rows
4 .. 10000 to 1 rows

For queries which cover values between 4..1000 on A, the misestimate  
won't
be much of a real execution problem.  But for values 1,2,3, the  
query will

bomb.


The other half of this is that bad selectivity estimates only matter
if they're bad enough to change the plan, and I'm not sure whether
cases like this are actually a problem in practice.


My experience is that any estimate which is more than 5x wrong (i.e.  
 .2

or  5.0) usually causes problems, and 3x sometimes causes problems.

--
--Josh

Josh Berkus
PostgreSQL
San Francisco

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Ron Mayer

Josh Berkus wrote:

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?


Well, we have two different correlation problems.  One is the problem of 
dependant correlation, such as the 1.0 correlation of ZIP and CITY fields 
as a common problem.  This could in fact be fixed, I believe, via a linear 
math calculation based on the sampled level of correlation, assuming we 
have enough samples.  And it's really only an issue if the correlation is 

0.5.


I'd note that this can be an issue even without 2 columns involved.

I've seen a number of tables where the data is loaded in batches
so similar-values from a batch tend to be packed into relatively few pages.

Thinks a database for a retailer that nightly aggregates data from
each of many stores.  Each incoming batch inserts the store's data
into tightly packed disk pages where most all rows on the page are for
that store.   But those pages are interspersed with pages from other
stores.

I think I like the ideas Greg Stark had a couple years ago:
http://archives.postgresql.org/pgsql-hackers/2006-09/msg01040.php
   ...sort the sampled values by value
   and count up the average number of distinct blocks per value Or
   perhaps we need a second histogram where the quantities are of
   distinct pages rather than total records We might also need a
   separate average number of n-block spans per value
since those seem to me to lead more directly to values like blocks
that need to be read.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Greg Stark
This is yet another issue entirely. This is about estimating how much  
io will be random io if we do an index order scan.  Correlation is a  
passable tool for this but we might be able to do better.


But it has nothing to do with the cross-column stats problem.

greg

On 17 Oct 2008, at 01:29 AM, Ron Mayer [EMAIL PROTECTED]  
wrote:



Josh Berkus wrote:

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?
Well, we have two different correlation problems.  One is the  
problem of dependant correlation, such as the 1.0 correlation of  
ZIP and CITY fields as a common problem.  This could in fact be  
fixed, I believe, via a linear math calculation based on the  
sampled level of correlation, assuming we have enough samples.  And  
it's really only an issue if the correlation is

0.5.


I'd note that this can be an issue even without 2 columns involved.

I've seen a number of tables where the data is loaded in batches
so similar-values from a batch tend to be packed into relatively few  
pages.


Thinks a database for a retailer that nightly aggregates data from
each of many stores.  Each incoming batch inserts the store's data
into tightly packed disk pages where most all rows on the page are for
that store.   But those pages are interspersed with pages from other
stores.

I think I like the ideas Greg Stark had a couple years ago:
http://archives.postgresql.org/pgsql-hackers/2006-09/msg01040.php
  ...sort the sampled values by value
  and count up the average number of distinct blocks per value Or
  perhaps we need a second histogram where the quantities are of
  distinct pages rather than total records We might also need a
  separate average number of n-block spans per value
since those seem to me to lead more directly to values like blocks
that need to be read.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Tom Lane
Joshua Tolley [EMAIL PROTECTED] writes:
 Most of the comments on this thread have centered around the questions
 of what we'd store and how we'd use it, which might be better
 phrased as, The database assumes columns are independent, but we know
 that's not always true. Does this cause enough problems to make it
 worth fixing? How might we fix it? I have to admit an inability to
 show that it causes problems,

Any small amount of trolling in our archives will turn up plenty of
examples.

It appears to me that a lot of people in this thread are confusing
correlation in the sense of statistical correlation between two
variables with correlation in the sense of how well physically-ordered
a column is.  (The latter is actually the same kind of animal, but
always taking one of the two variables to be physical position.)
A bad estimate for physical-position correlation has only limited
impact, as Josh B said upthread; but the other case leads to very
bad rowcount estimates which have *huge* impact on plan choices.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Joshua Tolley
On Thu, Oct 16, 2008 at 6:32 PM, Tom Lane [EMAIL PROTECTED] wrote:
 It appears to me that a lot of people in this thread are confusing
 correlation in the sense of statistical correlation between two
 variables with correlation in the sense of how well physically-ordered
 a column is.

For what it's worth, neither version of correlation was what I had in
mind. Statistical correlation between two variables is a single
number, is fairly easy to calculate, and probably wouldn't help query
plans much at all. I'm more interested in a more complex data
gathering. The data model I have in mind (which I note I have *not*
proven to actually help a large number of query plans -- that's
obviously an important part of what I'd need to do in all this)
involves instead a matrix of frequency counts. Right now our
histogram values are really quantiles; the statistics_target T for a
column determines a number of quantiles we'll keep track of, and we
grab values from into an ordered list L so that approximately 1/T of
the entries in that column fall between values L[n] and L[n+1]. I'm
thinking that multicolumn statistics would instead divide the range of
each column up into T equally sized segments, to form in the
two-column case a matrix, where the values of the matrix are frequency
counts -- the number of rows whose values for each column fall within
the particular segments of their respective ranges represented by the
boundaries of that cell in the matrix. I just realized while writing
this that this might not extend to situations where the two columns
are from different tables and don't necessarily have the same row
count, but I'll have to think about that.

Anyway, the size of such a matrix would be exponential in T, and
cross-column statistics involving just a few columns could easily
involve millions of values, for fairly normal statistics_targets.
That's where the compression ideas come in to play. This would
obviously need a fair bit of testing, but it's certainly conceivable
that modern regression techniques could reduce that frequency matrix
to a set of functions with a small number of parameters. Whether that
would result in values the planner can look up for a given set of
columns without spending more time than it's worth is another question
that will need exploring.

I started this thread knowing that past discussions have posed the
following questions:

1. What sorts of cross-column data can we really use?
2. Can we collect that information?
3. How do we know what columns to track?

For what it's worth, my original question was whether anyone had
concerns beyond these, and I think that has been fairly well answered
in this thread.

- Josh / eggyknap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Tom Lane
Joshua Tolley [EMAIL PROTECTED] writes:
 For what it's worth, neither version of correlation was what I had in
 mind. Statistical correlation between two variables is a single
 number, is fairly easy to calculate, and probably wouldn't help query
 plans much at all. I'm more interested in a more complex data
 gathering. The data model I have in mind (which I note I have *not*
 proven to actually help a large number of query plans -- that's
 obviously an important part of what I'd need to do in all this)
 involves instead a matrix of frequency counts.

Oh, I see the distinction you're trying to draw.  Agreed on both points:
a simple correlation number is pretty useless to us, and we don't have
hard evidence that a histogram-matrix will solve the problem.  However,
we do know that one-dimensional histograms of the sort currently
collected by ANALYZE work pretty well (if they're detailed enough).
It seems reasonable to me to extrapolate that same concept to two or
more dimensions.  The issue then becomes that a detailed enough
matrix might be impracticably bulky, so you need some kind of lossy
compression, and we don't have hard evidence about how well that will
work.  Nonetheless the road map seems clear enough to me.

 Right now our
 histogram values are really quantiles; the statistics_target T for a
 column determines a number of quantiles we'll keep track of, and we
 grab values from into an ordered list L so that approximately 1/T of
 the entries in that column fall between values L[n] and L[n+1]. I'm
 thinking that multicolumn statistics would instead divide the range of
 each column up into T equally sized segments,

Why would you not use the same histogram bin bounds derived for the
scalar stats (along each axis of the matrix, of course)?  This seems to
me to be arbitrarily replacing something proven to work with something
not proven.  Also, the above forces you to invent a concept of equally
sized ranges, which is going to be pretty bogus for a lot of datatypes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Joshua Tolley
On Thu, Oct 16, 2008 at 8:38 PM, Tom Lane [EMAIL PROTECTED] wrote:
 Joshua Tolley [EMAIL PROTECTED] writes:
 For what it's worth, neither version of correlation was what I had in
 mind. Statistical correlation between two variables is a single
 number, is fairly easy to calculate, and probably wouldn't help query
 plans much at all. I'm more interested in a more complex data
 gathering. The data model I have in mind (which I note I have *not*
 proven to actually help a large number of query plans -- that's
 obviously an important part of what I'd need to do in all this)
 involves instead a matrix of frequency counts.

 Oh, I see the distinction you're trying to draw.  Agreed on both points:
 a simple correlation number is pretty useless to us, and we don't have
 hard evidence that a histogram-matrix will solve the problem.  However,
 we do know that one-dimensional histograms of the sort currently
 collected by ANALYZE work pretty well (if they're detailed enough).
 It seems reasonable to me to extrapolate that same concept to two or
 more dimensions.  The issue then becomes that a detailed enough
 matrix might be impracticably bulky, so you need some kind of lossy
 compression, and we don't have hard evidence about how well that will
 work.  Nonetheless the road map seems clear enough to me.

Oh goodie -- I feel so validated :)


 Right now our
 histogram values are really quantiles; the statistics_target T for a
 column determines a number of quantiles we'll keep track of, and we
 grab values from into an ordered list L so that approximately 1/T of
 the entries in that column fall between values L[n] and L[n+1]. I'm
 thinking that multicolumn statistics would instead divide the range of
 each column up into T equally sized segments,

 Why would you not use the same histogram bin bounds derived for the
 scalar stats (along each axis of the matrix, of course)?  This seems to
 me to be arbitrarily replacing something proven to work with something
 not proven.  Also, the above forces you to invent a concept of equally
 sized ranges, which is going to be pretty bogus for a lot of datatypes.

Because I'm trying to picture geometrically how this might work for
the two-column case, and hoping to extend that to more dimensions, and
am finding that picturing a quantile-based system like the one we have
now in multiple dimensions is difficult. I believe those are the same
difficulties Gregory Stark mentioned having in his first post in this
thread. But of course that's an excellent point, that what we do now
is proven. I'm not sure which problem will be harder to solve -- the
weird geometry or the equally sized ranges for data types where that
makes no sense.

One option that came to mind recently would be to have statistics for
a subset of the rows in a single column A, where that subset is
defined by conditions on some other column B with a defined
relationship to A (for instance, that they're in the same table).
There are all kinds of complexities possible with that idea, but one
bright spot is that the values in the catalog and the way the planner
makes use of them would stay essentially the same as they are now.
That one's interesting to think about, but probably too complex for
real use. Anyway, I'll keep pondering.

- Josh / eggyknap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-15 Thread Gregory Stark
Joshua Tolley [EMAIL PROTECTED] writes:

 I've been interested in what it would take to start tracking
 cross-column statistics. A review of the mailing lists as linked from
 the TODO item on the subject [1] suggests the following concerns:

 1) What information exactly would be tracked?
 2) How would it be kept from exploding in size?
 3) For which combinations of columns would statistics be kept?

I think then you have 

4) How would we form estimates from these stats

 The major concern in #1 seemed to be that the most suitable form for
 keeping most common value lists, histograms, etc. is in an array, and
 at the time of the posts I read, arrays of composite types weren't
 possible. This seems much less of a concern now -- perhaps in greatest
 part because a test I just did against a recent 8.4devel sure makes it
 look like stats on composite type columns aren't even kept. The most
 straightforward is that we'd keep a simple multi-dimensional
 histogram, but that leads to a discussion of #2.

multi-dimensional histogram isn't such a simple concept, at least not to me.

Histograms aren't a bar chart of equal widths and various heights like I was
taught in school. They're actually bars of various widths arranged such that
they all of the same heights.

It's not clear how to extend that concept into two dimensions. I imagine
there's research on this though. What do the GIST statistics functions store?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-15 Thread Joshua Tolley
On Wed, Oct 15, 2008 at 7:51 AM, Gregory Stark [EMAIL PROTECTED] wrote:
 Joshua Tolley [EMAIL PROTECTED] writes:

 I've been interested in what it would take to start tracking
 cross-column statistics. A review of the mailing lists as linked from
 the TODO item on the subject [1] suggests the following concerns:

 1) What information exactly would be tracked?
 2) How would it be kept from exploding in size?
 3) For which combinations of columns would statistics be kept?

 I think then you have

 4) How would we form estimates from these stats

That too, but see below.

 The major concern in #1 seemed to be that the most suitable form for
 keeping most common value lists, histograms, etc. is in an array, and
 at the time of the posts I read, arrays of composite types weren't
 possible. This seems much less of a concern now -- perhaps in greatest
 part because a test I just did against a recent 8.4devel sure makes it
 look like stats on composite type columns aren't even kept. The most
 straightforward is that we'd keep a simple multi-dimensional
 histogram, but that leads to a discussion of #2.

 multi-dimensional histogram isn't such a simple concept, at least not to me.

 Histograms aren't a bar chart of equal widths and various heights like I was
 taught in school. They're actually bars of various widths arranged such that
 they all of the same heights.

That depends on whose definition you've got in mind. Wikipedia's
version (for whatever that's worth) defines it as variable width
columns of not-necessarily-uniform heights, where the area of each bar
is the frequency. The default behavior of the hist() function in the R
statistics package is uniform bar widths, which generates complaints
from purists. Semantics aside, it looks like pgsql's stats stuff
(which I realize I'll need to get to know lots better to undertake
something like this) uses a list of quantiles, which still only makes
sense, as I see it, when a distance measurement is available for the
data type. The multidimensional case might do better with a frequency
count, where we'd store a matrix/cube/etc. with uniformly-sized units,
probably compressed via some regression function. That said, a
multidimensional quantile list would also be possible, compressed
similarly.

Provided a frequency chart or quantile set, it seems estimates can be
made just as they are in one dimension, by finding the right value in
the quantile or frequency matrix.

 It's not clear how to extend that concept into two dimensions. I imagine
 there's research on this though. What do the GIST statistics functions store?

No idea.

- Josh / eggyknap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cross-column statistics revisited

2008-10-15 Thread Joshua Tolley
I've been interested in what it would take to start tracking
cross-column statistics. A review of the mailing lists as linked from
the TODO item on the subject [1] suggests the following concerns:

1) What information exactly would be tracked?
2) How would it be kept from exploding in size?
3) For which combinations of columns would statistics be kept?

The major concern in #1 seemed to be that the most suitable form for
keeping most common value lists, histograms, etc. is in an array, and
at the time of the posts I read, arrays of composite types weren't
possible. This seems much less of a concern now -- perhaps in greatest
part because a test I just did against a recent 8.4devel sure makes it
look like stats on composite type columns aren't even kept. The most
straightforward is that we'd keep a simple multi-dimensional
histogram, but that leads to a discussion of #2.

#2 is an interesting problem, and possibly the most difficult from a
theoretical perspective. Provided a fair expectation that the results
would be useful, I'd like to play with various forms of statistical
regressions to compress large cross-column histograms. But I'm not
sure I want to spend the time if there are other serious issues with
the whole idea of cross-column stats. Another possibility involves not
storing a histogram at all, but instead storing an array of quantiles
for each column. In other words, if S represents the statistics target
of a column, we'd have an array A of S+1 entries of the type in
question, where the beginning and end of the array are the minimum and
maximum values in the column, and the range between any two
consecutive array elements A[n] and A[n+1] contains 1/S of the total
entries in the column. For a uniformly distributed column, these array
elements would be evenly spaced across the total range of the column;
the difference between consecutive array elements would indicate
deviations from this distribution. Unfortunately this only works if
there's a valid distance metric for the data type in question.

In the archives, #3 was raised frequently as a concern -- obviously
tracking stats on all permutations of the columns in a database is a
non-starter. However there seemed to be little argument over sensible
defaults, namely that columns involved in a multi-column index would
have cross-column statistics kept automatically, and the user would be
allowed to instruct the server to keep stats on other arbitrary groups
of columns. There was some discussion about the fact that the user
would have to be smart about how (s)he used this feature, but there
are enough other potential foot guns in the system of similar
magnitude that worrying too much about that seems of little use.

I bring all this up because I'm interested in looking at how this
might be done, and perhaps doing it if 1) time permits, and 2) someone
else doesn't beat me to it. Comments, anyone? Am I missing something
obvious/serious/etc.? Thanks in advance.

- Josh / eggyknap

[1] Allow accurate statistics to be collected on indexes with more
than one column or expression indexes, perhaps using per-index
statistics, http://wiki.postgresql.org/wiki/Todo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross column statistics

2005-02-08 Thread Christopher Kings-Lynne
B) gather a full matrix of the level of correlation between each column and
   each other column. If this were a single floating point number per pair
   then it might be feasible. It would still obviously be n^2 in the number of
   columns though, so there would have to be some way to limit on how many
   columns would be analyzed this way.
Use foreign keys to just record those cross-correlations.
Chris
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Cross column statistics

2005-02-08 Thread Greg Stark
Christopher Kings-Lynne [EMAIL PROTECTED] writes:

  B) gather a full matrix of the level of correlation between each column 
  and
 each other column. If this were a single floating point number per pair
 then it might be feasible. It would still obviously be n^2 in the number 
  of
 columns though, so there would have to be some way to limit on how many
 columns would be analyzed this way.
 
 Use foreign keys to just record those cross-correlations.

My email was about cross-column intra-table correlations. inter-table
correlations are a whole other ball of wax.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] Cross column statistics

2005-02-05 Thread Greg Stark


Just brain storming a bit here. It seems to me there are two approaches for
cross-column statistics within a single table:

A) Treat an index on a,b,c the same way Postgres treats an expression index
   on ROW(a,b,c). That is, gather full statistics on the distribution of that
   ntuple of columns. I think this would be the easiest option for the
   analyzer. But:
   
   a) The optimizer would then need to do work to detect when all columns are
   present in the constraints and deduce the ntuple to look for in the
   statistics.

   b) It would only help if all the columns are used. I'm not sure how easy it
   would be to generalize this to queries using a,b or worse, b,c.

   c) It would only work if you create an index on the set of columns being
   queried. Often people have things like 

   SELECT * FROM tab WHERE indexed_col = ? AND deleted_flag IS false

   where deleted_flag *isn't* indexed or is a where clause on a partial index.

B) gather a full matrix of the level of correlation between each column and
   each other column. If this were a single floating point number per pair
   then it might be feasible. It would still obviously be n^2 in the number of
   columns though, so there would have to be some way to limit on how many
   columns would be analyzed this way.

   The problem is that's it's *very* unclear how to gather this information
   using a sample in any remotely efficient manner. It's not even clear what
   this number would be measuring.

   It's not actually correlation that Postgres usually needs. It's How many
   distinct values of b do we expect to find given a=a_0. Or rather how many
   do we expect to find relative to how many we would normally expect to find
   if the columns were independent.

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match