Re: [HACKERS] proposal : cross-column stats

2010-12-27 Thread Tomas Vondra
Dne 24.12.2010 13:37, Florian Pflug napsal(a): On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: 2010/12/24 Florian Pflug f...@phlo.org: On Dec23, 2010, at 20:39 , Tomas Vondra wrote: I guess we could use the highest possible value (equal to the number of tuples) - according to wiki

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Nicolas Barbier
2010/12/24 Florian Pflug f...@phlo.org: On Dec23, 2010, at 20:39 , Tomas Vondra wrote:   I guess we could use the highest possible value (equal to the number   of tuples) - according to wiki you need about 10 bits per element   with 1% error, i.e. about 10MB of memory for each million of  

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread tv
2010/12/24 Florian Pflug f...@phlo.org: On Dec23, 2010, at 20:39 , Tomas Vondra wrote:   I guess we could use the highest possible value (equal to the number   of tuples) - according to wiki you need about 10 bits per element   with 1% error, i.e. about 10MB of memory for each million of  

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Florian Pflug
On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: 2010/12/24 Florian Pflug f...@phlo.org: On Dec23, 2010, at 20:39 , Tomas Vondra wrote: I guess we could use the highest possible value (equal to the number of tuples) - according to wiki you need about 10 bits per element with 1%

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 13:15, t...@fuzzy.cz napsal(a): 2010/12/24 Florian Pflug f...@phlo.org: On Dec23, 2010, at 20:39 , Tomas Vondra wrote: I guess we could use the highest possible value (equal to the number of tuples) - according to wiki you need about 10 bits per element with 1% error,

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 13:37, Florian Pflug napsal(a): On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: 2010/12/24 Florian Pflug f...@phlo.org: On Dec23, 2010, at 20:39 , Tomas Vondra wrote: I guess we could use the highest possible value (equal to the number of tuples) - according to wiki

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 04:41, Florian Pflug napsal(a): The filter size could be derived from the table's statistics target, or be otherwise user-definable. We could also auto-resize once it gets too full. But still, that all seems awfully complex :-( Using a statistics target is a good idea I think. I

Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Tomas Vondra
Dne 21.12.2010 16:54, Florian Pflug napsal(a): I think that maybe it'd be acceptable to scan a large portion of the table to estimate dist(A,B), *if* we must only do so only once in a while. But even with a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in

Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Tomas Vondra
Dne 21.12.2010 19:03, Tomas Vondra napsal(a): My plan for the near future (a few weeks) is to build a simple 'module' with the ability to estimate the number of rows for a given condition. This could actually be quite useful as a stand-alone contrib module, as the users often ask how to get a

Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Florian Pflug
On Dec23, 2010, at 20:39 , Tomas Vondra wrote: Dne 21.12.2010 16:54, Florian Pflug napsal(a): I think that maybe it'd be acceptable to scan a large portion of the table to estimate dist(A,B), *if* we must only do so only once in a while. But even with a full scan, how would we arrive at an

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
On Dec18, 2010, at 17:59 , Tomas Vondra wrote: It seems to me you're missing one very important thing - this was not meant as a new default way to do estimates. It was meant as an option when the user (DBA, developer, ...) realizes the current solution gives really bad estimates (due to

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Robert Haas
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug f...@phlo.org wrote: I tried to pick up Robert's idea of quantifying Implicativeness - i.e., finding a number between 0 and 1 that describes how close the (A,B) are to representing a function A - B. Actually Heikki's idea... Observe that

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug f...@phlo.org wrote: You might use that to decide if either A-B or B-a looks function-like enough to use the uniform bayesian approach. Or you might even go further, and decide *with* bayesian formula to use - the paper you cited always averages

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Florian Pflug
On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote: I doubt there is a way to this decision with just dist(A), dist(B) and dist(A,B) values. Well, we could go with a rule if [dist(A) == dist(A,B)] the [A = B] but that's very fragile. Think about estimates (we're not going to work with exact

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote: I doubt there is a way to this decision with just dist(A), dist(B) and dist(A,B) values. Well, we could go with a rule if [dist(A) == dist(A,B)] the [A = B] but that's very fragile. Think about estimates (we're not going to work with exact

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Florian Pflug
On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote: This is the reason why they choose to always combine the values (with varying weights). There are no varying weights involved there. What they do is to express P(A=x,B=y) once as ... P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 +

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Florian Pflug
On Dec21, 2010, at 13:25 , t...@fuzzy.cz wrote: And there's one additional - IMHO very important - requirement. The whole thing should easily extend to more than two columns. This IF (F(A,B) F(B,A)) THEN ... probably is not a good solution regarding this. For example given 3 columns A,B,C,

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote: This is the reason why they choose to always combine the values (with varying weights). There are no varying weights involved there. What they do is to express P(A=x,B=y) once as ... P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Tomas Vondra
Dne 21.12.2010 16:54, Florian Pflug napsal(a): Hmmm, maybe we could give this possibility (to identify two separate groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the user would say 'build stats for (A,B) and (C)' - this actually represents apriori knowledge of

Re: [HACKERS] proposal : cross-column stats

2010-12-20 Thread Florian Pflug
On Dec18, 2010, at 17:59 , Tomas Vondra wrote: It seems to me you're missing one very important thing - this was not meant as a new default way to do estimates. It was meant as an option when the user (DBA, developer, ...) realizes the current solution gives really bad estimates (due to

Re: [HACKERS] proposal : cross-column stats

2010-12-19 Thread Simon Riggs
On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote: The proposed solution is based on contingency tables, built for selected groups of columns (not for each possible group). And the

Re: [HACKERS] proposal : cross-column stats

2010-12-19 Thread Tomas Vondra
Dne 19.12.2010 21:21, Simon Riggs napsal(a): On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote: The proposed solution is based on contingency tables, built for selected groups of

Re: [HACKERS] proposal : cross-column stats

2010-12-18 Thread Florian Pflug
On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote: On Dec17, 2010, at 23:12 , Tomas Vondra wrote: Well, not really - I haven't done any experiments with it. For two columns selectivity equation is (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) where A and B are columns, dist(X)

Re: [HACKERS] proposal : cross-column stats

2010-12-18 Thread Tomas Vondra
Dne 18.12.2010 16:59, Florian Pflug napsal(a): On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote: On Dec17, 2010, at 23:12 , Tomas Vondra wrote: Well, not really - I haven't done any experiments with it. For two columns selectivity equation is (dist(A) * sel(A) + dist(B) * sel(B)) / (2 *

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a): On 12.12.2010 15:17, Martijn van Oosterhout wrote: On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: Very cool that you're working on this. +1 Lets talk about one special case - I'll explain how the proposed solution works,

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Robert Haas
On Fri, Dec 17, 2010 at 12:58 PM, Tomas Vondra t...@fuzzy.cz wrote: In the end, all they need to compute an estimate is number of distinct values for each of the columns (we already have that in pg_stats) and a number of distinct values for the group of columns in a query. They really don't

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Hi, I've read about 10 papers about estimation this week, and I have gained some insight. I'm not going to describe each of the papers here, I plan to set up a wiki page about cross column statistics http://wiki.postgresql.org/wiki/Cross_Columns_Stats and I'll put the list of papers and some

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 19:58, Robert Haas napsal(a): I haven't read the paper yet (sorry) but just off the top of my head, one possible problem here is that our n_distinct estimates aren't always very accurate, especially for large tables. As we've discussed before, making them accurate requires

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes: AFAIK it will work with reasonably precise estimates, but the point is you need an estimate of distinct values of the whole group of columns. So when you want to get an estimate for queries on columns (a,b), you need the number of distinct value combinations

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Heikki Linnakangas
On 17.12.2010 23:13, Tomas Vondra wrote: Dne 17.12.2010 19:58, Robert Haas napsal(a): I haven't read the paper yet (sorry) but just off the top of my head, one possible problem here is that our n_distinct estimates aren't always very accurate, especially for large tables. As we've discussed

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 22:24, Tom Lane napsal(a): That seems likely to be even more unreliable than our existing single-column estimates :-( regards, tom lane Well, yes :-( I guess this is a place where we could use a multi-column index, if it contains all the interesting

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a): On 17.12.2010 23:13, Tomas Vondra wrote: Dne 17.12.2010 19:58, Robert Haas napsal(a): I haven't read the paper yet (sorry) but just off the top of my head, one possible problem here is that our n_distinct estimates aren't always very

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Florian Pflug
On Dec17, 2010, at 23:12 , Tomas Vondra wrote: Well, not really - I haven't done any experiments with it. For two columns selectivity equation is (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) where A and B are columns, dist(X) is number of distinct values in column X and

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread tv
On Dec17, 2010, at 23:12 , Tomas Vondra wrote: Well, not really - I haven't done any experiments with it. For two columns selectivity equation is (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) where A and B are columns, dist(X) is number of distinct values in column X and

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Yeb Havinga
On 2010-12-13 03:28, Robert Haas wrote: Well, I'm not real familiar with contingency tables, but it seems like you could end up needing to store a huge amount of data to get any benefit out of it, in some cases. For example, in the United States, there are over 40,000 postal codes, and some

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread tv
On 2010-12-13 03:28, Robert Haas wrote: Well, I'm not real familiar with contingency tables, but it seems like you could end up needing to store a huge amount of data to get any benefit out of it, in some cases. For example, in the United States, there are over 40,000 postal codes, and some

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes: Well, until this point we've discussed failure cases involving 'AND' conditions. What about 'OR' conditions? I think the current optimizer computes the selectivity as 's1+s2 - s1*s2' (at least that's what I found in backend/optimizer/path/clausesel.c:630).

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes: On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote: The proposed solution is based on contingency tables, built for selected groups of columns (not for each possible group). And the contingency table gives you the ability to estimate the

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Joshua Tolley
On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote: Another quick note: I think that storing the full contingency table is wasteful since the marginals are already stored in the single column statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was looking at this a couple

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 16:34, Tom Lane napsal(a): Tomas Vondra t...@fuzzy.cz writes: Well, until this point we've discussed failure cases involving 'AND' conditions. What about 'OR' conditions? I think the current optimizer computes the selectivity as 's1+s2 - s1*s2' (at least that's what I found in

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 16:38, Tom Lane napsal(a): The reason that this wasn't done years ago is precisely that nobody's figured out how to do it with a tolerable amount of stats data and a tolerable amount of processing time (both at ANALYZE time and during query planning). It's not hard to see what

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 18:59, Joshua Tolley napsal(a): On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote: Another quick note: I think that storing the full contingency table is wasteful since the marginals are already stored in the single column statistics. Look at copulas [2] ( FWIW I

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Josh Berkus
Tomas, (a) find out what statistics do we need to collect and how to use it (b) implement a really stupid inefficient solution (c) optimize in iterations, i.e. making it faster, consuming less space etc. I'll suggest again how to decide *which* columns to cross: whichever columns

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 22:50, Josh Berkus napsal(a): Tomas, (a) find out what statistics do we need to collect and how to use it (b) implement a really stupid inefficient solution (c) optimize in iterations, i.e. making it faster, consuming less space etc. I'll suggest again how to

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Pavel Stehule
2010/12/13 Josh Berkus j...@agliodbs.com: Tomas,   (a) find out what statistics do we need to collect and how to use it   (b) implement a really stupid inefficient solution   (c) optimize in iterations, i.e. making it faster, consuming less       space etc. I'll suggest again how to decide

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Robert Haas
On Mon, Dec 13, 2010 at 5:45 PM, Tomas Vondra t...@fuzzy.cz wrote: I'll suggest again how to decide *which* columns to cross: whichever columns are combined in composite indexes.  In version 2, allow the DBA to specify combinations. I think this is a bit early to discuss this, given the fact

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Martijn van Oosterhout
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: Hi everyone, one of the ssesion I've attended on PgDay last week was Heikki's session about statistics in PostgreSQL. One of the issues he mentioned (and one I regularly run into) is the absence of cross-column stats. When the

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Heikki Linnakangas
On 12.12.2010 15:17, Martijn van Oosterhout wrote: On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: Very cool that you're working on this. +1 Lets talk about one special case - I'll explain how the proposed solution works, and then I'll explain how to make it more general, what

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Hi! Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a): Lets talk about one special case - I'll explain how the proposed solution works, and then I'll explain how to make it more general, what improvements are possible, what issues are there. Anyway this is by no means a perfect or

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a): The classic failure case has always been: postcodes and city names. Strongly correlated, but in a way that the computer can't easily see. Yeah, and that's actually analogous to the example I used in my presentation. The way I think of

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Florian Pflug
On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote: The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for postcode = ? AND city = ? should be the selectivity of

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 12.12.2010 17:33, Florian Pflug napsal(a): On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote: The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for postcode = ?

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Robert Haas
On Sun, Dec 12, 2010 at 9:43 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for postcode = ? AND city

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 13.12.2010 01:05, Robert Haas napsal(a): This is a good idea, but I guess the question is what you do next. If you know that the applicability is 100%, you can disregard the restriction clause on the implied column. And if it has no implicatory power, then you just do what we do now.

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Robert Haas
On Sun, Dec 12, 2010 at 8:46 PM, Tomas Vondra t...@fuzzy.cz wrote: Dne 13.12.2010 01:05, Robert Haas napsal(a): This is a good idea, but I guess the question is what you do next.  If you know that the applicability is 100%, you can disregard the restriction clause on the implied column.  And

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
P(A|B) = P(A and B) / P(B). Well, until this point we've discussed failure cases involving 'AND' conditions. What about 'OR' conditions? I think the current optimizer computes the selectivity as 's1+s2 - s1*s2' (at least that's what I found in backend/optimizer/path/clausesel.c:630). Sometimes

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 13.12.2010 03:00, Robert Haas napsal(a): Well, the question is what data you are actually storing. It's appealing to store a measure of the extent to which a constraint on column X constrains column Y, because you'd only need to store O(ncolumns^2) values, which would be reasonably

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Robert Haas
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote: Dne 13.12.2010 03:00, Robert Haas napsal(a): Well, the question is what data you are actually storing.  It's appealing to store a measure of the extent to which a constraint on column X constrains column Y, because you'd only

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Nathan Boley
The proposed solution is based on contingency tables, built for selected groups of columns (not for each possible group). And the contingency table gives you the ability to estimate the probabilities needed to compute the selectivity. Or am I missing something? Well, I'm not real familiar

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread tv
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote: Dne 13.12.2010 03:00, Robert Haas napsal(a): Well, the question is what data you are actually storing.  It's appealing to store a measure of the extent to which a constraint on column X constrains column Y, because you'd only

[HACKERS] proposal : cross-column stats

2010-12-11 Thread Tomas Vondra
Hi everyone, one of the ssesion I've attended on PgDay last week was Heikki's session about statistics in PostgreSQL. One of the issues he mentioned (and one I regularly run into) is the absence of cross-column stats. When the columns are not independent, this usually result in poor estimates