Re: [HACKERS] Cross-column statistics revisited
> I still need to go through backend/utils/adt/selfuncs.c > to figure out exactly how we use the one-dimensional values. > Here's a page that helped me figure all this out. http://www.postgresql.org/docs/8.1/static/planner-stats-details.html >> >> 2) Do we want to fold the MCV's into the dependence histogram? That >> will cause problems in our copula approach but I'd hate to have to >> keep an N^d histogram dependence relation in addition to the copula. > > Yeah, if we're already trying to figure out how to compress copulae, > having also to compress MCV matrices seems painful and error-prone. > But I'm not sure why it would cause problems to keep them in the > copula -- is that just because we are most interested in the copula > modeling the parts of the distribution that are most sparsely > populated? > The problem I was thinking of is that we don't currently store the true marginal distribution. As it stands, histograms only include non mcv values. So we would either need to take the mcv's separately ( which would assume independence between mcv's and non-mcv values ) or store multiple histograms. >> 4) How will this approach deal with histogram buckets that have >> scaling count sizes ( ie -0.4 )? > > I'm not sure what you mean here. > That was more a note to myself, and should have been numbered 3.5. ndistinct estimates currently start to scale after a large enough row/ndistinct ratio. If we try to model ndistinct, we need to deal with scaling ndistinct counts somehow. But that's way off in the future, it was probably pointless to mention it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
On Fri, Oct 17, 2008 at 7:54 PM, Nathan Boley <[EMAIL PROTECTED]> wrote: >> I'm still working my way around the math, but copulas sound better >> than anything else I've been playing with. > > I think the easiest way to think of them is, in 2-D finite spaces, > they are just a plot of the order statistics against one another. Feel > free to mail me off list if you have any math questions. I'm still working out what a copula is, and how I've got to figure out the stuff below, too! :) /me has reading to do... The treatment of copulae in one of my stats texts is much better than wikipedia, I though, so I'm progressing. They sound like an excellent potential solution the more I figure out what they are, for whatever my opinion is worth, but making sure they work decently for all data types, including those where there's no real notion of distance, may be interesting. I still need to go through backend/utils/adt/selfuncs.c to figure out exactly how we use the one-dimensional values. > I've previously thought that, at least in the 2D case, we could use > image compression algorithms to compress the copula, but recently I've > realized that this is a change point problem. In terms of compression, > we want to decompose the copula into regions that are as homogenous as > possible. I'm not familiar with change point problems in multiple > dimensions, but I'll try and ask someone that is, probably late next > week. If you decide to go the copula route, I'd be happy to write the > decomposition algorithm - or at least work on the theory. > > Finally, a couple points that I hadn't seen mentioned earlier that > should probably be considered- > > 1) NULL's need to be treated specially - I suspect the assumption of > NULL independence is worse than other independence assumptions. Maybe > dealing with NULL dependence could be a half step towards full > dependence calculations? Agreed, though how to treat them I have no idea. > > 2) Do we want to fold the MCV's into the dependence histogram? That > will cause problems in our copula approach but I'd hate to have to > keep an N^d histogram dependence relation in addition to the copula. Yeah, if we're already trying to figure out how to compress copulae, having also to compress MCV matrices seems painful and error-prone. But I'm not sure why it would cause problems to keep them in the copula -- is that just because we are most interested in the copula modeling the parts of the distribution that are most sparsely populated? > 3) For equality selectivity estimates, I believe the assumption that > the ndistinct value distribution is uniform in the histogram will > become worse as the dimension increases. I proposed keeping track of > ndistinct per histogram beckets earlier in the marginal case partially > motivated by this exact scenario. Does that proposal make more sense > in this case? If so we'd need to store two distributions - one of the > counts and one of ndistinct. It's definitely worth investigating (or seeing if someone else has already published an investigation). There are probably several similar stats we could track and make use of, provided it didn't slow the planner too much to make use of them. > 4) How will this approach deal with histogram buckets that have > scaling count sizes ( ie -0.4 )? I'm not sure what you mean here. - Josh / eggyknap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
> I'm still working my way around the math, but copulas sound better > than anything else I've been playing with. I think the easiest way to think of them is, in 2-D finite spaces, they are just a plot of the order statistics against one another. Feel free to mail me off list if you have any math questions. I've previously thought that, at least in the 2D case, we could use image compression algorithms to compress the copula, but recently I've realized that this is a change point problem. In terms of compression, we want to decompose the copula into regions that are as homogenous as possible. I'm not familiar with change point problems in multiple dimensions, but I'll try and ask someone that is, probably late next week. If you decide to go the copula route, I'd be happy to write the decomposition algorithm - or at least work on the theory. Finally, a couple points that I hadn't seen mentioned earlier that should probably be considered- 1) NULL's need to be treated specially - I suspect the assumption of NULL independence is worse than other independence assumptions. Maybe dealing with NULL dependence could be a half step towards full dependence calculations? 2) Do we want to fold the MCV's into the dependence histogram? That will cause problems in our copula approach but I'd hate to have to keep an N^d histogram dependence relation in addition to the copula. 3) For equality selectivity estimates, I believe the assumption that the ndistinct value distribution is uniform in the histogram will become worse as the dimension increases. I proposed keeping track of ndistinct per histogram beckets earlier in the marginal case partially motivated by this exact scenario. Does that proposal make more sense in this case? If so we'd need to store two distributions - one of the counts and one of ndistinct. 4) How will this approach deal with histogram buckets that have scaling count sizes ( ie -0.4 )? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
On Fri, Oct 17, 2008 at 3:47 PM, Nathan Boley <[EMAIL PROTECTED]> wrote: Right now our "histogram" values are really quantiles; the statistics_target T for a column determines a number of quantiles we'll keep track of, and we grab values from into an ordered list L so that approximately 1/T of the entries in that column fall between values L[n] and L[n+1]. I'm thinking that multicolumn statistics would instead divide the range of each column up into T equally sized segments, >>> >>> Why would you not use the same histogram bin bounds derived for the >>> scalar stats (along each axis of the matrix, of course)? This seems to >>> me to be arbitrarily replacing something proven to work with something >>> not proven. Also, the above forces you to invent a concept of "equally >>> sized" ranges, which is going to be pretty bogus for a lot of datatypes. >> >> Because I'm trying to picture geometrically how this might work for >> the two-column case, and hoping to extend that to more dimensions, and >> am finding that picturing a quantile-based system like the one we have >> now in multiple dimensions is difficult. I believe those are the same >> difficulties Gregory Stark mentioned having in his first post in this >> thread. But of course that's an excellent point, that what we do now >> is proven. I'm not sure which problem will be harder to solve -- the >> weird geometry or the "equally sized ranges" for data types where that >> makes no sense. >> > > Look at copulas. They are a completely general method of describing > the dependence between two marginal distributions. It seems silly to > rewrite the stats table in terms of joint distributions when we'll > still need the marginals anyways. Also, It might be easier to think of > the dimension reduction problem in that form. > I'm still working my way around the math, but copulas sound better than anything else I've been playing with. What's more, there are plenty of existing implementations to refer to, provided it's done in a licensing-friendly way. A multidimensional extension of our existing stuff, at least in the ways I've been thinking of it, quickly becomes a recursive problem -- perhaps some dynamic programming solution would solve it, but copulas seem the more common solution for similar problems in other fields. Thanks. - Josh / eggyknap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
>>> Right now our >>> "histogram" values are really quantiles; the statistics_target T for a >>> column determines a number of quantiles we'll keep track of, and we >>> grab values from into an ordered list L so that approximately 1/T of >>> the entries in that column fall between values L[n] and L[n+1]. I'm >>> thinking that multicolumn statistics would instead divide the range of >>> each column up into T equally sized segments, >> >> Why would you not use the same histogram bin bounds derived for the >> scalar stats (along each axis of the matrix, of course)? This seems to >> me to be arbitrarily replacing something proven to work with something >> not proven. Also, the above forces you to invent a concept of "equally >> sized" ranges, which is going to be pretty bogus for a lot of datatypes. > > Because I'm trying to picture geometrically how this might work for > the two-column case, and hoping to extend that to more dimensions, and > am finding that picturing a quantile-based system like the one we have > now in multiple dimensions is difficult. I believe those are the same > difficulties Gregory Stark mentioned having in his first post in this > thread. But of course that's an excellent point, that what we do now > is proven. I'm not sure which problem will be harder to solve -- the > weird geometry or the "equally sized ranges" for data types where that > makes no sense. > Look at copulas. They are a completely general method of describing the dependence between two marginal distributions. It seems silly to rewrite the stats table in terms of joint distributions when we'll still need the marginals anyways. Also, It might be easier to think of the dimension reduction problem in that form. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
Tom Lane wrote: A bad estimate for physical-position correlation has only limited impact, Ah! This seems very true with 8.3 but much less true with 8.0. On a legacy 8.0 system I have a hard time avoiding cases where a query like "select * from addresses where add_state_or_province = 'CA';" does a 2-second full-table scan instead of a 300ms index scan thanks to a poor physical order guess. I just sucked the table into 8.3 and am pleased to say that it picks a 200ms bitmap scan even with the misleading correlation. Thanks for bitmap scans guys! I'll shut up about this physical ordering stuff now and try to do better upgrading before posting. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
Martijn van Oosterhout <[EMAIL PROTECTED]> writes: > Just a note: using a multidimensional histograms will work well for the > cases like (startdate,enddate) where the histogram will show a > clustering of values along the diagonal. But it will fail for the case > (zipcode,state) where one implies the other. Histogram-wise you're not > going to see any correlation at all Huh? Sure you are. What the histogram will show is that there is only one state value per zipcode, and only a limited subset of zipcodes per state. The nonempty cells won't cluster along the "diagonal" but we don't particularly care about that. What we really want from this is to not think that WHERE zip = '80210' AND state = 'CA' is significantly more selective than just WHERE zip = '80210' A histogram is certainly capable of telling us that. Whether it's the most compact representation is another question of course --- in an example like this, only about 1/50th of the cells would contain nonzero counts ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
Gregory Stark wrote: > They're certainly very much not independent variables. There are lots of ways > of measuring how much dependence there is between them. I don't know enough > about the math to know if your maps are equivalent to any of them. I think "dependency" captures the way I think about it rather than correlation (although I can see there must be function that could map that dependency onto how we think of correlations). > In any case as I described it's not enough information to know that the two > data sets are heavily dependent. You need to know for which pairs (or ntuples) > that dependency results in a higher density and for which it results in lower > density and how much higher or lower. That seems like a lot of information to > encode (and a lot to find in the sample). Like Josh Berkus mentioned a few points back, it's the handful of plan-changing values you're looking for. So, it seems like we've got: 1. Implied dependencies: zip-code=>city 2. Implied+constraint: start-date < end-date and the difference between the two is usually less than a week 3. "Top-heavy" foreign-key stats. #1 and #2 obviously need new infrastructure. >From a non-dev point of view it looks like #3 could use the existing stats on each side of the join. I'm not sure whether you could do anything meaningful for joins that don't explicitly specify one side of the join though. > Perhaps just knowing whether that there's a dependence between two data sets > might be somewhat useful if the planner kept a confidence value for all its > estimates. It would know to have a lower confidence value for estimates coming > from highly dependent clauses? It wouldn't be very easy for the planner to > distinguish "safe" plans for low confidence estimates and "risky" plans which > might blow up if the estimates are wrong though. And of course that's a lot > less interesting than just getting better estimates :) If we could abort a plan and restart then we could just try the quick-but-risky plan and if we reach 50 rows rather than the expected 10 try a different approach. That way we'd not need to gather stats, just react to the situation in individual queries. -- Richard Huxton Archonet Ltd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
Martijn van Oosterhout <[EMAIL PROTECTED]> writes: > On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote: >> Correlation is the wrong tool. In fact zip codes and city have nearly >> zero correlation. Zip codes near 0 are no more likely to be in >> cities starting with A than Z. > > I think we need to define our terms better. In terms of linear > correlation you are correct. However, you can define invertable mappings > from zip codes and cities onto the integers which will then have an > almost perfect correlation. > > According to a paper I found this is related to the "principle of > maximum entropy". The fact that you can't determine such functions > easily in practice doesn't change the fact that zip codes and city > names are highly correlated. They're certainly very much not independent variables. There are lots of ways of measuring how much dependence there is between them. I don't know enough about the math to know if your maps are equivalent to any of them. In any case as I described it's not enough information to know that the two data sets are heavily dependent. You need to know for which pairs (or ntuples) that dependency results in a higher density and for which it results in lower density and how much higher or lower. That seems like a lot of information to encode (and a lot to find in the sample). Perhaps just knowing whether that there's a dependence between two data sets might be somewhat useful if the planner kept a confidence value for all its estimates. It would know to have a lower confidence value for estimates coming from highly dependent clauses? It wouldn't be very easy for the planner to distinguish "safe" plans for low confidence estimates and "risky" plans which might blow up if the estimates are wrong though. And of course that's a lot less interesting than just getting better estimates :) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote: > Correlation is the wrong tool. In fact zip codes and city have nearly > zero correlation. Zip codes near 0 are no more likely to be in > cities starting with A than Z. I think we need to define our terms better. In terms of linear correlation you are correct. However, you can define invertable mappings from zip codes and cities onto the integers which will then have an almost perfect correlation. According to a paper I found this is related to the "principle of maximum entropy". The fact that you can't determine such functions easily in practice doesn't change the fact that zip codes and city names are highly correlated. Have a nice day, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] Cross-column statistics revisited
On Thu, Oct 16, 2008 at 09:17:03PM -0600, Joshua Tolley wrote: > Because I'm trying to picture geometrically how this might work for > the two-column case, and hoping to extend that to more dimensions, and > am finding that picturing a quantile-based system like the one we have > now in multiple dimensions is difficult. Just a note: using a multidimensional histograms will work well for the cases like (startdate,enddate) where the histogram will show a clustering of values along the diagonal. But it will fail for the case (zipcode,state) where one implies the other. Histogram-wise you're not going to see any correlation at all but what you want to know is: count(distinct zipcode,state) = count(distinct zipcode) So you might need to think about storing/searching for different kinds of correlation. Secondly, my feeling about multidimensional histograms is that you're not going to need the matrix to have 100 bins along each axis, but that it'll be enough to have 1000 bins total. The cases where we get it wrong enough for people to notice will probably be the same cases where the histogram will have noticable variation even for a small number of bins. Have a nice day, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] Cross-column statistics revisited
On Thu, Oct 16, 2008 at 8:38 PM, Tom Lane <[EMAIL PROTECTED]> wrote: > "Joshua Tolley" <[EMAIL PROTECTED]> writes: >> For what it's worth, neither version of correlation was what I had in >> mind. Statistical correlation between two variables is a single >> number, is fairly easy to calculate, and probably wouldn't help query >> plans much at all. I'm more interested in a more complex data >> gathering. The data model I have in mind (which I note I have *not* >> proven to actually help a large number of query plans -- that's >> obviously an important part of what I'd need to do in all this) >> involves instead a matrix of frequency counts. > > Oh, I see the distinction you're trying to draw. Agreed on both points: > a simple correlation number is pretty useless to us, and we don't have > hard evidence that a histogram-matrix will solve the problem. However, > we do know that one-dimensional histograms of the sort currently > collected by ANALYZE work pretty well (if they're detailed enough). > It seems reasonable to me to extrapolate that same concept to two or > more dimensions. The issue then becomes that a "detailed enough" > matrix might be impracticably bulky, so you need some kind of lossy > compression, and we don't have hard evidence about how well that will > work. Nonetheless the road map seems clear enough to me. Oh goodie -- I feel so validated :) > >> Right now our >> "histogram" values are really quantiles; the statistics_target T for a >> column determines a number of quantiles we'll keep track of, and we >> grab values from into an ordered list L so that approximately 1/T of >> the entries in that column fall between values L[n] and L[n+1]. I'm >> thinking that multicolumn statistics would instead divide the range of >> each column up into T equally sized segments, > > Why would you not use the same histogram bin bounds derived for the > scalar stats (along each axis of the matrix, of course)? This seems to > me to be arbitrarily replacing something proven to work with something > not proven. Also, the above forces you to invent a concept of "equally > sized" ranges, which is going to be pretty bogus for a lot of datatypes. Because I'm trying to picture geometrically how this might work for the two-column case, and hoping to extend that to more dimensions, and am finding that picturing a quantile-based system like the one we have now in multiple dimensions is difficult. I believe those are the same difficulties Gregory Stark mentioned having in his first post in this thread. But of course that's an excellent point, that what we do now is proven. I'm not sure which problem will be harder to solve -- the weird geometry or the "equally sized ranges" for data types where that makes no sense. One option that came to mind recently would be to have statistics for a subset of the rows in a single column A, where that subset is defined by conditions on some other column B with a defined relationship to A (for instance, that they're in the same table). There are all kinds of complexities possible with that idea, but one bright spot is that the values in the catalog and the way the planner makes use of them would stay essentially the same as they are now. That one's interesting to think about, but probably too complex for real use. Anyway, I'll keep pondering. - Josh / eggyknap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
"Joshua Tolley" <[EMAIL PROTECTED]> writes: > For what it's worth, neither version of correlation was what I had in > mind. Statistical correlation between two variables is a single > number, is fairly easy to calculate, and probably wouldn't help query > plans much at all. I'm more interested in a more complex data > gathering. The data model I have in mind (which I note I have *not* > proven to actually help a large number of query plans -- that's > obviously an important part of what I'd need to do in all this) > involves instead a matrix of frequency counts. Oh, I see the distinction you're trying to draw. Agreed on both points: a simple correlation number is pretty useless to us, and we don't have hard evidence that a histogram-matrix will solve the problem. However, we do know that one-dimensional histograms of the sort currently collected by ANALYZE work pretty well (if they're detailed enough). It seems reasonable to me to extrapolate that same concept to two or more dimensions. The issue then becomes that a "detailed enough" matrix might be impracticably bulky, so you need some kind of lossy compression, and we don't have hard evidence about how well that will work. Nonetheless the road map seems clear enough to me. > Right now our > "histogram" values are really quantiles; the statistics_target T for a > column determines a number of quantiles we'll keep track of, and we > grab values from into an ordered list L so that approximately 1/T of > the entries in that column fall between values L[n] and L[n+1]. I'm > thinking that multicolumn statistics would instead divide the range of > each column up into T equally sized segments, Why would you not use the same histogram bin bounds derived for the scalar stats (along each axis of the matrix, of course)? This seems to me to be arbitrarily replacing something proven to work with something not proven. Also, the above forces you to invent a concept of "equally sized" ranges, which is going to be pretty bogus for a lot of datatypes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
On Thu, Oct 16, 2008 at 6:32 PM, Tom Lane <[EMAIL PROTECTED]> wrote: > It appears to me that a lot of people in this thread are confusing > correlation in the sense of statistical correlation between two > variables with correlation in the sense of how well physically-ordered > a column is. For what it's worth, neither version of correlation was what I had in mind. Statistical correlation between two variables is a single number, is fairly easy to calculate, and probably wouldn't help query plans much at all. I'm more interested in a more complex data gathering. The data model I have in mind (which I note I have *not* proven to actually help a large number of query plans -- that's obviously an important part of what I'd need to do in all this) involves instead a matrix of frequency counts. Right now our "histogram" values are really quantiles; the statistics_target T for a column determines a number of quantiles we'll keep track of, and we grab values from into an ordered list L so that approximately 1/T of the entries in that column fall between values L[n] and L[n+1]. I'm thinking that multicolumn statistics would instead divide the range of each column up into T equally sized segments, to form in the two-column case a matrix, where the values of the matrix are frequency counts -- the number of rows whose values for each column fall within the particular segments of their respective ranges represented by the boundaries of that cell in the matrix. I just realized while writing this that this might not extend to situations where the two columns are from different tables and don't necessarily have the same row count, but I'll have to think about that. Anyway, the size of such a matrix would be exponential in T, and cross-column statistics involving just a few columns could easily involve millions of values, for fairly normal statistics_targets. That's where the compression ideas come in to play. This would obviously need a fair bit of testing, but it's certainly conceivable that modern regression techniques could reduce that frequency matrix to a set of functions with a small number of parameters. Whether that would result in values the planner can look up for a given set of columns without spending more time than it's worth is another question that will need exploring. I started this thread knowing that past discussions have posed the following questions: 1. What sorts of cross-column data can we really use? 2. Can we collect that information? 3. How do we know what columns to track? For what it's worth, my original question was whether anyone had concerns beyond these, and I think that has been fairly well answered in this thread. - Josh / eggyknap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
"Joshua Tolley" <[EMAIL PROTECTED]> writes: > Most of the comments on this thread have centered around the questions > of "what we'd store" and "how we'd use it", which might be better > phrased as, "The database assumes columns are independent, but we know > that's not always true. Does this cause enough problems to make it > worth fixing? How might we fix it?" I have to admit an inability to > show that it causes problems, Any small amount of trolling in our archives will turn up plenty of examples. It appears to me that a lot of people in this thread are confusing correlation in the sense of statistical correlation between two variables with correlation in the sense of how well physically-ordered a column is. (The latter is actually the same kind of animal, but always taking one of the two variables to be physical position.) A bad estimate for physical-position correlation has only limited impact, as Josh B said upthread; but the other case leads to very bad rowcount estimates which have *huge* impact on plan choices. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
This is yet another issue entirely. This is about estimating how much io will be random io if we do an index order scan. Correlation is a passable tool for this but we might be able to do better. But it has nothing to do with the cross-column stats problem. greg On 17 Oct 2008, at 01:29 AM, Ron Mayer <[EMAIL PROTECTED]> wrote: Josh Berkus wrote: Yes, or to phrase that another way: What kinds of queries are being poorly optimized now and why? Well, we have two different correlation problems. One is the problem of dependant correlation, such as the 1.0 correlation of ZIP and CITY fields as a common problem. This could in fact be fixed, I believe, via a linear math calculation based on the sampled level of correlation, assuming we have enough samples. And it's really only an issue if the correlation is 0.5. I'd note that this can be an issue even without 2 columns involved. I've seen a number of tables where the data is loaded in batches so similar-values from a batch tend to be packed into relatively few pages. Thinks a database for a retailer that nightly aggregates data from each of many stores. Each incoming batch inserts the store's data into tightly packed disk pages where most all rows on the page are for that store. But those pages are interspersed with pages from other stores. I think I like the ideas Greg Stark had a couple years ago: http://archives.postgresql.org/pgsql-hackers/2006-09/msg01040.php "...sort the sampled values by value and count up the average number of distinct blocks per value Or perhaps we need a second histogram where the quantities are of distinct pages rather than total records We might also need a separate "average number of n-block spans per value" since those seem to me to lead more directly to values like "blocks that need to be read". -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
Josh Berkus wrote: Yes, or to phrase that another way: What kinds of queries are being poorly optimized now and why? Well, we have two different correlation problems. One is the problem of dependant correlation, such as the 1.0 correlation of ZIP and CITY fields as a common problem. This could in fact be fixed, I believe, via a linear math calculation based on the sampled level of correlation, assuming we have enough samples. And it's really only an issue if the correlation is 0.5. I'd note that this can be an issue even without 2 columns involved. I've seen a number of tables where the data is loaded in batches so similar-values from a batch tend to be packed into relatively few pages. Thinks a database for a retailer that nightly aggregates data from each of many stores. Each incoming batch inserts the store's data into tightly packed disk pages where most all rows on the page are for that store. But those pages are interspersed with pages from other stores. I think I like the ideas Greg Stark had a couple years ago: http://archives.postgresql.org/pgsql-hackers/2006-09/msg01040.php "...sort the sampled values by value and count up the average number of distinct blocks per value Or perhaps we need a second histogram where the quantities are of distinct pages rather than total records We might also need a separate "average number of n-block spans per value" since those seem to me to lead more directly to values like "blocks that need to be read". -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
Correlation is the wrong tool. In fact zip codes and city have nearly zero correlation. Zip codes near 0 are no more likely to be in cities starting with A than Z. Even if you use an appropriate tool I'm not clear what to do with the information. Consider the case of WHERE city='boston' and zip='02139' and another query with WHERE city='boston' and zip='90210'. One will produce many more records than the separate histograms would predict and the other would produce zero. How do you determine which category a given pair of constants falls into? Separately you mention cross-table stats - but that' a whole other kettle of worms. I'm not sure which is easier but let's do one at a time? greg On 17 Oct 2008, at 12:12 AM, Josh Berkus <[EMAIL PROTECTED]> wrote: Yes, or to phrase that another way: What kinds of queries are being poorly optimized now and why? Well, we have two different correlation problems. One is the problem of dependant correlation, such as the 1.0 correlation of ZIP and CITY fields as a common problem. This could in fact be fixed, I believe, via a linear math calculation based on the sampled level of correlation, assuming we have enough samples. And it's really only an issue if the correlation is 0.5. The second type of correlation issue we have is correlating values in a parent table with *rows* in child table (i.e. FK joins). Currently, the planner assumes that all rows in the child table are evenly distributed against keys in the parent table. But many real-world databases have this kind of problem: AB 11 rows 21 rows 31000 rows 4 .. 10000 to 1 rows For queries which cover values between 4..1000 on A, the misestimate won't be much of a real execution problem. But for values 1,2,3, the query will bomb. The other half of this is that bad selectivity estimates only matter if they're bad enough to change the plan, and I'm not sure whether cases like this are actually a problem in practice. My experience is that any estimate which is more than 5x wrong (i.e. < .2 or > 5.0) usually causes problems, and 3x sometimes causes problems. -- --Josh Josh Berkus PostgreSQL San Francisco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
> Yes, or to phrase that another way: What kinds of queries are being > poorly optimized now and why? Well, we have two different correlation problems. One is the problem of dependant correlation, such as the 1.0 correlation of ZIP and CITY fields as a common problem. This could in fact be fixed, I believe, via a linear math calculation based on the sampled level of correlation, assuming we have enough samples. And it's really only an issue if the correlation is > 0.5. The second type of correlation issue we have is correlating values in a parent table with *rows* in child table (i.e. FK joins). Currently, the planner assumes that all rows in the child table are evenly distributed against keys in the parent table. But many real-world databases have this kind of problem: A B 1 1 rows 2 1 rows 3 1000 rows 4 .. 1000 0 to 1 rows For queries which cover values between 4..1000 on A, the misestimate won't be much of a real execution problem. But for values 1,2,3, the query will bomb. > The other half of this is that bad selectivity estimates only matter > if they're bad enough to change the plan, and I'm not sure whether > cases like this are actually a problem in practice. My experience is that any estimate which is more than 5x wrong (i.e. < .2 or > 5.0) usually causes problems, and 3x sometimes causes problems. -- --Josh Josh Berkus PostgreSQL San Francisco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
On Thu, Oct 16, 2008 at 2:54 PM, Josh Berkus <[EMAIL PROTECTED]> wrote: > Tom, > >> (I'm not certain of how to do that efficiently, even if we had the >> right stats :-() > > I was actually talking to someone about this at pgWest. Apparently there's > a fair amount of academic algorithms devoted to this topic. Josh, do you > remember who was talking about this? Actually, it was me :) My biggest concern at the time was finding ways to compress the correlation data, on the apparently fairly tenuous assumption that we'd somehow be able to make use of it. As it turns out, the particular set of algorithms I had in mind don't compress anything, but there are other methods that do. Most of the comments on this thread have centered around the questions of "what we'd store" and "how we'd use it", which might be better phrased as, "The database assumes columns are independent, but we know that's not always true. Does this cause enough problems to make it worth fixing? How might we fix it?" I have to admit an inability to show that it causes problems, though Neil Conway pointed me to some literature[1] on the subject I've not yet had time to go through. My basic idea is that if we have a cross-column frequency count, it will help us get better plans, but I don't know the internals enough to have details further than that. - Josh / eggyknap [1] http://www.cs.umd.edu/~amol/papers/paper-dep.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
Tom, > (I'm not certain of how to do that efficiently, even if we had the > right stats :-() I was actually talking to someone about this at pgWest. Apparently there's a fair amount of academic algorithms devoted to this topic. Josh, do you remember who was talking about this? -- --Josh Josh Berkus PostgreSQL San Francisco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
Robert Haas wrote: I think the real question is: what other kinds of correlation might people be interested in representing? Yes, or to phrase that another way: What kinds of queries are being poorly optimized now and why? The one that affects our largest tables are ones where we have an address (or other geo-data) clustered by zip, but with other columns (city, county, state, school-zone, police beat, etc) used in queries. Postgres considers those unclustered (correlation 0 in the stats), despite all rows for a given value residing on the same few pages. I could imagine that this could be handled by either some cross-column correlation (each zip has only 1-2 cities); or by an enhanced single-column statistic (even though cities aren't sorted alphabetically, all rows on a page tend to refer to the same city). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
On Thu, Oct 16, 2008 at 01:34:59PM -0400, Robert Haas wrote: > I suspect that a lot of the correlations people care about are > extreme. For example, it's fairly common for me to have a table where > column B is only used at all for certain values of column A. Like, > atm_machine_id is usually or always NULL unless transaction_type is > ATM, or something. So a clause of the form transaction_type = 'ATM' > and atm_machine_id < 1 looks more selective than it really is > (because the first half is redundant). That case is easily done by simply considering the indexed values of a column with a partial index. This should be fairly easy to do I think. It might be worthwhile someone trawling through the archives looking for examples where we estimate the correlation wrong. Have a nice day, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] Cross-column statistics revisited
> I think the real question is: what other kinds of correlation might > people be interested in representing? Yes, or to phrase that another way: What kinds of queries are being poorly optimized now and why? I suspect that a lot of the correlations people care about are extreme. For example, it's fairly common for me to have a table where column B is only used at all for certain values of column A. Like, atm_machine_id is usually or always NULL unless transaction_type is ATM, or something. So a clause of the form transaction_type = 'ATM' and atm_machine_id < 1 looks more selective than it really is (because the first half is redundant). The other half of this is that bad selectivity estimates only matter if they're bad enough to change the plan, and I'm not sure whether cases like this are actually a problem in practice. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
[sorry for top osting - dam phone] It's pretty straightforward to to a chi-squared test on all the pairs. But that tells you that the product is more likely to be wrong. It doesn't tell you whether it's going to be too high or too low... greg On 16 Oct 2008, at 07:20 PM, Tom Lane <[EMAIL PROTECTED]> wrote: Martijn van Oosterhout <[EMAIL PROTECTED]> writes: I think you need to go a step back: how are you going to use this data? The fundamental issue as the planner sees it is not having to assume independence of WHERE clauses. For instance, given WHERE a < 5 AND b > 10 our current approach is to estimate the fraction of rows with a < 5 (using stats for a), likewise estimate the fraction with b > 10 (using stats for b), and then multiply these fractions together. This is correct if a and b are independent, but can be very bad if they aren't. So if we had joint statistics on a and b, we'd want to somehow match that up to clauses for a and b and properly derive the joint probability. (I'm not certain of how to do that efficiently, even if we had the right stats :-() regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
Martijn van Oosterhout <[EMAIL PROTECTED]> writes: > I think you need to go a step back: how are you going to use this data? The fundamental issue as the planner sees it is not having to assume independence of WHERE clauses. For instance, given WHERE a < 5 AND b > 10 our current approach is to estimate the fraction of rows with a < 5 (using stats for a), likewise estimate the fraction with b > 10 (using stats for b), and then multiply these fractions together. This is correct if a and b are independent, but can be very bad if they aren't. So if we had joint statistics on a and b, we'd want to somehow match that up to clauses for a and b and properly derive the joint probability. (I'm not certain of how to do that efficiently, even if we had the right stats :-() regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
On Wed, Oct 15, 2008 at 04:53:10AM -0600, Joshua Tolley wrote: > I've been interested in what it would take to start tracking > cross-column statistics. A review of the mailing lists as linked from > the TODO item on the subject [1] suggests the following concerns: > > 1) What information exactly would be tracked? > 2) How would it be kept from exploding in size? > 3) For which combinations of columns would statistics be kept? I think you need to go a step back: how are you going to use this data? Whatever structure you choose the eventual goal you take a discription of the column (a,b) and take a clause like 'a < 5' and be able to generate an estimate of the distribution of b. Secondly, people arn't going to ask for multi-column stats on column that arn't correlated in some way. So you need to work out what kinds of correlation people are interested in and see how you can store them. One potential use case is the (startdate,enddate) columns. Here what you want to detect somehow that the distribution of (enddate-startdate) is constant. I think the real question is: what other kinds of correlation might people be interested in representing? Have a nice day, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] Cross-column statistics revisited
On Wed, Oct 15, 2008 at 7:51 AM, Gregory Stark <[EMAIL PROTECTED]> wrote: > "Joshua Tolley" <[EMAIL PROTECTED]> writes: > >> I've been interested in what it would take to start tracking >> cross-column statistics. A review of the mailing lists as linked from >> the TODO item on the subject [1] suggests the following concerns: >> >> 1) What information exactly would be tracked? >> 2) How would it be kept from exploding in size? >> 3) For which combinations of columns would statistics be kept? > > I think then you have > > 4) How would we form estimates from these stats That too, but see below. >> The major concern in #1 seemed to be that the most suitable form for >> keeping most common value lists, histograms, etc. is in an array, and >> at the time of the posts I read, arrays of composite types weren't >> possible. This seems much less of a concern now -- perhaps in greatest >> part because a test I just did against a recent 8.4devel sure makes it >> look like stats on composite type columns aren't even kept. The most >> straightforward is that we'd keep a simple multi-dimensional >> histogram, but that leads to a discussion of #2. > > "multi-dimensional histogram" isn't such a simple concept, at least not to me. > > Histograms aren't a bar chart of equal widths and various heights like I was > taught in school. They're actually bars of various widths arranged such that > they all of the same heights. That depends on whose definition you've got in mind. Wikipedia's version (for whatever that's worth) defines it as variable width columns of not-necessarily-uniform heights, where the area of each bar is the frequency. The default behavior of the hist() function in the R statistics package is uniform bar widths, which generates complaints from purists. Semantics aside, it looks like pgsql's stats stuff (which I realize I'll need to get to know lots better to undertake something like this) uses a list of quantiles, which still only makes sense, as I see it, when a distance measurement is available for the data type. The multidimensional case might do better with a frequency count, where we'd store a matrix/cube/etc. with uniformly-sized units, probably compressed via some regression function. That said, a multidimensional quantile list would also be possible, compressed similarly. Provided a frequency chart or quantile set, it seems estimates can be made just as they are in one dimension, by finding the right value in the quantile or frequency matrix. > It's not clear how to extend that concept into two dimensions. I imagine > there's research on this though. What do the GIST statistics functions store? No idea. - Josh / eggyknap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
"Joshua Tolley" <[EMAIL PROTECTED]> writes: > I've been interested in what it would take to start tracking > cross-column statistics. A review of the mailing lists as linked from > the TODO item on the subject [1] suggests the following concerns: > > 1) What information exactly would be tracked? > 2) How would it be kept from exploding in size? > 3) For which combinations of columns would statistics be kept? I think then you have 4) How would we form estimates from these stats > The major concern in #1 seemed to be that the most suitable form for > keeping most common value lists, histograms, etc. is in an array, and > at the time of the posts I read, arrays of composite types weren't > possible. This seems much less of a concern now -- perhaps in greatest > part because a test I just did against a recent 8.4devel sure makes it > look like stats on composite type columns aren't even kept. The most > straightforward is that we'd keep a simple multi-dimensional > histogram, but that leads to a discussion of #2. "multi-dimensional histogram" isn't such a simple concept, at least not to me. Histograms aren't a bar chart of equal widths and various heights like I was taught in school. They're actually bars of various widths arranged such that they all of the same heights. It's not clear how to extend that concept into two dimensions. I imagine there's research on this though. What do the GIST statistics functions store? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Cross-column statistics revisited
I've been interested in what it would take to start tracking cross-column statistics. A review of the mailing lists as linked from the TODO item on the subject [1] suggests the following concerns: 1) What information exactly would be tracked? 2) How would it be kept from exploding in size? 3) For which combinations of columns would statistics be kept? The major concern in #1 seemed to be that the most suitable form for keeping most common value lists, histograms, etc. is in an array, and at the time of the posts I read, arrays of composite types weren't possible. This seems much less of a concern now -- perhaps in greatest part because a test I just did against a recent 8.4devel sure makes it look like stats on composite type columns aren't even kept. The most straightforward is that we'd keep a simple multi-dimensional histogram, but that leads to a discussion of #2. #2 is an interesting problem, and possibly the most difficult from a theoretical perspective. Provided a fair expectation that the results would be useful, I'd like to play with various forms of statistical regressions to compress large cross-column histograms. But I'm not sure I want to spend the time if there are other serious issues with the whole idea of cross-column stats. Another possibility involves not storing a histogram at all, but instead storing an array of quantiles for each column. In other words, if S represents the statistics target of a column, we'd have an array A of S+1 entries of the type in question, where the beginning and end of the array are the minimum and maximum values in the column, and the range between any two consecutive array elements A[n] and A[n+1] contains 1/S of the total entries in the column. For a uniformly distributed column, these array elements would be evenly spaced across the total range of the column; the difference between consecutive array elements would indicate deviations from this distribution. Unfortunately this only works if there's a valid distance metric for the data type in question. In the archives, #3 was raised frequently as a concern -- obviously tracking stats on all permutations of the columns in a database is a non-starter. However there seemed to be little argument over sensible defaults, namely that columns involved in a multi-column index would have cross-column statistics kept automatically, and the user would be allowed to instruct the server to keep stats on other arbitrary groups of columns. There was some discussion about the fact that the user would have to be smart about how (s)he used this feature, but there are enough other potential foot guns in the system of similar magnitude that worrying too much about that seems of little use. I bring all this up because I'm interested in looking at how this might be done, and perhaps doing it if 1) time permits, and 2) someone else doesn't beat me to it. Comments, anyone? Am I missing something obvious/serious/etc.? Thanks in advance. - Josh / eggyknap [1] "Allow accurate statistics to be collected on indexes with more than one column or expression indexes, perhaps using per-index statistics", http://wiki.postgresql.org/wiki/Todo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers