Re: MCV lists for highly skewed distributions

2018-03-22 Thread Dean Rasheed
On 19 March 2018 at 16:59, John Naylor wrote: > On 3/19/18, Dean Rasheed wrote: >> As promised, here is a new patch, with comment updates, per John and >> Tomas' suggestions, plus the continuity correction, which seemed just >> about worthwhile. > > Great. I'm happy with the behavior of the patch

Re: MCV lists for highly skewed distributions

2018-03-19 Thread John Naylor
On 3/19/18, Dean Rasheed wrote: > As promised, here is a new patch, with comment updates, per John and > Tomas' suggestions, plus the continuity correction, which seemed just > about worthwhile. Great. I'm happy with the behavior of the patch. I've marked it ready for committer. > I repeated the

Re: MCV lists for highly skewed distributions

2018-03-19 Thread Dean Rasheed
On 18 March 2018 at 22:52, Dean Rasheed wrote: > I'll post something tomorrow, once I've finished wordsmithing the comments. > As promised, here is a new patch, with comment updates, per John and Tomas' suggestions, plus the continuity correction, which seemed just about worthwhile. Regards, Dea

Re: MCV lists for highly skewed distributions

2018-03-18 Thread Dean Rasheed
On 18 March 2018 at 12:24, John Naylor wrote: > Over the weekend I tried out a test to measure: > -The size of the MCV list > -The ratio between actual and estimated cardinality of the least > common value in the MCV list. > -The ratio between actual and estimated cardinality of the most common >

Re: MCV lists for highly skewed distributions

2018-03-18 Thread John Naylor
I wrote: > Looks good. I'll run some tests of my own this week, trying to find > corner cases different from yours. Over the weekend I tried out a test to measure: -The size of the MCV list -The ratio between actual and estimated cardinality of the least common value in the MCV list. -The ratio b

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Tomas Vondra
On 03/17/2018 08:32 PM, Dean Rasheed wrote: > On 17 March 2018 at 18:40, Tomas Vondra wrote: >> Currently, analyze_mcv_list only checks if the frequency of the >> current item is significantly higher than the non-MCV selectivity. >> My question is if it shouldn't also consider if removing the item

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed
On 17 March 2018 at 18:40, Tomas Vondra wrote: > Currently, analyze_mcv_list only checks if the frequency of the current > item is significantly higher than the non-MCV selectivity. My question > is if it shouldn't also consider if removing the item from MCV would not > increase the non-MCV select

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Tomas Vondra
On 03/17/2018 07:28 PM, Dean Rasheed wrote: > On 16 March 2018 at 15:26, Tomas Vondra wrote: >> Actually, one question - when deciding whether to keep the item in the >> MCV list, analyze_mcv_list only compares it's frequency with an average >> of the rest. But as we're removing items from the M

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed
On 16 March 2018 at 15:26, Tomas Vondra wrote: > Actually, one question - when deciding whether to keep the item in the > MCV list, analyze_mcv_list only compares it's frequency with an average > of the rest. But as we're removing items from the MCV list, the average > frequency of the non-MCV ite

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed
On 17 March 2018 at 17:16, Dean Rasheed wrote: > Using the calculator above, you can see that the distribution is > fairly normal-like, but with a noticeable positive skew. The 2-stddev > interval is 0.6 to 9.4, and according to the calculator the > probability of the value being less than or equa

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed
On 13 March 2018 at 08:39, John Naylor wrote: >> Also, this is common enough that in fact that distribution >> can be reasonably approximated by a normal distribution. > > For the archives, because it's typically seen 10 times in the sample, > per the rule of thumb mention upthread? > Actually, I

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed
> On 03/11/2018 10:23 AM, Dean Rasheed wrote: >> I'm moving this back to a status of "Needs review" partly because the >> code has changed significantly, but also because I want to do more >> testing, particularly with larger datasets. >> John, Tomas, Thanks for looking at this, and sorry for my

Re: MCV lists for highly skewed distributions

2018-03-16 Thread Tomas Vondra
On 03/11/2018 10:23 AM, Dean Rasheed wrote: > ... > > I'm moving this back to a status of "Needs review" partly because the > code has changed significantly, but also because I want to do more > testing, particularly with larger datasets. > Thanks for working on this. The code seems fine to me, a

Re: MCV lists for highly skewed distributions

2018-03-13 Thread John Naylor
Hi Dean, I've looked over your patch briefly, but not tested it yet. > [construction of a pathological data set] > So the table has around 31 million rows, and the stats target is maxed > out -- we're sampling around 10% of the table, and it's not possible > to sample more. Looking at the value a

Re: MCV lists for highly skewed distributions

2018-03-11 Thread Dean Rasheed
On 6 March 2018 at 16:48, Dean Rasheed wrote: > On 6 March 2018 at 08:51, John Naylor wrote: >> On 3/5/18, Dean Rasheed wrote: >>> Attached is an updated patch. >> Nice. The results look good. > > Thanks for the review. > So I was about ready to commit this, but decided to do more testing, beca

Re: MCV lists for highly skewed distributions

2018-03-06 Thread Dean Rasheed
On 6 March 2018 at 08:51, John Naylor wrote: > On 3/5/18, Dean Rasheed wrote: >> Attached is an updated patch. > Nice. The results look good. Thanks for the review. > I agree it should be in a separate function. As for that large > comment, I spent some time pouring over it to verify the math

Re: MCV lists for highly skewed distributions

2018-03-06 Thread John Naylor
On 3/5/18, Dean Rasheed wrote: > Attached is an updated patch. I decided to move the code that > determines the minimum number of occurrences for an MCV list item out > to a separate function. It's not much code, but there's a large > comment to explain its statistical justification, and I didn't

Re: MCV lists for highly skewed distributions

2018-03-05 Thread Dean Rasheed
On 7 February 2018 at 15:58, Dean Rasheed wrote: > On 7 February 2018 at 15:25, Robert Haas wrote: >> Do you plan to press forward with this, then, or what's >> the next step? > > I plan to do more testing TL;DR: I tested this new algorithm using 9 different relative standard error cutoffs (10%,

Re: MCV lists for highly skewed distributions

2018-03-01 Thread Dean Rasheed
On 1 March 2018 at 21:01, Andres Freund wrote: > This sounds like the patch's status of "waiting on author" isn't right, > and it should more be ready for committer? > Yes, I'll take a look at it this weekend. Regards, Dean

Re: MCV lists for highly skewed distributions

2018-03-01 Thread Andres Freund
Hi Dean, On 2018-02-07 15:58:14 +, Dean Rasheed wrote: > On 7 February 2018 at 15:25, Robert Haas wrote: > > Do you plan to press forward with this, then, or what's > > the next step? > > > > Yes, I think the results are pretty good so far, especially for the > more non-uniform distributions

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed
On 7 February 2018 at 15:25, Robert Haas wrote: > Do you plan to press forward with this, then, or what's > the next step? > Yes, I think the results are pretty good so far, especially for the more non-uniform distributions. AFAICS it solves the 2 original complaints, and I've not yet seen a case

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Robert Haas
On Wed, Feb 7, 2018 at 10:20 AM, Dean Rasheed wrote: > One thing this new algorithm does do is improve the user's ability to > get more MCVs by increasing the stats target. I'm not yet convinced > there should be a separate knob for the RSE cutoff. For that to be > useful, there would need to be s

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed
On 7 February 2018 at 13:29, Robert Haas wrote: > On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed wrote: >> Thanks for testing. I agree, this new algorithm seems to stand up >> pretty well in the testing I've done too. One thing about it that can >> be tuned is the cutoff threshold for the relative

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Robert Haas
On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed wrote: > Thanks for testing. I agree, this new algorithm seems to stand up > pretty well in the testing I've done too. One thing about it that can > be tuned is the cutoff threshold for the relative standard error -- I > chose 10% completely arbitrarily

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed
On 4 February 2018 at 12:18, John Naylor wrote: > I did the same basic eyeball testing I did on earlier patches, and > this is the best implementation so far. I've attached some typical > pg_stats contents for HEAD and this patch. More rigorous testing, > including of planner estimates on real dat

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed
On 1 February 2018 at 17:49, Robert Haas wrote: > One point which I want to emphasize is that the length of the MCV list > bounds the estimated frequency of non-MCVs in two ways: no non-MCV is > ever thought to be more frequent than the least-common MCVs, and > however many non-MCVs we think we ha

Re: MCV lists for highly skewed distributions

2018-02-04 Thread John Naylor
On 2/2/18, Dean Rasheed wrote: > I think it would be better to try to come up with an alternative > algorithm that has a better theoretical basis, and then test that to > see how it holds up in practice. > > With that in mind, attached is a patch based on the idea of setting a > bound on the relat

Re: MCV lists for highly skewed distributions

2018-02-01 Thread Robert Haas
On Thu, Feb 1, 2018 at 12:21 PM, Dean Rasheed wrote: > For a highly skewed distribution, it is possible for there to be > hardly any values (maybe only one) that appears more than 1.25 times > the average frequency, and so lots of otherwise perfectly good common > values get discarded. For such a

Re: MCV lists for highly skewed distributions

2018-02-01 Thread Dean Rasheed
On 1 February 2018 at 13:16, Simon Riggs wrote: > On 25 January 2018 at 22:19, Tom Lane wrote: >> In any case, since it looks like the next step is for someone to come >> up with a new proposal, I'm going to set this to Waiting on Author. > > Dean and John's results show that different algorithms

Re: MCV lists for highly skewed distributions

2018-02-01 Thread Simon Riggs
On 25 January 2018 at 22:19, Tom Lane wrote: > Dean Rasheed writes: >> It occurs to me that maybe a better test to exclude a value from the >> MCV list would be to demand that its relative standard error not be >> too high. Such a test, in addition to the existing tests, might be >> sufficient to

Re: MCV lists for highly skewed distributions

2018-01-25 Thread Tom Lane
Dean Rasheed writes: > It occurs to me that maybe a better test to exclude a value from the > MCV list would be to demand that its relative standard error not be > too high. Such a test, in addition to the existing tests, might be > sufficient to solve the opposite problem of too many values in th

Re: stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

2018-01-22 Thread Dean Rasheed
On 22 January 2018 at 08:07, John Naylor wrote: > On 1/21/18, Dean Rasheed wrote: >> It occurs to me that maybe a better test to exclude a value from the >> MCV list would be to demand that its relative standard error not be >> too high. Such a test, in addition to the existing tests, might be >>

stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

2018-01-22 Thread John Naylor
(Starting a new thread so as not to distract review) On 1/21/18, Dean Rasheed wrote: > On 21 January 2018 at 07:26, John Naylor wrote: >> I spent a few hours hacking on this, and it turns out calculating the >> right number of MCVs taking into account both uniform and highly >> non-uniform distr

Re: MCV lists for highly skewed distributions

2018-01-21 Thread Dean Rasheed
On 21 January 2018 at 07:26, John Naylor wrote: > I spent a few hours hacking on this, and it turns out calculating the > right number of MCVs taking into account both uniform and highly > non-uniform distributions is too delicate a problem for me to solve > right now. The logic suggested by Dean

Re: MCV lists for highly skewed distributions

2018-01-20 Thread John Naylor
I wrote: >> I have a slight reservaton about whether 1.25x is still a sensible >> heuristic. > > This was also discussed in [1], but no patch came out of it. I was > just now turning the formulas discussed there into code, but I'll > defer to someone with more expertise. FWIW, I suspect that a sol

Re: MCV lists for highly skewed distributions

2018-01-19 Thread John Naylor
I wrote: > FWIW, I suspect that a solution > that doesn't take into account a metric like coefficient of variation > will have the wrong behavior sometimes, whether for highly uniform or > highly non-uniform distributions. By this I meant the coefficient of variation of the class size in the sampl

Re: MCV lists for highly skewed distributions

2018-01-19 Thread John Naylor
>> [1] Occasionally it will store a much longer MCV list, because no values >> was >> sampled exactly once, which triggers a different code path in which all >> seen >> values are put in the MCV and the histogram is NULL. This is not >> reliable, >> as whether the least-sample value is present in

Re: MCV lists for highly skewed distributions

2018-01-18 Thread Simon Riggs
On 28 December 2017 at 01:45, Jeff Janes wrote: > If we stored just a few more values, their inclusion in the MCV would mean > they are depleted from the residual count, correctly lowering the estimate > we would get for very rare values not included in the sample. I agree with this thought. >

Re: MCV lists for highly skewed distributions

2018-01-07 Thread John Naylor
I wrote: > I'll be travelling for a few days, but I'll do some testing on some > data sets soon. I've attached some preliminary test methods and results. Probably not detailed or rigorous enough, but I wanted to share this much before digging deeper. I created some test data, ran analyze a few ti

Re: MCV lists for highly skewed distributions

2017-12-31 Thread John Naylor
I wrote: > On 12/28/17, Jeff Janes wrote: >> I think that perhaps maxmincount should also use the dynamic >> values_cnt_remaining rather than the static one. After all, things >> included in the MCV don' get represented in the histogram. When I've >> seen >> planning problems due to skewed dist

Re: MCV lists for highly skewed distributions

2017-12-29 Thread John Naylor
On 12/28/17, Jeff Janes wrote: > I want to revive a patch I sent couple years ago to the performance list, > as I have seen the issue pop up repeatedly since then. > If we stored just a few more values, their inclusion in the MCV would mean > they are depleted from the residual count, correctly

MCV lists for highly skewed distributions

2017-12-27 Thread Jeff Janes
I want to revive a patch I sent couple years ago to the performance list, as I have seen the issue pop up repeatedly since then. Original thread is here: https://www.postgresql.org/message-id/CAK_s-G2tc8r0N3AqPr8fW5QsRQMbZNurgAQ%3D_ME1aaf4vOmnnA%40mail.gmail.com The problem is that if you have a