On 19 March 2018 at 16:59, John Naylor wrote:
> On 3/19/18, Dean Rasheed wrote:
>> As promised, here is a new patch, with comment updates, per John and
>> Tomas' suggestions, plus the continuity correction, which seemed just
>> about worthwhile.
>
> Great. I'm happy with the behavior of the patch
On 3/19/18, Dean Rasheed wrote:
> As promised, here is a new patch, with comment updates, per John and
> Tomas' suggestions, plus the continuity correction, which seemed just
> about worthwhile.
Great. I'm happy with the behavior of the patch. I've marked it ready
for committer.
> I repeated the
On 18 March 2018 at 22:52, Dean Rasheed wrote:
> I'll post something tomorrow, once I've finished wordsmithing the comments.
>
As promised, here is a new patch, with comment updates, per John and
Tomas' suggestions, plus the continuity correction, which seemed just
about worthwhile.
Regards,
Dea
On 18 March 2018 at 12:24, John Naylor wrote:
> Over the weekend I tried out a test to measure:
> -The size of the MCV list
> -The ratio between actual and estimated cardinality of the least
> common value in the MCV list.
> -The ratio between actual and estimated cardinality of the most common
>
I wrote:
> Looks good. I'll run some tests of my own this week, trying to find
> corner cases different from yours.
Over the weekend I tried out a test to measure:
-The size of the MCV list
-The ratio between actual and estimated cardinality of the least
common value in the MCV list.
-The ratio b
On 03/17/2018 08:32 PM, Dean Rasheed wrote:
> On 17 March 2018 at 18:40, Tomas Vondra wrote:
>> Currently, analyze_mcv_list only checks if the frequency of the
>> current item is significantly higher than the non-MCV selectivity.
>> My question is if it shouldn't also consider if removing the item
On 17 March 2018 at 18:40, Tomas Vondra wrote:
> Currently, analyze_mcv_list only checks if the frequency of the current
> item is significantly higher than the non-MCV selectivity. My question
> is if it shouldn't also consider if removing the item from MCV would not
> increase the non-MCV select
On 03/17/2018 07:28 PM, Dean Rasheed wrote:
> On 16 March 2018 at 15:26, Tomas Vondra wrote:
>> Actually, one question - when deciding whether to keep the item in the
>> MCV list, analyze_mcv_list only compares it's frequency with an average
>> of the rest. But as we're removing items from the M
On 16 March 2018 at 15:26, Tomas Vondra wrote:
> Actually, one question - when deciding whether to keep the item in the
> MCV list, analyze_mcv_list only compares it's frequency with an average
> of the rest. But as we're removing items from the MCV list, the average
> frequency of the non-MCV ite
On 17 March 2018 at 17:16, Dean Rasheed wrote:
> Using the calculator above, you can see that the distribution is
> fairly normal-like, but with a noticeable positive skew. The 2-stddev
> interval is 0.6 to 9.4, and according to the calculator the
> probability of the value being less than or equa
On 13 March 2018 at 08:39, John Naylor wrote:
>> Also, this is common enough that in fact that distribution
>> can be reasonably approximated by a normal distribution.
>
> For the archives, because it's typically seen 10 times in the sample,
> per the rule of thumb mention upthread?
>
Actually, I
> On 03/11/2018 10:23 AM, Dean Rasheed wrote:
>> I'm moving this back to a status of "Needs review" partly because the
>> code has changed significantly, but also because I want to do more
>> testing, particularly with larger datasets.
>>
John, Tomas,
Thanks for looking at this, and sorry for my
On 03/11/2018 10:23 AM, Dean Rasheed wrote:
> ...
>
> I'm moving this back to a status of "Needs review" partly because the
> code has changed significantly, but also because I want to do more
> testing, particularly with larger datasets.
>
Thanks for working on this. The code seems fine to me, a
Hi Dean,
I've looked over your patch briefly, but not tested it yet.
> [construction of a pathological data set]
> So the table has around 31 million rows, and the stats target is maxed
> out -- we're sampling around 10% of the table, and it's not possible
> to sample more. Looking at the value a
On 6 March 2018 at 16:48, Dean Rasheed wrote:
> On 6 March 2018 at 08:51, John Naylor wrote:
>> On 3/5/18, Dean Rasheed wrote:
>>> Attached is an updated patch.
>> Nice. The results look good.
>
> Thanks for the review.
>
So I was about ready to commit this, but decided to do more testing,
beca
On 6 March 2018 at 08:51, John Naylor wrote:
> On 3/5/18, Dean Rasheed wrote:
>> Attached is an updated patch.
> Nice. The results look good.
Thanks for the review.
> I agree it should be in a separate function. As for that large
> comment, I spent some time pouring over it to verify the math
On 3/5/18, Dean Rasheed wrote:
> Attached is an updated patch. I decided to move the code that
> determines the minimum number of occurrences for an MCV list item out
> to a separate function. It's not much code, but there's a large
> comment to explain its statistical justification, and I didn't
On 7 February 2018 at 15:58, Dean Rasheed wrote:
> On 7 February 2018 at 15:25, Robert Haas wrote:
>> Do you plan to press forward with this, then, or what's
>> the next step?
>
> I plan to do more testing
TL;DR: I tested this new algorithm using 9 different relative standard
error cutoffs (10%,
On 1 March 2018 at 21:01, Andres Freund wrote:
> This sounds like the patch's status of "waiting on author" isn't right,
> and it should more be ready for committer?
>
Yes, I'll take a look at it this weekend.
Regards,
Dean
Hi Dean,
On 2018-02-07 15:58:14 +, Dean Rasheed wrote:
> On 7 February 2018 at 15:25, Robert Haas wrote:
> > Do you plan to press forward with this, then, or what's
> > the next step?
> >
>
> Yes, I think the results are pretty good so far, especially for the
> more non-uniform distributions
On 7 February 2018 at 15:25, Robert Haas wrote:
> Do you plan to press forward with this, then, or what's
> the next step?
>
Yes, I think the results are pretty good so far, especially for the
more non-uniform distributions. AFAICS it solves the 2 original
complaints, and I've not yet seen a case
On Wed, Feb 7, 2018 at 10:20 AM, Dean Rasheed wrote:
> One thing this new algorithm does do is improve the user's ability to
> get more MCVs by increasing the stats target. I'm not yet convinced
> there should be a separate knob for the RSE cutoff. For that to be
> useful, there would need to be s
On 7 February 2018 at 13:29, Robert Haas wrote:
> On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed wrote:
>> Thanks for testing. I agree, this new algorithm seems to stand up
>> pretty well in the testing I've done too. One thing about it that can
>> be tuned is the cutoff threshold for the relative
On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed wrote:
> Thanks for testing. I agree, this new algorithm seems to stand up
> pretty well in the testing I've done too. One thing about it that can
> be tuned is the cutoff threshold for the relative standard error -- I
> chose 10% completely arbitrarily
On 4 February 2018 at 12:18, John Naylor wrote:
> I did the same basic eyeball testing I did on earlier patches, and
> this is the best implementation so far. I've attached some typical
> pg_stats contents for HEAD and this patch. More rigorous testing,
> including of planner estimates on real dat
On 1 February 2018 at 17:49, Robert Haas wrote:
> One point which I want to emphasize is that the length of the MCV list
> bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
> ever thought to be more frequent than the least-common MCVs, and
> however many non-MCVs we think we ha
On 2/2/18, Dean Rasheed wrote:
> I think it would be better to try to come up with an alternative
> algorithm that has a better theoretical basis, and then test that to
> see how it holds up in practice.
>
> With that in mind, attached is a patch based on the idea of setting a
> bound on the relat
On Thu, Feb 1, 2018 at 12:21 PM, Dean Rasheed wrote:
> For a highly skewed distribution, it is possible for there to be
> hardly any values (maybe only one) that appears more than 1.25 times
> the average frequency, and so lots of otherwise perfectly good common
> values get discarded. For such a
On 1 February 2018 at 13:16, Simon Riggs wrote:
> On 25 January 2018 at 22:19, Tom Lane wrote:
>> In any case, since it looks like the next step is for someone to come
>> up with a new proposal, I'm going to set this to Waiting on Author.
>
> Dean and John's results show that different algorithms
On 25 January 2018 at 22:19, Tom Lane wrote:
> Dean Rasheed writes:
>> It occurs to me that maybe a better test to exclude a value from the
>> MCV list would be to demand that its relative standard error not be
>> too high. Such a test, in addition to the existing tests, might be
>> sufficient to
Dean Rasheed writes:
> It occurs to me that maybe a better test to exclude a value from the
> MCV list would be to demand that its relative standard error not be
> too high. Such a test, in addition to the existing tests, might be
> sufficient to solve the opposite problem of too many values in th
On 22 January 2018 at 08:07, John Naylor wrote:
> On 1/21/18, Dean Rasheed wrote:
>> It occurs to me that maybe a better test to exclude a value from the
>> MCV list would be to demand that its relative standard error not be
>> too high. Such a test, in addition to the existing tests, might be
>>
(Starting a new thread so as not to distract review)
On 1/21/18, Dean Rasheed wrote:
> On 21 January 2018 at 07:26, John Naylor wrote:
>> I spent a few hours hacking on this, and it turns out calculating the
>> right number of MCVs taking into account both uniform and highly
>> non-uniform distr
On 21 January 2018 at 07:26, John Naylor wrote:
> I spent a few hours hacking on this, and it turns out calculating the
> right number of MCVs taking into account both uniform and highly
> non-uniform distributions is too delicate a problem for me to solve
> right now. The logic suggested by Dean
I wrote:
>> I have a slight reservaton about whether 1.25x is still a sensible
>> heuristic.
>
> This was also discussed in [1], but no patch came out of it. I was
> just now turning the formulas discussed there into code, but I'll
> defer to someone with more expertise. FWIW, I suspect that a sol
I wrote:
> FWIW, I suspect that a solution
> that doesn't take into account a metric like coefficient of variation
> will have the wrong behavior sometimes, whether for highly uniform or
> highly non-uniform distributions.
By this I meant the coefficient of variation of the class size in the
sampl
>> [1] Occasionally it will store a much longer MCV list, because no values
>> was
>> sampled exactly once, which triggers a different code path in which all
>> seen
>> values are put in the MCV and the histogram is NULL. This is not
>> reliable,
>> as whether the least-sample value is present in
On 28 December 2017 at 01:45, Jeff Janes wrote:
> If we stored just a few more values, their inclusion in the MCV would mean
> they are depleted from the residual count, correctly lowering the estimate
> we would get for very rare values not included in the sample.
I agree with this thought.
>
I wrote:
> I'll be travelling for a few days, but I'll do some testing on some
> data sets soon.
I've attached some preliminary test methods and results. Probably not
detailed or rigorous enough, but I wanted to share this much before
digging deeper. I created some test data, ran analyze a few ti
I wrote:
> On 12/28/17, Jeff Janes wrote:
>> I think that perhaps maxmincount should also use the dynamic
>> values_cnt_remaining rather than the static one. After all, things
>> included in the MCV don' get represented in the histogram. When I've
>> seen
>> planning problems due to skewed dist
On 12/28/17, Jeff Janes wrote:
> I want to revive a patch I sent couple years ago to the performance list,
> as I have seen the issue pop up repeatedly since then.
> If we stored just a few more values, their inclusion in the MCV would mean
> they are depleted from the residual count, correctly
I want to revive a patch I sent couple years ago to the performance list,
as I have seen the issue pop up repeatedly since then.
Original thread is here:
https://www.postgresql.org/message-id/CAK_s-G2tc8r0N3AqPr8fW5QsRQMbZNurgAQ%3D_ME1aaf4vOmnnA%40mail.gmail.com
The problem is that if you have a
42 matches
Mail list logo