### Re: MCV lists for highly skewed distributions

On 19 March 2018 at 16:59, John Naylorwrote: > On 3/19/18, Dean Rasheed wrote: >> As promised, here is a new patch, with comment updates, per John and >> Tomas' suggestions, plus the continuity correction, which seemed just >> about worthwhile. > > Great. I'm happy with the behavior of the patch. I've marked it ready > for committer. > Thanks for testing. I also did more testing with tables 10-20 times as large and I was happy with the results, so I have pushed this. Regards, Dean

### Re: MCV lists for highly skewed distributions

On 3/19/18, Dean Rasheedwrote: > As promised, here is a new patch, with comment updates, per John and > Tomas' suggestions, plus the continuity correction, which seemed just > about worthwhile. Great. I'm happy with the behavior of the patch. I've marked it ready for committer. > I repeated these tests with a 2-SD continuity-corrected threshold and > the results fell somewhere between the 2-SD and 2.5-SD cases. My > inclination is to go with that, as something that strikes a reasonable > balance, and represents a distinct improvement over master in a number > of different areas. I also ran some tests with a hand-edited continuity correction last night, but just now got around to looking at the results (queries attached). I ran largely as before, but did 20 analyze runs with each file instead of 10, using the V2 patch, the CC patch, and the V2 patch with 2.5 threshold. I took out a bunch of the uniform tests to save time since they largely behave the same. - Non-uniform: To reduce noise for analysis, I tried filtering out the least and greatest distinct value before taking the average for the underestimation ratio for most common non-MCV. I also removed results where all 3 patches had a full MCV list. There was still enough noise that it's impossible to discern differences between patches that are very similar. It's not like against HEAD where there were clear differences in this test, so I won't post those numbers. Looking at the number of MCVs, it's a little clearer. With few exceptions, the number either stays the same or decreases slightly going left to right: title | v20_num_mcvs | cc05_num_mcvs | v25_num_mcvs --+--+---+-- Exp. dist. (N=500k, sd=0.25 (beta)) |3 | 3 |3 Exp. dist. (N=500k, sd=0.50 (beta)) |5 | 6 |5 Exp. dist. (N=500k, sd=1.00 (beta)) |9 | 9 |9 Exp. dist. (N=500k, sd=2.00 (beta)) | 15 | 15 | 15 Exp. dist. (N=500k, sd=3.00 (beta)) | 22 | 21 | 21 Exp. dist. (N=500k, sd=4.00 (beta)) | 27 | 27 | 26 Exp. dist. (N=500k, sd=5.00 (beta)) | 33 | 32 | 30 Exp. dist. (N=500k, sd=10.00 (beta)) | 60 | 58 | 56 Exp. dist. (N=500k, sd=20.00 (beta)) | 100 | 99 | 97 Laplace dist. (N=500k, b=0.25 (lambda)) |5 | 6 |5 Laplace dist. (N=500k, b=0.50 (lambda)) |9 | 8 |7 Laplace dist. (N=500k, b=1.00 (lambda)) | 15 | 15 | 15 Laplace dist. (N=500k, b=2.00 (lambda)) | 27 | 26 | 26 Laplace dist. (N=500k, b=3.00 (lambda)) | 38 | 37 | 36 Laplace dist. (N=500k, b=4.00 (lambda)) | 48 | 47 | 45 Laplace dist. (N=500k, b=5.00 (lambda)) | 58 | 57 | 55 Laplace dist. (N=500k, b=10.00 (lambda)) | 100 | 99 | 97 MNM (N=2000, peaks=20, sep=20, sd=15)| 100 | 100 | 62 Normal dist. (N=500k, sd=1) |7 | 7 |7 Normal dist. (N=500k, sd=2) | 15 | 15 | 14 Normal dist. (N=500k, sd=3) | 21 | 21 | 20 Normal dist. (N=500k, sd=4) | 28 | 26 | 26 Normal dist. (N=500k, sd=5) | 33 | 33 | 33 Normal dist. (N=500k, sd=6) | 39 | 38 | 38 Normal dist. (N=500k, sd=7) | 45 | 45 | 44 Normal dist. (N=500k, sd=8) | 51 | 49 | 49 Normal dist. (N=500k, sd=9) | 56 | 56 | 54 Normal dist. (N=500k, sd=10) | 63 | 62 | 60 Normal dist. (N=500k, sd=15) | 89 | 89 | 86 Pareto dist. (N=500, a=1.00) | 70 | 65 | 56 Pareto dist. (N=500, a=2.00) | 20 | 19 | 17 Pareto dist. (N=500, a=3.00) | 10 | 9 |9 Pareto dist. (N=500, a=4.00) |6 | 6 |6 Pareto dist. (N=500, a=5.00) |5 | 5 |4 (34 rows) - Uniform: Ideally, we want an empty MCV list unless we're sampling a large portion of the table. As Dean mentioned, this is not as important as getting the non-uniform case right. That said, it's nice to be confident we can keep out flotsam. The number generally gets smaller as we go left to right. title | v20_num_mcvs | cc05_num_mcvs | v25_num_mcvs ---+--+---+-- Corr. unif. dist. (N=1000k, Nd=1000) | 13 |

### Re: MCV lists for highly skewed distributions

On 18 March 2018 at 22:52, Dean Rasheedwrote: > I'll post something tomorrow, once I've finished wordsmithing the comments. > As promised, here is a new patch, with comment updates, per John and Tomas' suggestions, plus the continuity correction, which seemed just about worthwhile. Regards, Dean mcv-stats-v2cc.patch Description: Binary data

### Re: MCV lists for highly skewed distributions

On 18 March 2018 at 12:24, John Naylorwrote: > Over the weekend I tried out a test to measure: > -The size of the MCV list > -The ratio between actual and estimated cardinality of the least > common value in the MCV list. > -The ratio between actual and estimated cardinality of the most common > value not in the MCV list. Nice tests! > Summary for non-uniform distributions: > Note: There's a lot of variation in the estimation of the most common > non-MCV. About 1% of all ANALYZE runs apparently failed to sample one > of the good candidates for the MCV list, resulting in huge > underestimates. It didn't matter what patch was applied. For future > tests, I'll do more runs and measure a truncated mean. Looking at the > number of MCVs is much more stable, but unlike the uniform > distribution case, we don't have an intuitive feel for what that > number should be. That said, > > -The V2 patch is somewhat better than RSE and much better than master > at estimating the most common value not in the MCV list. Yes, that matches my observations too. It stems from the fact that the V2 patch tends to produce more MCVs than the RSE patch (which in turn tends to produce more MCVs than master) for non-uniform distributions. More MCVs then improve the estimates for non-MCVs. > -Bumping the sample stddev threshold to 3 seems to behave similarly to > the RSE patch, maybe slightly better. In a way, that's probably not surprising. The two algorithms are closely related. If the non-MCV frequencies are quite low, the v2 patch with a 3-SD threshold behaves like the RSE patch with a 33% RSE cutoff. The 20% RSE cutoff patch is mathematically equivalent to the v2 patch with a 5-SD threshold and the non-MCV frequencies all set to zero, if that makes sense. > Summary for uniform distributions: > Note 1: Using the average is not really adequate for testing > under/overestimation of values in my test setup, since I set the > estimation ratio to zero if there was either an empty MCV list or a > completely full list. In future runs, I'll drill down a bit more > precisely. > That said, there didn't seem to be a huge amount variation between the > patches as far as either of the ratios I was testing. Looking at the > number of MCVs is much better anyway, since we know we want either > none or everything. > > Note 2: All patches fully populated the list when all the values fit > in the MCV list. For the rest of the cases: > > -The RSE patch is not much better than master at excluding MCVs. To make sense of that, you need to look at the errors in the estimates. The RSE patch won't exclude MCVs if it thinks they will give reasonably accurate estimates, so may produce fully populated MCV lists for uniform distributions even when all the distinct values don't fit. That's not ideal, but it isn't necessarily bad. In my previous testing I found that it was good at excluding MCVs in cases where they gave inaccurate estimates. > -The V2 patch is much better than either of these, but still not ideal > (up to 30) > -The V2 patch with a cutoff of 2.5 severely cut down the MCVs (at most 3). > -With a cutoff of 3.0, all are empty. Makes sense, but I don't think we should necessarily put too much emphasis on the uniform tests. Real-world datasets tend to be non-uniform, and the consequences of too few MCVs tend to be worse than too many. I repeated these tests with a 2-SD continuity-corrected threshold and the results fell somewhere between the 2-SD and 2.5-SD cases. My inclination is to go with that, as something that strikes a reasonable balance, and represents a distinct improvement over master in a number of different areas. I'll post something tomorrow, once I've finished wordsmithing the comments. Regards, Dean

### Re: MCV lists for highly skewed distributions

I wrote: > Looks good. I'll run some tests of my own this week, trying to find > corner cases different from yours. Over the weekend I tried out a test to measure: -The size of the MCV list -The ratio between actual and estimated cardinality of the least common value in the MCV list. -The ratio between actual and estimated cardinality of the most common value not in the MCV list. The script, results, and some starter queries are attached. I ran a bunch more queries to drill down in certain areas, but they're all based on the ones here. I ran against master, the RSE patch, and 3 different variations of the V2 patch (with stddev cutoff of 2, 2.5, and 3.0). For each file, I loaded into each DB, ran ANALYZE 10 times, and took the average for each of the three metrics above. I only tested with the default stats target. For this test, I used the same distributions as Dean's original script, but I whacked around the script to enable inserting results into a table. -- Summary for non-uniform distributions: Note: There's a lot of variation in the estimation of the most common non-MCV. About 1% of all ANALYZE runs apparently failed to sample one of the good candidates for the MCV list, resulting in huge underestimates. It didn't matter what patch was applied. For future tests, I'll do more runs and measure a truncated mean. Looking at the number of MCVs is much more stable, but unlike the uniform distribution case, we don't have an intuitive feel for what that number should be. That said, -The V2 patch is somewhat better than RSE and much better than master at estimating the most common value not in the MCV list. -Bumping the sample stddev threshold to 3 seems to behave similarly to the RSE patch, maybe slightly better. -- Summary for uniform distributions: Note 1: Using the average is not really adequate for testing under/overestimation of values in my test setup, since I set the estimation ratio to zero if there was either an empty MCV list or a completely full list. In future runs, I'll drill down a bit more precisely. That said, there didn't seem to be a huge amount variation between the patches as far as either of the ratios I was testing. Looking at the number of MCVs is much better anyway, since we know we want either none or everything. Note 2: All patches fully populated the list when all the values fit in the MCV list. For the rest of the cases: -The RSE patch is not much better than master at excluding MCVs. -The V2 patch is much better than either of these, but still not ideal (up to 30) -The V2 patch with a cutoff of 2.5 severely cut down the MCVs (at most 3). -With a cutoff of 3.0, all are empty. -- Here's one interesting case. It's a normal distribution, but highly dispersed (N=50, sd=1000), so it resembles a uniform one. Looking at the number of MCVs, it jumps around a lot between patches, but the estimates don't vary much: Master: 100 RSE: 1 V2: 100 V2 with 2.5 cutoff: 100 V2 with 3.0 cutoff: 32 -- Thoughts and future testing: I think the V2 patch strikes a good balance. I'm partial to the V2 patch with the 2.5 cutoff in sample standard deviation, since it's much more aggressive about excluding values for uniform distributions. It's almost as good as V2 at including values in non-uniform distributions, but possibly worse enough that it's not the best choice. Time is running out, but some things I'd like to look at: -As mentioned above, better noise reduction in my data analysis. -The continuity-corrected Wald interval Dean mentioned [1] -I wonder if we can actually untie the max size of the MCV list from the stats target, and use a hard-coded maximum like 1000. That might allow us to bump the max stat target without hurting planning time. Perhaps next cycle. -John Naylor [1] https://www.postgresql.org/message-id/CAEZATCVHEEg%2BCrP%2B-0JUsVeNPu5rs_S23oJVeH4VF%3DfgDwhfLQ%40mail.gmail.com #!/usr/bin/python import datetime import math import numpy import psycopg2 import random import re import sys HEAD_PORT = 5430 RSE_PORT = 5431 SSD_20_PORT = 5432 SSD_25_PORT = 5425 SSD_30_PORT = 5433 TEST_DATA_FILE = '/tmp/mcv-test-data' NUMBER_OF_ANAYLYSE_TESTS = 10 conn0 = psycopg2.connect('port=%d dbname=john user=john' % HEAD_PORT) conn0.set_session(autocommit = True) conn1 = psycopg2.connect('port=%d dbname=john user=john' % RSE_PORT) conn1.set_session(autocommit = True) conn20 = psycopg2.connect('port=%d dbname=john user=john' % SSD_20_PORT) conn20.set_session(autocommit = True) conn25 = psycopg2.connect('port=%d dbname=john user=john' % SSD_25_PORT) conn25.set_session(autocommit = True) conn30 = psycopg2.connect('port=%d dbname=john user=john' % SSD_30_PORT) conn30.set_session(autocommit = True) branches = {} branches['master'] = conn0.cursor() branches['RSE'] = conn1.cursor() branches['V2_20'] = conn20.cursor() branches['V2_25'] = conn25.cursor() branches['V2_30'] = conn30.cursor() # Store test results in the master DB. result_store = branches['master'] # Create one results

### Re: MCV lists for highly skewed distributions

On 03/17/2018 08:32 PM, Dean Rasheed wrote: > On 17 March 2018 at 18:40, Tomas Vondrawrote: >> Currently, analyze_mcv_list only checks if the frequency of the >> current item is significantly higher than the non-MCV selectivity. >> My question is if it shouldn't also consider if removing the item >> from MCV would not increase the non-MCV selectivity too much. >> > > Oh, I see what you're saying. In theory, each MCV item we remove is > not significantly more common than the non-MCV items at that point, > so removing it shouldn't significantly increase the non-MCV > selectivity. It's possible the cumulative effect of removing multiple > items might start to add up, but I think it would necessarily be a > slow effect, and I think it would keep getting slower and slower as > more items are removed -- isn't this equivalent to constructing a > sequence of numbers where each number is a little greater than the > average of all the preceding numbers, and ends up virtually > flat-lining. > Yes, I think you're right. Another thing that occurred to me is that we're pretty much guaranteed to misestimate things at the tail end, and in my experience under-estimates have far worse consequences. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

### Re: MCV lists for highly skewed distributions

On 17 March 2018 at 18:40, Tomas Vondrawrote: > Currently, analyze_mcv_list only checks if the frequency of the current > item is significantly higher than the non-MCV selectivity. My question > is if it shouldn't also consider if removing the item from MCV would not > increase the non-MCV selectivity too much. > Oh, I see what you're saying. In theory, each MCV item we remove is not significantly more common than the non-MCV items at that point, so removing it shouldn't significantly increase the non-MCV selectivity. It's possible the cumulative effect of removing multiple items might start to add up, but I think it would necessarily be a slow effect, and I think it would keep getting slower and slower as more items are removed -- isn't this equivalent to constructing a sequence of numbers where each number is a little greater than the average of all the preceding numbers, and ends up virtually flat-lining. Regards, Dean

### Re: MCV lists for highly skewed distributions

On 03/17/2018 07:28 PM, Dean Rasheed wrote: > On 16 March 2018 at 15:26, Tomas Vondrawrote: >> Actually, one question - when deciding whether to keep the item in the >> MCV list, analyze_mcv_list only compares it's frequency with an average >> of the rest. But as we're removing items from the MCV list, the average >> frequency of the non-MCV items is growing (we're removing items with >> higher and higher frequencies). That means the estimates for the least >> common items will get higher and higher - essentially overestimates. So, >> couldn't/shouldn't analyze_mcv_list consider this too? >> > > Yes, that's the intention. At the start, sumcount is set to the count > of all but the last (least common) MCV item, so it can estimate the > frequency of the non-MCV items if the last MCV item were to be > removed. Then each time through the loop, sumcount is decreased by the > removed item's count, increasing the estimated frequency of the > non-MCV items accordingly. > I know it's updated like that, but that's not quite what I meant. Sorry for not making the question clearer, so let me rephrase it. Currently, analyze_mcv_list only checks if the frequency of the current item is significantly higher than the non-MCV selectivity. My question is if it shouldn't also consider if removing the item from MCV would not increase the non-MCV selectivity too much. But perhaps that would be a silly idea, because the non-MCV items may also be seen as a random noise. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

### Re: MCV lists for highly skewed distributions

On 16 March 2018 at 15:26, Tomas Vondrawrote: > Actually, one question - when deciding whether to keep the item in the > MCV list, analyze_mcv_list only compares it's frequency with an average > of the rest. But as we're removing items from the MCV list, the average > frequency of the non-MCV items is growing (we're removing items with > higher and higher frequencies). That means the estimates for the least > common items will get higher and higher - essentially overestimates. So, > couldn't/shouldn't analyze_mcv_list consider this too? > Yes, that's the intention. At the start, sumcount is set to the count of all but the last (least common) MCV item, so it can estimate the frequency of the non-MCV items if the last MCV item were to be removed. Then each time through the loop, sumcount is decreased by the removed item's count, increasing the estimated frequency of the non-MCV items accordingly. Regards, Dean

### Re: MCV lists for highly skewed distributions

On 17 March 2018 at 17:16, Dean Rasheedwrote: > Using the calculator above, you can see that the distribution is > fairly normal-like, but with a noticeable positive skew. The 2-stddev > interval is 0.6 to 9.4, and according to the calculator the > probability of the value being less than or equal to 1 is in the > ballpark of the 2.5% figure expected. So even with just 5 occurrences > in the sample, it's fairly close to a normal distribution. > One thing this does illustrate is that the hypergeometric distribution is a discrete distribution and there can be quite large jumps in the probability from one value to the next, so care may be needed when approximating it with a continuous distribution. The standard technique used to handle this is to apply what is known as a continuity correction factor. Suppose that X is a random variable with a discrete hypergeometric distribution, and Y is a continuous normal distribution, with the same mean and variance, being used to approximate X. Then P(X>i) for some integer i is the same as P(X>=i+1), because X can only be integer-valued. The idea is then that you can use P(Y>i+0.5) to get a fairly good approximation to P(X>i). That would correspond to adding 0.5 to the right-hand side of the current test, i.e., if (mcv_counts[num_mcv - 1] > selec * samplerows + 2 * stddev + 0.5) => Common enough to be included in MCV-list A lot of the time that won't make much difference, except when dealing with the smaller counts at the tail end of the MCV list, where it might help avoid the too-many-mcvs problem, so I think it's worth trying out. Apparently, in textbooks, an interval like the mean +/- 2*stddev is known as a Wald confidence interval, and the mean +/- (2*devdev+0.5) is the continuity-corrected Wald interval, so it's probably worth mentioning that in the comments. They are generally regarded as quite crude approximations for hypergeometric distributions, and there's quite a bit of research around establishing more accurate confidence intervals for this kind of distribution, but the formulae involved tend to be quite complicated, whereas the Wald interval has the advantage of being very simple. In this context, I don't think we need to establish a particularly accurate confidence interval. We just want to be able to say that the value is probably more common than the non-mcv values, without being too rigorous about what "probably" means, as long as it works in practice to discount values that just happen to be a bit more common in the sample, but aren't actually more common in the table as a whole. Regards, Dean

### Re: MCV lists for highly skewed distributions

On 13 March 2018 at 08:39, John Naylorwrote: >> Also, this is common enough that in fact that distribution >> can be reasonably approximated by a normal distribution. > > For the archives, because it's typically seen 10 times in the sample, > per the rule of thumb mention upthread? > Actually, I think that the rule of thumb of at least 10 instances in the sample for a normal approximation is probably too strict. Looking around, values as low as 5 seem to be quite commonly used. Of course there is no right answer here, it depends on what you're doing with it, and how close you want the normal distribution to be to the hypergeometric one. For our purposes, I don't think we really need it to be that close. We just want to be able to justify statements like "the value is *probably* more common than X, and it was unlikely that that was just random sampling error". >> Making use of >> the non-MCV-based estimate allows us to do better -- if the MCV-based >> estimate is statistically significantly higher than the non-MCV-based >> estimate, then it would almost certainly be better to include that >> value in the MCV list, even if its spread is quite large. > > Because we can treat the sample as normal, testing for > 2 standard > deviations means that we're "~95% sure" it's more common in the table > than the non-MCV selectivity would suggest, right? (I realize I'm not > using technically accurate language) > Actually, it's more like 97.5% (remember the normal distribution is symmetric, so there's around 2.5% below the 2-stddev interval, around 95% in it, and another 2.5% above it), but that's just nit-picking. There are several nice online calculators for playing around with hypergeometric distributions, such as http://keisan.casio.com/exec/system/1180573201 Consider this distribution: N = 100 n = 1 K = 500 Mean = n * K / N = 5 Variance = (K / N) * (1 - K / N) * n * (N - n) / (N - 1) = 4.9 Stddev = sqrt(Variance) = 2.2 Using the calculator above, you can see that the distribution is fairly normal-like, but with a noticeable positive skew. The 2-stddev interval is 0.6 to 9.4, and according to the calculator the probability of the value being less than or equal to 1 is in the ballpark of the 2.5% figure expected. So even with just 5 occurrences in the sample, it's fairly close to a normal distribution. Regards, Dean

### Re: MCV lists for highly skewed distributions

> On 03/11/2018 10:23 AM, Dean Rasheed wrote: >> I'm moving this back to a status of "Needs review" partly because the >> code has changed significantly, but also because I want to do more >> testing, particularly with larger datasets. >> John, Tomas, Thanks for looking at this, and sorry for my delayed reply. The demands of my day job keep getting in the way. I'm feeling pretty good about this patch though. There may still be some fine tuning to do, but it's looking like a definite improvement over HEAD, so I'm optimistic that I'll be able to finish it soonish. I'll post back more detailed responses to your specific points shortly. Regards, Dean

### Re: MCV lists for highly skewed distributions

On 03/11/2018 10:23 AM, Dean Rasheed wrote: > ... > > I'm moving this back to a status of "Needs review" partly because the > code has changed significantly, but also because I want to do more > testing, particularly with larger datasets. > Thanks for working on this. The code seems fine to me, although it might be a good idea to add comments explaining why analyze_mcv_list() starts with full MCV lists and then removes items (essentially the explanation why Jeff's original idea would not work so well). Actually, one question - when deciding whether to keep the item in the MCV list, analyze_mcv_list only compares it's frequency with an average of the rest. But as we're removing items from the MCV list, the average frequency of the non-MCV items is growing (we're removing items with higher and higher frequencies). That means the estimates for the least common items will get higher and higher - essentially overestimates. So, couldn't/shouldn't analyze_mcv_list consider this too? I've also done a bunch of testing regarding join cardinality estimates, because eqjoinsel_inner() is one of the places using MCV lists too. So I've generated tables with different sizes and data distributions, and observed how the patch affects the join estimates. The scripts and results are available here: https://github.com/tvondra/join-estimates-tests The tables were not particularly huge at this point (1000 to 100k rows), I've also varied number of distinct values (100 - 10k), statistics target (10 - 10k) and distribution (for each table independently): 1) uniform insert into t1 select random() * $ndistinct1, i from generate_series(1,$nrows1) s(i) 2) skewed insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i from generate_series(1,$nrows2) s(i) 3) skewed-inverse insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i from generate_series(1,$nrows2) s(i) 4) skewed insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i from generate_series(1,$nrows2) s(i) 5) skewed-strong-inverse insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i from generate_series(1,$nrows2) s(i) The results are a bit too large to attach them here - see for example https://github.com/tvondra/join-estimates-tests/blob/master/bench/summary.ods. Looking at Mean Relative Error, i.e. essentially MRE = AVERAGE(MAX(estimate/actual,actual/estimate)) over the 20 runs for each combination of parameters, The "average" sheet shows (MRE for patched) / (MRE for master) and the patch seems to clearly improve accuracy in this case. There are a few cases where the estimate gets slightly worse (say, by 10%) compared to current master. So I think that's nice. I'm open to running additional tests with other distributions, table sizes etc. if needed. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

### Re: MCV lists for highly skewed distributions

Hi Dean, I've looked over your patch briefly, but not tested it yet. > [construction of a pathological data set] > So the table has around 31 million rows, and the stats target is maxed > out -- we're sampling around 10% of the table, and it's not possible > to sample more. Looking at the value a=5, it appears 102 times in the > table, so we can expect it to be seen around 10 times in any sample, > but the minimum count for the new algorithm in this case is 23, so it > won't be included in the MCV list. This leads to it having the same > estimate as all other non-MCV values, which turns out to be rows=5, a > considerable under-estimate. > So arguably, the new algorithm is still better than the current one > for this data, but the fact that it doesn't include a=5 in the list > bugs me, because the estimate for that single value is consistently > worse. Looking at the distribution of a=5 in the sample, it is a > hypergeometric distribution with N=3100, n=300 and K=102. This > gives a sample mean of 9.8 and a variance of 8.9 (a standard deviation > of 3.0). Mean = n * K / N = 9.8 Var = (K / N) * (1 - K / N) * n * (N - n) / (N - 1) = 8.9 Stdev = sqrt(Var) = 3.0 Got it. > Also, this is common enough that in fact that distribution > can be reasonably approximated by a normal distribution. For the archives, because it's typically seen 10 times in the sample, per the rule of thumb mention upthread? > The value is > excluded from the MCV list because the spread is deemed too large, > relative to the mean, so the relative error of the MCV-based estimate > would be too high. However, not including it in the MCV list causes it > to have an estimate of 5, which would correspond to a sample mean of > 0.5, and the observed sample mean is more than 3 standard deviations > higher than that. So we do actually have enough information to > conclude that the value is almost certainly more common than the > non-MCV selectivity would suggest, and I think that's sufficient > justification for including it in the MCV list. > The crux of this is that the relative-standard-error algorithm is > making use of 2 pieces of information -- the sample frequency, and an > estimate of its statistical spread. However, there is a 3rd piece of > information that we know, that is not being made use of -- the > selectivity that would be assigned to the value if it were not > included in the MCV list. Looking at just the first 2 pieces of > information ensures that we get decent estimates for the values in the > MCV list (where what constitutes "decent" depends on an arbitrarily > chosen RSE cutoff), but it doesn't take into account the fact that the > values not in the list may have much worse estimates. Making use of > the non-MCV-based estimate allows us to do better -- if the MCV-based > estimate is statistically significantly higher than the non-MCV-based > estimate, then it would almost certainly be better to include that > value in the MCV list, even if its spread is quite large. Because we can treat the sample as normal, testing for > 2 standard deviations means that we're "~95% sure" it's more common in the table than the non-MCV selectivity would suggest, right? (I realize I'm not using technically accurate language) In your v1 patch, we didn't need a clamp on 10 values in the sample to treat it as normal, because it was mathematically impossible to pass the RSE test with less than 10 unless we were sampling most of the table, in which case it didn't matter. Is there a similar justification with the new algorithm? > [results summary] > > So HEAD and P1 are both producing too many MCVs. The latest 2 patches > cure that problem, but the new v2 patch is better at picking out all > the common values. Looks good. I'll run some tests of my own this week, trying to find corner cases different from yours. As far as the code goes, I haven't studied it very closely yet, but I did want to call attention to this comment: + /* +* If the value is kept in the MCV list, its population frequency is +* assumed to equal its sample frequency, and the distribution of the +* value's count in the sample is a hypergeomtric distribution with +* the following standard deviation. +*/ The part after "and" doesn't seem to follow from the first part. Would you mind clarifying? -John Naylor

### Re: MCV lists for highly skewed distributions

On 6 March 2018 at 16:48, Dean Rasheedwrote: > On 6 March 2018 at 08:51, John Naylor wrote: >> On 3/5/18, Dean Rasheed wrote: >>> Attached is an updated patch. >> Nice. The results look good. > > Thanks for the review. > So I was about ready to commit this, but decided to do more testing, because I worry that there may be instances that a user could point to where it makes estimates worse. Maybe that's inevitable for any change we might make, and I don't think that should stop us from at least trying to improve things, but it does make me wary about committing this without a lot of testing. In all the testing I've done so far, this looks to be a definite improvement over the current algorithm, but it looks like it's possible to do better. Consider the following test case: drop table if exists foo; create temp table foo(a int); insert into foo select * from ( select x/2 from generate_series(0,1999) g(x) union all select 0 from generate_series(0,999) union all select 1 from generate_series(0,99) union all select 2 from generate_series(0,9) union all select 3 from generate_series(0,) union all select 4 from generate_series(0,999) union all select 5 from generate_series(0,99) ) t order by random(); alter table foo alter column a set statistics 1; analyse foo; So the table has around 31 million rows, and the stats target is maxed out -- we're sampling around 10% of the table, and it's not possible to sample more. Looking at the value a=5, it appears 102 times in the table, so we can expect it to be seen around 10 times in any sample, but the minimum count for the new algorithm in this case is 23, so it won't be included in the MCV list. This leads to it having the same estimate as all other non-MCV values, which turns out to be rows=5, a considerable under-estimate. By comparison, the current algorithm in HEAD does include this value in the MCV list, so it gets a good estimate. On the other hand, it generates a complete MCV-list with 1 entries, most of which are just random values from the list that appear twice in the table and also happen to appear twice in the sample. So there are nearly 1 values that are significantly over-estimated in this case. So arguably, the new algorithm is still better than the current one for this data, but the fact that it doesn't include a=5 in the list bugs me, because the estimate for that single value is consistently worse. Looking at the distribution of a=5 in the sample, it is a hypergeometric distribution with N=3100, n=300 and K=102. This gives a sample mean of 9.8 and a variance of 8.9 (a standard deviation of 3.0). Also, this is common enough that in fact that distribution can be reasonably approximated by a normal distribution. The value is excluded from the MCV list because the spread is deemed too large, relative to the mean, so the relative error of the MCV-based estimate would be too high. However, not including it in the MCV list causes it to have an estimate of 5, which would correspond to a sample mean of 0.5, and the observed sample mean is more than 3 standard deviations higher than that. So we do actually have enough information to conclude that the value is almost certainly more common than the non-MCV selectivity would suggest, and I think that's sufficient justification for including it in the MCV list. The crux of this is that the relative-standard-error algorithm is making use of 2 pieces of information -- the sample frequency, and an estimate of its statistical spread. However, there is a 3rd piece of information that we know, that is not being made use of -- the selectivity that would be assigned to the value if it were not included in the MCV list. Looking at just the first 2 pieces of information ensures that we get decent estimates for the values in the MCV list (where what constitutes "decent" depends on an arbitrarily chosen RSE cutoff), but it doesn't take into account the fact that the values not in the list may have much worse estimates. Making use of the non-MCV-based estimate allows us to do better -- if the MCV-based estimate is statistically significantly higher than the non-MCV-based estimate, then it would almost certainly be better to include that value in the MCV list, even if its spread is quite large. This is somewhat similar to the initial idea from Jeff Janes, except that patch didn't take into account the spread, so it ended up making the too-many-mcvs problem worse for uniform distributions. There's also another subtlety with that patch -- when comparing frequencies of values not in the MCV list, you really have to start from a fully populated MCV list and work down, rather than starting with a empty one and working up, because if all the common values have about the same frequency, the most common amongst them may not be much more

### Re: MCV lists for highly skewed distributions

On 6 March 2018 at 08:51, John Naylorwrote: > On 3/5/18, Dean Rasheed wrote: >> Attached is an updated patch. > Nice. The results look good. Thanks for the review. > I agree it should be in a separate function. As for that large > comment, I spent some time pouring over it to verify the math and make > sure I understand it. I see a couple points where it might be a bit > confusing to someone reading this code for the first time: > > +* This bound is at most 25, and approaches 0 as n approaches 0 or N. > The > +* case where n approaches 0 cannot happen in practice, since the > sample > +* size is at least 300. > > I think the implication is that the bound cannot dip below 10 (the > stated minimum to justify using techniques intended for a normal > distributed sample), so there's no need to code a separate clamp to > ensure 10 values. That might be worth spelling out explicitly in the > comment. > Perhaps I should really say that n can't be less than 300 unless N is too, in which case n=N. So the only case where we need to worry about the bound approaching zero is when n --> N, and we're sampling the entire table, or almost all of it. I'll see if I can re-word that to be clearer. > +* size is at least 300. The case where n approaches N corresponds to > +* sampling the whole the table, in which case it is reasonable to > keep > +* the whole MCV list (have no lower bound), so it makes sense to > apply > +* this formula for all inputs, even though the above derivation is > +* technically only valid when the right hand side is at least around > 10. > > It occurred to me that if the table cardinality is just barely above > the sample size, we could get a case where a value sampled only once > could slip into the MCV list. With the default stat target, this would > be tables with between 30,000 and 31,500 rows. (actually I think that's 31250 rows, or more generally when we sample more than around 96% of the table) > I couldn't induce this > behavior, so I looked at the logic that identifies MCV candidates, and > saw a test for > > if (dups_cnt > 1) > > If I'm reading that block correctly, than a value sampled once will > never even be considered for the MCV list, so it seems that the > defacto lower bound for these tables is two. It might be worth > mentioning that in the comment. > > It also means that this clamp on HEAD > > - if (mincount < 2) > - mincount = 2; > > is dead code. > Yes, in compute_scalar_stats(), each value is guaranteed to have been seen at least twice. However, looking at compute_distinct_stats(), that's not the case. I'm not sure if that matters. When we have sampled 96% of the table, any estimate we produce based on the number of times a value has been seen in the sample is likely to be almost exact, even if it has only been seen once or twice. So the estimates from the MCV list will all be good, but I suspect in this case the estimates for all other values that didn't fit in the MCV list will also be good, so it may not matter whether or not we keep those MCV items. My instinct is to keep them, on the grounds that the more information the planner has, the better. >> mcv_cutoff - The relative standard error cutoff percentage used (10, >> 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests >> against HEAD. > > I'm not quite following the negative numbers for HEAD. They're all the > equivalent, right? Just a label for convenience to make sure you ran > the same number of tests? > Yes, they should all be equivalent. They just reflect the way I ran the tests -- HEAD vs the patch with a cutoff of 10%, HEAD vs the patch with a cutoff of 15%, and so on. So I ended up running it against HEAD 9 times, and I didn't want to throw any of that data away. It's useful for getting a feel for the scope of random variations between test runs. Regards, Dean

### Re: MCV lists for highly skewed distributions

On 3/5/18, Dean Rasheedwrote: > Attached is an updated patch. I decided to move the code that > determines the minimum number of occurrences for an MCV list item out > to a separate function. It's not much code, but there's a large > comment to explain its statistical justification, and I didn't want to > duplicate that in 2 different places (possibly to become 3, with the > multivariate MCV stats patch). Nice. The results look good. I agree it should be in a separate function. As for that large comment, I spent some time pouring over it to verify the math and make sure I understand it. I see a couple points where it might be a bit confusing to someone reading this code for the first time: +* This bound is at most 25, and approaches 0 as n approaches 0 or N. The +* case where n approaches 0 cannot happen in practice, since the sample +* size is at least 300. I think the implication is that the bound cannot dip below 10 (the stated minimum to justify using techniques intended for a normal distributed sample), so there's no need to code a separate clamp to ensure 10 values. That might be worth spelling out explicitly in the comment. +* size is at least 300. The case where n approaches N corresponds to +* sampling the whole the table, in which case it is reasonable to keep +* the whole MCV list (have no lower bound), so it makes sense to apply +* this formula for all inputs, even though the above derivation is +* technically only valid when the right hand side is at least around 10. It occurred to me that if the table cardinality is just barely above the sample size, we could get a case where a value sampled only once could slip into the MCV list. With the default stat target, this would be tables with between 30,000 and 31,500 rows. I couldn't induce this behavior, so I looked at the logic that identifies MCV candidates, and saw a test for if (dups_cnt > 1) If I'm reading that block correctly, than a value sampled once will never even be considered for the MCV list, so it seems that the defacto lower bound for these tables is two. It might be worth mentioning that in the comment. It also means that this clamp on HEAD - if (mincount < 2) - mincount = 2; is dead code. > I think this is ready for commit now, so barring objections, I will do > so in the next day or so. Fine by me. I've skimmed your test methodology and results and have just one question below. > There's a lot of random variation between tests, but by running a lot > of them, the general pattern can be seen. I ran the test 9 times, with > different MCV cutoff values in the patched code of 10%, 15%, ... 50%, > then collected all the results from the test runs into a single CSV > file to analyse using SQL. The columns in the CSV are: > > test_name - A textual description of the data distribution used in the > test. > > mcv_cutoff - The relative standard error cutoff percentage used (10, > 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests > against HEAD. I'm not quite following the negative numbers for HEAD. They're all the equivalent, right? Just a label for convenience to make sure you ran the same number of tests? -John Naylor

### Re: MCV lists for highly skewed distributions

On 7 February 2018 at 15:58, Dean Rasheedwrote: > On 7 February 2018 at 15:25, Robert Haas wrote: >> Do you plan to press forward with this, then, or what's >> the next step? > > I plan to do more testing TL;DR: I tested this new algorithm using 9 different relative standard error cutoffs (10%, 15%, ... 50%) against a variety of different data distributions, and it looks like a definite improvement over the current algorithm, for a wide range of cutoff values. Overall, it isn't that sensitive to the exact cutoff chosen, but it looks like a value of 20% is a good general-purpose choice. One of the properties of the new algorithm is that the size of the MCV list responds better to changing the stats target, so I don't think it's worth having a separate knob to control the cutoff percentage. Such a knob would have a similar, but weaker effect than the stats target knob, and thus doesn't seem worthwhile. Attached is an updated patch. I decided to move the code that determines the minimum number of occurrences for an MCV list item out to a separate function. It's not much code, but there's a large comment to explain its statistical justification, and I didn't want to duplicate that in 2 different places (possibly to become 3, with the multivariate MCV stats patch). I think this is ready for commit now, so barring objections, I will do so in the next day or so. --- I tested using the attached python script, which tests a large number of different data distributions, comparing the results with the patch vs those from HEAD. It uses EXPLAIN ANALYSE to compare the estimates against actual row counts, both for values in the MCV list, and non-MCV values, making it possible to tell whether it would have been better if the MCV list had been bigger or smaller. There's a lot of random variation between tests, but by running a lot of them, the general pattern can be seen. I ran the test 9 times, with different MCV cutoff values in the patched code of 10%, 15%, ... 50%, then collected all the results from the test runs into a single CSV file to analyse using SQL. The columns in the CSV are: test_name - A textual description of the data distribution used in the test. mcv_cutoff - The relative standard error cutoff percentage used (10, 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests against HEAD. stats_tgt - The stats target used (100 or 1000). num_mcvs - The size of the resulting MCV list. avg_mcv_err - The average percentage difference between estimated and actual row counts for values in the MCV list. (There's a bit of a fudge factor in calculating these percentages, to avoid them becoming too large in cases where the row counts are small, but I think it's still useful for comparison purposes.) max_mcv_err - The maximum percentage difference between estimated and actual row counts for values in the MCV list. avg_non_mcv_err - The average percentage difference between estimated and actual row counts for a bunch of non-MCV values. max_non_mcv_err - The maximum percentage difference between estimated and actual row counts for a bunch of non-MCV values. avg_err - The average percentage difference between estimated and actual row counts for all values tested. max_err - The maximum percentage difference between estimated and actual row counts for all values tested. >From this, it's possible to look for cases where the MCV list is too large or too small. For example, looking for too-many-mcv cases (generally the more uniform data distributions): SELECT mcv_cutoff, count(*) FROM results WHERE max_mcv_err - max_non_mcv_err > 50 GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count +--- -40 |41 -50 |40 -25 |39 -15 |38 -20 |38 -30 |38 -35 |38 -45 |37 -10 |37 50 |35 40 |35 45 |34 35 |33 30 |33 25 |25 20 |13 15 | 6 10 | 3 (18 rows) SELECT mcv_cutoff, count(*) FROM results WHERE max_mcv_err - max_non_mcv_err > 100 GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count +--- -30 |17 -40 |16 -15 |15 -25 |14 -10 |14 -35 |14 -45 |13 -50 |13 -20 |12 35 |12 45 |11 40 |10 30 |10 50 |10 25 | 6 10 | 2 (16 rows) ( and implicitly: 15 | 0 20 | 0 ) So the patched code is generally better at avoiding the too-many-mcvs problem, but an mcv_cutoff of 30% or more may be too high. Looking for too-few-mcv cases: SELECT mcv_cutoff, count(*) FROM results WHERE (max_non_mcv_err - max_mcv_err > 50 AND num_mcvs < stats_tgt) GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count

### Re: MCV lists for highly skewed distributions

On 1 March 2018 at 21:01, Andres Freundwrote: > This sounds like the patch's status of "waiting on author" isn't right, > and it should more be ready for committer? > Yes, I'll take a look at it this weekend. Regards, Dean

### Re: MCV lists for highly skewed distributions

Hi Dean, On 2018-02-07 15:58:14 +, Dean Rasheed wrote: > On 7 February 2018 at 15:25, Robert Haaswrote: > > Do you plan to press forward with this, then, or what's > > the next step? > > > > Yes, I think the results are pretty good so far, especially for the > more non-uniform distributions. AFAICS it solves the 2 original > complaints, and I've not yet seen a case where it makes things worse. > > I plan to do more testing (and if anyone else wants to, that would be > most welcome), and then barring any unexpected issues/objections, I'll > commit it. Probably not this week though. This sounds like the patch's status of "waiting on author" isn't right, and it should more be ready for committer? Greetings, Andres Freund

### Re: MCV lists for highly skewed distributions

On 7 February 2018 at 15:25, Robert Haaswrote: > Do you plan to press forward with this, then, or what's > the next step? > Yes, I think the results are pretty good so far, especially for the more non-uniform distributions. AFAICS it solves the 2 original complaints, and I've not yet seen a case where it makes things worse. I plan to do more testing (and if anyone else wants to, that would be most welcome), and then barring any unexpected issues/objections, I'll commit it. Probably not this week though. Regards, Dean

### Re: MCV lists for highly skewed distributions

On Wed, Feb 7, 2018 at 10:20 AM, Dean Rasheedwrote: > One thing this new algorithm does do is improve the user's ability to > get more MCVs by increasing the stats target. I'm not yet convinced > there should be a separate knob for the RSE cutoff. For that to be > useful, there would need to be some data distributions for which 10% > (say) was clearly better, and some for which 20% was better. So far, > there doesn't appear to be a massive difference between the two, and > it's nothing that can't compensated for using the existing stats > target knob. Fair enough. Do you plan to press forward with this, then, or what's the next step? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

### Re: MCV lists for highly skewed distributions

On 7 February 2018 at 13:29, Robert Haaswrote: > On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed wrote: >> Thanks for testing. I agree, this new algorithm seems to stand up >> pretty well in the testing I've done too. One thing about it that can >> be tuned is the cutoff threshold for the relative standard error -- I >> chose 10% completely arbitrarily, but it could just as easily be 20%, >> 30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09). > > Maybe it'd be worth having a separate GUC for this, and a reloption to > override the GUC. It seems to me that it would be very reasonable to > want to separately control (a) how much sampling you want to do and > (b) how aggressively you want to be about including things in the MCV > list. Of course, even if we do that, it doesn't excuse us from > needing to set a good default value. And it might not be necessary to > do the extra work for this. > > Looking at your data, it definitely seems like 10% would be too strict > -- if I'm reading this correctly, with a stats target in the 10-50 > range, your normally-distributed table gets no MCVs at all, rather > than a number equal to the statistics target. That doesn't seem good. > (Actually it was the 10-30 stats target range that gave no MCVs at all for 10%) The fact that 10% leads to no MCVs for a deliberately lowered stats target of 10 isn't necessarily bad, and might actually have been an improvement -- e.g., with a stats target of 1, you get no MCVs even with a 20% RSE cutoff, whereas with HEAD you typically get 1 completely random MCV, and a wildly inaccurate estimate for that value. The reason I think 10% is too low is that the extra MCVs you get by increasing it to 20% generally seem to give better estimates (for a range of stats targets) than if they weren't included (although it's sometimes a bit marginal). So 20% seems to strike about the right balance between too many and too few MCVs. One thing this new algorithm does do is improve the user's ability to get more MCVs by increasing the stats target. I'm not yet convinced there should be a separate knob for the RSE cutoff. For that to be useful, there would need to be some data distributions for which 10% (say) was clearly better, and some for which 20% was better. So far, there doesn't appear to be a massive difference between the two, and it's nothing that can't compensated for using the existing stats target knob. Regards, Dean

### Re: MCV lists for highly skewed distributions

On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheedwrote: > Thanks for testing. I agree, this new algorithm seems to stand up > pretty well in the testing I've done too. One thing about it that can > be tuned is the cutoff threshold for the relative standard error -- I > chose 10% completely arbitrarily, but it could just as easily be 20%, > 30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09). Maybe it'd be worth having a separate GUC for this, and a reloption to override the GUC. It seems to me that it would be very reasonable to want to separately control (a) how much sampling you want to do and (b) how aggressively you want to be about including things in the MCV list. Of course, even if we do that, it doesn't excuse us from needing to set a good default value. And it might not be necessary to do the extra work for this. Looking at your data, it definitely seems like 10% would be too strict -- if I'm reading this correctly, with a stats target in the 10-50 range, your normally-distributed table gets no MCVs at all, rather than a number equal to the statistics target. That doesn't seem good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

### Re: MCV lists for highly skewed distributions

On 4 February 2018 at 12:18, John Naylorwrote: > I did the same basic eyeball testing I did on earlier patches, and > this is the best implementation so far. I've attached some typical > pg_stats contents for HEAD and this patch. More rigorous testing, > including of planner estimates on real data, is still needed of > course, but I think this is definitely in the right direction. > Thanks for testing. I agree, this new algorithm seems to stand up pretty well in the testing I've done too. One thing about it that can be tuned is the cutoff threshold for the relative standard error -- I chose 10% completely arbitrarily, but it could just as easily be 20%, 30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09). Looking at the too-many-MCVs example from [1], the table has 100 million rows, and each distinct value occurs 10 times. With the default stats target of 100, the MCV-list is fully populated with values, most having been seen two (or occasionally three) times. Thus, if you query for one of those MCVs, the estimate is 6667 or 10,000 rather than 10, which is a pretty bad over-estimate. Increasing the stats target improves things, because the sample frequencies go down correspondingly. The problem is also lessened a bit if you randomise the order of the values in the second column so that it isn't correlated to the storage order: insert into qqq select a, random()*10^7 from generate_series(1, (10^8)::int) a; However in a sample of 30,000 rows it remains reasonably likely that the same value will be seen twice (I typically observed 40 to 50 MCVs in this case, c.f. the birthday paradox), and in a sample of 300,000 rows it goes back to giving 1000 MCVs, each seen 2 or 3 times. So this isn't really the fault of the non-uniform ANALYSE sample, but rather of the current algorithm's belief that seeing a value just twice is sufficient for it to qualify as an MCV. The new MCV algorithm solves this by demanding that a value be seen many more times before being included in cases where the table size is much larger than the sample size. The exact threshold depends on the table size, the sample size and the relative error cutoff value. In this example, with a table size of 100 million rows, the sample size makes little difference, because its always going to be much smaller than the table size, so the number of instances required in the sample depends mostly on the RSE cutoff chosen: rse_cutoff | sample=3 | sample=30 | sample=300 +--+---+ 10%| 100 | 100 | 97 20%| 25 |25 | 25 30%| 12 |12 | 11 40%|7 | 7 | 7 50%|4 | 4 | 4 So any of those RSE cutoff's solves the too-many-MCVs problem in this particular case, giving no MCVs, although 50% is pushing it a bit, and in any case, the theoretical basis for this algorithm breaks down well before then. The advantage of a larger RSE cutoff is that it gives more MCVs for a non-uniform data distribution, where the current algorithm tends to give too few MCVs. In the attached example, the table has 1 million rows of integer data in a normal distribution with a standard deviation of 50. This typically gives somewhere in the region of 430 to 440 distinct values. Running against HEAD and with this patch, for varying sample sizes and RSE cutoffs gives the following MCV-list sizes: stats_target mcvs (HEAD) mcvs (10% RSE) mcvs (20% RSE) mcvs (30% RSE) 1010 0 10 10 2020 0 20 20 3030 0 30 30 4040 17 40 40 5050 50 50 50 100 100 100 100 100 300 132 204 261 290 1000 205 266 315 338 3000 252 364 400 411 1 296 413 414 412 One thing to note is that the HEAD algorithm never includes more than around 300 values, because of its requirement for a value to be more common than average. The new algorithm has no such requirement, so it will include nearly every value in the MCV list (except the most uncommon ones) if the stats target is set high enough. Also, the detailed output from the test shows that the resulting estimates based on those MCVs are pretty decent. (Note: this example shows that the too-few-MCVs problem can occur in any non-uniform distribution, not just highly skewed ones.) I've also tried this out against some real-world data, and in the testing I've done so far, the new algorithm is not actually that sensitive to

### Re: MCV lists for highly skewed distributions

On 1 February 2018 at 17:49, Robert Haaswrote: > One point which I want to emphasize is that the length of the MCV list > bounds the estimated frequency of non-MCVs in two ways: no non-MCV is > ever thought to be more frequent than the least-common MCVs, and > however many non-MCVs we think we have (probably fewer than we > actually have) have to fit into whatever percentage of the table is > consumed by MCVs. This would be less important if we had reliable > n_distinct estimates, but we don't. So, even throwing things into the > MCV list that are no more common than the average item can improve > planning in some cases. > That's a good point, and a nice explanation. I think that lends more weight to the argument that we should be including as many MCVs as possible, provided there's enough evidence to justify their inclusion. Regards, Dean

### Re: MCV lists for highly skewed distributions

On 2/2/18, Dean Rasheedwrote: > I think it would be better to try to come up with an alternative > algorithm that has a better theoretical basis, and then test that to > see how it holds up in practice. > > With that in mind, attached is a patch based on the idea of setting a > bound on the relative standard error on the sample frequency I did the same basic eyeball testing I did on earlier patches, and this is the best implementation so far. I've attached some typical pg_stats contents for HEAD and this patch. More rigorous testing, including of planner estimates on real data, is still needed of course, but I think this is definitely in the right direction. -John Naylor test_mcvstats_v1.sql Description: application/sql

### Re: MCV lists for highly skewed distributions

On Thu, Feb 1, 2018 at 12:21 PM, Dean Rasheedwrote: > For a highly skewed distribution, it is possible for there to be > hardly any values (maybe only one) that appears more than 1.25 times > the average frequency, and so lots of otherwise perfectly good common > values get discarded. For such a distribution, I don't think that the > average frequency is particularly interesting, and it shouldn't be > used to filter the MCV list. Although I don't think I have enough math background to analyze this as rigorously as you, I agree with you on this point: the average frequency doesn't seem very interesting. One point which I want to emphasize is that the length of the MCV list bounds the estimated frequency of non-MCVs in two ways: no non-MCV is ever thought to be more frequent than the least-common MCVs, and however many non-MCVs we think we have (probably fewer than we actually have) have to fit into whatever percentage of the table is consumed by MCVs. This would be less important if we had reliable n_distinct estimates, but we don't. So, even throwing things into the MCV list that are no more common than the average item can improve planning in some cases. > Testing it with the example from [1], it does indeed appear to solve > the too-many-MCVs problem in that particular case (actually generating > no MCVs). > > Testing it with the highly skewed example at the start of this thread, > it typically generates a couple more MCVs, producing somewhat better > estimates, but to get really good estimates for who=17, you need to > crank up the stats target. It does at least appear to no longer be the > case that cranking up the stats target has a weak effect. Hmm, nice result. I agree with you that it would be nice if we can come up with a good general-purpose algorithm for this, rather than making it pluggable. I don't know whether we can succeed in doing that but we should try hard, because it's always better when things just work automatically... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

### Re: MCV lists for highly skewed distributions

On 1 February 2018 at 13:16, Simon Riggswrote: > On 25 January 2018 at 22:19, Tom Lane wrote: >> In any case, since it looks like the next step is for someone to come >> up with a new proposal, I'm going to set this to Waiting on Author. > > Dean and John's results show that different algorithms work better for > different cases. > > How about we make ANALYZE's MCV algorithm pluggable? And then include, > say, 2 additional algorithms. > I don't think we've yet proved that that's needed. I'd rather attempt to come up with a decent general-purpose algorithm, if possible. The main problem I have with the originally proposed patch is the lack of any kind of rigorous justification for the approach of changing the algorithm to include values that are "significantly more common than average frequency for values which will not be represented in the MCV list". So there's no guarantee that the MCVs produced will be backed by sufficient evidence, and it risks making the too-many-MCVs case worse. Of course the current code suffers from both the too-many-MCVs and too-few-MCVs problems, depending on the data distribution: For a reasonably uniform distribution with quite a large number of distinct values, it is possible to generate MCVs on the basis of having seen only a few instances of the values. Those few values seen are then not sufficiently statistically significant, and extrapolating to the whole table produces wildly inaccurate estimates. For a highly skewed distribution, it is possible for there to be hardly any values (maybe only one) that appears more than 1.25 times the average frequency, and so lots of otherwise perfectly good common values get discarded. For such a distribution, I don't think that the average frequency is particularly interesting, and it shouldn't be used to filter the MCV list. There is also another variant of the too-many-MCVs problem that I believe is also possible -- if the sample contains a large number of NULLs or too-wide values, values_cnt could be reduced to the point where maxmincount is quite small, and again MCVs could get included on the basis of a very small number of occurrences. I think it would be better to try to come up with an alternative algorithm that has a better theoretical basis, and then test that to see how it holds up in practice. With that in mind, attached is a patch based on the idea of setting a bound on the relative standard error on the sample frequency -- so we include values in the MCV list if and only if they are seen enough times in the sample that the standard error on the sample frequency is small compared to the sample frequency itself, and thus we expect that the relative error resulting from extrapolating that to the whole table is also small. In theory then, it should never generate too many MCVs, although tuning of the relative error threshold might be required to prevent it from generating too few. Note, this is not a finished patch (e.g., I haven't touched compute_distinct_stats()). It's merely a possible starting point from which a lot more testing will be required. Testing it with the example from [1], it does indeed appear to solve the too-many-MCVs problem in that particular case (actually generating no MCVs). Testing it with the highly skewed example at the start of this thread, it typically generates a couple more MCVs, producing somewhat better estimates, but to get really good estimates for who=17, you need to crank up the stats target. It does at least appear to no longer be the case that cranking up the stats target has a weak effect. Regards, Dean [1] https://www.postgresql.org/message-id/flat/20170522132017.29944.48...@wrigleys.postgresql.org diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c new file mode 100644 index 5f21fcb..af92835 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -2557,25 +2557,21 @@ compute_scalar_stats(VacAttrStatsP stats * Decide how many values are worth storing as most-common values. If * we are able to generate a complete MCV list (all the values in the * sample will fit, and we think these are all the ones in the table), - * then do so. Otherwise, store only those values that are - * significantly more common than the (estimated) average. We set the - * threshold rather arbitrarily at 25% more than average, with at - * least 2 instances in the sample. Also, we won't suppress values - * that have a frequency of at least 1/K where K is the intended - * number of histogram bins; such values might otherwise cause us to - * emit duplicate histogram bin boundaries. (We might end up with - * duplicate histogram entries anyway, if the distribution is skewed; - * but we prefer to treat such values as MCVs if at all possible.) + * then do so. Otherwise, keep only those values that appear + * sufficiently often in the sample that it is reasonable to + *

### Re: MCV lists for highly skewed distributions

On 25 January 2018 at 22:19, Tom Lanewrote: > Dean Rasheed writes: >> It occurs to me that maybe a better test to exclude a value from the >> MCV list would be to demand that its relative standard error not be >> too high. Such a test, in addition to the existing tests, might be >> sufficient to solve the opposite problem of too many values in the MCV >> list, because the real problem there is including a value after having >> seen relatively few occurrences of it in the sample, and thus having a >> wildly inaccurate estimate for it. Setting a bound on the relative >> standard error would mean that we could have a reasonable degree of >> confidence in estimates produced from the sample. > > This patch is marked Ready for Committer, but that seems wildly optimistic > based on the state of the discussion. It doesn't look to me like we > even have consensus on an algorithm, let alone code for it. Certainly, > whatever we do needs to address the too-many-MCVs issue from [1] as > well as Jeff's too-few-MCVs case. > > Even if we had a plausible patch, I'm not sure how we get to the > point of having enough consensus to commit. In the previous thread, > it seemed that some people would object to any change whatsoever :-( > > In any case, since it looks like the next step is for someone to come > up with a new proposal, I'm going to set this to Waiting on Author. Dean and John's results show that different algorithms work better for different cases. How about we make ANALYZE's MCV algorithm pluggable? And then include, say, 2 additional algorithms. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

### Re: MCV lists for highly skewed distributions

Dean Rasheedwrites: > It occurs to me that maybe a better test to exclude a value from the > MCV list would be to demand that its relative standard error not be > too high. Such a test, in addition to the existing tests, might be > sufficient to solve the opposite problem of too many values in the MCV > list, because the real problem there is including a value after having > seen relatively few occurrences of it in the sample, and thus having a > wildly inaccurate estimate for it. Setting a bound on the relative > standard error would mean that we could have a reasonable degree of > confidence in estimates produced from the sample. This patch is marked Ready for Committer, but that seems wildly optimistic based on the state of the discussion. It doesn't look to me like we even have consensus on an algorithm, let alone code for it. Certainly, whatever we do needs to address the too-many-MCVs issue from [1] as well as Jeff's too-few-MCVs case. Even if we had a plausible patch, I'm not sure how we get to the point of having enough consensus to commit. In the previous thread, it seemed that some people would object to any change whatsoever :-( In any case, since it looks like the next step is for someone to come up with a new proposal, I'm going to set this to Waiting on Author. regards, tom lane [1] https://www.postgresql.org/message-id/flat/32261.1496611829%40sss.pgh.pa.us

### Re: stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

On 22 January 2018 at 08:07, John Naylorwrote: > On 1/21/18, Dean Rasheed wrote: >> It occurs to me that maybe a better test to exclude a value from the >> MCV list would be to demand that its relative standard error not be >> too high. Such a test, in addition to the existing tests, might be >> sufficient to solve the opposite problem of too many values in the MCV >> list, because the real problem there is including a value after having >> seen relatively few occurrences of it in the sample, and thus having a >> wildly inaccurate estimate for it. Setting a bound on the relative >> standard error would mean that we could have a reasonable degree of >> confidence in estimates produced from the sample. > > If you don't mind, what would the math look like for that? Using the same syntax as before: N = Total rows in table (population size) n = Number of rows sampled x = Number of times a particular value appears in the sample p = x/n = Frequency of the value in the sample So that the standard error of the proportion is SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1)) Then the relative standard error (which is usually expressed as a percentage) is RSE = 100 * SE / p So if we were to demand that the relative standard error was less than, say, 10%, then the constraint would just be SE < 0.1 * p Note: * This formula not valid if x is very small (see what I wrote about being able to approximate the distribution of p with a normal distribution). So we should also enforce the "rule of thumb" x >= 10. * The frequency p that we're storing is the count divided by the overall sample size. So the values for N and n above should not be the counts that exclude NULLs. As far as this logic is concerned, NULLs (and too-wide values) are just values not equal to value under consideration. Thus it appears, from just a quick glance at your patch, that you're using the wrong values for N and n. * The RSE is a monotonically decreasing function of p in the range [0,1], so an upper bound on the RSE equates to a lower bound on the number of occurrences of the value. This last point gives me a different idea. Instead of applying this test *in addition to* the existing tests, how about applying it *instead of* the existing tests. That is, we keep all MCVs that appear sufficiently often that we can be reasonably confident in the estimates produced from their sample frequencies, regardless of whether or not they are more common than average (the average being fairly meaningless for a highly skewed distribution anyway). This is still keeping the most common values, but now we'd be saying that we keep any value that appears sufficiently often in the sample that we can extrapolate its sample frequency to produce a reasonably accurate estimate of the population frequency, and discard values for which the estimate is likely to be inaccurate. I have not tested this idea at all, but it seems like it might be worth considering. It has the nice property that the test depends entirely on how often the value appears, rather than on other previously computed statistics, such as Ndistinct. Doing a quick test in pure SQL, using the highly skewed distribution Jeff Janes gave in the other thread, with the default sample size of 30,000: with samples as ( select floor(-log(random())/log(2))::int as who from generate_series(1,3) ), freqs as ( select who, count(*) as x, count(*)/3::float8 as p from samples group by who ), stats as ( select *, sqrt(p*(1-p)/3) * sqrt((1000-3)::float8/(1000-1)) as se from freqs ) select *, (1000*p)::int||'+/-'||(2*se*1000)::int as "95% interval", case when x >=10 and se < 0.1*p then 'KEEP' else 'DISCARD' end from stats order by p desc limit 100; it pretty consistently keeps the 8 most common values: who | x | p | se | 95% interval | case -+---+--+--+-+- 0 | 15017 |0.5005667 | 0.00288241625942075 | 5005667+/-57648 | KEEP 1 | 7607 |0.2535667 | 0.00250800597590887 | 2535667+/-50160 | KEEP 2 | 3713 |0.1237667 | 0.0018984483 | 1237667+/-37969 | KEEP 3 | 1855 | 0.0618 | 0.0013884757600711 | 618333+/-27770 | KEEP 4 | 914 | 0.03046667 | 0.000990788179299791 | 304667+/-19816 | KEEP 5 | 448 | 0.0149 | 0.000699194759916533 | 149333+/-13984 | KEEP 6 | 229 | 0.00763 | 0.000501741670388358 | 76333+/-10035 | KEEP 7 | 108 | 0.0036 | 0.000345267009604061 | 36000+/-6905| KEEP 8 |46 | 0.00153 | 0.000225565173744715 | 15333+/-4511| DISCARD 9 |34 | 0.00113 | 0.000193963300230354 | 11333+/-3879| DISCARD 10 |17 | 0.000567 | 0.000137191663419411 | 5667+/-2744 | DISCARD 11 |11 |

### stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

(Starting a new thread so as not to distract review) On 1/21/18, Dean Rasheedwrote: > On 21 January 2018 at 07:26, John Naylor wrote: >> I spent a few hours hacking on this, and it turns out calculating the >> right number of MCVs taking into account both uniform and highly >> non-uniform distributions is too delicate a problem for me to solve >> right now. The logic suggested by Dean Rasheed in [1] always produces >> no MCVs for a perfectly uniform distribution (which is good), but very >> often also for other distributions, which is not good. My efforts to >> tweak that didn't work, so I didn't get as far as adapting it for the >> problem Jeff is trying to solve. > > Hmm, Tom suggested that the test based on the average frequency over > all values might be too strict because the estimated number of > distinct values is often too low, so that might explain what you're > seeing. In my test tables, I've noticed that our Ndistinct estimator is most inaccurate for geometric distributions, so that's certainly possible, but confusingly, it occasionally gave an empty MCV list along with a histogram with a boundary duplicated 5 times, which I thought I was guarding against. I'm thinking my implementation of your logic is flawed somehow. In case you're curious I've attached my rough (complier warnings and all) test patch. > It occurs to me that maybe a better test to exclude a value from the > MCV list would be to demand that its relative standard error not be > too high. Such a test, in addition to the existing tests, might be > sufficient to solve the opposite problem of too many values in the MCV > list, because the real problem there is including a value after having > seen relatively few occurrences of it in the sample, and thus having a > wildly inaccurate estimate for it. Setting a bound on the relative > standard error would mean that we could have a reasonable degree of > confidence in estimates produced from the sample. If you don't mind, what would the math look like for that? -John Naylor diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 5f21fcb..da21333 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -2318,6 +2318,8 @@ compute_scalar_stats(VacAttrStatsP stats, int num_mcv = stats->attr->attstattarget; int num_bins = stats->attr->attstattarget; StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data; + double N, +n; values = (ScalarItem *) palloc(samplerows * sizeof(ScalarItem)); tupnoLink = (int *) palloc(samplerows * sizeof(int)); @@ -2525,10 +2527,10 @@ compute_scalar_stats(VacAttrStatsP stats, */ int f1 = ndistinct - nmultiple + toowide_cnt; int d = f1 + nmultiple; - double n = samplerows - null_cnt; - double N = totalrows * (1.0 - stats->stanullfrac); double stadistinct; + n = samplerows - null_cnt; + N = totalrows * (1.0 - stats->stanullfrac); /* N == 0 shouldn't happen, but just in case ... */ if (N > 0) stadistinct = (n * d) / ((n - f1) + f1 * n / N); @@ -2558,9 +2560,44 @@ compute_scalar_stats(VacAttrStatsP stats, * we are able to generate a complete MCV list (all the values in the * sample will fit, and we think these are all the ones in the table), * then do so. Otherwise, store only those values that are - * significantly more common than the (estimated) average. We set the - * threshold rather arbitrarily at 25% more than average, with at - * least 2 instances in the sample. Also, we won't suppress values + * significantly more common than the (estimated) average. + * + * Note: For this POC patch, the implementation and comments + * were copied from an email from Dean Rasheed, which contains further references: + * https://www.postgresql.org/message-id/CAEZATCVu9zK0N%3Dnd9ufavabbM8YZiyWYJca0oiE8F31GAY%2B_XA%40mail.gmail.com + * + * We calculate the threshold from the table and sample sizes. + + * The initial rule of thumb is that the value should occur at + * least 10 times in the sample. + * + * Suppose that N is the population size (total number of rows in the + * table), and n is the sample size, and that some particular candidate + * value appears x times in the sample. Then the "sample proportion" is + * given by p = x/n. + * + * It is reasonable to treat p as having a normal distribution, which + * then allows the margin of error to be analysed using standard + * techniques. We calculate the standard error of the sample proportion: + * + * SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1)) + * + * The second term is a finite population correction. There is a 95% + * probability that the total population proportion lies in the range + * + * [ pmin = p-2*SE, pmax = p+2*SE ] + * + * If there are Nd distinct values in the table, so that the average + * frequency of occurrence of any particular value is 1/Nd, then the test + *

### Re: MCV lists for highly skewed distributions

On 21 January 2018 at 07:26, John Naylorwrote: > I spent a few hours hacking on this, and it turns out calculating the > right number of MCVs taking into account both uniform and highly > non-uniform distributions is too delicate a problem for me to solve > right now. The logic suggested by Dean Rasheed in [1] always produces > no MCVs for a perfectly uniform distribution (which is good), but very > often also for other distributions, which is not good. My efforts to > tweak that didn't work, so I didn't get as far as adapting it for the > problem Jeff is trying to solve. Hmm, Tom suggested that the test based on the average frequency over all values might be too strict because the estimated number of distinct values is often too low, so that might explain what you're seeing. It occurs to me that maybe a better test to exclude a value from the MCV list would be to demand that its relative standard error not be too high. Such a test, in addition to the existing tests, might be sufficient to solve the opposite problem of too many values in the MCV list, because the real problem there is including a value after having seen relatively few occurrences of it in the sample, and thus having a wildly inaccurate estimate for it. Setting a bound on the relative standard error would mean that we could have a reasonable degree of confidence in estimates produced from the sample. Regards, Dean

### Re: MCV lists for highly skewed distributions

I wrote: >> I have a slight reservaton about whether 1.25x is still a sensible >> heuristic. > > This was also discussed in [1], but no patch came out of it. I was > just now turning the formulas discussed there into code, but I'll > defer to someone with more expertise. FWIW, I suspect that a solution > that doesn't take into account a metric like coefficient of variation > will have the wrong behavior sometimes, whether for highly uniform or > highly non-uniform distributions. I spent a few hours hacking on this, and it turns out calculating the right number of MCVs taking into account both uniform and highly non-uniform distributions is too delicate a problem for me to solve right now. The logic suggested by Dean Rasheed in [1] always produces no MCVs for a perfectly uniform distribution (which is good), but very often also for other distributions, which is not good. My efforts to tweak that didn't work, so I didn't get as far as adapting it for the problem Jeff is trying to solve. I have not been able to come up with a more compelling alternative, so I have nothing further to say about the patch under review. > [1] > https://www.postgresql.org/message-id/flat/32261.1496611829%40sss.pgh.pa.us#32261.1496611...@sss.pgh.pa.us -John Naylor

### Re: MCV lists for highly skewed distributions

I wrote: > FWIW, I suspect that a solution > that doesn't take into account a metric like coefficient of variation > will have the wrong behavior sometimes, whether for highly uniform or > highly non-uniform distributions. By this I meant the coefficient of variation of the class size in the sample, as denoted by gamma in the Haas and Stokes paper on page 7. -John Naylor

### Re: MCV lists for highly skewed distributions

>> [1] Occasionally it will store a much longer MCV list, because no values >> was >> sampled exactly once, which triggers a different code path in which all >> seen >> values are put in the MCV and the histogram is NULL. This is not >> reliable, >> as whether the least-sample value is present in the sample once or twice >> is >> pretty brittle. > > And we need a better discussion of risk: Before we generated too few > MCV entried. To what extent might me now generate too many? Which > would be a problem in increased planning time. (My apologies, I just now found some time to test this further, but I don't have additional results yet) Simon, Earlier, I referenced a thread [1] complaining that we currently already have too many MCVs in the case of uniform distributions, with worse consequences than planning time. Based on my (admittedly quick and dirty) preliminary testing (see attachment from a couple weeks ago), this patch exacerbates that problem, and I was hoping to find a way to fix that. > I have a slight reservaton about whether 1.25x is still a sensible > heuristic. This was also discussed in [1], but no patch came out of it. I was just now turning the formulas discussed there into code, but I'll defer to someone with more expertise. FWIW, I suspect that a solution that doesn't take into account a metric like coefficient of variation will have the wrong behavior sometimes, whether for highly uniform or highly non-uniform distributions. [1] https://www.postgresql.org/message-id/flat/32261.1496611829%40sss.pgh.pa.us#32261.1496611...@sss.pgh.pa.us -John Naylor

### Re: MCV lists for highly skewed distributions

On 28 December 2017 at 01:45, Jeff Janeswrote: > If we stored just a few more values, their inclusion in the MCV would mean > they are depleted from the residual count, correctly lowering the estimate > we would get for very rare values not included in the sample. I agree with this thought. > So instead of having the threshold of 1.25x the average frequency over all > values, I think we should use 1.25x the average frequency of only those > values not already included in the MCV, as in the attached. It looks like this might even have been the original intention of that code. Patch looks OK, but I think the comments need minor changes above line 2575 > As it is, you can partially overcome the too short MCV list by cranking up > the default statistics target, but this is a weak effect and comes at a high > cost of CPU time. In some of the real distributions I've looked at, > cranking up default statistics target is almost entirely ineffective. Agreed, not a good solution. > [1] Occasionally it will store a much longer MCV list, because no values was > sampled exactly once, which triggers a different code path in which all seen > values are put in the MCV and the histogram is NULL. This is not reliable, > as whether the least-sample value is present in the sample once or twice is > pretty brittle. And we need a better discussion of risk: Before we generated too few MCV entried. To what extent might me now generate too many? Which would be a problem in increased planning time. I have a slight reservaton about whether 1.25x is still a sensible heuristic. Should we add a parameter for that to allow testing during beta? Marking as Ready For Committer. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

### Re: MCV lists for highly skewed distributions

I wrote: > I'll be travelling for a few days, but I'll do some testing on some > data sets soon. I've attached some preliminary test methods and results. Probably not detailed or rigorous enough, but I wanted to share this much before digging deeper. I created some test data, ran analyze a few times and eyeballed the results, of which I give a typical example. I used a low stat target to make it easier to see the general picture. Suggestions welcome. -John Naylor test_analyze_highly_skewed_v1.sql Description: application/sql

### Re: MCV lists for highly skewed distributions

I wrote: > On 12/28/17, Jeff Janeswrote: >> I think that perhaps maxmincount should also use the dynamic >> values_cnt_remaining rather than the static one. After all, things >> included in the MCV don' get represented in the histogram. When I've >> seen >> planning problems due to skewed distributions I also usually see >> redundant >> values in the histogram boundary list which I think should be in the MCV >> list instead. But I have not changed that here, pending discussion. > > I think this is also a good idea, but I haven't thought it through. If > you don't go this route, I would move this section back out of the > loop as well. I did some quick and dirty testing of this, and I just want to note that in this case, setting mincount to its hard-coded minimum must come after setting it to maxmincount, since now maxmincount can go arbitrarily low. I'll be travelling for a few days, but I'll do some testing on some data sets soon. While looking through the archives for more info, I saw this thread https://www.postgresql.org/message-id/32261.1496611829%40sss.pgh.pa.us which showcases the opposite problem: For more uniform distributions, there are too many MCVs. Not relevant to your problem, but if I have time I'll try my hand at testing an approach suggested in that thread at the same time I test your patch and see how it interacts. -John Naylor

### Re: MCV lists for highly skewed distributions

On 12/28/17, Jeff Janeswrote: > I want to revive a patch I sent couple years ago to the performance list, > as I have seen the issue pop up repeatedly since then. > If we stored just a few more values, their inclusion in the MCV would mean > they are depleted from the residual count, correctly lowering the estimate > we would get for very rare values not included in the sample. > > So instead of having the threshold of 1.25x the average frequency over all > values, I think we should use 1.25x the average frequency of only those > values not already included in the MCV, as in the attached. After reading the thread you mentioned, I think that's a good approach. On master, the numerator of avg_count is nonnull_cnt, but you've (indirectly) set it to values_cnt. You didn't mention it here, but I think you're right and master is wrong, since in this function only sortable values go through the MCV path. If I understand the code correctly, if there are enough too-wide values in the sample, they could bias the average and prevent normal values from being considered for the MCV list. Am I off base here? Anyway, since you're now overwriting ndistinct_table, I would rename it to ndistinct_remaining for clarity's sake. This part doesn't use any loop variables, so should happen before the loop: + if (num_mcv > track_cnt) + num_mcv = track_cnt; I think this comment /* estimate # of occurrences in sample of a typical nonnull value */ belongs just above the calculation of avg_count. > I think that perhaps maxmincount should also use the dynamic > values_cnt_remaining rather than the static one. After all, things > included in the MCV don' get represented in the histogram. When I've seen > planning problems due to skewed distributions I also usually see redundant > values in the histogram boundary list which I think should be in the MCV > list instead. But I have not changed that here, pending discussion. I think this is also a good idea, but I haven't thought it through. If you don't go this route, I would move this section back out of the loop as well. -John Naylor