subject:"stricter MCV tests for uniform distributions \(was Re\: MCV lists for highly skewed distributions\)"

Re: stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

2018-01-22 Thread Dean Rasheed

On 22 January 2018 at 08:07, John Naylor  wrote:
> On 1/21/18, Dean Rasheed  wrote:
>> It occurs to me that maybe a better test to exclude a value from the
>> MCV list would be to demand that its relative standard error not be
>> too high. Such a test, in addition to the existing tests, might be
>> sufficient to solve the opposite problem of too many values in the MCV
>> list, because the real problem there is including a value after having
>> seen relatively few occurrences of it in the sample, and thus having a
>> wildly inaccurate estimate for it. Setting a bound on the relative
>> standard error would mean that we could have a reasonable degree of
>> confidence in estimates produced from the sample.
>
> If you don't mind, what would the math look like for that?

Using the same syntax as before:

N = Total rows in table (population size)
n = Number of rows sampled
x = Number of times a particular value appears in the sample
p = x/n = Frequency of the value in the sample

So that the standard error of the proportion is

SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))

Then the relative standard error (which is usually expressed as a percentage) is

RSE = 100 * SE / p

So if we were to demand that the relative standard error was less
than, say, 10%, then the constraint would just be

SE < 0.1 * p

Note:

* This formula not valid if x is very small (see what I wrote about
being able to approximate the distribution of p with a normal
distribution). So we should also enforce the "rule of thumb" x >= 10.

* The frequency p that we're storing is the count divided by the
overall sample size. So the values for N and n above should not be the
counts that exclude NULLs. As far as this logic is concerned, NULLs
(and too-wide values) are just values not equal to value under
consideration. Thus it appears, from just a quick glance at your
patch, that you're using the wrong values for N and n.

* The RSE is a monotonically decreasing function of p in the range
[0,1], so an upper bound on the RSE equates to a lower bound on the
number of occurrences of the value.

This last point gives me a different idea. Instead of applying this
test *in addition to* the existing tests, how about applying it
*instead of* the existing tests. That is, we keep all MCVs that appear
sufficiently often that we can be reasonably confident in the
estimates produced from their sample frequencies, regardless of
whether or not they are more common than average (the average being
fairly meaningless for a highly skewed distribution anyway).

This is still keeping the most common values, but now we'd be saying
that we keep any value that appears sufficiently often in the sample
that we can extrapolate its sample frequency to produce a reasonably
accurate estimate of the population frequency, and discard values for
which the estimate is likely to be inaccurate.

I have not tested this idea at all, but it seems like it might be
worth considering. It has the nice property that the test depends
entirely on how often the value appears, rather than on other
previously computed statistics, such as Ndistinct.

Doing a quick test in pure SQL, using the highly skewed distribution
Jeff Janes gave in the other thread, with the default sample size of
30,000:

with samples as (
  select floor(-log(random())/log(2))::int  as who
  from generate_series(1,3)
), freqs as (
  select who, count(*) as x, count(*)/3::float8 as p
  from samples group by who
), stats as (
  select *, sqrt(p*(1-p)/3) *
sqrt((1000-3)::float8/(1000-1)) as se
  from freqs
)
select *, (1000*p)::int||'+/-'||(2*se*1000)::int as "95% interval",
   case when x >=10 and se < 0.1*p then 'KEEP' else 'DISCARD' end
from stats order by p desc limit 100;

it pretty consistently keeps the 8 most common values:

 who |   x   |  p   |  se  |  95%
interval   |  case
-+---+--+--+-+-
   0 | 15017 |0.5005667 |  0.00288241625942075 |
5005667+/-57648 | KEEP
   1 |  7607 |0.2535667 |  0.00250800597590887 |
2535667+/-50160 | KEEP
   2 |  3713 |0.1237667 | 0.0018984483 |
1237667+/-37969 | KEEP
   3 |  1855 |   0.0618 |   0.0013884757600711 |
618333+/-27770  | KEEP
   4 |   914 |   0.03046667 | 0.000990788179299791 |
304667+/-19816  | KEEP
   5 |   448 |   0.0149 | 0.000699194759916533 |
149333+/-13984  | KEEP
   6 |   229 |  0.00763 | 0.000501741670388358 |
76333+/-10035   | KEEP
   7 |   108 |   0.0036 | 0.000345267009604061 |
36000+/-6905| KEEP
   8 |46 |  0.00153 | 0.000225565173744715 |
15333+/-4511| DISCARD
   9 |34 |  0.00113 | 0.000193963300230354 |
11333+/-3879| DISCARD
  10 |17 | 0.000567 | 0.000137191663419411 |
5667+/-2744 | DISCARD
  11 |11 | 0.000367 | 0.000110367969704201 |
3667+/-2207 | D

stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

2018-01-22 Thread John Naylor

(Starting a new thread so as not to distract review)

On 1/21/18, Dean Rasheed  wrote:
> On 21 January 2018 at 07:26, John Naylor  wrote:
>> I spent a few hours hacking on this, and it turns out calculating the
>> right number of MCVs taking into account both uniform and highly
>> non-uniform distributions is too delicate a problem for me to solve
>> right now. The logic suggested by Dean Rasheed in [1] always produces
>> no MCVs for a perfectly uniform distribution (which is good), but very
>> often also for other distributions, which is not good. My efforts to
>> tweak that didn't work, so I didn't get as far as adapting it for the
>> problem Jeff is trying to solve.
>
> Hmm, Tom suggested that the test based on the average frequency over
> all values might be too strict because the estimated number of
> distinct values is often too low, so that might explain what you're
> seeing.

In my test tables, I've noticed that our Ndistinct estimator is most
inaccurate for geometric distributions, so that's certainly possible,
but confusingly, it occasionally gave an empty MCV list along with a
histogram with a boundary duplicated 5 times, which I thought I was
guarding against. I'm thinking my implementation of your logic is
flawed somehow. In case you're curious I've attached my rough
(complier warnings and all) test patch.

> It occurs to me that maybe a better test to exclude a value from the
> MCV list would be to demand that its relative standard error not be
> too high. Such a test, in addition to the existing tests, might be
> sufficient to solve the opposite problem of too many values in the MCV
> list, because the real problem there is including a value after having
> seen relatively few occurrences of it in the sample, and thus having a
> wildly inaccurate estimate for it. Setting a bound on the relative
> standard error would mean that we could have a reasonable degree of
> confidence in estimates produced from the sample.

If you don't mind, what would the math look like for that?

-John Naylor
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5f21fcb..da21333 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2318,6 +2318,8 @@ compute_scalar_stats(VacAttrStatsP stats,
 	int			num_mcv = stats->attr->attstattarget;
 	int			num_bins = stats->attr->attstattarget;
 	StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
+	double		N,
+n;
 
 	values = (ScalarItem *) palloc(samplerows * sizeof(ScalarItem));
 	tupnoLink = (int *) palloc(samplerows * sizeof(int));
@@ -2525,10 +2527,10 @@ compute_scalar_stats(VacAttrStatsP stats,
 			 */
 			int			f1 = ndistinct - nmultiple + toowide_cnt;
 			int			d = f1 + nmultiple;
-			double		n = samplerows - null_cnt;
-			double		N = totalrows * (1.0 - stats->stanullfrac);
 			double		stadistinct;
 
+			n = samplerows - null_cnt;
+			N = totalrows * (1.0 - stats->stanullfrac);
 			/* N == 0 shouldn't happen, but just in case ... */
 			if (N > 0)
 stadistinct = (n * d) / ((n - f1) + f1 * n / N);
@@ -2558,9 +2560,44 @@ compute_scalar_stats(VacAttrStatsP stats,
 		 * we are able to generate a complete MCV list (all the values in the
 		 * sample will fit, and we think these are all the ones in the table),
 		 * then do so.  Otherwise, store only those values that are
-		 * significantly more common than the (estimated) average. We set the
-		 * threshold rather arbitrarily at 25% more than average, with at
-		 * least 2 instances in the sample.  Also, we won't suppress values
+		 * significantly more common than the (estimated) average.
+		 *
+		 * Note: For this POC patch, the implementation and comments
+		 * were copied from an email from Dean Rasheed, which contains further references:
+		 * https://www.postgresql.org/message-id/CAEZATCVu9zK0N%3Dnd9ufavabbM8YZiyWYJca0oiE8F31GAY%2B_XA%40mail.gmail.com
+		 *
+		 * We calculate the threshold from the table and sample sizes.
+
+		 * The initial rule of thumb is that the value should occur at
+		 * least 10 times in the sample.
+		 *
+		 * Suppose that N is the population size (total number of rows in the
+		 * table), and n is the sample size, and that some particular candidate
+		 * value appears x times in the sample. Then the "sample proportion" is
+		 * given by p = x/n.
+		 *
+		 * It is reasonable to treat p as having a normal distribution, which
+		 * then allows the margin of error to be analysed using standard
+		 * techniques. We calculate the standard error of the sample proportion:
+		 *
+		 * SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))
+		 *
+		 * The second term is a finite population correction. There is a 95%
+		 * probability that the total population proportion lies in the range
+		 *
+		 * [ pmin = p-2*SE, pmax = p+2*SE ]
+		 *
+		 * If there are Nd distinct values in the table, so that the average
+		 * frequency of occurrence of any particular value is 1/Nd, then the test
+		 *
+		 * pmin > 1/Nd
+		 *
+		 * would imply that

Re: stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

2 matches

Site Navigation

Mail list logo

Footer information