subject:"Re\: MCV lists for highly skewed distributions"

Re: MCV lists for highly skewed distributions

2018-03-22 Thread Dean Rasheed

On 19 March 2018 at 16:59, John Naylor  wrote:
> On 3/19/18, Dean Rasheed  wrote:
>> As promised, here is a new patch, with comment updates, per John and
>> Tomas' suggestions, plus the continuity correction, which seemed just
>> about worthwhile.
>
> Great. I'm happy with the behavior of the patch. I've marked it ready
> for committer.
>

Thanks for testing. I also did more testing with tables 10-20 times as
large and I was happy with the results, so I have pushed this.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-19 Thread John Naylor

On 3/19/18, Dean Rasheed  wrote:
> As promised, here is a new patch, with comment updates, per John and
> Tomas' suggestions, plus the continuity correction, which seemed just
> about worthwhile.

Great. I'm happy with the behavior of the patch. I've marked it ready
for committer.

> I repeated these tests with a 2-SD continuity-corrected threshold and
> the results fell somewhere between the 2-SD and 2.5-SD cases. My
> inclination is to go with that, as something that strikes a reasonable
> balance, and represents a distinct improvement over master in a number
> of different areas.

I also ran some tests with a hand-edited continuity correction last
night, but just now got around to looking at the results (queries
attached). I ran largely as before, but did 20 analyze runs with each
file instead of 10, using the V2 patch, the CC patch, and the V2 patch
with 2.5 threshold. I took out a bunch of the uniform tests to save
time since they largely behave the same.

-
Non-uniform:

To reduce noise for analysis, I tried filtering out the least and
greatest distinct value before taking the average for the
underestimation ratio for most common non-MCV. I also removed results
where all 3 patches had a full MCV list. There was still enough noise
that it's impossible to discern differences between patches that are
very similar. It's not like against HEAD where there were clear
differences in this test, so I won't post those numbers.

Looking at the number of MCVs, it's a little clearer. With few
exceptions, the number either stays the same or decreases slightly
going left to right:

  title   | v20_num_mcvs |
cc05_num_mcvs | v25_num_mcvs
--+--+---+--
 Exp. dist. (N=500k, sd=0.25 (beta))  |3 |
3 |3
 Exp. dist. (N=500k, sd=0.50 (beta))  |5 |
6 |5
 Exp. dist. (N=500k, sd=1.00 (beta))  |9 |
9 |9
 Exp. dist. (N=500k, sd=2.00 (beta))  |   15 |
15 |   15
 Exp. dist. (N=500k, sd=3.00 (beta))  |   22 |
21 |   21
 Exp. dist. (N=500k, sd=4.00 (beta))  |   27 |
27 |   26
 Exp. dist. (N=500k, sd=5.00 (beta))  |   33 |
32 |   30
 Exp. dist. (N=500k, sd=10.00 (beta)) |   60 |
58 |   56
 Exp. dist. (N=500k, sd=20.00 (beta)) |  100 |
99 |   97
 Laplace dist. (N=500k, b=0.25 (lambda))  |5 |
6 |5
 Laplace dist. (N=500k, b=0.50 (lambda))  |9 |
8 |7
 Laplace dist. (N=500k, b=1.00 (lambda))  |   15 |
15 |   15
 Laplace dist. (N=500k, b=2.00 (lambda))  |   27 |
26 |   26
 Laplace dist. (N=500k, b=3.00 (lambda))  |   38 |
37 |   36
 Laplace dist. (N=500k, b=4.00 (lambda))  |   48 |
47 |   45
 Laplace dist. (N=500k, b=5.00 (lambda))  |   58 |
57 |   55
 Laplace dist. (N=500k, b=10.00 (lambda)) |  100 |
99 |   97
 MNM (N=2000, peaks=20, sep=20, sd=15)|  100 |
100 |   62
 Normal dist. (N=500k, sd=1)  |7 |
7 |7
 Normal dist. (N=500k, sd=2)  |   15 |
15 |   14
 Normal dist. (N=500k, sd=3)  |   21 |
21 |   20
 Normal dist. (N=500k, sd=4)  |   28 |
26 |   26
 Normal dist. (N=500k, sd=5)  |   33 |
33 |   33
 Normal dist. (N=500k, sd=6)  |   39 |
38 |   38
 Normal dist. (N=500k, sd=7)  |   45 |
45 |   44
 Normal dist. (N=500k, sd=8)  |   51 |
49 |   49
 Normal dist. (N=500k, sd=9)  |   56 |
56 |   54
 Normal dist. (N=500k, sd=10) |   63 |
62 |   60
 Normal dist. (N=500k, sd=15) |   89 |
89 |   86
 Pareto dist. (N=500, a=1.00) |   70 |
65 |   56
 Pareto dist. (N=500, a=2.00) |   20 |
19 |   17
 Pareto dist. (N=500, a=3.00) |   10 |
9 |9
 Pareto dist. (N=500, a=4.00) |6 |
6 |6
 Pareto dist. (N=500, a=5.00) |5 |
5 |4
(34 rows)

-
Uniform:

Ideally, we want an empty MCV list unless we're sampling a large
portion of the table. As Dean mentioned, this is not as important as
getting the non-uniform case right. That said, it's nice to be
confident we can keep out flotsam. The number generally gets smaller
as we go left to right.

 title | v20_num_mcvs | cc05_num_mcvs
| v25_num_mcvs
---+--+---+--
 Corr. unif. dist. (N=1000k, Nd=1000)  |   13 | 9
|2
 Corr. uni

Re: MCV lists for highly skewed distributions

2018-03-19 Thread Dean Rasheed

On 18 March 2018 at 22:52, Dean Rasheed  wrote:
> I'll post something tomorrow, once I've finished wordsmithing the comments.
>

As promised, here is a new patch, with comment updates, per John and
Tomas' suggestions, plus the continuity correction, which seemed just
about worthwhile.

Regards,
Dean

mcv-stats-v2cc.patch
Description: Binary data

Re: MCV lists for highly skewed distributions

2018-03-18 Thread Dean Rasheed

On 18 March 2018 at 12:24, John Naylor  wrote:
> Over the weekend I tried out a test to measure:
> -The size of the MCV list
> -The ratio between actual and estimated cardinality of the least
> common value in the MCV list.
> -The ratio between actual and estimated cardinality of the most common
> value not in the MCV list.

Nice tests!

> Summary for non-uniform distributions:
> Note: There's a lot of variation in the estimation of the most common
> non-MCV. About 1% of all ANALYZE runs apparently failed to sample one
> of the good candidates for the MCV list, resulting in huge
> underestimates. It didn't matter what patch was applied. For future
> tests, I'll do more runs and measure a truncated mean. Looking at the
> number of MCVs is much more stable, but unlike the uniform
> distribution case, we don't have an intuitive feel for what that
> number should be. That said,
>
> -The V2 patch is somewhat better than RSE and much better than master
> at estimating the most common value not in the MCV list.

Yes, that matches my observations too. It stems from the fact that the
V2 patch tends to produce more MCVs than the RSE patch (which in turn
tends to produce more MCVs than master) for non-uniform distributions.
More MCVs then improve the estimates for non-MCVs.

> -Bumping the sample stddev threshold to 3 seems to behave similarly to
> the RSE patch, maybe slightly better.

In a way, that's probably not surprising. The two algorithms are
closely related. If the non-MCV frequencies are quite low, the v2
patch with a 3-SD threshold behaves like the RSE patch with a 33% RSE
cutoff. The 20% RSE cutoff patch is mathematically equivalent to the
v2 patch with a 5-SD threshold and the non-MCV frequencies all set to
zero, if that makes sense.

> Summary for uniform distributions:
> Note 1: Using the average is not really adequate for testing
> under/overestimation of values in my test setup, since I set the
> estimation ratio to zero if there was either an empty MCV list or a
> completely full list. In future runs, I'll drill down a bit more
> precisely.
> That said, there didn't seem to be a huge amount variation between the
> patches as far as either of the ratios I was testing. Looking at the
> number of MCVs is much better anyway, since we know we want either
> none or everything.
>
> Note 2: All patches fully populated the list when all the values fit
> in the MCV list. For the rest of the cases:
>
> -The RSE patch is not much better than master at excluding MCVs.

To make sense of that, you need to look at the errors in the
estimates. The RSE patch won't exclude MCVs if it thinks they will
give reasonably accurate estimates, so may produce fully populated MCV
lists for uniform distributions even when all the distinct values
don't fit. That's not ideal, but it isn't necessarily bad. In my
previous testing I found that it was good at excluding MCVs in cases
where they gave inaccurate estimates.

> -The V2 patch is much better than either of these, but still not ideal
> (up to 30)
> -The V2 patch with a cutoff of 2.5 severely cut down the MCVs (at most 3).
> -With a cutoff of 3.0, all are empty.

Makes sense, but I don't think we should necessarily put too much
emphasis on the uniform tests. Real-world datasets tend to be
non-uniform, and the consequences of too few MCVs tend to be worse
than too many.

I repeated these tests with a 2-SD continuity-corrected threshold and
the results fell somewhere between the 2-SD and 2.5-SD cases. My
inclination is to go with that, as something that strikes a reasonable
balance, and represents a distinct improvement over master in a number
of different areas.

I'll post something tomorrow, once I've finished wordsmithing the comments.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-18 Thread John Naylor

I wrote:

> Looks good. I'll run some tests of my own this week, trying to find
> corner cases different from yours.

Over the weekend I tried out a test to measure:
-The size of the MCV list
-The ratio between actual and estimated cardinality of the least
common value in the MCV list.
-The ratio between actual and estimated cardinality of the most common
value not in the MCV list.

The script, results, and some starter queries are attached. I ran a
bunch more queries to drill down in certain areas, but they're all
based on the ones here.

I ran against master, the RSE patch, and 3 different variations of the
V2 patch (with stddev cutoff of 2, 2.5, and 3.0). For each file, I
loaded into each DB, ran ANALYZE 10 times, and took the average for
each of the three metrics above. I only tested with the default stats
target.

For this test, I used the same distributions as Dean's original
script, but I whacked around the script to enable inserting results
into a table.

--
Summary for non-uniform distributions:
Note: There's a lot of variation in the estimation of the most common
non-MCV. About 1% of all ANALYZE runs apparently failed to sample one
of the good candidates for the MCV list, resulting in huge
underestimates. It didn't matter what patch was applied. For future
tests, I'll do more runs and measure a truncated mean. Looking at the
number of MCVs is much more stable, but unlike the uniform
distribution case, we don't have an intuitive feel for what that
number should be. That said,

-The V2 patch is somewhat better than RSE and much better than master
at estimating the most common value not in the MCV list.
-Bumping the sample stddev threshold to 3 seems to behave similarly to
the RSE patch, maybe slightly better.

--
Summary for uniform distributions:
Note 1: Using the average is not really adequate for testing
under/overestimation of values in my test setup, since I set the
estimation ratio to zero if there was either an empty MCV list or a
completely full list. In future runs, I'll drill down a bit more
precisely.
That said, there didn't seem to be a huge amount variation between the
patches as far as either of the ratios I was testing. Looking at the
number of MCVs is much better anyway, since we know we want either
none or everything.

Note 2: All patches fully populated the list when all the values fit
in the MCV list. For the rest of the cases:

-The RSE patch is not much better than master at excluding MCVs.
-The V2 patch is much better than either of these, but still not ideal
(up to 30)
-The V2 patch with a cutoff of 2.5 severely cut down the MCVs (at most 3).
-With a cutoff of 3.0, all are empty.

--
Here's one interesting case. It's a normal distribution, but highly
dispersed (N=50, sd=1000), so it resembles a uniform one. Looking
at the number of MCVs, it jumps around a lot between patches, but the
estimates don't vary much:

Master: 100
RSE: 1
V2: 100
V2 with 2.5 cutoff: 100
V2 with 3.0 cutoff: 32


--
Thoughts and future testing:
I think the V2 patch strikes a good balance.
I'm partial to the V2 patch with the 2.5 cutoff in sample standard
deviation, since it's much more aggressive about excluding values for
uniform distributions. It's almost as good as V2 at including values
in non-uniform distributions, but possibly worse enough that it's not
the best choice.

Time is running out, but some things I'd like to look at:

-As mentioned above, better noise reduction in my data analysis.
-The continuity-corrected Wald interval Dean mentioned [1]
-I wonder if we can actually untie the max size of the MCV list from
the stats target, and use a hard-coded maximum like 1000. That might
allow us to bump the max stat target without hurting planning time.
Perhaps next cycle.


-John Naylor

[1] 
https://www.postgresql.org/message-id/CAEZATCVHEEg%2BCrP%2B-0JUsVeNPu5rs_S23oJVeH4VF%3DfgDwhfLQ%40mail.gmail.com
#!/usr/bin/python

import datetime
import math
import numpy
import psycopg2
import random
import re
import sys

HEAD_PORT = 5430
RSE_PORT = 5431
SSD_20_PORT = 5432
SSD_25_PORT = 5425
SSD_30_PORT = 5433

TEST_DATA_FILE = '/tmp/mcv-test-data'
NUMBER_OF_ANAYLYSE_TESTS = 10

conn0 = psycopg2.connect('port=%d dbname=john user=john' % HEAD_PORT)
conn0.set_session(autocommit = True)
conn1 = psycopg2.connect('port=%d dbname=john user=john' % RSE_PORT)
conn1.set_session(autocommit = True)
conn20 = psycopg2.connect('port=%d dbname=john user=john' % SSD_20_PORT)
conn20.set_session(autocommit = True)
conn25 = psycopg2.connect('port=%d dbname=john user=john' % SSD_25_PORT)
conn25.set_session(autocommit = True)
conn30 = psycopg2.connect('port=%d dbname=john user=john' % SSD_30_PORT)
conn30.set_session(autocommit = True)

branches = {}
branches['master'] = conn0.cursor()
branches['RSE'] = conn1.cursor()
branches['V2_20'] = conn20.cursor()
branches['V2_25'] = conn25.cursor()
branches['V2_30'] = conn30.cursor()

# Store test results in the master DB.
result_store = branches['master']

# Create one results table

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Tomas Vondra

On 03/17/2018 08:32 PM, Dean Rasheed wrote:
> On 17 March 2018 at 18:40, Tomas Vondra  wrote:
>> Currently, analyze_mcv_list only checks if the frequency of the
>> current item is significantly higher than the non-MCV selectivity.
>> My question is if it shouldn't also consider if removing the item
>> from MCV would not increase the non-MCV selectivity too much.
>>
> 
> Oh, I see what you're saying. In theory, each MCV item we remove is 
> not significantly more common than the non-MCV items at that point,
> so removing it shouldn't significantly increase the non-MCV
> selectivity. It's possible the cumulative effect of removing multiple
> items might start to add up, but I think it would necessarily be a
> slow effect, and I think it would keep getting slower and slower as
> more items are removed -- isn't this equivalent to constructing a
> sequence of numbers where each number is a little greater than the
> average of all the preceding numbers, and ends up virtually
> flat-lining.
> 

Yes, I think you're right. Another thing that occurred to me is that
we're pretty much guaranteed to misestimate things at the tail end, and
in my experience under-estimates have far worse consequences.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed

On 17 March 2018 at 18:40, Tomas Vondra  wrote:
> Currently, analyze_mcv_list only checks if the frequency of the current
> item is significantly higher than the non-MCV selectivity. My question
> is if it shouldn't also consider if removing the item from MCV would not
> increase the non-MCV selectivity too much.
>

Oh, I see what you're saying. In theory, each MCV item we remove is
not significantly more common than the non-MCV items at that point, so
removing it shouldn't significantly increase the non-MCV selectivity.
It's possible the cumulative effect of removing multiple items might
start to add up, but I think it would necessarily be a slow effect,
and I think it would keep getting slower and slower as more items are
removed -- isn't this equivalent to constructing a sequence of numbers
where each number is a little greater than the average of all the
preceding numbers, and ends up virtually flat-lining.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Tomas Vondra

On 03/17/2018 07:28 PM, Dean Rasheed wrote:
> On 16 March 2018 at 15:26, Tomas Vondra  wrote:
>> Actually, one question - when deciding whether to keep the item in the
>> MCV list, analyze_mcv_list only compares it's frequency with an average
>> of the rest. But as we're removing items from the MCV list, the average
>> frequency of the non-MCV items is growing (we're removing items with
>> higher and higher frequencies). That means the estimates for the least
>> common items will get higher and higher - essentially overestimates. So,
>> couldn't/shouldn't analyze_mcv_list consider this too?
>>
> 
> Yes, that's the intention. At the start, sumcount is set to the count
> of all but the last (least common) MCV item, so it can estimate the
> frequency of the non-MCV items if the last MCV item were to be
> removed. Then each time through the loop, sumcount is decreased by the
> removed item's count, increasing the estimated frequency of the
> non-MCV items accordingly.
> 

I know it's updated like that, but that's not quite what I meant. Sorry
for not making the question clearer, so let me rephrase it.

Currently, analyze_mcv_list only checks if the frequency of the current
item is significantly higher than the non-MCV selectivity. My question
is if it shouldn't also consider if removing the item from MCV would not
increase the non-MCV selectivity too much.

But perhaps that would be a silly idea, because the non-MCV items may
also be seen as a random noise.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed

On 16 March 2018 at 15:26, Tomas Vondra  wrote:
> Actually, one question - when deciding whether to keep the item in the
> MCV list, analyze_mcv_list only compares it's frequency with an average
> of the rest. But as we're removing items from the MCV list, the average
> frequency of the non-MCV items is growing (we're removing items with
> higher and higher frequencies). That means the estimates for the least
> common items will get higher and higher - essentially overestimates. So,
> couldn't/shouldn't analyze_mcv_list consider this too?
>

Yes, that's the intention. At the start, sumcount is set to the count
of all but the last (least common) MCV item, so it can estimate the
frequency of the non-MCV items if the last MCV item were to be
removed. Then each time through the loop, sumcount is decreased by the
removed item's count, increasing the estimated frequency of the
non-MCV items accordingly.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed

On 17 March 2018 at 17:16, Dean Rasheed  wrote:
> Using the calculator above, you can see that the distribution is
> fairly normal-like, but with a noticeable positive skew. The 2-stddev
> interval is 0.6 to 9.4, and according to the calculator the
> probability of the value being less than or equal to 1 is in the
> ballpark of the 2.5% figure expected. So even with just 5 occurrences
> in the sample, it's fairly close to a normal distribution.
>

One thing this does illustrate is that the hypergeometric distribution
is a discrete distribution and there can be quite large jumps in the
probability from one value to the next, so care may be needed when
approximating it with a continuous distribution. The standard
technique used to handle this is to apply what is known as a
continuity correction factor. Suppose that X is a random variable with
a discrete hypergeometric distribution, and Y is a continuous normal
distribution, with the same mean and variance, being used to
approximate X. Then P(X>i) for some integer i is the same as
P(X>=i+1), because X can only be integer-valued. The idea is then that
you can use P(Y>i+0.5) to get a fairly good approximation to P(X>i).
That would correspond to adding 0.5 to the right-hand side of the
current test, i.e.,

if (mcv_counts[num_mcv - 1] > selec * samplerows + 2 * stddev + 0.5)
=> Common enough to be included in MCV-list

A lot of the time that won't make much difference, except when dealing
with the smaller counts at the tail end of the MCV list, where it
might help avoid the too-many-mcvs problem, so I think it's worth
trying out.

Apparently, in textbooks, an interval like the mean +/- 2*stddev is
known as a Wald confidence interval, and the mean +/- (2*devdev+0.5)
is the continuity-corrected Wald interval, so it's probably worth
mentioning that in the comments. They are generally regarded as quite
crude approximations for hypergeometric distributions, and there's
quite a bit of research around establishing more accurate confidence
intervals for this kind of distribution, but the formulae involved
tend to be quite complicated, whereas the Wald interval has the
advantage of being very simple. In this context, I don't think we need
to establish a particularly accurate confidence interval. We just want
to be able to say that the value is probably more common than the
non-mcv values, without being too rigorous about what "probably"
means, as long as it works in practice to discount values that just
happen to be a bit more common in the sample, but aren't actually more
common in the table as a whole.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed

On 13 March 2018 at 08:39, John Naylor  wrote:
>> Also, this is common enough that in fact that distribution
>> can be reasonably approximated by a normal distribution.
>
> For the archives, because it's typically seen 10 times in the sample,
> per the rule of thumb mention upthread?
>

Actually, I think that the rule of thumb of at least 10 instances in
the sample for a normal approximation is probably too strict.

Looking around, values as low as 5 seem to be quite commonly used. Of
course there is no right answer here, it depends on what you're doing
with it, and how close you want the normal distribution to be to the
hypergeometric one. For our purposes, I don't think we really need it
to be that close. We just want to be able to justify statements like
"the value is *probably* more common than X, and it was unlikely that
that was just random sampling error".

>> Making use of
>> the non-MCV-based estimate allows us to do better -- if the MCV-based
>> estimate is statistically significantly higher than the non-MCV-based
>> estimate, then it would almost certainly be better to include that
>> value in the MCV list, even if its spread is quite large.
>
> Because we can treat the sample as normal, testing for > 2 standard
> deviations means that we're "~95% sure" it's more common in the table
> than the non-MCV selectivity would suggest, right? (I realize I'm not
> using technically accurate language)
>

Actually, it's more like 97.5% (remember the normal distribution is
symmetric, so there's around 2.5% below the 2-stddev interval, around
95% in it, and another 2.5% above it), but that's just nit-picking.

There are several nice online calculators for playing around with
hypergeometric distributions, such as
http://keisan.casio.com/exec/system/1180573201

Consider this distribution:

N = 100
n = 1
K = 500
Mean = n * K / N = 5
Variance = (K / N) * (1 - K / N) * n * (N - n) / (N - 1) = 4.9
Stddev = sqrt(Variance) = 2.2

Using the calculator above, you can see that the distribution is
fairly normal-like, but with a noticeable positive skew. The 2-stddev
interval is 0.6 to 9.4, and according to the calculator the
probability of the value being less than or equal to 1 is in the
ballpark of the 2.5% figure expected. So even with just 5 occurrences
in the sample, it's fairly close to a normal distribution.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed

> On 03/11/2018 10:23 AM, Dean Rasheed wrote:
>> I'm moving this back to a status of "Needs review" partly because the
>> code has changed significantly, but also because I want to do more
>> testing, particularly with larger datasets.
>>

John, Tomas,

Thanks for looking at this, and sorry for my delayed reply. The
demands of my day job keep getting in the way. I'm feeling pretty good
about this patch though. There may still be some fine tuning to do,
but it's looking like a definite improvement over HEAD, so I'm
optimistic that I'll be able to finish it soonish.

I'll post back more detailed responses to your specific points shortly.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-16 Thread Tomas Vondra

On 03/11/2018 10:23 AM, Dean Rasheed wrote:
> ...
>
> I'm moving this back to a status of "Needs review" partly because the
> code has changed significantly, but also because I want to do more
> testing, particularly with larger datasets.
> 

Thanks for working on this. The code seems fine to me, although it might
be a good idea to add comments explaining why analyze_mcv_list() starts
with full MCV lists and then removes items (essentially the explanation
why Jeff's original idea would not work so well).

Actually, one question - when deciding whether to keep the item in the
MCV list, analyze_mcv_list only compares it's frequency with an average
of the rest. But as we're removing items from the MCV list, the average
frequency of the non-MCV items is growing (we're removing items with
higher and higher frequencies). That means the estimates for the least
common items will get higher and higher - essentially overestimates. So,
couldn't/shouldn't analyze_mcv_list consider this too?

I've also done a bunch of testing regarding join cardinality estimates,
because eqjoinsel_inner() is one of the places using MCV lists too. So
I've generated tables with different sizes and data distributions, and
observed how the patch affects the join estimates.

The scripts and results are available here:

   https://github.com/tvondra/join-estimates-tests

The tables were not particularly huge at this point (1000 to 100k rows),
I've also varied number of distinct values (100 - 10k), statistics
target (10 - 10k) and distribution (for each table independently):

1) uniform

insert into t1 select random() * $ndistinct1, i
  from generate_series(1,$nrows1) s(i)

2) skewed

insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i
  from generate_series(1,$nrows2) s(i)

3) skewed-inverse

insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i
  from generate_series(1,$nrows2) s(i)

4) skewed

insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i
  from generate_series(1,$nrows2) s(i)

5) skewed-strong-inverse

insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i
  from generate_series(1,$nrows2) s(i)

The results are a bit too large to attach them here - see for example
https://github.com/tvondra/join-estimates-tests/blob/master/bench/summary.ods.

Looking at Mean Relative Error, i.e. essentially

MRE = AVERAGE(MAX(estimate/actual,actual/estimate))

over the 20 runs for each combination of parameters, The "average" sheet
shows

 (MRE for patched) / (MRE for master)

and the patch seems to clearly improve accuracy in this case. There are
a few cases where the estimate gets slightly worse (say, by 10%)
compared to current master. So I think that's nice.

I'm open to running additional tests with other distributions, table
sizes etc. if needed.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: MCV lists for highly skewed distributions

2018-03-13 Thread John Naylor

Hi Dean,
I've looked over your patch briefly, but not tested it yet.

> [construction of a pathological data set]

> So the table has around 31 million rows, and the stats target is maxed
> out -- we're sampling around 10% of the table, and it's not possible
> to sample more. Looking at the value a=5, it appears 102 times in the
> table, so we can expect it to be seen around 10 times in any sample,
> but the minimum count for the new algorithm in this case is 23, so it
> won't be included in the MCV list. This leads to it having the same
> estimate as all other non-MCV values, which turns out to be rows=5, a
> considerable under-estimate.

> So arguably, the new algorithm is still better than the current one
> for this data, but the fact that it doesn't include a=5 in the list
> bugs me, because the estimate for that single value is consistently
> worse. Looking at the distribution of a=5 in the sample, it is a
> hypergeometric distribution with N=3100, n=300 and K=102. This
> gives a sample mean of 9.8 and a variance of 8.9 (a standard deviation
> of 3.0).

Mean =  n * K / N = 9.8
Var = (K / N) * (1 - K / N) * n * (N - n) / (N - 1) = 8.9
Stdev = sqrt(Var) = 3.0

Got it.

> Also, this is common enough that in fact that distribution
> can be reasonably approximated by a normal distribution.

For the archives, because it's typically seen 10 times in the sample,
per the rule of thumb mention upthread?

> The value is
> excluded from the MCV list because the spread is deemed too large,
> relative to the mean, so the relative error of the MCV-based estimate
> would be too high. However, not including it in the MCV list causes it
> to have an estimate of 5, which would correspond to a sample mean of
> 0.5, and the observed sample mean is more than 3 standard deviations
> higher than that. So we do actually have enough information to
> conclude that the value is almost certainly more common than the
> non-MCV selectivity would suggest, and I think that's sufficient
> justification for including it in the MCV list.

> The crux of this is that the relative-standard-error algorithm is
> making use of 2 pieces of information -- the sample frequency, and an
> estimate of its statistical spread. However, there is a 3rd piece of
> information that we know, that is not being made use of -- the
> selectivity that would be assigned to the value if it were not
> included in the MCV list. Looking at just the first 2 pieces of
> information ensures that we get decent estimates for the values in the
> MCV list (where what constitutes "decent" depends on an arbitrarily
> chosen RSE cutoff), but it doesn't take into account the fact that the
> values not in the list may have much worse estimates. Making use of
> the non-MCV-based estimate allows us to do better -- if the MCV-based
> estimate is statistically significantly higher than the non-MCV-based
> estimate, then it would almost certainly be better to include that
> value in the MCV list, even if its spread is quite large.

Because we can treat the sample as normal, testing for > 2 standard
deviations means that we're "~95% sure" it's more common in the table
than the non-MCV selectivity would suggest, right? (I realize I'm not
using technically accurate language)

In your v1 patch, we didn't need a clamp on 10 values in the sample to
treat it as normal, because it was mathematically impossible to pass
the RSE test with less than 10 unless we were sampling most of the
table, in which case it didn't matter. Is there a similar
justification with the new algorithm?

> [results summary]
>
> So HEAD and P1 are both producing too many MCVs. The latest 2 patches
> cure that problem, but the new v2 patch is better at picking out all
> the common values.

Looks good. I'll run some tests of my own this week, trying to find
corner cases different from yours.

As far as the code goes, I haven't studied it very closely yet, but I
did want to call attention to this comment:

+   /*
+* If the value is kept in the MCV list, its population 
frequency is
+* assumed to equal its sample frequency, and the distribution 
of the
+* value's count in the sample is a hypergeomtric distribution 
with
+* the following standard deviation.
+*/

The part after "and" doesn't seem to follow from the first part. Would
you mind clarifying?

-John Naylor

Re: MCV lists for highly skewed distributions

2018-03-11 Thread Dean Rasheed

On 6 March 2018 at 16:48, Dean Rasheed  wrote:
> On 6 March 2018 at 08:51, John Naylor  wrote:
>> On 3/5/18, Dean Rasheed  wrote:
>>> Attached is an updated patch.
>> Nice. The results look good.
>
> Thanks for the review.
>

So I was about ready to commit this, but decided to do more testing,
because I worry that there may be instances that a user could point to
where it makes estimates worse. Maybe that's inevitable for any change
we might make, and I don't think that should stop us from at least
trying to improve things, but it does make me wary about committing
this without a lot of testing.

In all the testing I've done so far, this looks to be a definite
improvement over the current algorithm, but it looks like it's
possible to do better.

Consider the following test case:

drop table if exists foo;
create temp table foo(a int);
insert into foo
  select * from
   (
 select x/2 from generate_series(0,1999) g(x)
 union all
 select 0 from generate_series(0,999)
 union all
 select 1 from generate_series(0,99)
 union all
 select 2 from generate_series(0,9)
 union all
 select 3 from generate_series(0,)
 union all
 select 4 from generate_series(0,999)
 union all
 select 5 from generate_series(0,99)
   ) t
   order by random();
alter table foo alter column a set statistics 1;
analyse foo;

So the table has around 31 million rows, and the stats target is maxed
out -- we're sampling around 10% of the table, and it's not possible
to sample more. Looking at the value a=5, it appears 102 times in the
table, so we can expect it to be seen around 10 times in any sample,
but the minimum count for the new algorithm in this case is 23, so it
won't be included in the MCV list. This leads to it having the same
estimate as all other non-MCV values, which turns out to be rows=5, a
considerable under-estimate.

By comparison, the current algorithm in HEAD does include this value
in the MCV list, so it gets a good estimate. On the other hand, it
generates a complete MCV-list with 1 entries, most of which are
just random values from the list that appear twice in the table and
also happen to appear twice in the sample. So there are nearly 1
values that are significantly over-estimated in this case.

So arguably, the new algorithm is still better than the current one
for this data, but the fact that it doesn't include a=5 in the list
bugs me, because the estimate for that single value is consistently
worse. Looking at the distribution of a=5 in the sample, it is a
hypergeometric distribution with N=3100, n=300 and K=102. This
gives a sample mean of 9.8 and a variance of 8.9 (a standard deviation
of 3.0). Also, this is common enough that in fact that distribution
can be reasonably approximated by a normal distribution. The value is
excluded from the MCV list because the spread is deemed too large,
relative to the mean, so the relative error of the MCV-based estimate
would be too high. However, not including it in the MCV list causes it
to have an estimate of 5, which would correspond to a sample mean of
0.5, and the observed sample mean is more than 3 standard deviations
higher than that. So we do actually have enough information to
conclude that the value is almost certainly more common than the
non-MCV selectivity would suggest, and I think that's sufficient
justification for including it in the MCV list.

The crux of this is that the relative-standard-error algorithm is
making use of 2 pieces of information -- the sample frequency, and an
estimate of its statistical spread. However, there is a 3rd piece of
information that we know, that is not being made use of -- the
selectivity that would be assigned to the value if it were not
included in the MCV list. Looking at just the first 2 pieces of
information ensures that we get decent estimates for the values in the
MCV list (where what constitutes "decent" depends on an arbitrarily
chosen RSE cutoff), but it doesn't take into account the fact that the
values not in the list may have much worse estimates. Making use of
the non-MCV-based estimate allows us to do better -- if the MCV-based
estimate is statistically significantly higher than the non-MCV-based
estimate, then it would almost certainly be better to include that
value in the MCV list, even if its spread is quite large.

This is somewhat similar to the initial idea from Jeff Janes, except
that patch didn't take into account the spread, so it ended up making
the too-many-mcvs problem worse for uniform distributions. There's
also another subtlety with that patch -- when comparing frequencies of
values not in the MCV list, you really have to start from a fully
populated MCV list and work down, rather than starting with a empty
one and working up, because if all the common values have about the
same frequency, the most common amongst them may not be much more
common than the initial average, so you may end up with an empty MCV
list.

Re: MCV lists for highly skewed distributions

2018-03-06 Thread Dean Rasheed

On 6 March 2018 at 08:51, John Naylor  wrote:
> On 3/5/18, Dean Rasheed  wrote:
>> Attached is an updated patch.
> Nice. The results look good.

Thanks for the review.

> I agree it should be in a separate function. As for that large
> comment, I spent some time pouring over it to verify the math and make
> sure I understand it. I see a couple points where it might be a bit
> confusing to someone reading this code for the first time:
>
> +* This bound is at most 25, and approaches 0 as n approaches 0 or N. 
> The
> +* case where n approaches 0 cannot happen in practice, since the 
> sample
> +* size is at least 300.
>
> I think the implication is that the bound cannot dip below 10 (the
> stated minimum to justify using techniques intended for a normal
> distributed sample), so there's no need to code a separate clamp to
> ensure 10 values. That might be worth spelling out explicitly in the
> comment.
>

Perhaps I should really say that n can't be less than 300 unless N is
too, in which case n=N. So the only case where we need to worry about
the bound approaching zero is when n --> N, and we're sampling the
entire table, or almost all of it. I'll see if I can re-word that to
be clearer.

> +* size is at least 300.  The case where n approaches N corresponds to
> +* sampling the whole the table, in which case it is reasonable to 
> keep
> +* the whole MCV list (have no lower bound), so it makes sense to 
> apply
> +* this formula for all inputs, even though the above derivation is
> +* technically only valid when the right hand side is at least around 
> 10.
>
> It occurred to me that if the table cardinality is just barely above
> the sample size, we could get a case where a value sampled only once
> could slip into the MCV list. With the default stat target, this would
> be tables with between 30,000 and 31,500 rows.

(actually I think that's 31250 rows, or more generally when we sample
more than around 96% of the table)

> I couldn't induce this
> behavior, so I looked at the logic that identifies MCV candidates, and
> saw a test for
>
> if (dups_cnt > 1)
>
> If I'm reading that block correctly, than a value sampled once will
> never even be considered for the MCV list, so it seems that the
> defacto lower bound for these tables is two. It might be worth
> mentioning that in the comment.
>
> It also means that this clamp on HEAD
>
> -   if (mincount < 2)
> -   mincount = 2;
>
> is dead code.
>

Yes, in compute_scalar_stats(), each value is guaranteed to have been
seen at least twice. However, looking at compute_distinct_stats(),
that's not the case. I'm not sure if that matters.

When we have sampled 96% of the table, any estimate we produce based
on the number of times a value has been seen in the sample is likely
to be almost exact, even if it has only been seen once or twice. So
the estimates from the MCV list will all be good, but I suspect in
this case the estimates for all other values that didn't fit in the
MCV list will also be good, so it may not matter whether or not we
keep those MCV items. My instinct is to keep them, on the grounds that
the more information the planner has, the better.

>>   mcv_cutoff - The relative standard error cutoff percentage used (10,
>> 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests
>> against HEAD.
>
> I'm not quite following the negative numbers for HEAD. They're all the
> equivalent, right? Just a label for convenience to make sure you ran
> the same number of tests?
>

Yes, they should all be equivalent. They just reflect the way I ran
the tests -- HEAD vs the patch with a cutoff of 10%, HEAD vs the patch
with a cutoff of 15%, and so on. So I ended up running it against HEAD
9 times, and I didn't want to throw any of that data away. It's useful
for getting a feel for the scope of random variations between test
runs.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-06 Thread John Naylor

On 3/5/18, Dean Rasheed  wrote:
> Attached is an updated patch. I decided to move the code that
> determines the minimum number of occurrences for an MCV list item out
> to a separate function. It's not much code, but there's a large
> comment to explain its statistical justification, and I didn't want to
> duplicate that in 2 different places (possibly to become 3, with the
> multivariate MCV stats patch).

Nice. The results look good.

I agree it should be in a separate function. As for that large
comment, I spent some time pouring over it to verify the math and make
sure I understand it. I see a couple points where it might be a bit
confusing to someone reading this code for the first time:

+* This bound is at most 25, and approaches 0 as n approaches 0 or N. 
The
+* case where n approaches 0 cannot happen in practice, since the sample
+* size is at least 300.

I think the implication is that the bound cannot dip below 10 (the
stated minimum to justify using techniques intended for a normal
distributed sample), so there's no need to code a separate clamp to
ensure 10 values. That might be worth spelling out explicitly in the
comment.

+* size is at least 300.  The case where n approaches N corresponds to
+* sampling the whole the table, in which case it is reasonable to keep
+* the whole MCV list (have no lower bound), so it makes sense to apply
+* this formula for all inputs, even though the above derivation is
+* technically only valid when the right hand side is at least around 
10.

It occurred to me that if the table cardinality is just barely above
the sample size, we could get a case where a value sampled only once
could slip into the MCV list. With the default stat target, this would
be tables with between 30,000 and 31,500 rows. I couldn't induce this
behavior, so I looked at the logic that identifies MCV candidates, and
saw a test for

if (dups_cnt > 1)

If I'm reading that block correctly, than a value sampled once will
never even be considered for the MCV list, so it seems that the
defacto lower bound for these tables is two. It might be worth
mentioning that in the comment.

It also means that this clamp on HEAD

-   if (mincount < 2)
-   mincount = 2;

is dead code.

> I think this is ready for commit now, so barring objections, I will do
> so in the next day or so.

Fine by me. I've skimmed your test methodology and results and have
just one question below.

> There's a lot of random variation between tests, but by running a lot
> of them, the general pattern can be seen. I ran the test 9 times, with
> different MCV cutoff values in the patched code of 10%, 15%, ... 50%,
> then collected all the results from the test runs into a single CSV
> file to analyse using SQL. The columns in the CSV are:
>
>   test_name - A textual description of the data distribution used in the
> test.
>
>   mcv_cutoff - The relative standard error cutoff percentage used (10,
> 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests
> against HEAD.

I'm not quite following the negative numbers for HEAD. They're all the
equivalent, right? Just a label for convenience to make sure you ran
the same number of tests?

-John Naylor

Re: MCV lists for highly skewed distributions

2018-03-05 Thread Dean Rasheed

On 7 February 2018 at 15:58, Dean Rasheed  wrote:
> On 7 February 2018 at 15:25, Robert Haas  wrote:
>> Do you plan to press forward with this, then, or what's
>> the next step?
>
> I plan to do more testing

TL;DR: I tested this new algorithm using 9 different relative standard
error cutoffs (10%, 15%, ... 50%) against a variety of different data
distributions, and it looks like a definite improvement over the
current algorithm, for a wide range of cutoff values. Overall, it
isn't that sensitive to the exact cutoff chosen, but it looks like a
value of 20% is a good general-purpose choice.

One of the properties of the new algorithm is that the size of the MCV
list responds better to changing the stats target, so I don't think
it's worth having a separate knob to control the cutoff percentage.
Such a knob would have a similar, but weaker effect than the stats
target knob, and thus doesn't seem worthwhile.

Attached is an updated patch. I decided to move the code that
determines the minimum number of occurrences for an MCV list item out
to a separate function. It's not much code, but there's a large
comment to explain its statistical justification, and I didn't want to
duplicate that in 2 different places (possibly to become 3, with the
multivariate MCV stats patch).

I think this is ready for commit now, so barring objections, I will do
so in the next day or so.

---

I tested using the attached python script, which tests a large number
of different data distributions, comparing the results with the patch
vs those from HEAD. It uses EXPLAIN ANALYSE to compare the estimates
against actual row counts, both for values in the MCV list, and
non-MCV values, making it possible to tell whether it would have been
better if the MCV list had been bigger or smaller.

There's a lot of random variation between tests, but by running a lot
of them, the general pattern can be seen. I ran the test 9 times, with
different MCV cutoff values in the patched code of 10%, 15%, ... 50%,
then collected all the results from the test runs into a single CSV
file to analyse using SQL. The columns in the CSV are:

  test_name - A textual description of the data distribution used in the test.

  mcv_cutoff - The relative standard error cutoff percentage used (10,
15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests
against HEAD.

  stats_tgt - The stats target used (100 or 1000).

  num_mcvs - The size of the resulting MCV list.

  avg_mcv_err - The average percentage difference between estimated
and actual row counts for values in the MCV list. (There's a bit of a
fudge factor in calculating these percentages, to avoid them becoming
too large in cases where the row counts are small, but I think it's
still useful for comparison purposes.)

  max_mcv_err - The maximum percentage difference between estimated
and actual row counts for values in the MCV list.

  avg_non_mcv_err - The average percentage difference between
estimated and actual row counts for a bunch of non-MCV values.

  max_non_mcv_err - The maximum percentage difference between
estimated and actual row counts for a bunch of non-MCV values.

  avg_err - The average percentage difference between estimated and
actual row counts for all values tested.

  max_err - The maximum percentage difference between estimated and
actual row counts for all values tested.

>From this, it's possible to look for cases where the MCV list is too
large or too small. For example, looking for too-many-mcv cases
(generally the more uniform data distributions):

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE max_mcv_err - max_non_mcv_err > 50
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff | count
+---
-40 |41
-50 |40
-25 |39
-15 |38
-20 |38
-30 |38
-35 |38
-45 |37
-10 |37
 50 |35
 40 |35
 45 |34
 35 |33
 30 |33
 25 |25
 20 |13
 15 | 6
 10 | 3
(18 rows)

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE max_mcv_err - max_non_mcv_err > 100
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff | count
+---
-30 |17
-40 |16
-15 |15
-25 |14
-10 |14
-35 |14
-45 |13
-50 |13
-20 |12
 35 |12
 45 |11
 40 |10
 30 |10
 50 |10
 25 | 6
 10 | 2
(16 rows)
( and implicitly:
 15 | 0
 20 | 0 )

So the patched code is generally better at avoiding the too-many-mcvs
problem, but an mcv_cutoff of 30% or more may be too high.

Looking for too-few-mcv cases:

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE (max_non_mcv_err - max_mcv_err > 50 AND num_mcvs < stats_tgt)
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff | count
+---
-20 |   120
-5

Re: MCV lists for highly skewed distributions

2018-03-01 Thread Dean Rasheed

On 1 March 2018 at 21:01, Andres Freund  wrote:
> This sounds like the patch's status of "waiting on author" isn't right,
> and it should more be ready for committer?
>

Yes, I'll take a look at it this weekend.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-01 Thread Andres Freund

Hi Dean,

On 2018-02-07 15:58:14 +, Dean Rasheed wrote:
> On 7 February 2018 at 15:25, Robert Haas  wrote:
> > Do you plan to press forward with this, then, or what's
> > the next step?
> >
> 
> Yes, I think the results are pretty good so far, especially for the
> more non-uniform distributions. AFAICS it solves the 2 original
> complaints, and I've not yet seen a case where it makes things worse.
> 
> I plan to do more testing (and if anyone else wants to, that would be
> most welcome), and then barring any unexpected issues/objections, I'll
> commit it. Probably not this week though.

This sounds like the patch's status of "waiting on author" isn't right,
and it should more be ready for committer?

Greetings,

Andres Freund

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed

On 7 February 2018 at 15:25, Robert Haas  wrote:
> Do you plan to press forward with this, then, or what's
> the next step?
>

Yes, I think the results are pretty good so far, especially for the
more non-uniform distributions. AFAICS it solves the 2 original
complaints, and I've not yet seen a case where it makes things worse.

I plan to do more testing (and if anyone else wants to, that would be
most welcome), and then barring any unexpected issues/objections, I'll
commit it. Probably not this week though.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Robert Haas

On Wed, Feb 7, 2018 at 10:20 AM, Dean Rasheed  wrote:
> One thing this new algorithm does do is improve the user's ability to
> get more MCVs by increasing the stats target. I'm not yet convinced
> there should be a separate knob for the RSE cutoff. For that to be
> useful, there would need to be some data distributions for which 10%
> (say) was clearly better, and some for which 20% was better. So far,
> there doesn't appear to be a massive difference between the two, and
> it's nothing that can't compensated for using the existing stats
> target knob.

Fair enough.  Do you plan to press forward with this, then, or what's
the next step?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed

On 7 February 2018 at 13:29, Robert Haas  wrote:
> On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed  wrote:
>> Thanks for testing. I agree, this new algorithm seems to stand up
>> pretty well in the testing I've done too. One thing about it that can
>> be tuned is the cutoff threshold for the relative standard error -- I
>> chose 10% completely arbitrarily, but it could just as easily be 20%,
>> 30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).
>
> Maybe it'd be worth having a separate GUC for this, and a reloption to
> override the GUC.  It seems to me that it would be very reasonable to
> want to separately control (a) how much sampling you want to do and
> (b) how aggressively you want to be about including things in the MCV
> list.  Of course, even if we do that, it doesn't excuse us from
> needing to set a good default value.  And it might not be necessary to
> do the extra work for this.
>
> Looking at your data, it definitely seems like 10% would be too strict
> -- if I'm reading this correctly, with a stats target in the 10-50
> range, your normally-distributed table gets no MCVs at all, rather
> than a number equal to the statistics target.  That doesn't seem good.
>

(Actually it was the 10-30 stats target range that gave no MCVs at all for 10%)

The fact that 10% leads to no MCVs for a deliberately lowered stats
target of 10 isn't necessarily bad, and might actually have been an
improvement -- e.g., with a stats target of 1, you get no MCVs even
with a 20% RSE cutoff, whereas with HEAD you typically get 1
completely random MCV, and a wildly inaccurate estimate for that
value.

The reason I think 10% is too low is that the extra MCVs you get by
increasing it to 20% generally seem to give better estimates (for a
range of stats targets) than if they weren't included (although it's
sometimes a bit marginal). So 20% seems to strike about the right
balance between too many and too few MCVs.

One thing this new algorithm does do is improve the user's ability to
get more MCVs by increasing the stats target. I'm not yet convinced
there should be a separate knob for the RSE cutoff. For that to be
useful, there would need to be some data distributions for which 10%
(say) was clearly better, and some for which 20% was better. So far,
there doesn't appear to be a massive difference between the two, and
it's nothing that can't compensated for using the existing stats
target knob.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Robert Haas

On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed  wrote:
> Thanks for testing. I agree, this new algorithm seems to stand up
> pretty well in the testing I've done too. One thing about it that can
> be tuned is the cutoff threshold for the relative standard error -- I
> chose 10% completely arbitrarily, but it could just as easily be 20%,
> 30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).

Maybe it'd be worth having a separate GUC for this, and a reloption to
override the GUC.  It seems to me that it would be very reasonable to
want to separately control (a) how much sampling you want to do and
(b) how aggressively you want to be about including things in the MCV
list.  Of course, even if we do that, it doesn't excuse us from
needing to set a good default value.  And it might not be necessary to
do the extra work for this.

Looking at your data, it definitely seems like 10% would be too strict
-- if I'm reading this correctly, with a stats target in the 10-50
range, your normally-distributed table gets no MCVs at all, rather
than a number equal to the statistics target.  That doesn't seem good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed

On 4 February 2018 at 12:18, John Naylor  wrote:
> I did the same basic eyeball testing I did on earlier patches, and
> this is the best implementation so far. I've attached some typical
> pg_stats contents for HEAD and this patch. More rigorous testing,
> including of planner estimates on real data, is still needed of
> course, but I think this is definitely in the right direction.
>

Thanks for testing. I agree, this new algorithm seems to stand up
pretty well in the testing I've done too. One thing about it that can
be tuned is the cutoff threshold for the relative standard error -- I
chose 10% completely arbitrarily, but it could just as easily be 20%,
30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).

Looking at the too-many-MCVs example from [1], the table has 100
million rows, and each distinct value occurs 10 times. With the
default stats target of 100, the MCV-list is fully populated with
values, most having been seen two (or occasionally three) times. Thus,
if you query for one of those MCVs, the estimate is 6667 or 10,000
rather than 10, which is a pretty bad over-estimate. Increasing the
stats target improves things, because the sample frequencies go down
correspondingly. The problem is also lessened a bit if you randomise
the order of the values in the second column so that it isn't
correlated to the storage order:

insert into qqq select a, random()*10^7 from generate_series(1, (10^8)::int) a;

However in a sample of 30,000 rows it remains reasonably likely that
the same value will be seen twice (I typically observed 40 to 50 MCVs
in this case, c.f. the birthday paradox), and in a sample of 300,000
rows it goes back to giving 1000 MCVs, each seen 2 or 3 times. So this
isn't really the fault of the non-uniform ANALYSE sample, but rather
of the current algorithm's belief that seeing a value just twice is
sufficient for it to qualify as an MCV.

The new MCV algorithm solves this by demanding that a value be seen
many more times before being included in cases where the table size is
much larger than the sample size. The exact threshold depends on the
table size, the sample size and the relative error cutoff value. In
this example, with a table size of 100 million rows, the sample size
makes little difference, because its always going to be much smaller
than the table size, so the number of instances required in the sample
depends mostly on the RSE cutoff chosen:

 rse_cutoff | sample=3 | sample=30 | sample=300
+--+---+
 10%|  100 |   100 | 97
 20%|   25 |25 | 25
 30%|   12 |12 | 11
 40%|7 | 7 |  7
 50%|4 | 4 |  4

So any of those RSE cutoff's solves the too-many-MCVs problem in this
particular case, giving no MCVs, although 50% is pushing it a bit, and
in any case, the theoretical basis for this algorithm breaks down well
before then.

The advantage of a larger RSE cutoff is that it gives more MCVs for a
non-uniform data distribution, where the current algorithm tends to
give too few MCVs. In the attached example, the table has 1 million
rows of integer data in a normal distribution with a standard
deviation of 50. This typically gives somewhere in the region of 430
to 440 distinct values. Running against HEAD and with this patch, for
varying sample sizes and RSE cutoffs gives the following MCV-list
sizes:

stats_target  mcvs (HEAD)  mcvs (10% RSE)  mcvs (20% RSE)  mcvs (30% RSE)
1010   0   10  10
2020   0   20  20
3030   0   30  30
4040   17  40  40
5050   50  50  50
100   100  100 100 100
300   132  204 261 290
1000  205  266 315 338
3000  252  364 400 411
1 296  413 414 412

One thing to note is that the HEAD algorithm never includes more than
around 300 values, because of its requirement for a value to be more
common than average. The new algorithm has no such requirement, so it
will include nearly every value in the MCV list (except the most
uncommon ones) if the stats target is set high enough. Also, the
detailed output from the test shows that the resulting estimates based
on those MCVs are pretty decent.

(Note: this example shows that the too-few-MCVs problem can occur in
any non-uniform distribution, not just highly skewed ones.)

I've also tried this out against some real-world data, and in the
testing I've done so far, the new algorithm is not actually that
sensitive to the precise RSE cutoff

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed

On 1 February 2018 at 17:49, Robert Haas  wrote:
> One point which I want to emphasize is that the length of the MCV list
> bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
> ever thought to be more frequent than the least-common MCVs, and
> however many non-MCVs we think we have (probably fewer than we
> actually have) have to fit into whatever percentage of the table is
> consumed by MCVs.  This would be less important if we had reliable
> n_distinct estimates, but we don't.  So, even throwing things into the
> MCV list that are no more common than the average item can improve
> planning in some cases.
>

That's a good point, and a nice explanation. I think that lends more
weight to the argument that we should be including as many MCVs as
possible, provided there's enough evidence to justify their inclusion.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-02-04 Thread John Naylor

On 2/2/18, Dean Rasheed  wrote:
> I think it would be better to try to come up with an alternative
> algorithm that has a better theoretical basis, and then test that to
> see how it holds up in practice.
>
> With that in mind, attached is a patch based on the idea of setting a
> bound on the relative standard error on the sample frequency

I did the same basic eyeball testing I did on earlier patches, and
this is the best implementation so far. I've attached some typical
pg_stats contents for HEAD and this patch. More rigorous testing,
including of planner estimates on real data, is still needed of
course, but I think this is definitely in the right direction.

-John Naylor

test_mcvstats_v1.sql
Description: application/sql

Re: MCV lists for highly skewed distributions

2018-02-01 Thread Robert Haas

On Thu, Feb 1, 2018 at 12:21 PM, Dean Rasheed  wrote:
> For a highly skewed distribution, it is possible for there to be
> hardly any values (maybe only one) that appears more than 1.25 times
> the average frequency, and so lots of otherwise perfectly good common
> values get discarded. For such a distribution, I don't think that the
> average frequency is particularly interesting, and it shouldn't be
> used to filter the MCV list.

Although I don't think I have enough math background to analyze this
as rigorously as you, I agree with you on this point: the average
frequency doesn't seem very interesting.

One point which I want to emphasize is that the length of the MCV list
bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
ever thought to be more frequent than the least-common MCVs, and
however many non-MCVs we think we have (probably fewer than we
actually have) have to fit into whatever percentage of the table is
consumed by MCVs.  This would be less important if we had reliable
n_distinct estimates, but we don't.  So, even throwing things into the
MCV list that are no more common than the average item can improve
planning in some cases.

> Testing it with the example from [1], it does indeed appear to solve
> the too-many-MCVs problem in that particular case (actually generating
> no MCVs).
>
> Testing it with the highly skewed example at the start of this thread,
> it typically generates a couple more MCVs, producing somewhat better
> estimates, but to get really good estimates for who=17, you need to
> crank up the stats target. It does at least appear to no longer be the
> case that cranking up the stats target has a weak effect.

Hmm, nice result.  I agree with you that it would be nice if we can
come up with a good general-purpose algorithm for this, rather than
making it pluggable.  I don't know whether we can succeed in doing
that but we should try hard, because it's always better when things
just work automatically...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MCV lists for highly skewed distributions

2018-02-01 Thread Dean Rasheed

On 1 February 2018 at 13:16, Simon Riggs  wrote:
> On 25 January 2018 at 22:19, Tom Lane  wrote:
>> In any case, since it looks like the next step is for someone to come
>> up with a new proposal, I'm going to set this to Waiting on Author.
>
> Dean and John's results show that different algorithms work better for
> different cases.
>
> How about we make ANALYZE's MCV algorithm pluggable? And then include,
> say, 2 additional algorithms.
>

I don't think we've yet proved that that's needed. I'd rather attempt
to come up with a decent general-purpose algorithm, if possible.

The main problem I have with the originally proposed patch is the lack
of any kind of rigorous justification for the approach of changing the
algorithm to include values that are "significantly more common than
average frequency for values which will not be represented in the MCV
list". So there's no guarantee that the MCVs produced will be backed
by sufficient evidence, and it risks making the too-many-MCVs case
worse.

Of course the current code suffers from both the too-many-MCVs and
too-few-MCVs problems, depending on the data distribution:

For a reasonably uniform distribution with quite a large number of
distinct values, it is possible to generate MCVs on the basis of
having seen only a few instances of the values. Those few values seen
are then not sufficiently statistically significant, and extrapolating
to the whole table produces wildly inaccurate estimates.

For a highly skewed distribution, it is possible for there to be
hardly any values (maybe only one) that appears more than 1.25 times
the average frequency, and so lots of otherwise perfectly good common
values get discarded. For such a distribution, I don't think that the
average frequency is particularly interesting, and it shouldn't be
used to filter the MCV list.

There is also another variant of the too-many-MCVs problem that I
believe is also possible -- if the sample contains a large number of
NULLs or too-wide values, values_cnt could be reduced to the point
where maxmincount is quite small, and again MCVs could get included on
the basis of a very small number of occurrences.

I think it would be better to try to come up with an alternative
algorithm that has a better theoretical basis, and then test that to
see how it holds up in practice.

With that in mind, attached is a patch based on the idea of setting a
bound on the relative standard error on the sample frequency -- so we
include values in the MCV list if and only if they are seen enough
times in the sample that the standard error on the sample frequency is
small compared to the sample frequency itself, and thus we expect that
the relative error resulting from extrapolating that to the whole
table is also small. In theory then, it should never generate too many
MCVs, although tuning of the relative error threshold might be
required to prevent it from generating too few.

Note, this is not a finished patch (e.g., I haven't touched
compute_distinct_stats()). It's merely a possible starting point from
which a lot more testing will be required.

Testing it with the example from [1], it does indeed appear to solve
the too-many-MCVs problem in that particular case (actually generating
no MCVs).

Testing it with the highly skewed example at the start of this thread,
it typically generates a couple more MCVs, producing somewhat better
estimates, but to get really good estimates for who=17, you need to
crank up the stats target. It does at least appear to no longer be the
case that cranking up the stats target has a weak effect.

Regards,
Dean

[1] 
https://www.postgresql.org/message-id/flat/20170522132017.29944.48...@wrigleys.postgresql.org
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
new file mode 100644
index 5f21fcb..af92835
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2557,25 +2557,21 @@ compute_scalar_stats(VacAttrStatsP stats
 		 * Decide how many values are worth storing as most-common values. If
 		 * we are able to generate a complete MCV list (all the values in the
 		 * sample will fit, and we think these are all the ones in the table),
-		 * then do so.  Otherwise, store only those values that are
-		 * significantly more common than the (estimated) average. We set the
-		 * threshold rather arbitrarily at 25% more than average, with at
-		 * least 2 instances in the sample.  Also, we won't suppress values
-		 * that have a frequency of at least 1/K where K is the intended
-		 * number of histogram bins; such values might otherwise cause us to
-		 * emit duplicate histogram bin boundaries.  (We might end up with
-		 * duplicate histogram entries anyway, if the distribution is skewed;
-		 * but we prefer to treat such values as MCVs if at all possible.)
+		 * then do so.  Otherwise, keep only those values that appear
+		 * sufficiently often in the sample that it is reasonable to
+		 * extrapolate their sample frequencies to the entire ta

Re: MCV lists for highly skewed distributions

2018-02-01 Thread Simon Riggs

On 25 January 2018 at 22:19, Tom Lane  wrote:
> Dean Rasheed  writes:
>> It occurs to me that maybe a better test to exclude a value from the
>> MCV list would be to demand that its relative standard error not be
>> too high. Such a test, in addition to the existing tests, might be
>> sufficient to solve the opposite problem of too many values in the MCV
>> list, because the real problem there is including a value after having
>> seen relatively few occurrences of it in the sample, and thus having a
>> wildly inaccurate estimate for it. Setting a bound on the relative
>> standard error would mean that we could have a reasonable degree of
>> confidence in estimates produced from the sample.
>
> This patch is marked Ready for Committer, but that seems wildly optimistic
> based on the state of the discussion.  It doesn't look to me like we
> even have consensus on an algorithm, let alone code for it.  Certainly,
> whatever we do needs to address the too-many-MCVs issue from [1] as
> well as Jeff's too-few-MCVs case.
>
> Even if we had a plausible patch, I'm not sure how we get to the
> point of having enough consensus to commit.  In the previous thread,
> it seemed that some people would object to any change whatsoever :-(
>
> In any case, since it looks like the next step is for someone to come
> up with a new proposal, I'm going to set this to Waiting on Author.

Dean and John's results show that different algorithms work better for
different cases.

How about we make ANALYZE's MCV algorithm pluggable? And then include,
say, 2 additional algorithms.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: MCV lists for highly skewed distributions

2018-01-25 Thread Tom Lane

Dean Rasheed  writes:
> It occurs to me that maybe a better test to exclude a value from the
> MCV list would be to demand that its relative standard error not be
> too high. Such a test, in addition to the existing tests, might be
> sufficient to solve the opposite problem of too many values in the MCV
> list, because the real problem there is including a value after having
> seen relatively few occurrences of it in the sample, and thus having a
> wildly inaccurate estimate for it. Setting a bound on the relative
> standard error would mean that we could have a reasonable degree of
> confidence in estimates produced from the sample.

This patch is marked Ready for Committer, but that seems wildly optimistic
based on the state of the discussion.  It doesn't look to me like we
even have consensus on an algorithm, let alone code for it.  Certainly,
whatever we do needs to address the too-many-MCVs issue from [1] as
well as Jeff's too-few-MCVs case.

Even if we had a plausible patch, I'm not sure how we get to the
point of having enough consensus to commit.  In the previous thread,
it seemed that some people would object to any change whatsoever :-(

In any case, since it looks like the next step is for someone to come
up with a new proposal, I'm going to set this to Waiting on Author.

regards, tom lane

[1] https://www.postgresql.org/message-id/flat/32261.1496611829%40sss.pgh.pa.us

Re: stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

2018-01-22 Thread Dean Rasheed

On 22 January 2018 at 08:07, John Naylor  wrote:
> On 1/21/18, Dean Rasheed  wrote:
>> It occurs to me that maybe a better test to exclude a value from the
>> MCV list would be to demand that its relative standard error not be
>> too high. Such a test, in addition to the existing tests, might be
>> sufficient to solve the opposite problem of too many values in the MCV
>> list, because the real problem there is including a value after having
>> seen relatively few occurrences of it in the sample, and thus having a
>> wildly inaccurate estimate for it. Setting a bound on the relative
>> standard error would mean that we could have a reasonable degree of
>> confidence in estimates produced from the sample.
>
> If you don't mind, what would the math look like for that?

Using the same syntax as before:

N = Total rows in table (population size)
n = Number of rows sampled
x = Number of times a particular value appears in the sample
p = x/n = Frequency of the value in the sample

So that the standard error of the proportion is

SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))

Then the relative standard error (which is usually expressed as a percentage) is

RSE = 100 * SE / p

So if we were to demand that the relative standard error was less
than, say, 10%, then the constraint would just be

SE < 0.1 * p

Note:

* This formula not valid if x is very small (see what I wrote about
being able to approximate the distribution of p with a normal
distribution). So we should also enforce the "rule of thumb" x >= 10.

* The frequency p that we're storing is the count divided by the
overall sample size. So the values for N and n above should not be the
counts that exclude NULLs. As far as this logic is concerned, NULLs
(and too-wide values) are just values not equal to value under
consideration. Thus it appears, from just a quick glance at your
patch, that you're using the wrong values for N and n.

* The RSE is a monotonically decreasing function of p in the range
[0,1], so an upper bound on the RSE equates to a lower bound on the
number of occurrences of the value.

This last point gives me a different idea. Instead of applying this
test *in addition to* the existing tests, how about applying it
*instead of* the existing tests. That is, we keep all MCVs that appear
sufficiently often that we can be reasonably confident in the
estimates produced from their sample frequencies, regardless of
whether or not they are more common than average (the average being
fairly meaningless for a highly skewed distribution anyway).

This is still keeping the most common values, but now we'd be saying
that we keep any value that appears sufficiently often in the sample
that we can extrapolate its sample frequency to produce a reasonably
accurate estimate of the population frequency, and discard values for
which the estimate is likely to be inaccurate.

I have not tested this idea at all, but it seems like it might be
worth considering. It has the nice property that the test depends
entirely on how often the value appears, rather than on other
previously computed statistics, such as Ndistinct.

Doing a quick test in pure SQL, using the highly skewed distribution
Jeff Janes gave in the other thread, with the default sample size of
30,000:

with samples as (
  select floor(-log(random())/log(2))::int  as who
  from generate_series(1,3)
), freqs as (
  select who, count(*) as x, count(*)/3::float8 as p
  from samples group by who
), stats as (
  select *, sqrt(p*(1-p)/3) *
sqrt((1000-3)::float8/(1000-1)) as se
  from freqs
)
select *, (1000*p)::int||'+/-'||(2*se*1000)::int as "95% interval",
   case when x >=10 and se < 0.1*p then 'KEEP' else 'DISCARD' end
from stats order by p desc limit 100;

it pretty consistently keeps the 8 most common values:

 who |   x   |  p   |  se  |  95%
interval   |  case
-+---+--+--+-+-
   0 | 15017 |0.5005667 |  0.00288241625942075 |
5005667+/-57648 | KEEP
   1 |  7607 |0.2535667 |  0.00250800597590887 |
2535667+/-50160 | KEEP
   2 |  3713 |0.1237667 | 0.0018984483 |
1237667+/-37969 | KEEP
   3 |  1855 |   0.0618 |   0.0013884757600711 |
618333+/-27770  | KEEP
   4 |   914 |   0.03046667 | 0.000990788179299791 |
304667+/-19816  | KEEP
   5 |   448 |   0.0149 | 0.000699194759916533 |
149333+/-13984  | KEEP
   6 |   229 |  0.00763 | 0.000501741670388358 |
76333+/-10035   | KEEP
   7 |   108 |   0.0036 | 0.000345267009604061 |
36000+/-6905| KEEP
   8 |46 |  0.00153 | 0.000225565173744715 |
15333+/-4511| DISCARD
   9 |34 |  0.00113 | 0.000193963300230354 |
11333+/-3879| DISCARD
  10 |17 | 0.000567 | 0.000137191663419411 |
5667+/-2744 | DISCARD
  11 |11 | 0.000367 | 0.000110367969704201 |
3667+/-2207 | D

stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

2018-01-22 Thread John Naylor

(Starting a new thread so as not to distract review)

On 1/21/18, Dean Rasheed  wrote:
> On 21 January 2018 at 07:26, John Naylor  wrote:
>> I spent a few hours hacking on this, and it turns out calculating the
>> right number of MCVs taking into account both uniform and highly
>> non-uniform distributions is too delicate a problem for me to solve
>> right now. The logic suggested by Dean Rasheed in [1] always produces
>> no MCVs for a perfectly uniform distribution (which is good), but very
>> often also for other distributions, which is not good. My efforts to
>> tweak that didn't work, so I didn't get as far as adapting it for the
>> problem Jeff is trying to solve.
>
> Hmm, Tom suggested that the test based on the average frequency over
> all values might be too strict because the estimated number of
> distinct values is often too low, so that might explain what you're
> seeing.

In my test tables, I've noticed that our Ndistinct estimator is most
inaccurate for geometric distributions, so that's certainly possible,
but confusingly, it occasionally gave an empty MCV list along with a
histogram with a boundary duplicated 5 times, which I thought I was
guarding against. I'm thinking my implementation of your logic is
flawed somehow. In case you're curious I've attached my rough
(complier warnings and all) test patch.

> It occurs to me that maybe a better test to exclude a value from the
> MCV list would be to demand that its relative standard error not be
> too high. Such a test, in addition to the existing tests, might be
> sufficient to solve the opposite problem of too many values in the MCV
> list, because the real problem there is including a value after having
> seen relatively few occurrences of it in the sample, and thus having a
> wildly inaccurate estimate for it. Setting a bound on the relative
> standard error would mean that we could have a reasonable degree of
> confidence in estimates produced from the sample.

If you don't mind, what would the math look like for that?

-John Naylor
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5f21fcb..da21333 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2318,6 +2318,8 @@ compute_scalar_stats(VacAttrStatsP stats,
 	int			num_mcv = stats->attr->attstattarget;
 	int			num_bins = stats->attr->attstattarget;
 	StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
+	double		N,
+n;
 
 	values = (ScalarItem *) palloc(samplerows * sizeof(ScalarItem));
 	tupnoLink = (int *) palloc(samplerows * sizeof(int));
@@ -2525,10 +2527,10 @@ compute_scalar_stats(VacAttrStatsP stats,
 			 */
 			int			f1 = ndistinct - nmultiple + toowide_cnt;
 			int			d = f1 + nmultiple;
-			double		n = samplerows - null_cnt;
-			double		N = totalrows * (1.0 - stats->stanullfrac);
 			double		stadistinct;
 
+			n = samplerows - null_cnt;
+			N = totalrows * (1.0 - stats->stanullfrac);
 			/* N == 0 shouldn't happen, but just in case ... */
 			if (N > 0)
 stadistinct = (n * d) / ((n - f1) + f1 * n / N);
@@ -2558,9 +2560,44 @@ compute_scalar_stats(VacAttrStatsP stats,
 		 * we are able to generate a complete MCV list (all the values in the
 		 * sample will fit, and we think these are all the ones in the table),
 		 * then do so.  Otherwise, store only those values that are
-		 * significantly more common than the (estimated) average. We set the
-		 * threshold rather arbitrarily at 25% more than average, with at
-		 * least 2 instances in the sample.  Also, we won't suppress values
+		 * significantly more common than the (estimated) average.
+		 *
+		 * Note: For this POC patch, the implementation and comments
+		 * were copied from an email from Dean Rasheed, which contains further references:
+		 * https://www.postgresql.org/message-id/CAEZATCVu9zK0N%3Dnd9ufavabbM8YZiyWYJca0oiE8F31GAY%2B_XA%40mail.gmail.com
+		 *
+		 * We calculate the threshold from the table and sample sizes.
+
+		 * The initial rule of thumb is that the value should occur at
+		 * least 10 times in the sample.
+		 *
+		 * Suppose that N is the population size (total number of rows in the
+		 * table), and n is the sample size, and that some particular candidate
+		 * value appears x times in the sample. Then the "sample proportion" is
+		 * given by p = x/n.
+		 *
+		 * It is reasonable to treat p as having a normal distribution, which
+		 * then allows the margin of error to be analysed using standard
+		 * techniques. We calculate the standard error of the sample proportion:
+		 *
+		 * SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))
+		 *
+		 * The second term is a finite population correction. There is a 95%
+		 * probability that the total population proportion lies in the range
+		 *
+		 * [ pmin = p-2*SE, pmax = p+2*SE ]
+		 *
+		 * If there are Nd distinct values in the table, so that the average
+		 * frequency of occurrence of any particular value is 1/Nd, then the test
+		 *
+		 * pmin > 1/Nd
+		 *
+		 * would imply that

Re: MCV lists for highly skewed distributions

2018-01-21 Thread Dean Rasheed

On 21 January 2018 at 07:26, John Naylor  wrote:
> I spent a few hours hacking on this, and it turns out calculating the
> right number of MCVs taking into account both uniform and highly
> non-uniform distributions is too delicate a problem for me to solve
> right now. The logic suggested by Dean Rasheed in [1] always produces
> no MCVs for a perfectly uniform distribution (which is good), but very
> often also for other distributions, which is not good. My efforts to
> tweak that didn't work, so I didn't get as far as adapting it for the
> problem Jeff is trying to solve.

Hmm, Tom suggested that the test based on the average frequency over
all values might be too strict because the estimated number of
distinct values is often too low, so that might explain what you're
seeing.

It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-01-20 Thread John Naylor

I wrote:

>> I have a slight reservaton about whether 1.25x is still a sensible
>> heuristic.
>
> This was also discussed in [1], but no patch came out of it. I was
> just now turning the formulas discussed there into code, but I'll
> defer to someone with more expertise. FWIW, I suspect that a solution
> that doesn't take into account a metric like coefficient of variation
> will have the wrong behavior sometimes, whether for highly uniform or
> highly non-uniform distributions.

I spent a few hours hacking on this, and it turns out calculating the
right number of MCVs taking into account both uniform and highly
non-uniform distributions is too delicate a problem for me to solve
right now. The logic suggested by Dean Rasheed in [1] always produces
no MCVs for a perfectly uniform distribution (which is good), but very
often also for other distributions, which is not good. My efforts to
tweak that didn't work, so I didn't get as far as adapting it for the
problem Jeff is trying to solve.

I have not been able to come up with a more compelling alternative, so
I have nothing further to say about the patch under review.

> [1]
> https://www.postgresql.org/message-id/flat/32261.1496611829%40sss.pgh.pa.us#32261.1496611...@sss.pgh.pa.us

-John Naylor

Re: MCV lists for highly skewed distributions

2018-01-19 Thread John Naylor

I wrote:
> FWIW, I suspect that a solution
> that doesn't take into account a metric like coefficient of variation
> will have the wrong behavior sometimes, whether for highly uniform or
> highly non-uniform distributions.

By this I meant the coefficient of variation of the class size in the
sample, as denoted by gamma in the Haas and Stokes paper on page 7.

-John Naylor

Re: MCV lists for highly skewed distributions

2018-01-19 Thread John Naylor

>> [1] Occasionally it will store a much longer MCV list, because no values
>> was
>> sampled exactly once, which triggers a different code path in which all
>> seen
>> values are put in the MCV and the histogram is NULL.  This is not
>> reliable,
>> as whether the least-sample value is present in the sample once or twice
>> is
>> pretty brittle.
>
> And we need a better discussion of risk: Before we generated too few
> MCV entried. To what extent might me now generate too many? Which
> would be a problem in increased planning time.

(My apologies, I just now found some time to test this further, but I
don't have additional results yet)

Simon,
Earlier, I referenced a thread [1] complaining that we currently
already have too many MCVs in the case of uniform distributions, with
worse consequences than planning time. Based on my (admittedly quick
and dirty) preliminary testing (see attachment from a couple weeks
ago), this patch exacerbates that problem, and I was hoping to find a
way to fix that.

> I have a slight reservaton about whether 1.25x is still a sensible
> heuristic.

This was also discussed in [1], but no patch came out of it. I was
just now turning the formulas discussed there into code, but I'll
defer to someone with more expertise. FWIW, I suspect that a solution
that doesn't take into account a metric like coefficient of variation
will have the wrong behavior sometimes, whether for highly uniform or
highly non-uniform distributions.

[1] 
https://www.postgresql.org/message-id/flat/32261.1496611829%40sss.pgh.pa.us#32261.1496611...@sss.pgh.pa.us

-John Naylor

Re: MCV lists for highly skewed distributions

2018-01-18 Thread Simon Riggs

On 28 December 2017 at 01:45, Jeff Janes  wrote:

> If we stored just a few more values, their inclusion in the MCV would mean
> they are depleted from the residual count, correctly lowering the estimate
> we would get for very rare values not included in the sample.

I agree with this thought.

> So instead of having the threshold of 1.25x the average frequency over all
> values, I think we should use 1.25x the average frequency of only those
> values not already included in the MCV, as in the attached.

It looks like this might even have been the original intention of that code.

Patch looks OK, but I think the comments need minor changes above line 2575

> As it is, you can partially overcome the too short MCV list by cranking up
> the default statistics target, but this is a weak effect and comes at a high
> cost of CPU time.  In some of the real distributions I've looked at,
> cranking up default statistics target is almost entirely ineffective.

Agreed, not a good solution.

> [1] Occasionally it will store a much longer MCV list, because no values was
> sampled exactly once, which triggers a different code path in which all seen
> values are put in the MCV and the histogram is NULL.  This is not reliable,
> as whether the least-sample value is present in the sample once or twice is
> pretty brittle.

And we need a better discussion of risk: Before we generated too few
MCV entried. To what extent might me now generate too many? Which
would be a problem in increased planning time.

I have a slight reservaton about whether 1.25x is still a sensible
heuristic. Should we add a parameter for that to allow testing during
beta?

Marking as Ready For Committer.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: MCV lists for highly skewed distributions

2018-01-07 Thread John Naylor

I wrote:

> I'll be travelling for a few days, but I'll do some testing on some
> data sets soon.

I've attached some preliminary test methods and results. Probably not
detailed or rigorous enough, but I wanted to share this much before
digging deeper. I created some test data, ran analyze a few times and
eyeballed the results, of which I give a typical example. I used a low
stat target to make it easier to see the general picture. Suggestions
welcome.

-John Naylor

test_analyze_highly_skewed_v1.sql
Description: application/sql

Re: MCV lists for highly skewed distributions

2017-12-31 Thread John Naylor

I wrote:

> On 12/28/17, Jeff Janes  wrote:
>> I think that perhaps maxmincount should also use the dynamic
>> values_cnt_remaining rather than the static one.  After all, things
>> included in the MCV don' get represented in the histogram.  When I've
>> seen
>> planning problems due to skewed distributions I also usually see
>> redundant
>> values in the histogram boundary list which I think should be in the MCV
>> list instead. But I have not changed that here, pending discussion.
>
> I think this is also a good idea, but I haven't thought it through. If
> you don't go this route, I would move this section back out of the
> loop as well.

I did some quick and dirty testing of this, and I just want to note
that in this case, setting mincount to its hard-coded minimum must
come after setting it to maxmincount, since now maxmincount can go
arbitrarily low.

I'll be travelling for a few days, but I'll do some testing on some
data sets soon. While looking through the archives for more info, I
saw this thread

https://www.postgresql.org/message-id/32261.1496611829%40sss.pgh.pa.us

which showcases the opposite problem: For more uniform distributions,
there are too many MCVs. Not relevant to your problem, but if I have
time I'll try my hand at testing an approach suggested in that thread
at the same time I test your patch and see how it interacts.

-John Naylor

Re: MCV lists for highly skewed distributions

2017-12-29 Thread John Naylor

On 12/28/17, Jeff Janes  wrote:
> I want to revive a patch I sent  couple years ago to the performance list,
> as I have seen the issue pop up repeatedly since then.

> If we stored just a few more values, their inclusion in the MCV would mean
> they are depleted from the residual count, correctly lowering the estimate
> we would get for very rare values not included in the sample.
>
> So instead of having the threshold of 1.25x the average frequency over all
> values, I think we should use 1.25x the average frequency of only those
> values not already included in the MCV, as in the attached.

After reading the thread you mentioned, I think that's a good approach.

On master, the numerator of avg_count is nonnull_cnt, but you've
(indirectly) set it to values_cnt. You didn't mention it here, but I
think you're right and master is wrong, since in this function only
sortable values go through the MCV path. If I understand the code
correctly, if there are enough too-wide values in the sample, they
could bias the average and prevent normal values from being considered
for the MCV list. Am I off base here?

Anyway, since you're now overwriting ndistinct_table, I would rename
it to ndistinct_remaining for clarity's sake.

This part doesn't use any loop variables, so should happen before the loop:

+   if (num_mcv > track_cnt)
+   num_mcv = track_cnt;

I think this comment
/* estimate # of occurrences in sample of a typical nonnull value */
belongs just above the calculation of avg_count.

> I think that perhaps maxmincount should also use the dynamic
> values_cnt_remaining rather than the static one.  After all, things
> included in the MCV don' get represented in the histogram.  When I've seen
> planning problems due to skewed distributions I also usually see redundant
> values in the histogram boundary list which I think should be in the MCV
> list instead. But I have not changed that here, pending discussion.

I think this is also a good idea, but I haven't thought it through. If
you don't go this route, I would move this section back out of the
loop as well.

-John Naylor

41 matches

Mail list logo