Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Simon Riggs
On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote:
Greg Stark wrote
> > I looked into this a while back when we were talking about changing the
> > sampling method. The conclusions were discouraging. Fundamentally, using
> > constant sized samples of data for n_distinct is bogus. Constant sized
> > samples only work for things like the histograms that can be analyzed
> > through standard statistics population sampling which depends on the law of
> > large numbers.

ISTM Greg's comments are correct. There is no way to calculate this with
consistent accuracy when using a constant sized sample. (If it were,
then people wouldnt bother to hold large databases...)

> Overall, our formula is inherently conservative of n_distinct.   That is, I 
> believe that it is actually computing the *smallest* number of distinct 
> values which would reasonably produce the given sample, rather than the 
> *median* one.  This is contrary to the notes in analyze.c, which seem to 
> think that we're *overestimating* n_distinct.  

The only information you can determine from a sample is the smallest
number of distinct values that would reasonably produce the given
sample. There is no meaningful concept of a median one... (You do have
an upper bound: the number of rows in the table, but I cannot see any
meaning from taking (Nrows+estimatedN_distinct)/2 ).

Even if you use Zipf's Law to predict the frequency of occurrence, you'd
still need to estimate the parameters for the distribution.

Most other RDBMS make optimizer statistics collection an unsampled
scan. Some offer this as one of their options, as well as the ability to
define the sample size in terms of fixed number of rows or fixed
proportion of the table.

My suggested hack for PostgreSQL is to have an option to *not* sample,
just to scan the whole table and find n_distinct accurately. Then
anybody who doesn't like the estimated statistics has a clear path to
take. 

The problem of poorly derived base statistics is a difficult one. When
considering join estimates we already go to the trouble of including MFV
comparisons to ensure an upper bound of join selectivity is known. If
the statistics derived are themselves inaccurate the error propagation
touches every other calculation in the optimizer. GIGO.

What price a single scan of a table, however large, when incorrect
statistics could force scans and sorts to occur when they aren't
actually needed ?

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Josh Berkus
People,

> Can someone whose math is more recent than calculus in 1989 take a look at
> that paper, and look at the formula toward the bottom of page 10, and see
> if we are correctly interpreting it?    I'm particularly confused as to
> what "q" and "d-sub-n" represent.  Thanks!

Actually, I managed to solve for these and it appears we are using the formula 
correctly.  It's just a bad formula.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Josh Berkus
Greg,

> I looked into this a while back when we were talking about changing the
> sampling method. The conclusions were discouraging. Fundamentally, using
> constant sized samples of data for n_distinct is bogus. Constant sized
> samples only work for things like the histograms that can be analyzed
> through standard statistics population sampling which depends on the law of
> large numbers.

Well, unusual distributions are certainly tough.  But I think the problem 
exists even for relatively well-distributed populations.Part of it is, I 
believe, the formula we are using:

n*d / (n - f1 + f1*n/N)

This is an estimation formula from Haas and Stokes in IBM Research Report RJ 
10025, and is called the DUJ1 formula. 
(http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf)  It appears to 
suck.   For example, take my table:

rows: 26million (N)
distinct values: 3.4million

I took a random sample of 1000 rows (n) from that table.   It contained:
968 values that occurred only once (f1)
981 distinct values (d)

Any human being looking at that sample would assume a large number of distinct 
values; after all, 96.8% of the values occurred only once.   But the formula 
gives us:

1000*981 / ( 1000 - 968 + ( 968 * 1000/2600 ) ) = 30620

This is obviously dramatically wrong, by a factor of 100.  The math gets worse 
as the sample size goes down:

Sample 250, 248 distinct values, 246 unique values:

250*248 / ( 250 - 246 + ( 246 * 250 / 2600 ) ) = 15490

Even in a case with an ovewhelming majority of unique values, the formula gets 
it wrong:

999 unque values in sample
998 appearing only once:

1000*999 / ( 1000 - 998 + ( 998 * 1000 / 2600 ) ) = 490093

This means that, with a sample size of 1000 a table of 26million rows cannot 
ever have with this formula more than half a million distinct values, unless 
the column is a unique column.

Overall, our formula is inherently conservative of n_distinct.   That is, I 
believe that it is actually computing the *smallest* number of distinct 
values which would reasonably produce the given sample, rather than the 
*median* one.  This is contrary to the notes in analyze.c, which seem to 
think that we're *overestimating* n_distinct.  

This formula appears broken everywhere:

Table: 969000 rows
Distinct values: 374000
Sample Size: 1000
Unique values in sample: 938
Values appearing only once: 918

1000*938 / ( 1000 - 918 + ( 918 * 1000 / 969000 ) ) = 11308

Again, too small by a factor of 20x.   

This is so broken, in fact, that I'm wondering if we've read the paper right?  
I've perused the paper on almaden, and the DUJ1 formula appears considerably 
more complex than the formula we're using.  

Can someone whose math is more recent than calculus in 1989 take a look at 
that paper, and look at the formula toward the bottom of page 10, and see if 
we are correctly interpreting it?I'm particularly confused as to what "q" 
and "d-sub-n" represent.  Thanks!

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Greg Stark

Josh Berkus  writes:

> ... I think the problem is in our heuristic sampling code.  I'm not the first 
> person to have this kind of a problem.  Will be following up with tests ...

I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized samples
only work for things like the histograms that can be analyzed through standard
statistics population sampling which depends on the law of large numbers.

n_distinct requires you to estimate how frequently very rare things occur. You
can't apply the law of large numbers because even a single instance of a value
out of a large pool alters the results disproportionately.

To get a valid estimate for n_distinct you would need to sample a fixed
percentage of the table. Not a fixed size sample. That just isn't practical.
Moreover, I think the percentage would have to be quite large. Even if you
sampled half the table I think the confidence interval would be quite wide.

The situation is worsened because it's unclear how to interpolate values for
subsets of the table. If the histogram says you have a million records and
you're adding a clause that has a selectivity of 50% then half a million is a
good guess. But if what you care about is n_distinct and you start with a
million records with 1,000 distinct values and then apply a clause that
filters 50% of them, how do you estimate the resulting n_distinct? 500 is
clearly wrong, but 1,000 is wrong too. You could end up with anywhere from 0
to 1,000 and you have no good way to figure out where the truth lies.

So I fear this is fundamentally a hopeless situation. It's going to be
difficult to consistently get good plans for any queries that depend on good
estimates for n_distinct.

-- 
greg


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-22 Thread Marko Ristola
Hi.
Sometimes, if the random number generator, that PostgreSQL uses,
isn't good enough, the randomly selected pages for the statistics
might not be random enough.
Solaris is unknown to me. Maybe the used random number generator there
isn't good enough?
Good statistics depend on good random numbers.
So, for example, if you have one million pages, but the upper bound for 
the random
numbers is one hundred thousand pages, the statistics might get tuned.

Or some random number generator has for example only 32000 different values.
Regards,
Marko Ristola
Josh Berkus wrote:
Tom,
Any thoughts?   This is really messing up query execution all across the 
database ...

--Josh
 

Here is the stats = 100 version.   Notice that n_distinct has gone down.
schemaname |  tablename   |  attname   | null_frac | avg_width |
n_distinct |   most_common_vals
|most_common_freqs
| histogram_bounds  |
correlation
   

 

---+- public | web_site_activity_fa |
session_id | 0 | 8 | 96107 |
{4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705
488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006
604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,
4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387
835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23
450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,
71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,709
86,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,623982
5,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,25
46720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,438802
5}
{0.0017,0.00146667,0.0013,0.0011,0.00093,0.0009,0.0008,0.0008,0.000
73,0.00073,0.0007,0.00063,0.0006,0.0006,0.00057,0.00057,
0.00057,0.00057,0.00057,0.00057,0.00057,0.00053,0.00
05,0.0005,0.0005,0.0005,0.0005,0.0005,0.00047,0.00047,0.00043,0.
00043,0.00043,0.00043,0.0004,0.0004,0.0004,0.0004,0.0004,0.00036
6667,0.00037,0.00037,0.00037,0.00033,0.00033,0.00033
,0.00033,0.00033,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.
0003,0.0003,0.0003,0.00027,0.00027,0.00027,0.00027,0.0002666
67,0.00027,0.00027,0.00027,0.00027,0.00023,0.00023,0
.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.000
23,0.00023,0.00023,0.00023,0.00023,0.00023,0.0002333
33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0
002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002}
{230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,38
6486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419,
1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038
573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229,
2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832
224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587,
3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804
593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625,
4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078
912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200,
6395250,6424719,6888329}
| 0.41744
   

 


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-22 Thread Josh Berkus

> > Solaris is unknown to me. Maybe the used random number generator there
> > isn't good enough?
>
> Hmmm.  Good point.  Will have to test on Linux.

Nope:

Linux 2.4.20:

test=# select tablename, attname, n_distinct from pg_stats where tablename = 
'web_site_activity_fa';
  tablename   |   attname   | n_distinct
--+-+
 web_site_activity_fa | session_id  | 626127

test=# select count(distinct session_id) from web_site_activity_fa;
  count
-
 3174813
(1 row)

... I think the problem is in our heuristic sampling code.  I'm not the first 
person to have this kind of a problem.  Will be following up with tests ...

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-22 Thread Josh Berkus
Marko,

> Sometimes, if the random number generator, that PostgreSQL uses,
> isn't good enough, the randomly selected pages for the statistics
> might not be random enough.
>
> Solaris is unknown to me. Maybe the used random number generator there
> isn't good enough?

Hmmm.  Good point.  Will have to test on Linux.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-20 Thread Josh Berkus
Tom,

Any thoughts?   This is really messing up query execution all across the 
database ...

--Josh

> Here is the stats = 100 version.   Notice that n_distinct has gone down.
>
>  schemaname |  tablename   |  attname   | null_frac | avg_width |
> n_distinct |   most_common_vals
>
> |most_common_freqs
> | histogram_bounds  |
>
> correlation

>---+- public | web_site_activity_fa |
> session_id | 0 | 8 | 96107 |
> {4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705
>488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006
>604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,
>4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387
>835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23
>450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,
>71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,709
>86,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,623982
>5,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,25
>46720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,438802
>5}
>
> {0.0017,0.00146667,0.0013,0.0011,0.00093,0.0009,0.0008,0.0008,0.000
>73,0.00073,0.0007,0.00063,0.0006,0.0006,0.00057,0.00057,
>0.00057,0.00057,0.00057,0.00057,0.00057,0.00053,0.00
>05,0.0005,0.0005,0.0005,0.0005,0.0005,0.00047,0.00047,0.00043,0.
>00043,0.00043,0.00043,0.0004,0.0004,0.0004,0.0004,0.0004,0.00036
>6667,0.00037,0.00037,0.00037,0.00033,0.00033,0.00033
>,0.00033,0.00033,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.
>0003,0.0003,0.0003,0.00027,0.00027,0.00027,0.00027,0.0002666
>67,0.00027,0.00027,0.00027,0.00027,0.00023,0.00023,0
>.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.000
>23,0.00023,0.00023,0.00023,0.00023,0.00023,0.0002333
>33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0
>002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002}
>
> {230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,38
>6486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419,
>1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038
>573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229,
>2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832
>224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587,
>3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804
>593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625,
>4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078
>912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200,
>6395250,6424719,6888329}
>
> | 0.41744

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-19 Thread Josh Berkus
Tom,

> What's the histogram itself look like?  (I'd like to see the whole
> pg_stats row not just part of it ...)  There's probably no point in
> showing the target=1000 version, but maybe target=100 would be
> informative.

Here is the stats = 100 version.   Notice that n_distinct has gone down.

 schemaname |  tablename   |  attname   | null_frac | avg_width | 
n_distinct |   most_common_vals 

|most_common_freqs  
  
| histogram_bounds  | 
correlation
+--++---+---++---+--+-+-
 public | web_site_activity_fa | session_id | 0 | 8 |  
96107 | 
{4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,70986,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,6239825,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,2546720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,4388025}
 
| 
{0.0017,0.00146667,0.0013,0.0011,0.00093,0.0009,0.0008,0.0008,0.00073,0.00073,0.0007,0.00063,0.0006,0.0006,0.00057,0.00057,0.00057,0.00057,0.00057,0.00057,0.00057,0.00053,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.00047,0.00047,0.00043,0.00043,0.00043,0.00043,0.0004,0.0004,0.0004,0.0004,0.0004,0.00037,0.00037,0.00037,0.00037,0.00033,0.00033,0.00033,0.00033,0.00033,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.00027,0.00027,0.00027,0.00027,0.00027,0.00027,0.00027,0.00027,0.00027,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-19 Thread Tom Lane
Josh Berkus  writes:
> As you can see, n_distinct estimation is off by a factor of 10x and it's 
> causing query planning problems.   Any suggested hacks to improve the 
> histogram on this?

What's the histogram itself look like?  (I'd like to see the whole
pg_stats row not just part of it ...)  There's probably no point in
showing the target=1000 version, but maybe target=100 would be
informative.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-19 Thread Dave Held
> -Original Message-
> From: Josh Berkus [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, April 19, 2005 2:09 PM
> To: pgsql-perform
> Subject: [PERFORM] Bad n_distinct estimation; hacks suggested?
> 
> [...]
> (BTW, increasing the stats to 1000 only doubles n_distinct, 
> and doesn't solve the problem)

Speaking of which, is there a reason why statistics are limited
to 1000?  Performance?

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]