Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote: Greg Stark wrote > > I looked into this a while back when we were talking about changing the > > sampling method. The conclusions were discouraging. Fundamentally, using > > constant sized samples of data for n_distinct is bogus. Constant sized > > samples only work for things like the histograms that can be analyzed > > through standard statistics population sampling which depends on the law of > > large numbers. ISTM Greg's comments are correct. There is no way to calculate this with consistent accuracy when using a constant sized sample. (If it were, then people wouldnt bother to hold large databases...) > Overall, our formula is inherently conservative of n_distinct. That is, I > believe that it is actually computing the *smallest* number of distinct > values which would reasonably produce the given sample, rather than the > *median* one. This is contrary to the notes in analyze.c, which seem to > think that we're *overestimating* n_distinct. The only information you can determine from a sample is the smallest number of distinct values that would reasonably produce the given sample. There is no meaningful concept of a median one... (You do have an upper bound: the number of rows in the table, but I cannot see any meaning from taking (Nrows+estimatedN_distinct)/2 ). Even if you use Zipf's Law to predict the frequency of occurrence, you'd still need to estimate the parameters for the distribution. Most other RDBMS make optimizer statistics collection an unsampled scan. Some offer this as one of their options, as well as the ability to define the sample size in terms of fixed number of rows or fixed proportion of the table. My suggested hack for PostgreSQL is to have an option to *not* sample, just to scan the whole table and find n_distinct accurately. Then anybody who doesn't like the estimated statistics has a clear path to take. The problem of poorly derived base statistics is a difficult one. When considering join estimates we already go to the trouble of including MFV comparisons to ensure an upper bound of join selectivity is known. If the statistics derived are themselves inaccurate the error propagation touches every other calculation in the optimizer. GIGO. What price a single scan of a table, however large, when incorrect statistics could force scans and sorts to occur when they aren't actually needed ? Best Regards, Simon Riggs ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
People, > Can someone whose math is more recent than calculus in 1989 take a look at > that paper, and look at the formula toward the bottom of page 10, and see > if we are correctly interpreting it? I'm particularly confused as to > what "q" and "d-sub-n" represent. Thanks! Actually, I managed to solve for these and it appears we are using the formula correctly. It's just a bad formula. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Greg, > I looked into this a while back when we were talking about changing the > sampling method. The conclusions were discouraging. Fundamentally, using > constant sized samples of data for n_distinct is bogus. Constant sized > samples only work for things like the histograms that can be analyzed > through standard statistics population sampling which depends on the law of > large numbers. Well, unusual distributions are certainly tough. But I think the problem exists even for relatively well-distributed populations.Part of it is, I believe, the formula we are using: n*d / (n - f1 + f1*n/N) This is an estimation formula from Haas and Stokes in IBM Research Report RJ 10025, and is called the DUJ1 formula. (http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf) It appears to suck. For example, take my table: rows: 26million (N) distinct values: 3.4million I took a random sample of 1000 rows (n) from that table. It contained: 968 values that occurred only once (f1) 981 distinct values (d) Any human being looking at that sample would assume a large number of distinct values; after all, 96.8% of the values occurred only once. But the formula gives us: 1000*981 / ( 1000 - 968 + ( 968 * 1000/2600 ) ) = 30620 This is obviously dramatically wrong, by a factor of 100. The math gets worse as the sample size goes down: Sample 250, 248 distinct values, 246 unique values: 250*248 / ( 250 - 246 + ( 246 * 250 / 2600 ) ) = 15490 Even in a case with an ovewhelming majority of unique values, the formula gets it wrong: 999 unque values in sample 998 appearing only once: 1000*999 / ( 1000 - 998 + ( 998 * 1000 / 2600 ) ) = 490093 This means that, with a sample size of 1000 a table of 26million rows cannot ever have with this formula more than half a million distinct values, unless the column is a unique column. Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given sample, rather than the *median* one. This is contrary to the notes in analyze.c, which seem to think that we're *overestimating* n_distinct. This formula appears broken everywhere: Table: 969000 rows Distinct values: 374000 Sample Size: 1000 Unique values in sample: 938 Values appearing only once: 918 1000*938 / ( 1000 - 918 + ( 918 * 1000 / 969000 ) ) = 11308 Again, too small by a factor of 20x. This is so broken, in fact, that I'm wondering if we've read the paper right? I've perused the paper on almaden, and the DUJ1 formula appears considerably more complex than the formula we're using. Can someone whose math is more recent than calculus in 1989 take a look at that paper, and look at the formula toward the bottom of page 10, and see if we are correctly interpreting it?I'm particularly confused as to what "q" and "d-sub-n" represent. Thanks! -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Josh Berkus writes: > ... I think the problem is in our heuristic sampling code. I'm not the first > person to have this kind of a problem. Will be following up with tests ... I looked into this a while back when we were talking about changing the sampling method. The conclusions were discouraging. Fundamentally, using constant sized samples of data for n_distinct is bogus. Constant sized samples only work for things like the histograms that can be analyzed through standard statistics population sampling which depends on the law of large numbers. n_distinct requires you to estimate how frequently very rare things occur. You can't apply the law of large numbers because even a single instance of a value out of a large pool alters the results disproportionately. To get a valid estimate for n_distinct you would need to sample a fixed percentage of the table. Not a fixed size sample. That just isn't practical. Moreover, I think the percentage would have to be quite large. Even if you sampled half the table I think the confidence interval would be quite wide. The situation is worsened because it's unclear how to interpolate values for subsets of the table. If the histogram says you have a million records and you're adding a clause that has a selectivity of 50% then half a million is a good guess. But if what you care about is n_distinct and you start with a million records with 1,000 distinct values and then apply a clause that filters 50% of them, how do you estimate the resulting n_distinct? 500 is clearly wrong, but 1,000 is wrong too. You could end up with anywhere from 0 to 1,000 and you have no good way to figure out where the truth lies. So I fear this is fundamentally a hopeless situation. It's going to be difficult to consistently get good plans for any queries that depend on good estimates for n_distinct. -- greg ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Hi. Sometimes, if the random number generator, that PostgreSQL uses, isn't good enough, the randomly selected pages for the statistics might not be random enough. Solaris is unknown to me. Maybe the used random number generator there isn't good enough? Good statistics depend on good random numbers. So, for example, if you have one million pages, but the upper bound for the random numbers is one hundred thousand pages, the statistics might get tuned. Or some random number generator has for example only 32000 different values. Regards, Marko Ristola Josh Berkus wrote: Tom, Any thoughts? This is really messing up query execution all across the database ... --Josh Here is the stats = 100 version. Notice that n_distinct has gone down. schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals |most_common_freqs | histogram_bounds | correlation ---+- public | web_site_activity_fa | session_id | 0 | 8 | 96107 | {4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705 488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006 604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815, 4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387 835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23 450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506, 71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,709 86,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,623982 5,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,25 46720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,438802 5} {0.0017,0.00146667,0.0013,0.0011,0.00093,0.0009,0.0008,0.0008,0.000 73,0.00073,0.0007,0.00063,0.0006,0.0006,0.00057,0.00057, 0.00057,0.00057,0.00057,0.00057,0.00057,0.00053,0.00 05,0.0005,0.0005,0.0005,0.0005,0.0005,0.00047,0.00047,0.00043,0. 00043,0.00043,0.00043,0.0004,0.0004,0.0004,0.0004,0.0004,0.00036 6667,0.00037,0.00037,0.00037,0.00033,0.00033,0.00033 ,0.00033,0.00033,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0. 0003,0.0003,0.0003,0.00027,0.00027,0.00027,0.00027,0.0002666 67,0.00027,0.00027,0.00027,0.00027,0.00023,0.00023,0 .00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.000 23,0.00023,0.00023,0.00023,0.00023,0.00023,0.0002333 33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0 002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002} {230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,38 6486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419, 1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038 573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229, 2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832 224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587, 3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804 593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625, 4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078 912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200, 6395250,6424719,6888329} | 0.41744 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
> > Solaris is unknown to me. Maybe the used random number generator there > > isn't good enough? > > Hmmm. Good point. Will have to test on Linux. Nope: Linux 2.4.20: test=# select tablename, attname, n_distinct from pg_stats where tablename = 'web_site_activity_fa'; tablename | attname | n_distinct --+-+ web_site_activity_fa | session_id | 626127 test=# select count(distinct session_id) from web_site_activity_fa; count - 3174813 (1 row) ... I think the problem is in our heuristic sampling code. I'm not the first person to have this kind of a problem. Will be following up with tests ... -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Marko, > Sometimes, if the random number generator, that PostgreSQL uses, > isn't good enough, the randomly selected pages for the statistics > might not be random enough. > > Solaris is unknown to me. Maybe the used random number generator there > isn't good enough? Hmmm. Good point. Will have to test on Linux. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Tom, Any thoughts? This is really messing up query execution all across the database ... --Josh > Here is the stats = 100 version. Notice that n_distinct has gone down. > > schemaname | tablename | attname | null_frac | avg_width | > n_distinct | most_common_vals > > |most_common_freqs > | histogram_bounds | > > correlation >---+- public | web_site_activity_fa | > session_id | 0 | 8 | 96107 | > {4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705 >488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006 >604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815, >4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387 >835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23 >450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506, >71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,709 >86,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,623982 >5,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,25 >46720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,438802 >5} > > {0.0017,0.00146667,0.0013,0.0011,0.00093,0.0009,0.0008,0.0008,0.000 >73,0.00073,0.0007,0.00063,0.0006,0.0006,0.00057,0.00057, >0.00057,0.00057,0.00057,0.00057,0.00057,0.00053,0.00 >05,0.0005,0.0005,0.0005,0.0005,0.0005,0.00047,0.00047,0.00043,0. >00043,0.00043,0.00043,0.0004,0.0004,0.0004,0.0004,0.0004,0.00036 >6667,0.00037,0.00037,0.00037,0.00033,0.00033,0.00033 >,0.00033,0.00033,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0. >0003,0.0003,0.0003,0.00027,0.00027,0.00027,0.00027,0.0002666 >67,0.00027,0.00027,0.00027,0.00027,0.00023,0.00023,0 >.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.000 >23,0.00023,0.00023,0.00023,0.00023,0.00023,0.0002333 >33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0 >002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002} > > {230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,38 >6486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419, >1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038 >573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229, >2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832 >224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587, >3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804 >593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625, >4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078 >912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200, >6395250,6424719,6888329} > > | 0.41744 -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Tom, > What's the histogram itself look like? (I'd like to see the whole > pg_stats row not just part of it ...) There's probably no point in > showing the target=1000 version, but maybe target=100 would be > informative. Here is the stats = 100 version. Notice that n_distinct has gone down. schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals |most_common_freqs | histogram_bounds | correlation +--++---+---++---+--+-+- public | web_site_activity_fa | session_id | 0 | 8 | 96107 | {4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,70986,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,6239825,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,2546720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,4388025} | {0.0017,0.00146667,0.0013,0.0011,0.00093,0.0009,0.0008,0.0008,0.00073,0.00073,0.0007,0.00063,0.0006,0.0006,0.00057,0.00057,0.00057,0.00057,0.00057,0.00057,0.00057,0.00053,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.00047,0.00047,0.00043,0.00043,0.00043,0.00043,0.0004,0.0004,0.0004,0.0004,0.0004,0.00037,0.00037,0.00037,0.00037,0.00033,0.00033,0.00033,0.00033,0.00033,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.00027,0.00027,0.00027,0.00027,0.00027,0.00027,0.00027,0.00027,0.00027,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Josh Berkus writes: > As you can see, n_distinct estimation is off by a factor of 10x and it's > causing query planning problems. Any suggested hacks to improve the > histogram on this? What's the histogram itself look like? (I'd like to see the whole pg_stats row not just part of it ...) There's probably no point in showing the target=1000 version, but maybe target=100 would be informative. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
> -Original Message- > From: Josh Berkus [mailto:[EMAIL PROTECTED] > Sent: Tuesday, April 19, 2005 2:09 PM > To: pgsql-perform > Subject: [PERFORM] Bad n_distinct estimation; hacks suggested? > > [...] > (BTW, increasing the stats to 1000 only doubles n_distinct, > and doesn't solve the problem) Speaking of which, is there a reason why statistics are limited to 1000? Performance? __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]