Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Markus Schaber
Hi, Josh, Josh Berkus wrote: Yes, actually. We need 3 different estimation methods: 1 for tables where we can sample a large % of pages (say, = 0.1) 1 for tables where we sample a small % of pages but are easily estimated 1 for tables which are not easily estimated by we can't afford to

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Mischa Sandberg
Quoting Markus Schaber [EMAIL PROTECTED]: Hi, Josh, Josh Berkus wrote: Yes, actually. We need 3 different estimation methods: 1 for tables where we can sample a large % of pages (say, = 0.1) 1 for tables where we sample a small % of pages but are easily estimated 1 for tables

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Josh Berkus
Mischa, Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is that to do, for

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Josh Berkus
John, But doesn't an index only sample one column at a time, whereas with page-based sampling, you can sample all of the columns at once. Hmmm. Yeah, we're not currently doing that though. Another good idea ... -- --Josh Josh Berkus Aglio Database Solutions San Francisco

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Mischa Sandberg
Quoting Josh Berkus josh@agliodbs.com: Mischa, Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) Bad analogy, but funny. Bad

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Mischa Sandberg
Quoting Josh Berkus josh@agliodbs.com: Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Marko Ristola
First I will comment my original idea. Second I will give another improved suggestion (an idea). I hope, that they will be useful for you. (I don't know, wether the first one was useful at all because it showed, that I and some others of us are not very good with statistics :( ) I haven't looked

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Andrew Dunstan
Mischa Sandberg wrote: Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Simon Riggs
On Tue, 2005-04-26 at 15:00 -0700, Gurmeet Manku wrote: 2. In a single scan, it is possible to estimate n_distinct by using a very simple algorithm: Distinct sampling for highly-accurate answers to distinct value queries and event reports by Gibbons, VLDB 2001.

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Josh Berkus
Mischa, Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than just taking

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Dave Held
-Original Message- From: Josh Berkus [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 27, 2005 10:25 AM To: Andrew Dunstan Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested? [...] Actually

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Simon Riggs
On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote: Josh Berkus josh@agliodbs.com writes: Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Josh Berkus
Simon, Could it be that we have overlooked this simple explanation and that the Haas and Stokes equation is actually quite good, but just not being applied? That's probably part of it, but I've tried Haas and Stokes on a pure random sample and it's still bad, or more specifically overly

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Gurmeet Manku
Hi everybody! Perhaps the following papers are relevant to the discussion here (their contact authors have been cc'd): 1. The following proposes effective algorithms for using block-level sampling for n_distinct estimation: Effective use of block-level sampling in statistics

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Andrew Dunstan
Josh Berkus wrote: Simon, Tom: While it's not possible to get accurate estimates from a fixed size sample, I think it would be possible from a small but scalable sample: say, 0.1% of all data pages on large tables, up to the limit of maintenance_work_mem. Setting up these samples as a % of

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Andrew Dunstan
Tom Lane wrote: Josh Berkus josh@agliodbs.com writes: Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given sample, rather than the *median* one.

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Andrew Dunstan
Simon Riggs wrote: The comment * Every value in the sample appeared more than once. Assume * the column has just these values. doesn't seem to apply when using larger samples, as Josh is using. Looking at Josh's application it does seem likely that when taking a sample, all

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Dave Held
: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested? [...] 2. In a single scan, it is possible to estimate n_distinct by using a very simple algorithm: Distinct sampling for highly-accurate answers to distinct value queries and event reports by Gibbons, VLDB 2001. http

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Mischa Sandberg
Quoting Andrew Dunstan [EMAIL PROTECTED]: After some more experimentation, I'm wondering about some sort of adaptive algorithm, a bit along the lines suggested by Marko Ristola, but limited to 2 rounds. The idea would be that we take a sample (either of fixed size, or some small

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Simon Riggs
On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote: Greg Stark wrote I looked into this a while back when we were talking about changing the sampling method. The conclusions were discouraging. Fundamentally, using constant sized samples of data for n_distinct is bogus. Constant sized

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes: My suggested hack for PostgreSQL is to have an option to *not* sample, just to scan the whole table and find n_distinct accurately. ... What price a single scan of a table, however large, when incorrect statistics could force scans and sorts to occur when

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Simon Riggs
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: My suggested hack for PostgreSQL is to have an option to *not* sample, just to scan the whole table and find n_distinct accurately. ... What price a single scan of a table, however large, when

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Josh Berkus
Simon, Tom: While it's not possible to get accurate estimates from a fixed size sample, I think it would be possible from a small but scalable sample: say, 0.1% of all data pages on large tables, up to the limit of maintenance_work_mem. Setting up these samples as a % of data pages, rather

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Josh Berkus
Guys, While it's not possible to get accurate estimates from a fixed size sample, I think it would be possible from a small but scalable sample: say, 0.1% of all data pages on large tables, up to the limit of maintenance_work_mem. BTW, when I say accurate estimates here, I'm talking about

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Dave Held
-Original Message- From: Josh Berkus [mailto:[EMAIL PROTECTED] Sent: Sunday, April 24, 2005 2:08 PM To: Andrew Dunstan Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes: On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: It's not just the scan --- you also have to sort, or something like that, if you want to count distinct values. I doubt anyone is really going to consider this a feasible answer for large tables.

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Dave Held
-Original Message- From: Andrew Dunstan [mailto:[EMAIL PROTECTED] Sent: Monday, April 25, 2005 3:43 PM To: josh@agliodbs.com Cc: pgsql-perform; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested? Josh Berkus wrote: Simon

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Marko Ristola
Here is my opinion. I hope this helps. Maybe there is no one good formula: On boolean type, there are at most 3 distinct values. There is an upper bound for fornames in one country. There is an upper bound for last names in one country. There is a fixed number of states and postal codes in one

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Josh Berkus
Andrew, The math in the paper does not seem to look at very low levels of q (= sample to pop ratio). Yes, I think that's the failing. Mind you, I did more testing and found out that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy (which I would consider acceptable)

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Josh Berkus
Folks, I wonder if this paper has anything that might help: http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I were more of a statistician I might be able to answer :-) Actually, that paper looks *really* promising. Does anyone here have enough math to solve for

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes: Tom, how does our heuristic sampling work? Is it pure random sampling, or page sampling? Manfred probably remembers better than I do, but I think the idea is to approximate pure random sampling as best we can without actually examining every page of the

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Greg Stark
Josh Berkus josh@agliodbs.com writes: ... I think the problem is in our heuristic sampling code. I'm not the first person to have this kind of a problem. Will be following up with tests ... I looked into this a while back when we were talking about changing the sampling method. The

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Josh Berkus
Greg, I looked into this a while back when we were talking about changing the sampling method. The conclusions were discouraging. Fundamentally, using constant sized samples of data for n_distinct is bogus. Constant sized samples only work for things like the histograms that can be analyzed

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Josh Berkus
People, Can someone whose math is more recent than calculus in 1989 take a look at that paper, and look at the formula toward the bottom of page 10, and see if we are correctly interpreting it?    I'm particularly confused as to what q and d-sub-n represent.  Thanks! Actually, I managed to

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Andrew Dunstan
Josh Berkus said: Well, unusual distributions are certainly tough. But I think the problem exists even for relatively well-distributed populations. Part of it is, I believe, the formula we are using: n*d / (n - f1 + f1*n/N) [snip] This is so broken, in fact, that I'm wondering if

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes: Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given sample, rather than the *median* one. This is contrary

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-22 Thread Josh Berkus
Marko, Sometimes, if the random number generator, that PostgreSQL uses, isn't good enough, the randomly selected pages for the statistics might not be random enough. Solaris is unknown to me. Maybe the used random number generator there isn't good enough? Hmmm. Good point. Will have to

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-22 Thread Josh Berkus
Solaris is unknown to me. Maybe the used random number generator there isn't good enough? Hmmm. Good point. Will have to test on Linux. Nope: Linux 2.4.20: test=# select tablename, attname, n_distinct from pg_stats where tablename = 'web_site_activity_fa'; tablename |

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-22 Thread Marko Ristola
Hi. Sometimes, if the random number generator, that PostgreSQL uses, isn't good enough, the randomly selected pages for the statistics might not be random enough. Solaris is unknown to me. Maybe the used random number generator there isn't good enough? Good statistics depend on good random

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-20 Thread Josh Berkus
Tom, Any thoughts? This is really messing up query execution all across the database ... --Josh Here is the stats = 100 version. Notice that n_distinct has gone down. schemaname | tablename | attname | null_frac | avg_width | n_distinct |

[PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-19 Thread Josh Berkus
Folks, Params: PostgreSQL 8.0.1 on Solaris 10 Statistics = 500 (tablenames have been changed to protect NDA) e1=# select tablename, null_frac, correlation, n_distinct from pg_stats where tablename = 'clickstream1' andattname = 'session_id'; tablename | null_frac | correlation |

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-19 Thread Dave Held
-Original Message- From: Josh Berkus [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 19, 2005 2:09 PM To: pgsql-perform Subject: [PERFORM] Bad n_distinct estimation; hacks suggested? [...] (BTW, increasing the stats to 1000 only doubles n_distinct, and doesn't solve the problem

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-19 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes: As you can see, n_distinct estimation is off by a factor of 10x and it's causing query planning problems. Any suggested hacks to improve the histogram on this? What's the histogram itself look like? (I'd like to see the whole pg_stats row not just