Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Mischa Sandberg
Quoting Josh Berkus : > Mischa, > > > Okay, although given the track record of page-based sampling for > > n-distinct, it's a bit like looking for your keys under the > streetlight, > > rather than in the alley where you dropped them :-) > > Bad analogy, but funny. Bad analogy? Page-

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Josh Berkus
John, > But doesn't an index only sample one column at a time, whereas with > page-based sampling, you can sample all of the columns at once. Hmmm. Yeah, we're not currently doing that though. Another good idea ... -- --Josh Josh Berkus Aglio Database Solutions San Francisco --

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread John A Meinel
Josh Berkus wrote: Mischa, Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is th

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Josh Berkus
Mischa, > Okay, although given the track record of page-based sampling for > n-distinct, it's a bit like looking for your keys under the streetlight, > rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is that to do, fo

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Mischa Sandberg
Quoting Markus Schaber <[EMAIL PROTECTED]>: > Hi, Josh, > > Josh Berkus wrote: > > > Yes, actually. We need 3 different estimation methods: > > 1 for tables where we can sample a large % of pages (say, >= 0.1) > > 1 for tables where we sample a small % of pages but are "easily > estimated" > >

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Markus Schaber
Hi, Josh, Josh Berkus wrote: > Yes, actually. We need 3 different estimation methods: > 1 for tables where we can sample a large % of pages (say, >= 0.1) > 1 for tables where we sample a small % of pages but are "easily estimated" > 1 for tables which are not easily estimated by we can't afford

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Andrew Dunstan
Mischa Sandberg wrote: Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Marko Ristola
First I will comment my original idea. Second I will give another improved suggestion (an idea). I hope, that they will be useful for you. (I don't know, wether the first one was useful at all because it showed, that I and some others of us are not very good with statistics :( ) I haven't looked ab

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Mischa Sandberg
Quoting Josh Berkus : > > >Perhaps I can save you some time (yes, I have a degree in Math). If I > > >understand correctly, you're trying extrapolate from the correlation > > >between a tiny sample and a larger sample. Introducing the tiny sample > > >into any decision can only produce a less accu

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Greg Stark
"Dave Held" <[EMAIL PROTECTED]> writes: > > Actually, it's more to characterize how large of a sample > > we need. For example, if we sample 0.005 of disk pages, and > > get an estimate, and then sample another 0.005 of disk pages > > and get an estimate which is not even close to the first > >

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Dave Held
> -Original Message- > From: Josh Berkus [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 27, 2005 10:25 AM > To: Andrew Dunstan > Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation;

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Josh Berkus
Mischa, > >Perhaps I can save you some time (yes, I have a degree in Math). If I > >understand correctly, you're trying extrapolate from the correlation > >between a tiny sample and a larger sample. Introducing the tiny sample > >into any decision can only produce a less accurate result than just

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Simon Riggs
On Tue, 2005-04-26 at 15:00 -0700, Gurmeet Manku wrote: > 2. In a single scan, it is possible to estimate n_distinct by using > a very simple algorithm: > > "Distinct sampling for highly-accurate answers to distinct value > queries and event reports" by Gibbons, VLDB 2001. > > http://ww

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Mischa Sandberg
Quoting Andrew Dunstan <[EMAIL PROTECTED]>: > After some more experimentation, I'm wondering about some sort of > adaptive algorithm, a bit along the lines suggested by Marko Ristola, but limited to 2 rounds. > > The idea would be that we take a sample (either of fixed size, or > some sm

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Dave Held
TECTED] > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks > suggested? > > [...] > 2. In a single scan, it is possible to estimate n_distinct by using > a very simple algorithm: > > "Distinct sampling for highly-accurate answers to distinct val

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Andrew Dunstan
Simon Riggs wrote: The comment * Every value in the sample appeared more than once. Assume * the column has just these values. doesn't seem to apply when using larger samples, as Josh is using. Looking at Josh's application it does seem likely that when taking a sample, all site

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Andrew Dunstan
Tom Lane wrote: Josh Berkus writes: Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given sample, rather than the *median* one. This is contrary to

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Andrew Dunstan
Josh Berkus wrote: Simon, Tom: While it's not possible to get accurate estimates from a fixed size sample, I think it would be possible from a small but scalable sample: say, 0.1% of all data pages on large tables, up to the limit of maintenance_work_mem. Setting up these samples as a % of da

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Gurmeet Manku
Hi everybody! Perhaps the following papers are relevant to the discussion here (their contact authors have been cc'd): 1. The following proposes effective algorithms for using block-level sampling for n_distinct estimation: "Effective use of block-level sampling in statistics estimat

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Simon Riggs
On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: > >> It's not just the scan --- you also have to sort, or something like > >> that, if you want to count distinct values. I doubt anyone is really > >

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Josh Berkus
Simon, > Could it be that we have overlooked this simple explanation and that the > Haas and Stokes equation is actually quite good, but just not being > applied? That's probably part of it, but I've tried Haas and Stokes on a pure random sample and it's still bad, or more specifically overly co

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Simon Riggs
On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote: > Josh Berkus writes: > > Overall, our formula is inherently conservative of n_distinct. That is, I > > believe that it is actually computing the *smallest* number of distinct > > values which would reasonably produce the given sample, rather

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Dave Held
> -Original Message- > From: Andrew Dunstan [mailto:[EMAIL PROTECTED] > Sent: Monday, April 25, 2005 3:43 PM > To: josh@agliodbs.com > Cc: pgsql-perform; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks > suggested? &

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Tom Lane
Simon Riggs <[EMAIL PROTECTED]> writes: > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: >> It's not just the scan --- you also have to sort, or something like >> that, if you want to count distinct values. I doubt anyone is really >> going to consider this a feasible answer for large tables.

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Dave Held
> -Original Message- > From: Josh Berkus [mailto:[EMAIL PROTECTED] > Sent: Sunday, April 24, 2005 2:08 PM > To: Andrew Dunstan > Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [PERFORM] Bad n_disti

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Josh Berkus
Guys, > While it's not possible to get accurate estimates from a fixed size sample, > I think it would be possible from a small but scalable sample: say, 0.1% of > all data pages on large tables, up to the limit of maintenance_work_mem. BTW, when I say "accurate estimates" here, I'm talking about

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Josh Berkus
Simon, Tom: While it's not possible to get accurate estimates from a fixed size sample, I think it would be possible from a small but scalable sample: say, 0.1% of all data pages on large tables, up to the limit of maintenance_work_mem. Setting up these samples as a % of data pages, rather th

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Simon Riggs
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > My suggested hack for PostgreSQL is to have an option to *not* sample, > > just to scan the whole table and find n_distinct accurately. > > ... > > What price a single scan of a table, however large, wh

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Tom Lane
Simon Riggs <[EMAIL PROTECTED]> writes: > My suggested hack for PostgreSQL is to have an option to *not* sample, > just to scan the whole table and find n_distinct accurately. > ... > What price a single scan of a table, however large, when incorrect > statistics could force scans and sorts to occu

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Tom Lane
Josh Berkus writes: > Tom, how does our heuristic sampling work? Is it pure random sampling, or > page sampling? Manfred probably remembers better than I do, but I think the idea is to approximate pure random sampling as best we can without actually examining every page of the table.

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Josh Berkus
Folks, > I wonder if this paper has anything that might help: > http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I > were more of a statistician I might be able to answer :-) Actually, that paper looks *really* promising. Does anyone here have enough math to solve for D(s

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Josh Berkus
Andrew, > The math in the paper does not seem to look at very low levels of q (= > sample to pop ratio). Yes, I think that's the failing. Mind you, I did more testing and found out that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy (which I would consider acceptable)

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Marko Ristola
Here is my opinion. I hope this helps. Maybe there is no one good formula: On boolean type, there are at most 3 distinct values. There is an upper bound for fornames in one country. There is an upper bound for last names in one country. There is a fixed number of states and postal codes in one coun

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Tom Lane
Josh Berkus writes: > Overall, our formula is inherently conservative of n_distinct. That is, I > believe that it is actually computing the *smallest* number of distinct > values which would reasonably produce the given sample, rather than the > *median* one. This is contrary to the notes in

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Andrew Dunstan
Josh Berkus said: > > > Well, unusual distributions are certainly tough. But I think the > problem exists even for relatively well-distributed populations. > Part of it is, I believe, the formula we are using: > > n*d / (n - f1 + f1*n/N) > [snip] > > This is so broken, in fact, that I'm wonderin