Quoting Josh Berkus :
> Mischa,
>
> > Okay, although given the track record of page-based sampling for
> > n-distinct, it's a bit like looking for your keys under the
> streetlight,
> > rather than in the alley where you dropped them :-)
>
> Bad analogy, but funny.
Bad analogy? Page-
John,
> But doesn't an index only sample one column at a time, whereas with
> page-based sampling, you can sample all of the columns at once.
Hmmm. Yeah, we're not currently doing that though. Another good idea ...
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
--
Josh Berkus wrote:
Mischa,
Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the streetlight,
rather than in the alley where you dropped them :-)
Bad analogy, but funny.
The issue with page-based vs. pure random sampling is th
Mischa,
> Okay, although given the track record of page-based sampling for
> n-distinct, it's a bit like looking for your keys under the streetlight,
> rather than in the alley where you dropped them :-)
Bad analogy, but funny.
The issue with page-based vs. pure random sampling is that to do, fo
Quoting Markus Schaber <[EMAIL PROTECTED]>:
> Hi, Josh,
>
> Josh Berkus wrote:
>
> > Yes, actually. We need 3 different estimation methods:
> > 1 for tables where we can sample a large % of pages (say, >= 0.1)
> > 1 for tables where we sample a small % of pages but are "easily
> estimated"
> >
Hi, Josh,
Josh Berkus wrote:
> Yes, actually. We need 3 different estimation methods:
> 1 for tables where we can sample a large % of pages (say, >= 0.1)
> 1 for tables where we sample a small % of pages but are "easily estimated"
> 1 for tables which are not easily estimated by we can't afford
Mischa Sandberg wrote:
Perhaps I can save you some time (yes, I have a degree in Math). If I
understand correctly, you're trying extrapolate from the correlation
between a tiny sample and a larger sample. Introducing the tiny sample
into any decision can only produce a less accurate result than
First I will comment my original idea.
Second I will give another improved suggestion (an idea).
I hope, that they will be useful for you.
(I don't know, wether the first one was useful at all because it showed,
that I and some others of us are not very good with statistics :( )
I haven't looked ab
Quoting Josh Berkus :
> > >Perhaps I can save you some time (yes, I have a degree in Math). If I
> > >understand correctly, you're trying extrapolate from the correlation
> > >between a tiny sample and a larger sample. Introducing the tiny sample
> > >into any decision can only produce a less accu
"Dave Held" <[EMAIL PROTECTED]> writes:
> > Actually, it's more to characterize how large of a sample
> > we need. For example, if we sample 0.005 of disk pages, and
> > get an estimate, and then sample another 0.005 of disk pages
> > and get an estimate which is not even close to the first
> >
> -Original Message-
> From: Josh Berkus [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 27, 2005 10:25 AM
> To: Andrew Dunstan
> Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation;
Mischa,
> >Perhaps I can save you some time (yes, I have a degree in Math). If I
> >understand correctly, you're trying extrapolate from the correlation
> >between a tiny sample and a larger sample. Introducing the tiny sample
> >into any decision can only produce a less accurate result than just
On Tue, 2005-04-26 at 15:00 -0700, Gurmeet Manku wrote:
> 2. In a single scan, it is possible to estimate n_distinct by using
> a very simple algorithm:
>
> "Distinct sampling for highly-accurate answers to distinct value
> queries and event reports" by Gibbons, VLDB 2001.
>
> http://ww
Quoting Andrew Dunstan <[EMAIL PROTECTED]>:
> After some more experimentation, I'm wondering about some sort of
> adaptive algorithm, a bit along the lines suggested by Marko
Ristola, but limited to 2 rounds.
>
> The idea would be that we take a sample (either of fixed size, or
> some sm
TECTED]
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
>
> [...]
> 2. In a single scan, it is possible to estimate n_distinct by using
> a very simple algorithm:
>
> "Distinct sampling for highly-accurate answers to distinct val
Simon Riggs wrote:
The comment
* Every value in the sample appeared more than once. Assume
* the column has just these values.
doesn't seem to apply when using larger samples, as Josh is using.
Looking at Josh's application it does seem likely that when taking a
sample, all site
Tom Lane wrote:
Josh Berkus writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to
Josh Berkus wrote:
Simon, Tom:
While it's not possible to get accurate estimates from a fixed size sample, I
think it would be possible from a small but scalable sample: say, 0.1% of all
data pages on large tables, up to the limit of maintenance_work_mem.
Setting up these samples as a % of da
Hi everybody!
Perhaps the following papers are relevant to the discussion here
(their contact authors have been cc'd):
1. The following proposes effective algorithms for using block-level
sampling for n_distinct estimation:
"Effective use of block-level sampling in statistics estimat
On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
> >> It's not just the scan --- you also have to sort, or something like
> >> that, if you want to count distinct values. I doubt anyone is really
> >
Simon,
> Could it be that we have overlooked this simple explanation and that the
> Haas and Stokes equation is actually quite good, but just not being
> applied?
That's probably part of it, but I've tried Haas and Stokes on a pure random
sample and it's still bad, or more specifically overly co
On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote:
> Josh Berkus writes:
> > Overall, our formula is inherently conservative of n_distinct. That is, I
> > believe that it is actually computing the *smallest* number of distinct
> > values which would reasonably produce the given sample, rather
> -Original Message-
> From: Andrew Dunstan [mailto:[EMAIL PROTECTED]
> Sent: Monday, April 25, 2005 3:43 PM
> To: josh@agliodbs.com
> Cc: pgsql-perform; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
&
Simon Riggs <[EMAIL PROTECTED]> writes:
> On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
>> It's not just the scan --- you also have to sort, or something like
>> that, if you want to count distinct values. I doubt anyone is really
>> going to consider this a feasible answer for large tables.
> -Original Message-
> From: Josh Berkus [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 24, 2005 2:08 PM
> To: Andrew Dunstan
> Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_disti
Guys,
> While it's not possible to get accurate estimates from a fixed size sample,
> I think it would be possible from a small but scalable sample: say, 0.1% of
> all data pages on large tables, up to the limit of maintenance_work_mem.
BTW, when I say "accurate estimates" here, I'm talking about
Simon, Tom:
While it's not possible to get accurate estimates from a fixed size sample, I
think it would be possible from a small but scalable sample: say, 0.1% of all
data pages on large tables, up to the limit of maintenance_work_mem.
Setting up these samples as a % of data pages, rather th
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > My suggested hack for PostgreSQL is to have an option to *not* sample,
> > just to scan the whole table and find n_distinct accurately.
> > ...
> > What price a single scan of a table, however large, wh
Simon Riggs <[EMAIL PROTECTED]> writes:
> My suggested hack for PostgreSQL is to have an option to *not* sample,
> just to scan the whole table and find n_distinct accurately.
> ...
> What price a single scan of a table, however large, when incorrect
> statistics could force scans and sorts to occu
Josh Berkus writes:
> Tom, how does our heuristic sampling work? Is it pure random sampling, or
> page sampling?
Manfred probably remembers better than I do, but I think the idea is
to approximate pure random sampling as best we can without actually
examining every page of the table.
Folks,
> I wonder if this paper has anything that might help:
> http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I
> were more of a statistician I might be able to answer :-)
Actually, that paper looks *really* promising. Does anyone here have enough
math to solve for D(s
Andrew,
> The math in the paper does not seem to look at very low levels of q (=
> sample to pop ratio).
Yes, I think that's the failing. Mind you, I did more testing and found out
that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy
(which I would consider acceptable)
Here is my opinion.
I hope this helps.
Maybe there is no one good formula:
On boolean type, there are at most 3 distinct values.
There is an upper bound for fornames in one country.
There is an upper bound for last names in one country.
There is a fixed number of states and postal codes in one coun
Josh Berkus writes:
> Overall, our formula is inherently conservative of n_distinct. That is, I
> believe that it is actually computing the *smallest* number of distinct
> values which would reasonably produce the given sample, rather than the
> *median* one. This is contrary to the notes in
Josh Berkus said:
>
>
> Well, unusual distributions are certainly tough. But I think the
> problem exists even for relatively well-distributed populations.
> Part of it is, I believe, the formula we are using:
>
> n*d / (n - f1 + f1*n/N)
>
[snip]
>
> This is so broken, in fact, that I'm wonderin
35 matches
Mail list logo