Hi, Josh,
Josh Berkus wrote:
Yes, actually. We need 3 different estimation methods:
1 for tables where we can sample a large % of pages (say, = 0.1)
1 for tables where we sample a small % of pages but are easily estimated
1 for tables which are not easily estimated by we can't afford to
Quoting Markus Schaber [EMAIL PROTECTED]:
Hi, Josh,
Josh Berkus wrote:
Yes, actually. We need 3 different estimation methods:
1 for tables where we can sample a large % of pages (say, = 0.1)
1 for tables where we sample a small % of pages but are easily
estimated
1 for tables
Mischa,
Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the streetlight,
rather than in the alley where you dropped them :-)
Bad analogy, but funny.
The issue with page-based vs. pure random sampling is that to do, for
John,
But doesn't an index only sample one column at a time, whereas with
page-based sampling, you can sample all of the columns at once.
Hmmm. Yeah, we're not currently doing that though. Another good idea ...
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Quoting Josh Berkus josh@agliodbs.com:
Mischa,
Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the
streetlight,
rather than in the alley where you dropped them :-)
Bad analogy, but funny.
Bad
Quoting Josh Berkus josh@agliodbs.com:
Perhaps I can save you some time (yes, I have a degree in Math). If I
understand correctly, you're trying extrapolate from the correlation
between a tiny sample and a larger sample. Introducing the tiny sample
into any decision can only produce a
First I will comment my original idea.
Second I will give another improved suggestion (an idea).
I hope, that they will be useful for you.
(I don't know, wether the first one was useful at all because it showed,
that I and some others of us are not very good with statistics :( )
I haven't looked
Mischa Sandberg wrote:
Perhaps I can save you some time (yes, I have a degree in Math). If I
understand correctly, you're trying extrapolate from the correlation
between a tiny sample and a larger sample. Introducing the tiny sample
into any decision can only produce a less accurate result
On Tue, 2005-04-26 at 15:00 -0700, Gurmeet Manku wrote:
2. In a single scan, it is possible to estimate n_distinct by using
a very simple algorithm:
Distinct sampling for highly-accurate answers to distinct value
queries and event reports by Gibbons, VLDB 2001.
Mischa,
Perhaps I can save you some time (yes, I have a degree in Math). If I
understand correctly, you're trying extrapolate from the correlation
between a tiny sample and a larger sample. Introducing the tiny sample
into any decision can only produce a less accurate result than just
taking
-Original Message-
From: Josh Berkus [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 27, 2005 10:25 AM
To: Andrew Dunstan
Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested?
[...]
Actually
On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote:
Josh Berkus josh@agliodbs.com writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given
Simon,
Could it be that we have overlooked this simple explanation and that the
Haas and Stokes equation is actually quite good, but just not being
applied?
That's probably part of it, but I've tried Haas and Stokes on a pure random
sample and it's still bad, or more specifically overly
Hi everybody!
Perhaps the following papers are relevant to the discussion here
(their contact authors have been cc'd):
1. The following proposes effective algorithms for using block-level
sampling for n_distinct estimation:
Effective use of block-level sampling in statistics
Josh Berkus wrote:
Simon, Tom:
While it's not possible to get accurate estimates from a fixed size sample, I
think it would be possible from a small but scalable sample: say, 0.1% of all
data pages on large tables, up to the limit of maintenance_work_mem.
Setting up these samples as a % of
Tom Lane wrote:
Josh Berkus josh@agliodbs.com writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one.
Simon Riggs wrote:
The comment
* Every value in the sample appeared more than once. Assume
* the column has just these values.
doesn't seem to apply when using larger samples, as Josh is using.
Looking at Josh's application it does seem likely that when taking a
sample, all
: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested?
[...]
2. In a single scan, it is possible to estimate n_distinct by using
a very simple algorithm:
Distinct sampling for highly-accurate answers to distinct value
queries and event reports by Gibbons, VLDB 2001.
http
Quoting Andrew Dunstan [EMAIL PROTECTED]:
After some more experimentation, I'm wondering about some sort of
adaptive algorithm, a bit along the lines suggested by Marko
Ristola, but limited to 2 rounds.
The idea would be that we take a sample (either of fixed size, or
some small
On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote:
Greg Stark wrote
I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized
Simon Riggs [EMAIL PROTECTED] writes:
My suggested hack for PostgreSQL is to have an option to *not* sample,
just to scan the whole table and find n_distinct accurately.
...
What price a single scan of a table, however large, when incorrect
statistics could force scans and sorts to occur when
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
Simon Riggs [EMAIL PROTECTED] writes:
My suggested hack for PostgreSQL is to have an option to *not* sample,
just to scan the whole table and find n_distinct accurately.
...
What price a single scan of a table, however large, when
Simon, Tom:
While it's not possible to get accurate estimates from a fixed size sample, I
think it would be possible from a small but scalable sample: say, 0.1% of all
data pages on large tables, up to the limit of maintenance_work_mem.
Setting up these samples as a % of data pages, rather
Guys,
While it's not possible to get accurate estimates from a fixed size sample,
I think it would be possible from a small but scalable sample: say, 0.1% of
all data pages on large tables, up to the limit of maintenance_work_mem.
BTW, when I say accurate estimates here, I'm talking about
-Original Message-
From: Josh Berkus [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 24, 2005 2:08 PM
To: Andrew Dunstan
Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform;
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested
Simon Riggs [EMAIL PROTECTED] writes:
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
It's not just the scan --- you also have to sort, or something like
that, if you want to count distinct values. I doubt anyone is really
going to consider this a feasible answer for large tables.
-Original Message-
From: Andrew Dunstan [mailto:[EMAIL PROTECTED]
Sent: Monday, April 25, 2005 3:43 PM
To: josh@agliodbs.com
Cc: pgsql-perform; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
suggested?
Josh Berkus wrote:
Simon
Here is my opinion.
I hope this helps.
Maybe there is no one good formula:
On boolean type, there are at most 3 distinct values.
There is an upper bound for fornames in one country.
There is an upper bound for last names in one country.
There is a fixed number of states and postal codes in one
Andrew,
The math in the paper does not seem to look at very low levels of q (=
sample to pop ratio).
Yes, I think that's the failing. Mind you, I did more testing and found out
that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy
(which I would consider acceptable)
Folks,
I wonder if this paper has anything that might help:
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I
were more of a statistician I might be able to answer :-)
Actually, that paper looks *really* promising. Does anyone here have enough
math to solve for
Josh Berkus josh@agliodbs.com writes:
Tom, how does our heuristic sampling work? Is it pure random sampling, or
page sampling?
Manfred probably remembers better than I do, but I think the idea is
to approximate pure random sampling as best we can without actually
examining every page of the
Josh Berkus josh@agliodbs.com writes:
... I think the problem is in our heuristic sampling code. I'm not the first
person to have this kind of a problem. Will be following up with tests ...
I looked into this a while back when we were talking about changing the
sampling method. The
Greg,
I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized
samples only work for things like the histograms that can be analyzed
People,
Can someone whose math is more recent than calculus in 1989 take a look at
that paper, and look at the formula toward the bottom of page 10, and see
if we are correctly interpreting it? I'm particularly confused as to
what q and d-sub-n represent. Thanks!
Actually, I managed to
Josh Berkus said:
Well, unusual distributions are certainly tough. But I think the
problem exists even for relatively well-distributed populations.
Part of it is, I believe, the formula we are using:
n*d / (n - f1 + f1*n/N)
[snip]
This is so broken, in fact, that I'm wondering if
Josh Berkus josh@agliodbs.com writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary
Marko,
Sometimes, if the random number generator, that PostgreSQL uses,
isn't good enough, the randomly selected pages for the statistics
might not be random enough.
Solaris is unknown to me. Maybe the used random number generator there
isn't good enough?
Hmmm. Good point. Will have to
Solaris is unknown to me. Maybe the used random number generator there
isn't good enough?
Hmmm. Good point. Will have to test on Linux.
Nope:
Linux 2.4.20:
test=# select tablename, attname, n_distinct from pg_stats where tablename =
'web_site_activity_fa';
tablename |
Hi.
Sometimes, if the random number generator, that PostgreSQL uses,
isn't good enough, the randomly selected pages for the statistics
might not be random enough.
Solaris is unknown to me. Maybe the used random number generator there
isn't good enough?
Good statistics depend on good random
Tom,
Any thoughts? This is really messing up query execution all across the
database ...
--Josh
Here is the stats = 100 version. Notice that n_distinct has gone down.
schemaname | tablename | attname | null_frac | avg_width |
n_distinct |
Folks,
Params: PostgreSQL 8.0.1 on Solaris 10
Statistics = 500
(tablenames have been changed to protect NDA)
e1=# select tablename, null_frac, correlation, n_distinct from pg_stats where
tablename = 'clickstream1' andattname = 'session_id';
tablename | null_frac | correlation |
-Original Message-
From: Josh Berkus [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 19, 2005 2:09 PM
To: pgsql-perform
Subject: [PERFORM] Bad n_distinct estimation; hacks suggested?
[...]
(BTW, increasing the stats to 1000 only doubles n_distinct,
and doesn't solve the problem
Josh Berkus josh@agliodbs.com writes:
As you can see, n_distinct estimation is off by a factor of 10x and it's
causing query planning problems. Any suggested hacks to improve the
histogram on this?
What's the histogram itself look like? (I'd like to see the whole
pg_stats row not just
43 matches
Mail list logo