Re: [HACKERS] ANALYZE sampling is too good

2014-03-08 Thread Bruce Momjian

I assume we never came up with a TODO from this thread:

---

On Tue, Dec  3, 2013 at 11:30:44PM +, Greg Stark wrote:
 At multiple conferences I've heard about people trying all sorts of
 gymnastics to avoid ANALYZE which they expect to take too long and
 consume too much I/O. This is especially a big complain after upgrades
 when their new database performs poorly until the new statistics are
 in and they did pg_upgrade to avoid an extended downtime and complain
 about ANALYZE taking hours.
 
 I always gave the party line that ANALYZE only takes a small
 constant-sized sample so even very large tables should be very quick.
 But after hearing the same story again in Heroku I looked into it a
 bit further. I was kind of shocked but the numbers.
 
 ANALYZE takes a sample of 300 * statistics_target rows. That sounds
 pretty reasonable but with default_statistics_target set to 100 that's
 30,000 rows. If I'm reading the code right It takes this sample by
 sampling 30,000 blocks and then (if the table is large enough) taking
 an average of one row per block. Each block is 8192 bytes so that
 means it's reading 240MB of each table.That's a lot more than I
 realized.
 
 It means if your table is anywhere up to 240MB you're effectively
 doing a full table scan and then throwing out nearly all the data
 read.
 
 Worse, my experience with the posix_fadvise benchmarking is that on
 spinning media reading one out of every 16 blocks takes about the same
 time as reading them all. Presumably this is because the seek time
 between tracks dominates and reading one out of every 16 blocks is
 still reading every track. So in fact if your table is up to about
 3-4G ANALYZE is still effectively going to do a full table scan, at
 least as far as I/O time goes.
 
 The current algorithm seems like it was designed with a 100G+ table in
 mind but the consequences on the more common 100M-100G tables weren't
 really considered. Consider what this means for partitioned tables. If
 they partition their terabyte table into 10 partitions ANALYZE will
 suddenly want to use 10x as much I/O which seems like a perverse
 consequence.
 
 I'm not sure I have a prescription but my general feeling is that
 we're spending an awful lot of resources going after a statistically
 valid sample when we can spend a lot less resources and get something
 90% as good. Or if we're really going to read that much data that we
 might as well use more of the rows we find.
 
 -- 
 greg
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-17 Thread Heikki Linnakangas

On 12/17/2013 12:06 AM, Jeff Janes wrote:

On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:


  I took a stab at using posix_fadvise() in ANALYZE. It turned out to be
very easy, patch attached. Your mileage may vary, but I'm seeing a nice
gain from this on my laptop. Taking a 3 page sample of a table with
717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds
without the patch, and less than a second with the patch, with
effective_io_concurrency=10. If anyone with a good test data set loaded
would like to test this and post some numbers, that would be great.


Performance is often chaotic near transition points, so I try to avoid data
sets that are slightly bigger or slightly smaller than RAM (or some other
limit).

Do you know how many io channels your SSD has (or whatever the term of art
is for SSD drives)?


No idea. It's an Intel 335.


On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB)
with 4 GB of RAM goes from ~106 seconds to ~19 seconds.

However, I'm not sure what problem we want to solve here.


The case that Greg Stark mentioned in the email starting this thread is 
doing a database-wide ANALYZE after an upgrade. In that use case, you 
certainly want to get it done as quickly as possible, using all the 
available resources.



I certainly would not wish to give a background maintenance process
permission to confiscate my entire RAID throughput for its own
operation.


Then don't set effective_io_concurrency. If you're worried about that, 
you probably wouldn't want any other process to monopolize the RAID 
array either.



Perhaps this could only be active for explicit analyze, and only if
vacuum_cost_delay=0?


That would be a bit weird, because ANALYZE in general doesn't obey 
vacuum_cost_delay. Maybe it should, though...



Perhaps there should be something like alter background role autovac set
  Otherwise we are going to end up with an autovacuum_* shadow
parameter for many of our parameters, see autovacuum_work_mem discussions.


Yeah, so it seems.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-16 Thread Jeff Janes
On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:



  I took a stab at using posix_fadvise() in ANALYZE. It turned out to be
 very easy, patch attached. Your mileage may vary, but I'm seeing a nice
 gain from this on my laptop. Taking a 3 page sample of a table with
 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds
 without the patch, and less than a second with the patch, with
 effective_io_concurrency=10. If anyone with a good test data set loaded
 would like to test this and post some numbers, that would be great.


Performance is often chaotic near transition points, so I try to avoid data
sets that are slightly bigger or slightly smaller than RAM (or some other
limit).

Do you know how many io channels your SSD has (or whatever the term of art
is for SSD drives)?

On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB)
with 4 GB of RAM goes from ~106 seconds to ~19 seconds.

However, I'm not sure what problem we want to solve here.  I certainly
would not wish to give a background maintenance process permission to
confiscate my entire RAID throughput for its own operation.  Perhaps this
could only be active for explicit analyze, and only if vacuum_cost_delay=0?

Perhaps there should be something like alter background role autovac set
  Otherwise we are going to end up with an autovacuum_* shadow
parameter for many of our parameters, see autovacuum_work_mem discussions.

Cheers,

Jeff


Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Florian Pflug
Here's an analysis of Jeff Janes' simple example of a table where our
n_distinct estimate is way off.

On Dec11, 2013, at 00:03 , Jeff Janes jeff.ja...@gmail.com wrote:
 create table baz as select floor(random()*1000), md5(random()::text) from 
 generate_series(1,1);
 create table baz2 as select * from baz order by floor;
 create table baz3 as select * from baz order by md5(floor::text);
 
 baz unclustered, baz2 is clustered with perfect correlation, baz3 is 
 clustered but without correlation.
 
 After analyzing all of them:
 
 select tablename, n_distinct,correlation  from pg_stats where tablename  like 
 'baz%' and attname='floor'  ;
  tablename | n_distinct  | correlation
 ---+-+-
  baz   | 8.56006e+06 |  0.00497713
  baz2  |  333774 |   1
  baz3  |  361048 |  -0.0118147

I think I understand what's going on here. I worked with a reduced test cases
of 1e7 rows containing random values between 0 and 5e5 and instrumented
analyze to print the raw ndistinct and nmultiple values of the sample
population (i.e. the number of distinct values in the sample, and the number
of distinct values which appeared more than once). I've also considered only
baz and baz2, and thus removed the than unnecessary md5 column. To account for
the reduced table sizes, I adjusted default_statistics_target to 10 instead of
100. The resulting estimates are then

 tablename | n_distinct (est) | n_distinct (act) 
---+--+--
 baz   |   391685 |   50 
 baz2  |36001 |   50 

ANALYZE assumes that both tables contain 1048 rows and samples 3000 of
those.

The sample of baz contains 2989 distinct values, 11 of which appear more than
once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
appear more than once.

The very different results stem from the Duj1 estimator we use. It estimates
n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
samples, N the number of rows, d the number of distinct samples, and f1 the
number of distinct samples occurring exactly once. If all samples are unique
(i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very
quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly
0.03% - and the estimate already less than a 1/10 of what it would be if f1
where 0.

Which leaves us with the question why sampling baz2 produces more duplicate
values than sampling baz does. Now, if we used block sampling, that behaviour
would be unsurprising - baz2 is sorted, so each block very likely contains
each value more than since, since the row count exceeds the number of possible
values by more than a magnitude. In fact, with block sampling, we'd probably
see f1 values close to 0 and thus our estimate of n_distinct would be roughly
equal to the number of distinct values in the *sample* population, i.e. about
300 or so.

So why does vitter's algorithm fail here? Given that we see inflated numbers
of duplicate values in our sample, yet still far fewer than block-based
sampling would yield, my guess is that it's the initial reservoir filling that
bites us here. After that initial filling step, the reservoir contains a lot of
consecutive rows, i.e. a block-based sample taken from just a few blocks. If
the replacement phase that follows somehow uses a slightly smaller replacement
probability than it should, more of these rows will survive replacement,
resulting in exactly the kind of inflated numbers of duplicate values we're
seeing. I've yet to validate this by looking at the reservoir before and after
the replacement stage, though.

So at least for the purpose of estimating n_distinct, our current sampling
method seems to exhibit the worst rather than the best properties of block-
and row- based sampling. What conclusions to draw of that, I'm not sure yet -
other that if we move to block-based sampling, we'll certainly have to change
the way we estimate n_distinct.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes
On Wed, Dec 11, 2013 at 2:33 PM, Greg Stark st...@mit.edu wrote:


 I think we're all wet here. I don't see any bias towards larger or smaller
 rows. Larger tied will be on a larger number of pages but there will be
 fewer of them on any one page. The average effect should be the same.

 Smaller values might have a higher variance with block based sampling than
 larger values. But that actually *is* the kind of thing that Simon's
 approach of just compensating with later samples can deal with.


I think that looking at all rows in randomly-chosen blocks will not bias
size, or histograms.  But it will bias n_distinct and MCV for some data
distributions of data, unless we find some way to compensate for it.

But even for avg size and histograms, what does block sampling get us?  We
get larger samples sizes for the same IO, but those samples are less
independent (assuming data is no randomly scattered over the table), so the
effective sample size is less than the true sample size.  So we can't
just sample 100 time fewer blocks because there are about 100 rows per
block--doing so would not bias our avg size or histogram boundaries, but it
would certainly make them noisier.

Cheers,

Jeff


Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes
On Thu, Dec 12, 2013 at 6:39 AM, Florian Pflug f...@phlo.org wrote:

 Here's an analysis of Jeff Janes' simple example of a table where our
 n_distinct estimate is way off.

 On Dec11, 2013, at 00:03 , Jeff Janes jeff.ja...@gmail.com wrote:
  create table baz as select floor(random()*1000), md5(random()::text)
 from generate_series(1,1);
  create table baz2 as select * from baz order by floor;
  create table baz3 as select * from baz order by md5(floor::text);
 
  baz unclustered, baz2 is clustered with perfect correlation, baz3 is
 clustered but without correlation.
 
  After analyzing all of them:
 
  select tablename, n_distinct,correlation  from pg_stats where tablename
  like 'baz%' and attname='floor'  ;
   tablename | n_distinct  | correlation
  ---+-+-
   baz   | 8.56006e+06 |  0.00497713
   baz2  |  333774 |   1
   baz3  |  361048 |  -0.0118147

 I think I understand what's going on here. I worked with a reduced test
 cases
 of 1e7 rows containing random values between 0 and 5e5 and instrumented
 analyze to print the raw ndistinct and nmultiple values of the sample
 population (i.e. the number of distinct values in the sample, and the
 number
 of distinct values which appeared more than once). I've also considered
 only
 baz and baz2, and thus removed the than unnecessary md5 column. To account
 for
 the reduced table sizes, I adjusted default_statistics_target to 10
 instead of
 100. The resulting estimates are then

  tablename | n_distinct (est) | n_distinct (act)
 ---+--+--
  baz   |   391685 |   50
  baz2  |36001 |   50

 ANALYZE assumes that both tables contain 1048 rows and samples 3000 of
 those.

 The sample of baz contains 2989 distinct values, 11 of which appear more
 than
 once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
 appear more than once.

 The very different results stem from the Duj1 estimator we use. It
 estimates
 n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
 samples, N the number of rows, d the number of distinct samples, and f1 the
 number of distinct samples occurring exactly once. If all samples are
 unique
 (i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops
 very
 quickly - sampling baz2 produces 117 non-unique values out of 2878 -
 roughly
 0.03% - and the estimate already less than a 1/10 of what it would be if f1
 where 0.

 Which leaves us with the question why sampling baz2 produces more duplicate
 values than sampling baz does. Now, if we used block sampling, that
 behaviour
 would be unsurprising - baz2 is sorted, so each block very likely contains
 each value more than since, since the row count exceeds the number of
 possible
 values by more than a magnitude. In fact, with block sampling, we'd
 probably
 see f1 values close to 0 and thus our estimate of n_distinct would be
 roughly
 equal to the number of distinct values in the *sample* population, i.e.
 about
 300 or so.

 So why does vitter's algorithm fail here? Given that we see inflated
 numbers
 of duplicate values in our sample, yet still far fewer than block-based
 sampling would yield, my guess is that it's the initial reservoir filling
 that
 bites us here. After that initial filling step, the reservoir contains a
 lot of
 consecutive rows, i.e. a block-based sample taken from just a few blocks.
 If
 the replacement phase that follows somehow uses a slightly smaller
 replacement
 probability than it should, more of these rows will survive replacement,
 resulting in exactly the kind of inflated numbers of duplicate values we're
 seeing. I've yet to validate this by looking at the reservoir before and
 after
 the replacement stage, though.



I think the problem is more subtle than that.  It is easier to visualize it
if you think of every block having the same number of rows, with that
number being fairly large.  If you pick 30,000 rows at random from
1,000,000 blocks, the number of rows chosen from any given block should be
close to following a poisson distribution with average of 0.03, which means
about 29113 blocks should have exactly 1 row chosen from them while 441
would have two or more rows chosen from them.

But if you instead select 30,000 row from 30,000 blocks, which is what we
ask Vitter's algorithm to do, you get about a Poisson distribution with
average of 1.0.  Then about 11036 blocks have exactly one row chosen from
them, and 7927 blocks have two or more rows sampled from it.  Another
11,036 blocks get zero rows selected from them due to Vitter, in addition
to the 970,000 that didn't even get submitted to Vitter in the first place.
 That is why you see too many duplicates for clustered data, as too many
blocks are sampled multiple times.

The Poisson argument doesn't apply cleanly when blocks have variable number
of rows, but the general principle 

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 It would be relatively easy to fix this if we trusted the number of visible
 rows in each block to be fairly constant.  But without that assumption, I
 don't see a way to fix the sample selection process without reading the
 entire table.

Yeah, varying tuple density is the weak spot in every algorithm we've
looked at.  The current code is better than what was there before, but as
you say, not perfect.  You might be entertained to look at the threads
referenced by the patch that created the current sampling method:
http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at

particularly
http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at


However ... where this thread started was not about trying to reduce
the remaining statistical imperfections in our existing sampling method.
It was about whether we could reduce the number of pages read for an
acceptable cost in increased statistical imperfection.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire
On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
 It would be relatively easy to fix this if we trusted the number of visible
 rows in each block to be fairly constant.  But without that assumption, I
 don't see a way to fix the sample selection process without reading the
 entire table.

 Yeah, varying tuple density is the weak spot in every algorithm we've
 looked at.  The current code is better than what was there before, but as
 you say, not perfect.  You might be entertained to look at the threads
 referenced by the patch that created the current sampling method:
 http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at

 particularly
 http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at


 However ... where this thread started was not about trying to reduce
 the remaining statistical imperfections in our existing sampling method.
 It was about whether we could reduce the number of pages read for an
 acceptable cost in increased statistical imperfection.


Well, why not take a supersample containing all visible tuples from N
selected blocks, and do bootstrapping over it, with subsamples of M
independent rows each?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Josh Berkus
On 12/12/2013 10:33 AM, Claudio Freire wrote:
 Well, why not take a supersample containing all visible tuples from N
 selected blocks, and do bootstrapping over it, with subsamples of M
 independent rows each?

Well, we still need to look at each individual block to determine
grouping correlation.  Let's take a worst case example: imagine a table
has *just* been created by:

CREATE TABLE newdata AS SELECT * FROM olddata ORDER BY category, item;

If category is fairly low cardinality, then grouping will be severe;
we can reasonably expect that if we sample 100 blocks, many of them will
have only one category value present.  The answer to this is to make our
block samples fairly widely spaced and compare them.

In this simplified example, if the table had 1000 blocks, we would take
blocks 1,101,201,301,401,etc.  Then we would compare the number and
content of values found on each block with the number and content found
on each other block.  For example, if we see that block 101 is entirely
the category cats, and block 701 is entirely the category shopping
and block 901 is split 60/40 between the categories transportation and
voting, then we can assume that the level of grouping is very high,
and the number of unknown values we haven't seen is also high.

Whereas if 101 is cats and 201 is cats and 301 through 501 are
cats with 2% other stuff, then we assume that the level of grouping is
moderate and it's just the case that most of the dataset is cats.
Which means that the number of unknown values we haven't seen is low.

Whereas if 101, 201, 501, and 901 have near-identical distributions of
values, we assume that the level of grouping is very low, and that there
are very few values we haven't seen.

As someone else pointed out, full-block (the proposal) vs. random-row
(our current style) doesn't have a very significant effect on estimates
of Histograms and nullfrac, as long as the sampled blocks are widely
spaced.  Well, nullfrac is affected in the extreme example of a totally
ordered table where the nulls are all in one block, but I'll point out
that we can (and do) also miss that using our current algo.

Estimated grouping should, however, affect MCVs.  In cases where we
estimate that grouping levels are high, the expected % of observed
values should be discounted somehow.  That is, with total random
distribution you have a 1:1 ratio between observed frequency of a value
and assumed frequency.  However, with highly grouped values, you might
have a 2:1 ratio.

Again, more math (backed by statistical analysis) is needed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire
On Thu, Dec 12, 2013 at 3:56 PM, Josh Berkus j...@agliodbs.com wrote:

 Estimated grouping should, however, affect MCVs.  In cases where we
 estimate that grouping levels are high, the expected % of observed
 values should be discounted somehow.  That is, with total random
 distribution you have a 1:1 ratio between observed frequency of a value
 and assumed frequency.  However, with highly grouped values, you might
 have a 2:1 ratio.

Cross validation can help there. But it's costly.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes
On Thu, Dec 12, 2013 at 10:33 AM, Claudio Freire klaussfre...@gmail.comwrote:

 On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Jeff Janes jeff.ja...@gmail.com writes:
  It would be relatively easy to fix this if we trusted the number of
 visible
  rows in each block to be fairly constant.  But without that assumption,
 I
  don't see a way to fix the sample selection process without reading the
  entire table.
 
  Yeah, varying tuple density is the weak spot in every algorithm we've
  looked at.  The current code is better than what was there before, but as
  you say, not perfect.  You might be entertained to look at the threads
  referenced by the patch that created the current sampling method:
 
 http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at
 
  particularly
 
 http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at



Thanks, I will read those.


 
 
  However ... where this thread started was not about trying to reduce
  the remaining statistical imperfections in our existing sampling method.
  It was about whether we could reduce the number of pages read for an
  acceptable cost in increased statistical imperfection.


I think it is pretty clear that n_distinct at least, and probably MCV,
would be a catastrophe under some common data distribution patterns if we
sample all rows in each block without changing our current computation
method.  If we come up with a computation that works for that type of
sampling, it would probably be an improvement under our current sampling as
well.  If we find such a thing, I wouldn't want it to get rejected just
because the larger block-sampling method change did not make it in.

Well, why not take a supersample containing all visible tuples from N
 selected blocks, and do bootstrapping over it, with subsamples of M
 independent rows each?


Bootstrapping methods generally do not work well when ties are significant
events, i.e. when two values being identical is meaningfully different from
them being very close but not identical.

Cheers,

Jeff


Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire
On Thu, Dec 12, 2013 at 4:13 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Well, why not take a supersample containing all visible tuples from N
 selected blocks, and do bootstrapping over it, with subsamples of M
 independent rows each?


 Bootstrapping methods generally do not work well when ties are significant
 events, i.e. when two values being identical is meaningfully different from
 them being very close but not identical.

Yes, that's why I meant to say (but I see now that I didn't) that it
wouldn't do much for n_distinct, just the histogram.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes
On Tue, Dec 3, 2013 at 3:30 PM, Greg Stark st...@mit.edu wrote:

 At multiple conferences I've heard about people trying all sorts of
 gymnastics to avoid ANALYZE which they expect to take too long and
 consume too much I/O. This is especially a big complain after upgrades
 when their new database performs poorly until the new statistics are
 in and they did pg_upgrade to avoid an extended downtime and complain
 about ANALYZE taking hours.


Out of curiosity, are they using the 3 stage script
analyze_new_cluster.sh?  If so, is the complaint that even the first
rounds are too slow, or that the database remains unusable until the last
round is complete?

Cheers,

Jeff


Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Florian Pflug
On Dec12, 2013, at 19:29 , Tom Lane t...@sss.pgh.pa.us wrote:
 However ... where this thread started was not about trying to reduce
 the remaining statistical imperfections in our existing sampling method.
 It was about whether we could reduce the number of pages read for an
 acceptable cost in increased statistical imperfection.

True, but Jeff's case shows that even the imperfections of the current
sampling method are larger than what the n_distinct estimator expects.

Making it even more biased will thus require us to rethink how we
obtain a n_distinct estimate or something equivalent. I don't mean that
as an argument against changing the sampling method, just as something
to watch out for.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Mark Kirkwood

On 11/12/13 19:34, Simon Riggs wrote:


Realistically, I never heard of an Oracle DBA doing advanced
statistical mathematics before setting the sample size on ANALYZE. You
use the default and bump it up if the sample is insufficient for the
data.



I'm not sure that Oracle's stats and optimizer design is an example to 
be envied - pretty much all Oracle DBA's I've encountered will apply 
hints all queries to get the plan they want...


Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Greg Stark
On Wed, Dec 11, 2013 at 12:58 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Yes, it is not a perfect statistical sample. All sampling is subject
 to an error that is data dependent.

Well there's random variation due to the limitations of dealing with a
sample. And then there's systemic biases due to incorrect algorithms.
You wouldn't be happy if the samples discarded every row with NULLs or
every row older than some date etc. These things would not be
corrected by larger samples. That's the kind of error we're talking
about here.

But the more I think about things the less convinced I am that there
is a systemic bias introduced by reading the entire block. I had
assumed larger rows would be selected against but that's not really
true, they're just selected against relative to the number of bytes
they occupy which is the correct frequency to sample.

Even blocks that are mostly empty don't really bias things. Picture a
table that consists of 100 blocks with 100 rows each (value A) and
another 100 blocks with only 1 row each (value B). The rows with value
B have a 50% chance of being in any given block which is grossly
inflated however each block selected with value A will produce 100
rows. So if you sample 10 blocks you'll get 100x10xA and 1x10xB which
will be the correct proportion.

I'm not actually sure there is any systemic bias here. The larger
number of rows per block generate less precise results but from my
thought experiments they seem to still be accurate?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Greg Stark
On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark st...@mit.edu wrote:
 I'm not actually sure there is any systemic bias here. The larger
 number of rows per block generate less precise results but from my
 thought experiments they seem to still be accurate?

So I've done some empirical tests for a table generated by:
create table sizeskew as (select i,j,repeat('i',i) from
generate_series(1,1000) as i, generate_series(1,1000) as j);

I find that using the whole block doesn't cause any problem with the
avg_width field for the repeat column.That does reinforce my belief
that we might not need any particularly black magic here.

It does however cause a systemic error in the histogram bounds. It
seems the median is systematically overestimated by more and more the
larger the number of rows per block are used:

1: 524
4: 549
8: 571
12: 596
16: 602
20: 618 (total sample slightly smaller than normal)
30: 703 (substantially smaller sample)

So there is something clearly wonky in the histogram stats that's
affected by the distribution of the sample. The only thing I can think
of is maybe the most common elements are being selected preferentially
from the early part of the sample which is removing a substantial part
of the lower end of the range. But even removing 100 from the
beginning shouldn't be enough to push the median above 550.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Greg Stark
On Wed, Dec 11, 2013 at 12:08 PM, Greg Stark st...@mit.edu wrote:
 The only thing I can think
 of is maybe the most common elements are being selected preferentially
 from the early part of the sample which is removing a substantial part
 of the lower end of the range. But even removing 100 from the
 beginning shouldn't be enough to push the median above 550.

Just to follow up here. I think what's going is that not only are the
most_common_vals being preferentially taken from the beginning of the
sample but also their frequency is being massively overestimated. All
values have a frequency of about .001 but the head of the MCV has a
frequency as high as .10 in some of my tests.



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Heikki Linnakangas

On 12/11/2013 02:08 PM, Greg Stark wrote:

On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark st...@mit.edu wrote:

I'm not actually sure there is any systemic bias here. The larger
number of rows per block generate less precise results but from my
thought experiments they seem to still be accurate?


So I've done some empirical tests for a table generated by:
create table sizeskew as (select i,j,repeat('i',i) from
generate_series(1,1000) as i, generate_series(1,1000) as j);

I find that using the whole block doesn't cause any problem with the
avg_width field for the repeat column.That does reinforce my belief
that we might not need any particularly black magic here.


How large a sample did you use? Remember that the point of doing 
block-level sampling instead of the current approach would be to allow 
using a significantly smaller sample (in # of blocks), and still achieve 
the same sampling error. If the sample is large enough, it will mask 
any systemic bias caused by block-sampling, but the point is to reduce 
the number of sampled blocks.


The practical question here is this: What happens to the quality of the 
statistics if you only read 1/2 the number of blocks than you normally 
would, but included all the rows in the blocks we read in the sample? 
How about 1/10 ?


Or to put it another way: could we achieve more accurate statistics by 
including all rows from the sampled rows, while reading the same number 
of blocks? In particular, I wonder if it would help with estimating 
ndistinct. It generally helps to have a larger sample for ndistinct 
estimation, so it might be beneficial.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Florian Pflug
On Dec10, 2013, at 15:32 , Claudio Freire klaussfre...@gmail.com wrote:
 On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark st...@mit.edu wrote:
 
 On 10 Dec 2013 08:28, Albe Laurenz laurenz.a...@wien.gv.at wrote:
 
 
 Doesn't all that assume a normally distributed random variable?
 
 I don't think so because of the law of large numbers. If you have a large
 population and sample it the sample behaves like a normal distribution when
 if the distribution of the population isn't.
 
 No, the large population says that if you have an AVERAGE of many
 samples of a random variable, the random variable that is the AVERAGE
 behaves like a normal.

Actually, that's the central limit theorem, and it doesn't hold for all
random variables, only for those with finite expected value and variance.

The law of large numbers, in contrast, only tells you that the AVERAGE of
n samples of a random variable will converge to the random variables'
expected value as n goes to infinity (there are different versions of the
law which guarantee different kinds of convergence, weak or strong).

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Simon Riggs
On 11 December 2013 12:08, Greg Stark st...@mit.edu wrote:

 So there is something clearly wonky in the histogram stats that's
 affected by the distribution of the sample.

...in the case where the avg width changes in a consistent manner
across the table.

Well spotted.

ISTM we can have a specific cross check for bias in the sample of that
nature. We just calculate the avg width per block and then check for
correlation of the avg width against block number. If we find bias we
can calculate how many extra blocks to sample and from where.

There may be other biases also, so we can check for them and respond
accordingly.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 So I've done some empirical tests for a table generated by:
 create table sizeskew as (select i,j,repeat('i',i) from
 generate_series(1,1000) as i, generate_series(1,1000) as j);

 I find that using the whole block doesn't cause any problem with the
 avg_width field for the repeat column.That does reinforce my belief
 that we might not need any particularly black magic here.

 It does however cause a systemic error in the histogram bounds. It
 seems the median is systematically overestimated by more and more the
 larger the number of rows per block are used:

Hm.  You can only take N rows from a block if there actually are at least
N rows in the block.  So the sampling rule I suppose you are using is
select up to N rows from each sampled block --- and that is going to
favor the contents of blocks containing narrower-than-average rows.

Now in this case, it looks like that ought to favor rows with *smaller* i
values, but you say the median goes up not down.  So I'm not sure what's
going on.  I thought at first that TOAST compression might be part of the
explanation, but TOAST shouldn't kick in on rows with raw representation
narrower than 2KB.

Did you do a run with no upper limit on the number of rows per block?
Because I'm not sure that tests with a limit in place are a good guide
to what happens without it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Tom Lane
I wrote:
 Hm.  You can only take N rows from a block if there actually are at least
 N rows in the block.  So the sampling rule I suppose you are using is
 select up to N rows from each sampled block --- and that is going to
 favor the contents of blocks containing narrower-than-average rows.

Oh, no, wait: that's backwards.  (I plead insufficient caffeine.)
Actually, this sampling rule discriminates *against* blocks with
narrower rows.  You previously argued, correctly I think, that
sampling all rows on each page introduces no new bias because row
width cancels out across all sampled pages.  However, if you just
include up to N rows from each page, then rows on pages with more
than N rows have a lower probability of being selected, but there's
no such bias against wider rows.  This explains why you saw smaller
values of i being undersampled.

Had you run the test series all the way up to the max number of
tuples per block, which is probably a couple hundred in this test,
I think you'd have seen the bias go away again.  But the takeaway
point is that we have to sample all tuples per page, not just a
limited number of them, if we want to change it like this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower

On 12/12/13 06:22, Tom Lane wrote:

I wrote:

Hm.  You can only take N rows from a block if there actually are at least
N rows in the block.  So the sampling rule I suppose you are using is
select up to N rows from each sampled block --- and that is going to
favor the contents of blocks containing narrower-than-average rows.

Oh, no, wait: that's backwards.  (I plead insufficient caffeine.)
Actually, this sampling rule discriminates *against* blocks with
narrower rows.  You previously argued, correctly I think, that
sampling all rows on each page introduces no new bias because row
width cancels out across all sampled pages.  However, if you just
include up to N rows from each page, then rows on pages with more
than N rows have a lower probability of being selected, but there's
no such bias against wider rows.  This explains why you saw smaller
values of i being undersampled.

Had you run the test series all the way up to the max number of
tuples per block, which is probably a couple hundred in this test,
I think you'd have seen the bias go away again.  But the takeaway
point is that we have to sample all tuples per page, not just a
limited number of them, if we want to change it like this.

regards, tom lane


Surely we want to sample a 'constant fraction' (obviously, in practice 
you have to sample an integral number of rows in a page!) of rows per 
page? The simplest way, as Tom suggests, is to use all the rows in a page.


However, if you wanted the same number of rows from a greater number of 
pages, you could (for example) select a quarter of the rows from each 
page.  In which case, when this is a fractional number: take the 
integral number of rows, plus on extra row with a probability equal to 
the fraction (here 0.25).


Either way, if it is determined that you need N rows, then keep 
selecting pages at random (but never use the same page more than once) 
until you have at least N rows.



Cheers,
Gavin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower

On 12/12/13 06:22, Tom Lane wrote:

I wrote:

Hm.  You can only take N rows from a block if there actually are at least
N rows in the block.  So the sampling rule I suppose you are using is
select up to N rows from each sampled block --- and that is going to
favor the contents of blocks containing narrower-than-average rows.

Oh, no, wait: that's backwards.  (I plead insufficient caffeine.)
Actually, this sampling rule discriminates *against* blocks with
narrower rows.  You previously argued, correctly I think, that
sampling all rows on each page introduces no new bias because row
width cancels out across all sampled pages.  However, if you just
include up to N rows from each page, then rows on pages with more
than N rows have a lower probability of being selected, but there's
no such bias against wider rows.  This explains why you saw smaller
values of i being undersampled.

Had you run the test series all the way up to the max number of
tuples per block, which is probably a couple hundred in this test,
I think you'd have seen the bias go away again.  But the takeaway
point is that we have to sample all tuples per page, not just a
limited number of them, if we want to change it like this.

regards, tom lane



Hmm...

In my previous reply, which hasn't shown up yet!

I realized I made a mistake!

The fraction/probability could any of 0.25. 0.50, and 0.75.


Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower

On 12/12/13 07:22, Gavin Flower wrote:

On 12/12/13 06:22, Tom Lane wrote:

I wrote:
Hm.  You can only take N rows from a block if there actually are at 
least

N rows in the block.  So the sampling rule I suppose you are using is
select up to N rows from each sampled block --- and that is going to
favor the contents of blocks containing narrower-than-average rows.

Oh, no, wait: that's backwards.  (I plead insufficient caffeine.)
Actually, this sampling rule discriminates *against* blocks with
narrower rows.  You previously argued, correctly I think, that
sampling all rows on each page introduces no new bias because row
width cancels out across all sampled pages.  However, if you just
include up to N rows from each page, then rows on pages with more
than N rows have a lower probability of being selected, but there's
no such bias against wider rows.  This explains why you saw smaller
values of i being undersampled.

Had you run the test series all the way up to the max number of
tuples per block, which is probably a couple hundred in this test,
I think you'd have seen the bias go away again.  But the takeaway
point is that we have to sample all tuples per page, not just a
limited number of them, if we want to change it like this.

regards, tom lane


Surely we want to sample a 'constant fraction' (obviously, in practice 
you have to sample an integral number of rows in a page!) of rows per 
page? The simplest way, as Tom suggests, is to use all the rows in a 
page.


However, if you wanted the same number of rows from a greater number 
of pages, you could (for example) select a quarter of the rows from 
each page.  In which case, when this is a fractional number: take the 
integral number of rows, plus on extra row with a probability equal to 
the fraction (here 0.25).


Either way, if it is determined that you need N rows, then keep 
selecting pages at random (but never use the same page more than once) 
until you have at least N rows.



Cheers,
Gavin




Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75.

But there is a bias introduced by the arithmetic average size of the 
rows in a page. This results in block sampling favouring large rows, as 
they are in a larger proportion of pages.


For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, 
using 400 byte pages.  In the pathologically worst case, assuming 
maximum packing density and no page has both types: the large rows would 
occupy  500 pages and the smaller rows 50 pages. So if one selected 11 
pages at random, you get about 10 pages of large rows and about one for 
small rows!  In practice, it would be much less extreme - for a start, 
not all blocks will be fully packed, most blocks would have both types 
of rows, and there is usually greater variation in row size - but still 
a bias towards sampling larger rows.  So somehow, this bias needs to be 
counteracted.



Cheers,
Gavin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Kevin Grittner
Gavin Flower gavinflo...@archidevsys.co.nz wrote:

 For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
 using 400 byte pages.  In the pathologically worst case, assuming
 maximum packing density and no page has both types: the large rows would
 occupy  500 pages and the smaller rows 50 pages. So if one selected 11
 pages at random, you get about 10 pages of large rows and about one for
 small rows!

With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower

On 12/12/13 08:14, Gavin Flower wrote:

On 12/12/13 07:22, Gavin Flower wrote:

On 12/12/13 06:22, Tom Lane wrote:

I wrote:
Hm.  You can only take N rows from a block if there actually are at 
least

N rows in the block.  So the sampling rule I suppose you are using is
select up to N rows from each sampled block --- and that is going to
favor the contents of blocks containing narrower-than-average rows.

Oh, no, wait: that's backwards.  (I plead insufficient caffeine.)
Actually, this sampling rule discriminates *against* blocks with
narrower rows.  You previously argued, correctly I think, that
sampling all rows on each page introduces no new bias because row
width cancels out across all sampled pages.  However, if you just
include up to N rows from each page, then rows on pages with more
than N rows have a lower probability of being selected, but there's
no such bias against wider rows.  This explains why you saw smaller
values of i being undersampled.

Had you run the test series all the way up to the max number of
tuples per block, which is probably a couple hundred in this test,
I think you'd have seen the bias go away again.  But the takeaway
point is that we have to sample all tuples per page, not just a
limited number of them, if we want to change it like this.

regards, tom lane


Surely we want to sample a 'constant fraction' (obviously, in 
practice you have to sample an integral number of rows in a page!) of 
rows per page? The simplest way, as Tom suggests, is to use all the 
rows in a page.


However, if you wanted the same number of rows from a greater number 
of pages, you could (for example) select a quarter of the rows from 
each page.  In which case, when this is a fractional number: take the 
integral number of rows, plus on extra row with a probability equal 
to the fraction (here 0.25).


Either way, if it is determined that you need N rows, then keep 
selecting pages at random (but never use the same page more than 
once) until you have at least N rows.



Cheers,
Gavin




Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75.

But there is a bias introduced by the arithmetic average size of the 
rows in a page. This results in block sampling favouring large rows, 
as they are in a larger proportion of pages.


For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, 
using 400 byte pages.  In the pathologically worst case, assuming 
maximum packing density and no page has both types: the large rows 
would occupy  500 pages and the smaller rows 50 pages. So if one 
selected 11 pages at random, you get about 10 pages of large rows and 
about one for small rows!  In practice, it would be much less extreme 
- for a start, not all blocks will be fully packed, most blocks would 
have both types of rows, and there is usually greater variation in row 
size - but still a bias towards sampling larger rows.  So somehow, 
this bias needs to be counteracted.



Cheers,
Gavin

Actually, I just thought of a possible way to overcome the bias towards 
large rows.


1. Calculate (a rough estimate may be sufficient, if not too 'rough')
   the size of the smallest row.

2. Select a page at random (never selecting the same page twice)

3. Then select rows at random within the page (never selecting the same
   row twice).  For each row selected, accept it with the probability
   equal to (size of smallest row)/(size of selected row).  I think you
   find that will almost completely offset the bias towards larger rows!

4. If you do not have sufficient rows, and you still have pages not yet
   selected, goto 2

Note that it will be normal for for some pages not to have any rows 
selected, especially for large tables!



Cheers,
Gavin

 P.S.
I really need to stop thinking about this problem, and get on with my 
assigned project!!!





Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 4:48 PM, Peter Geoghegan p...@heroku.com wrote:
 Why would I even mention that to a statistician? We want guidance. But
 yes, I bet I could give a statistician an explanation of statistics
 target that they'd understand without too much trouble.

Actually, I think that if we told a statistician about the statistics
target, his or her response would be: why would you presume to know
ahead of time what statistics target is going to be effective? I
suspect that the basic problem is that it isn't adaptive. I think that
if we could somehow characterize the quality of our sample as we took
it, and then cease sampling when we reached a certain degree of
confidence in its quality, that would be helpful. It might not even
matter that the sample was clustered from various blocks.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower

On 12/12/13 08:31, Kevin Grittner wrote:

Gavin Flower gavinflo...@archidevsys.co.nz wrote:


For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
using 400 byte pages.  In the pathologically worst case, assuming
maximum packing density and no page has both types: the large rows would
occupy  500 pages and the smaller rows 50 pages. So if one selected 11
pages at random, you get about 10 pages of large rows and about one for
small rows!

With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sorry, I've simply come up with well argued nonsense!

Kevin, you're dead right.


Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower

On 12/12/13 08:39, Gavin Flower wrote:

On 12/12/13 08:31, Kevin Grittner wrote:

Gavin Flower gavinflo...@archidevsys.co.nz wrote:


For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
using 400 byte pages.  In the pathologically worst case, assuming
maximum packing density and no page has both types: the large rows 
would

occupy  500 pages and the smaller rows 50 pages. So if one selected 11
pages at random, you get about 10 pages of large rows and about one for
small rows!

With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sorry, I've simply come up with well argued nonsense!

Kevin, you're dead right.


Cheers,
Gavin



I looked at:
http://www.postgresql.org/docs/current/interactive/storage-page-layout.html
this says that each row has an overhead, which suggests there should be 
a bias towards small rows.


There must be a lot of things going on, that I'm simply not aware of, 
that affect sampling bias...



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower

On 12/12/13 09:12, Gavin Flower wrote:

On 12/12/13 08:39, Gavin Flower wrote:

On 12/12/13 08:31, Kevin Grittner wrote:

Gavin Flower gavinflo...@archidevsys.co.nz wrote:


For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
using 400 byte pages.  In the pathologically worst case, assuming
maximum packing density and no page has both types: the large rows 
would

occupy  500 pages and the smaller rows 50 pages. So if one selected 11
pages at random, you get about 10 pages of large rows and about one 
for

small rows!

With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sorry, I've simply come up with well argued nonsense!

Kevin, you're dead right.


Cheers,
Gavin



I looked at:
http://www.postgresql.org/docs/current/interactive/storage-page-layout.html 

this says that each row has an overhead, which suggests there should 
be a bias towards small rows.


There must be a lot of things going on, that I'm simply not aware of, 
that affect sampling bias...



Cheers,
Gavin


Ignoring overheads per row and other things... There will be a biasing 
affect when the distribution of sizes is not symmetric.  For example: 
when the majority of rows have sizes greater than the arithmetic mean, 
then most samples will be biased towards larger rows. Similarly there 
could be a bias towards smaller rows when most rows are smaller than the 
arithmetic mean.  Yes, I did think about this in depth - but it is way 
too complicated to attempt to quantify the bias, because it depends on 
too many factors (even just limiting it to the distribution of row sizes).


So apart from the nature of volatility of the table, and the pattern of 
insertions/updates/deletes - there there will be a bias depending on the 
distribution of values in the table.


So I despair, that a simple elegant  practical algorithm will ever be 
found.


Therefore, I expect the best answer is probably some kind of empirical 
adaptive approach - which I think has already been suggested.



Cheers,
Gavin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Greg Stark
I think we're all wet here. I don't see any bias towards larger or smaller
rows. Larger tied will be on a larger number of pages but there will be
fewer of them on any one page. The average effect should be the same.

Smaller values might have a higher variance with block based sampling than
larger values. But that actually *is* the kind of thing that Simon's
approach of just compensating with later samples can deal with.

-- 
greg


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Martijn van Oosterhout
On Thu, Dec 12, 2013 at 07:22:59AM +1300, Gavin Flower wrote:
 Surely we want to sample a 'constant fraction' (obviously, in
 practice you have to sample an integral number of rows in a page!)
 of rows per page? The simplest way, as Tom suggests, is to use all
 the rows in a page.
 
 However, if you wanted the same number of rows from a greater number
 of pages, you could (for example) select a quarter of the rows from
 each page.  In which case, when this is a fractional number: take
 the integral number of rows, plus on extra row with a probability
 equal to the fraction (here 0.25).

In this discussion we've mostly used block = 1 postgresql block of 8k. 
But when reading from a disk once you've read one block you can
basically read the following ones practically for free.

So I wonder if you could make your sampling read always 16 consecutive
blocks, but then use 25-50% of the tuples.  That way you get many more
tuples for the same amount of disk I/O seeks..

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Josh Berkus
On 12/11/2013 02:39 PM, Martijn van Oosterhout wrote:
 In this discussion we've mostly used block = 1 postgresql block of 8k. 
 But when reading from a disk once you've read one block you can
 basically read the following ones practically for free.
 
 So I wonder if you could make your sampling read always 16 consecutive
 blocks, but then use 25-50% of the tuples.  That way you get many more
 tuples for the same amount of disk I/O seeks..

Yeah, that's what I meant by tune this for the FS.   We'll probably
have to test a lot of different block sizes on different FSes before
we arrive at a reasonable size, and even then I'll bet we have to offer
a GUC.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Albe Laurenz
Greg Stark wrote:
 It's also applicable for the other stats; histogram buckets constructed
 from a 5% sample are more likely to be accurate than those constructed
 from a 0.1% sample.   Same with nullfrac.  The degree of improved
 accuracy, would, of course, require some math to determine.
 
 This some math is straightforward basic statistics.  The 95th
 percentile confidence interval for a sample consisting of 300 samples
 from a population of a 1 million would be 5.66%. A sample consisting
 of 1000 samples would have a 95th percentile confidence interval of
 +/- 3.1%.

Doesn't all that assume a normally distributed random variable?

I don't think it can be applied to database table contents
without further analysis.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Greg Stark
On 10 Dec 2013 08:28, Albe Laurenz laurenz.a...@wien.gv.at wrote:


 Doesn't all that assume a normally distributed random variable?

I don't think so because of the law of large numbers. If you have a large
population and sample it the sample behaves like a normal distribution when
if the distribution of the population isn't.


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Albe Laurenz
Greg Stark wrote:
 Doesn't all that assume a normally distributed random variable?

 I don't think so because of the law of large numbers. If you have a large 
 population and sample it the
 sample behaves like a normal distribution when if the distribution of the 
 population isn't.

Statistics is the part of mathematics I know least of, but aren't
you saying that in a large enough sample of people there will
always be some with age  0 (which is what a normal distribution
would imply)?

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Claudio Freire
On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark st...@mit.edu wrote:

 On 10 Dec 2013 08:28, Albe Laurenz laurenz.a...@wien.gv.at wrote:


 Doesn't all that assume a normally distributed random variable?

 I don't think so because of the law of large numbers. If you have a large
 population and sample it the sample behaves like a normal distribution when
 if the distribution of the population isn't.


No, the large population says that if you have an AVERAGE of many
samples of a random variable, the random variable that is the AVERAGE
behaves like a normal.

The variable itself doesn't.

And for n_distinct, you need to know the variable itself.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Claudio Freire
On Tue, Dec 10, 2013 at 11:32 AM, Claudio Freire klaussfre...@gmail.com wrote:
 On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark st...@mit.edu wrote:

 On 10 Dec 2013 08:28, Albe Laurenz laurenz.a...@wien.gv.at wrote:


 Doesn't all that assume a normally distributed random variable?

 I don't think so because of the law of large numbers. If you have a large
 population and sample it the sample behaves like a normal distribution when
 if the distribution of the population isn't.


 No, the large population says that if you have an AVERAGE of many
 samples of a random variable, the random variable that is the AVERAGE
 behaves like a normal.


Sorry, that should be No, the *law of large numbers* says...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 6 December 2013 09:21, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
 Has anyone ever thought about opportunistic ANALYZE piggy-backing on
 other full-table scans? That doesn't really help Greg, because his
 complaint is mostly that a fresh ANALYZE is too expensive, but it
 could be an interesting, albeit risky approach.

 What I've been thinking of is

 a) making it piggy back on scans vacuum is doing instead of doing
 separate ones all the time (if possible, analyze needs to be more
 frequent). Currently with quite some likelihood the cache will be gone
 again when revisiting.

 b) make analyze incremental. In lots of bigger tables most of the table
 is static - and we actually *do* know that, thanks to the vm. So keep a
 rawer form of what ends in the catalogs around somewhere, chunked by the
 region of the table the statistic is from. Everytime a part of the table
 changes, re-sample only that part. Then recompute the aggregate.

Piggy-backing sounds like a bad thing. If I run a query, I don't want
to be given some extra task thanks! Especially if we might need to run
data type code I'm not authroised to run, or to sample data I may not
be authorised to see. The only way that could work is to kick off an
autovacuum worker to run the ANALYZE as a separate process and then
use synchronous scan behaviour to derive benefit indirectly.

However, these things presume that we need to continue scanning most
of the blocks of the table, which I don't think needs to be the case.
There is a better way.

Back in 2005/6, I advocated a block sampling method, as described by
Chaudri et al (ref?)
That has two advantages

* We don't need to visit all of the blocks, reducing I/O

* It would also give better analysis of clustered data (its all in
that paper...)

The recent patch on TABLESAMPLE contained a rewrite of the guts of
ANALYZE into a generic sample scan, which would then be used as the
basis for a TABLESAMPLE BERNOULLI. A TABLESAMPLE SYSTEM could use a
block sample, as it does in some other DBMS (DB2, SQLServer).

While I was unimpressed with the TABLESAMPLE patch, all it needs is
some committer-love, so Greg...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Andres Freund
On 2013-12-10 19:23:37 +, Simon Riggs wrote:
 On 6 December 2013 09:21, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
  Has anyone ever thought about opportunistic ANALYZE piggy-backing on
  other full-table scans? That doesn't really help Greg, because his
  complaint is mostly that a fresh ANALYZE is too expensive, but it
  could be an interesting, albeit risky approach.
 
  What I've been thinking of is
 
  a) making it piggy back on scans vacuum is doing instead of doing
  separate ones all the time (if possible, analyze needs to be more
  frequent). Currently with quite some likelihood the cache will be gone
  again when revisiting.
 
  b) make analyze incremental. In lots of bigger tables most of the table
  is static - and we actually *do* know that, thanks to the vm. So keep a
  rawer form of what ends in the catalogs around somewhere, chunked by the
  region of the table the statistic is from. Everytime a part of the table
  changes, re-sample only that part. Then recompute the aggregate.
 
 Piggy-backing sounds like a bad thing. If I run a query, I don't want
 to be given some extra task thanks! Especially if we might need to run
 data type code I'm not authroised to run, or to sample data I may not
 be authorised to see. The only way that could work is to kick off an
 autovacuum worker to run the ANALYZE as a separate process and then
 use synchronous scan behaviour to derive benefit indirectly.

I was suggesting to piggyback on VACUUM, not user queries. The latter
suggestion was somebody else.
In combination with incremental or chunk-wise building of statistics,
doing more frequent partial vacuums which re-compute the changing part
of the stats would be great. All those blocks have to be read anyway,
and we should be more agressive about vacuuming anyway.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
 However, these things presume that we need to continue scanning most
 of the blocks of the table, which I don't think needs to be the case.
 There is a better way.

Do they? I think it's one opportunistic way of ameliorating the cost.

 Back in 2005/6, I advocated a block sampling method, as described by
 Chaudri et al (ref?)

I don't think that anyone believes that not doing block sampling is
tenable, fwiw. Clearly some type of block sampling would be preferable
for most or all purposes.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 10 December 2013 19:49, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
 However, these things presume that we need to continue scanning most
 of the blocks of the table, which I don't think needs to be the case.
 There is a better way.

 Do they? I think it's one opportunistic way of ameliorating the cost.

 Back in 2005/6, I advocated a block sampling method, as described by
 Chaudri et al (ref?)

 I don't think that anyone believes that not doing block sampling is
 tenable, fwiw. Clearly some type of block sampling would be preferable
 for most or all purposes.

If we have one way of reducing cost of ANALYZE, I'd suggest we don't
need 2 ways - especially if the second way involves the interaction of
otherwise not fully related parts of the code.

Or to put it clearly, lets go with block sampling and then see if that
needs even more work.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Josh Berkus
On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
 On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote: 
 I don't think that anyone believes that not doing block sampling is
 tenable, fwiw. Clearly some type of block sampling would be preferable
 for most or all purposes.

As discussed, we need math though.  Does anyone have an ACM subscription
and time to do a search?  Someone must.  We can buy one with community
funds, but no reason to do so if we don't have to.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Greg Stark
On Tue, Dec 10, 2013 at 7:54 PM, Josh Berkus j...@agliodbs.com wrote:
 As discussed, we need math though.  Does anyone have an ACM subscription
 and time to do a search?  Someone must.  We can buy one with community
 funds, but no reason to do so if we don't have to.

Anyone in a university likely has access through their library.

But I don't really think this is the right way to go about this.
Research papers are going to turn up pretty specialized solutions that
are probably patented. We don't even have the basic understanding we
need. I suspect a basic textbook chapter on multistage sampling will
discuss at least the standard techniques.

Once we have a handle on the standard multistage sampling techniques
that would be safe from patents then we might want to go look at
research papers to find how they've been applied to databases in the
past but we would have to do that fairly carefully.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 10 December 2013 19:54, Josh Berkus j...@agliodbs.com wrote:
 On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
 On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I don't think that anyone believes that not doing block sampling is
 tenable, fwiw. Clearly some type of block sampling would be preferable
 for most or all purposes.

 As discussed, we need math though.  Does anyone have an ACM subscription
 and time to do a search?  Someone must.  We can buy one with community
 funds, but no reason to do so if we don't have to.

We already have that, just use Vitter's algorithm at the block level
rather than the row level.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark st...@mit.edu wrote:
 But I don't really think this is the right way to go about this.
 Research papers are going to turn up pretty specialized solutions that
 are probably patented. We don't even have the basic understanding we
 need. I suspect a basic textbook chapter on multistage sampling will
 discuss at least the standard techniques.

I agree that looking for information on block level sampling
specifically, and its impact on estimation quality is likely to not
turn up very much, and whatever it does turn up will have patent
issues.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Heikki Linnakangas

On 12/10/2013 10:00 PM, Simon Riggs wrote:

On 10 December 2013 19:54, Josh Berkus j...@agliodbs.com wrote:

On 12/10/2013 11:49 AM, Peter Geoghegan wrote:

On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
I don't think that anyone believes that not doing block sampling is
tenable, fwiw. Clearly some type of block sampling would be preferable
for most or all purposes.


As discussed, we need math though.  Does anyone have an ACM subscription
and time to do a search?  Someone must.  We can buy one with community
funds, but no reason to do so if we don't have to.


We already have that, just use Vitter's algorithm at the block level
rather than the row level.


And what do you do with the blocks? How many blocks do you choose? 
Details, please.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Mark Kirkwood

On 11/12/13 09:19, Heikki Linnakangas wrote:

On 12/10/2013 10:00 PM, Simon Riggs wrote:

On 10 December 2013 19:54, Josh Berkus j...@agliodbs.com wrote:

On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs 
si...@2ndquadrant.com wrote:

I don't think that anyone believes that not doing block sampling is
tenable, fwiw. Clearly some type of block sampling would be preferable
for most or all purposes.


As discussed, we need math though.  Does anyone have an ACM 
subscription

and time to do a search?  Someone must.  We can buy one with community
funds, but no reason to do so if we don't have to.


We already have that, just use Vitter's algorithm at the block level
rather than the row level.


And what do you do with the blocks? How many blocks do you choose? 
Details, please.





Yeah - and we seem to be back to Josh's point about needing 'some math' 
to cope with the rows within a block not being a purely random selection.


Regards

Mark





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Josh Berkus
On 12/10/2013 01:33 PM, Mark Kirkwood wrote:
 Yeah - and we seem to be back to Josh's point about needing 'some math'
 to cope with the rows within a block not being a purely random selection.

Well, sometimes they are effectively random.  But sometimes they are
not.  The Chaudri et al paper had a formula for estimating randomness
based on the grouping of rows in each block, assuming that the sampled
blocks were widely spaced (if they aren't there's not much you can do).
 This is where you get up to needing a 5% sample; you need to take
enough blocks that you're confident that the blocks you sampled are
representative of the population.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Greg Stark
On Tue, Dec 10, 2013 at 7:49 PM, Peter Geoghegan p...@heroku.com wrote:
 Back in 2005/6, I advocated a block sampling method, as described by
 Chaudri et al (ref?)

 I don't think that anyone believes that not doing block sampling is
 tenable, fwiw. Clearly some type of block sampling would be preferable
 for most or all purposes.

We do block sampling now. But then we select rows from those blocks uniformly.

Incidentally, if you mean Surajit Chaudhuri, he's a Microsoft Research
lead so I would be nervous about anything he's published being
patented by Microsoft.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Jeff Janes
On Mon, Dec 9, 2013 at 2:37 PM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes jeff.ja...@gmail.com wrote:
  My reading of the code is that if it is not in the MCV, then it is
 assumed
  to have the average selectivity (about 1/n_distinct, but deflating top
 and
  bottom for the MCV list).  There is also a check that it is less than the
  least common of the MCV, but I don't know why that situation would ever
  prevail--that should always be higher or equal to the average
 selectivity.

 I've never seen an n_distinct value of more than 5 digits, regardless
 of reality.  Typically I've seen 20-50k, even if the real number is
 much higher.



I don't recall seeing an n_distinct that is literally above 100,000 in the
wild, but I've seen negative ones that multiply with reltuples to give
values more than that.  In test cases it is easy enough to get values in
the millions by creating tables using floor(random()*$something).

create table baz as select floor(random()*1000), md5(random()::text)
from generate_series(1,1);
create table baz2 as select * from baz order by floor;
create table baz3 as select * from baz order by md5(floor::text);

baz unclustered, baz2 is clustered with perfect correlation, baz3 is
clustered but without correlation.

After analyzing all of them:

select tablename, n_distinct,correlation  from pg_stats where tablename
 like 'baz%' and attname='floor'  ;
 tablename | n_distinct  | correlation
---+-+-
 baz   | 8.56006e+06 |  0.00497713
 baz2  |  333774 |   1
 baz3  |  361048 |  -0.0118147

So baz is pretty close, while the other two are way off.  But the
n_distinct computation doesn't depend on the order of the rows, as far as I
can tell, but only the composition of the sample.  So this can only mean
that our random sampling method is desperately non-random, right?



  But the n_distinct value is only for non-MCVs, so if we
 estimate the selectivity of column = 'rarevalue' to be
 (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
 the estimate, and making the MCV list longer naturally makes mcvfrac
 bigger.


Ah, I see.  By including more things into MCV, we crowd out the rest of the
space implicitly.




 I'm not sure how important the
 less-frequent-than-the-least-common-MCV part is, but I'm very sure
 that raising the statistics target helps to solve the problem of
 overestimating the prevalence of uncommon values in a very big table.

  I think that parts of the planner are N^2 in the size of histogram (or
 was
  that the size of the MCV list?).  So we would probably need a way to use
 a
  larger sample size to get more accurate n_distinct and MCV frequencies,
 but
  not save the entire histogram that goes with that sample size.

 I think the saving the histogram part is important.  As you say, the
 MCVs are important for a variety of planning purposes, such as hash
 joins.  More than that, in my experience, people with large tables are
 typically very willing to spend more planning time to get a better
 plan, because mistakes are expensive and the queries are likely to run
 for a while anyway.  People with small tables care about planning

time, because it makes no sense to spend an extra 1ms planning a query
 unless you improve the plan by enough to save at least 1ms when
 executing it, and when the tables are small and access is expected to
 be fast anyway that's often not the case.


I would think that the dichotomy is more about the size of the query-plan
than of the tables.  I think a lot of people with huge tables end up doing
mostly indexed lookups in unique or highly selective indexes, once all the
planning is done.

Does anyone have generators for examples of cases where increasing the
sample size to get better histograms (as opposed more accurate n_distinct
or more accurate MCV) was the key to fixing bad plans?  I wonder if it is
more bins, or more accurate boundaries, that makes the difference.

Cheers,

Jeff


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Jim Nasby

On 12/10/13 2:17 PM, Peter Geoghegan wrote:

On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark st...@mit.edu wrote:

But I don't really think this is the right way to go about this.
Research papers are going to turn up pretty specialized solutions that
are probably patented. We don't even have the basic understanding we
need. I suspect a basic textbook chapter on multistage sampling will
discuss at least the standard techniques.


I agree that looking for information on block level sampling
specifically, and its impact on estimation quality is likely to not
turn up very much, and whatever it does turn up will have patent
issues.


We have an entire analytics dept. at work that specializes in finding patterns 
in our data. I might be able to get some time from them to at least provide 
some guidance here, if the community is interested. They could really only 
serve in a consulting role though.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby j...@nasby.net wrote:
 I agree that looking for information on block level sampling
 specifically, and its impact on estimation quality is likely to not
 turn up very much, and whatever it does turn up will have patent
 issues.


 We have an entire analytics dept. at work that specializes in finding
 patterns in our data. I might be able to get some time from them to at least
 provide some guidance here, if the community is interested. They could
 really only serve in a consulting role though.

I think that Greg had this right several years ago: it would probably
be very useful to have the input of someone with a strong background
in statistics. It doesn't seem that important that they already know a
lot about databases, provided they can understand what our constraints
are, and what is important to us. It might just be a matter of having
them point us in the right direction.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 10 December 2013 23:43, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby j...@nasby.net wrote:
 I agree that looking for information on block level sampling
 specifically, and its impact on estimation quality is likely to not
 turn up very much, and whatever it does turn up will have patent
 issues.


 We have an entire analytics dept. at work that specializes in finding
 patterns in our data. I might be able to get some time from them to at least
 provide some guidance here, if the community is interested. They could
 really only serve in a consulting role though.

 I think that Greg had this right several years ago: it would probably
 be very useful to have the input of someone with a strong background
 in statistics. It doesn't seem that important that they already know a
 lot about databases, provided they can understand what our constraints
 are, and what is important to us. It might just be a matter of having
 them point us in the right direction.

err, so what does stats target mean exactly in statistical theory?
Waiting for a statistician, and confirming his credentials before you
believe him above others here, seems like wasted time.

What your statistician will tell you is it that YMMV, depending on the data.

So we'll still need a parameter to fine tune things when the default
is off. We can argue about the default later, in various level of
rigour.

Block sampling, with parameter to specify sample size. +1

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Greg Stark
   On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Block sampling, with parameter to specify sample size. +1

Simon this is very frustrating. Can you define block sampling?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 11 December 2013 00:28, Greg Stark st...@mit.edu wrote:
On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs si...@2ndquadrant.com 
 wrote:
 Block sampling, with parameter to specify sample size. +1

 Simon this is very frustrating. Can you define block sampling?

Blocks selected using Vitter's algorithm, using a parameterised
fraction of the total.

When we select a block we should read all rows on that block, to help
identify the extent of clustering within the data.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Greg Stark
On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs si...@2ndquadrant.com wrote:
 When we select a block we should read all rows on that block, to help
 identify the extent of clustering within the data.

So how do you interpret the results of the sample read that way that
doesn't introduce bias?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 4:14 PM, Simon Riggs si...@2ndquadrant.com wrote:
 err, so what does stats target mean exactly in statistical theory?

Why would I even mention that to a statistician? We want guidance. But
yes, I bet I could give a statistician an explanation of statistics
target that they'd understand without too much trouble.

 Waiting for a statistician, and confirming his credentials before you
 believe him above others here, seems like wasted time.

 What your statistician will tell you is it that YMMV, depending on the data.

I'm reasonably confident that they'd give me more than that.

 So we'll still need a parameter to fine tune things when the default
 is off. We can argue about the default later, in various level of
 rigour.

 Block sampling, with parameter to specify sample size. +1

Again, it isn't as if the likely efficacy of *some* block sampling
approach is in question. I'm sure analyze.c is currently naive about
many things. Everyone knows that there are big gains to be had.
Balancing those gains against the possible downsides in terms of
impact on the quality of statistics generated is pretty nuanced. I do
know enough to know that a lot of thought goes into mitigating and/or
detecting the downsides of block-based sampling.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 11 December 2013 00:44, Greg Stark st...@mit.edu wrote:
 On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs si...@2ndquadrant.com wrote:
 When we select a block we should read all rows on that block, to help
 identify the extent of clustering within the data.

 So how do you interpret the results of the sample read that way that
 doesn't introduce bias?

Yes, it is not a perfect statistical sample. All sampling is subject
to an error that is data dependent.

I'm happy that we have an option to select this/or not and a default
that maintains current behaviour, since otherwise we might expect some
plan instability.

I would like to be able to

* allow ANALYZE to run faster in some cases
* increase/decrease sample size when it matters
* have the default sample size vary according to the size of the
table, i.e. a proportional sample

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Sergey E. Koposov

For what it's worth.

I'll quote Chaudhuri et al. first line from the abstract about the block 
sampling.
Block-level sampling is far more efficient than true uniform-random 
sampling over a large database, but prone to  significant errors if used 
to create database statistics.


And after briefly glancing through the paper, my opinion is why it works 
is because after making one version of statistics they cross-validate, see 
how well it goes and then collect more if the cross-validation error is 
large (for example because the data is clustered). Without this bit, as 
far as I can a simply block based sampler will be bound to make 
catastrophic mistakes depending on the distribution


Also, just another point about targets (e.g X%) for estimating stuff from 
the samples (as it was discussed in the thread). Basically, the is a 
point talking about a sampling a fixed target (5%) of the data 
ONLY if you fix the actual  distribution of your data in the table, and 
decide what statistic you are trying to find, e.g. average, std. dev. a 
90% percentile, ndistinct or a histogram and so forth. There won't be a 
general answer as the percentages will be distribution dependend and 
statistic dependent.


Cheers,
Sergey

PS I'm not a statistician, but I use statistics a lot

***
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 Again, it isn't as if the likely efficacy of *some* block sampling
 approach is in question. I'm sure analyze.c is currently naive about
 many things.

It's not *that* naive; this is already about a third-generation algorithm.
The last major revision (commit 9d6570b8a4) was to address problems with
misestimating the number of live tuples due to nonuniform tuple density
in a table.  IIRC, the previous code could be seriously misled if the
first few pages in the table were significantly non-representative of the
live-tuple density further on.  I'm not sure how we can significantly
reduce the number of blocks examined without re-introducing that hazard in
some form.  In particular, given that you want to see at least N tuples,
how many blocks will you read if you don't have an a-priori estimate of
tuple density?  You have to decide that before you start sampling blocks,
if you want all blocks to have the same probability of being selected
and you want to read them in sequence.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Jeff Janes
On Tuesday, December 10, 2013, Simon Riggs wrote:

 On 11 December 2013 00:28, Greg Stark st...@mit.edu javascript:;
 wrote:
 On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs 
  si...@2ndquadrant.comjavascript:;
 wrote:
  Block sampling, with parameter to specify sample size. +1
 
  Simon this is very frustrating. Can you define block sampling?

 Blocks selected using Vitter's algorithm, using a parameterised
 fraction of the total.


OK, thanks for defining that.

We only need Vitter's algorithm when we don't know in advance how many
items we are sampling from (such as for tuples--unless we want to rely on
the previous estimate for the current round of analysis).  But for blocks,
we do know how many there are, so there are simpler ways to pick them.



 When we select a block we should read all rows on that block, to help
 identify the extent of clustering within the data.


But we have no mechanism to store such information (or to use it if it were
stored), nor even ways to prevent the resulting skew in the sample from
seriously messing up the estimates which we do have ways of storing and
using.

Cheers,

Jeff


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 11 December 2013 01:27, Sergey E. Koposov m...@sai.msu.ru wrote:
 For what it's worth.

 I'll quote Chaudhuri et al. first line from the abstract about the block
 sampling.
 Block-level sampling is far more efficient than true uniform-random
 sampling over a large database, but prone to  significant errors if used to
 create database statistics.

This glosses over the point that both SQLServer and Oracle use this technique.

 And after briefly glancing through the paper, my opinion is why it works is
 because after making one version of statistics they cross-validate, see how
 well it goes and then collect more if the cross-validation error is large
 (for example because the data is clustered). Without this bit, as far as I
 can a simply block based sampler will be bound to make catastrophic mistakes
 depending on the distribution

I don't think its true that a block based sampler will be *bound* to
make catastrophic mistakes. They can clearly happen, just as they
can with random samples, hence the need for a parameter to control the
sample with a parameter.

Realistically, I never heard of an Oracle DBA doing advanced
statistical mathematics before setting the sample size on ANALYZE. You
use the default and bump it up if the sample is insufficient for the
data.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 10:34 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 11 December 2013 01:27, Sergey E. Koposov m...@sai.msu.ru wrote:
 For what it's worth.

 I'll quote Chaudhuri et al. first line from the abstract about the block
 sampling.
 Block-level sampling is far more efficient than true uniform-random
 sampling over a large database, but prone to  significant errors if used to
 create database statistics.

 This glosses over the point that both SQLServer and Oracle use this technique.

That seems like an unusual omission for Microsoft Research to have made.

I didn't read that paper, because undoubtedly it's all patented. But
before I figured that out, after finding it on Google randomly, I did
read the first couple of paragraphs, which more or less said what
follows - the entire paper - is an explanation as to why it's okay
that we do block sampling.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Hannu Krosing
On 12/11/2013 01:44 AM, Greg Stark wrote:
 On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs si...@2ndquadrant.com wrote:
 When we select a block we should read all rows on that block, to help
 identify the extent of clustering within the data.
 So how do you interpret the results of the sample read that way that
 doesn't introduce bias?

Initially/experimentally we could just compare it to our current approach :)

That is, implement *some* block sampling and then check it against what
we currently have. Then figure out the bad differences. Rinse. Repeat.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Josh Berkus
Greg,

 I really don't believe the 5% thing. It's not enough for n_distinct
 and it's *far* too high a value for linear properties like histograms
 or nullfrac etc. 

Actually, it is enough for n_distinct, or more properly, 5% is as good
as you can get for n_distinct unless you're going to jump to scanning
50% or more.

It's also applicable for the other stats; histogram buckets constructed
from a 5% sample are more likely to be accurate than those constructed
from a 0.1% sample.   Same with nullfrac.  The degree of improved
accuracy, would, of course, require some math to determine.

 From a computer point of view it's too high to be
 worth bothering. If we have to read 5% of the table we might as well
 do a full scan anyways, it'll be marginally slower but much better
 quality results.

Reading 5% of a 200GB table is going to be considerably faster than
reading the whole thing, if that 5% is being scanned in a way that the
FS understands.

Also, we can optimize this significantly by using the VM, as Robert (I
think) suggested.

In the advanced approaches section, there's also the idea of collecting
analyze data from table pages while they're in memory anyway for other
reasons.

You do seem kind of hostile to the idea of full-page-sampling, going
pretty far beyond the I'd need to see the math.  Why?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 Reading 5% of a 200GB table is going to be considerably faster than
 reading the whole thing, if that 5% is being scanned in a way that the
 FS understands.

Really?  See the upthread point that reading one sector from each track
has just as much seek overhead as reading the whole thing.  I will grant
that if you think that reading a *contiguous* 5% of the table is good
enough, you can make it faster --- but I don't believe offhand that
you can make this better without seriously compromising the randomness
of your sample.  Too many tables are loaded roughly in time order, or
in other ways that make contiguous subsets nonrandom.

 You do seem kind of hostile to the idea of full-page-sampling, going
 pretty far beyond the I'd need to see the math.  Why?

I'm detecting a lot of hostility to assertions unsupported by any math.
For good reason.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Robert Haas
On Mon, Dec 9, 2013 at 1:03 PM, Josh Berkus j...@agliodbs.com wrote:
 I really don't believe the 5% thing. It's not enough for n_distinct
 and it's *far* too high a value for linear properties like histograms
 or nullfrac etc.

 Actually, it is enough for n_distinct, or more properly, 5% is as good
 as you can get for n_distinct unless you're going to jump to scanning
 50% or more.

I'd like to see a proof of that result.

Not because I'm hostile to changing the algorithm, but because you've
made numerous mathematical claims on this thread that fly in the face
of what Greg, myself, and others understand to be mathematically true
- including this one.  If our understanding is wrong, then by all
means let's get that fixed.  But you're not going to convince anyone
here that we should rip out the existing algorithm and its
peer-reviewed journal citations by making categorical assertions about
the right way to do things.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Greg Stark
On Mon, Dec 9, 2013 at 6:03 PM, Josh Berkus j...@agliodbs.com wrote:

 It's also applicable for the other stats; histogram buckets constructed
 from a 5% sample are more likely to be accurate than those constructed
 from a 0.1% sample.   Same with nullfrac.  The degree of improved
 accuracy, would, of course, require some math to determine.

This some math is straightforward basic statistics.  The 95th
percentile confidence interval for a sample consisting of 300 samples
from a population of a 1 million would be 5.66%. A sample consisting
of 1000 samples would have a 95th percentile confidence interval of
+/- 3.1%.

The histogram and nullfact answers the same kind of question as a
political poll, what fraction of the population falls within this
subset. This is why pollsters don't need to sample 15 million
Americans to have a decent poll result. That's just not how the math
works for these kinds of questions.

n_distinct is an entirely different kettle of fish. It's a different
kind of problem and the error rate there *is* going to be dependent on
the percentage of the total population that you sampled. Moreover from
the papers I read I'm convinced any sample less than 50-80% is nearly
useless so I'm convinced you can't get good results without reading
the whole table.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Greg Stark
On Mon, Dec 9, 2013 at 6:54 PM, Greg Stark st...@mit.edu wrote:

 This some math is straightforward basic statistics.  The 95th
 percentile confidence interval for a sample consisting of 300 samples
 from a population of a 1 million would be 5.66%. A sample consisting
 of 1000 samples would have a 95th percentile confidence interval of
 +/- 3.1%.

Incidentally I got this using an online sample size calculator. Google
turns up several but this one seems the easiest to use:
http://www.raosoft.com/samplesize.html


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Jeff Janes
On Sat, Dec 7, 2013 at 11:46 AM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark st...@mit.edu wrote:
  I always gave the party line that ANALYZE only takes a small
  constant-sized sample so even very large tables should be very quick.
  But after hearing the same story again in Heroku I looked into it a
  bit further. I was kind of shocked but the numbers.
 
  ANALYZE takes a sample of 300 * statistics_target rows. That sounds
  pretty reasonable but with default_statistics_target set to 100 that's
  30,000 rows. If I'm reading the code right It takes this sample by
  sampling 30,000 blocks and then (if the table is large enough) taking
  an average of one row per block. Each block is 8192 bytes so that
  means it's reading 240MB of each table.That's a lot more than I
  realized.

 That is a lot.  On the other hand, I question the subject line:
 sometimes, our ANALYZE sampling is not good enough.  Before we raised
 the default statistics target from 10 to 100, complaints about bad
 plan choices due to insufficiently-precise statistics were legion --
 and we still have people periodically proposing to sample a fixed
 percentage of the table instead of a fixed amount of the table, even
 on large tables, which is going the opposite direction.  I think this
 is because they're getting really bad n_distinct estimates, and no
 fixed-size sample can reliably give a good one.



I don't recall ever tracing a bad plan down to a bad n_distinct.  I have
seen several that were due to bad frequency estimates in MCV list, because
hash join planning is extremely sensitive to that.  Do we have some kind of
catalog of generators of problematic data, so that changes can be tested on
known problem sets?  Perhaps a wiki page to accumulate them would be
useful.  For automated testing I guess the generator and query is the easy
part, the hard part is the cost settings/caching/RAM needed to trigger the
problem, and parsing and interpreting the results.



 More generally, I think the basic problem that people are trying to
 solve by raising the statistics target is avoid index scans on
 gigantic tables.  Obviously, there are a lot of other situations where
 inadequate statistics can cause problems, but that's a pretty
 easy-to-understand one that we do not always get right.  We know that
 an equality search looking for some_column = 'some_constant', where
 some_constant is an MCV, must be more selective than a search for the
 least-frequent MCV.  If you store more and more MCVs for a table,
 eventually you'll have enough that the least-frequent one is pretty
 infrequent, and then things work a lot better.


My reading of the code is that if it is not in the MCV, then it is assumed
to have the average selectivity (about 1/n_distinct, but deflating top and
bottom for the MCV list).  There is also a check that it is less than the
least common of the MCV, but I don't know why that situation would ever
prevail--that should always be higher or equal to the average selectivity.




 This is more of a problem for big tables than for small tables.  MCV
 #100 can't have a frequency of greater than 1/100 = 0.01, but that's a
 lot more rows on a big table than small one.  On a table with 10
 million rows we might estimate something close to 100,000 rows when
 the real number is just a handful; when the table has only 10,000
 rows, we just can't be off by as many orders of magnitude.  Things
 don't always work out that badly, but in the worst case they do.

 Maybe there's some highly-principled statistical approach which could
 be taken here, and if so that's fine, but I suspect not.  So what I
 think we should do is auto-tune the statistics target based on the
 table size.  If, say, we think that the generally useful range for the
 statistics target is something like 10 to 400, then let's come up with
 a formula based on table size that outputs 10 for small tables, 400
 for really big tables, and intermediate values for tables in the
 middle.


I think that parts of the planner are N^2 in the size of histogram (or was
that the size of the MCV list?).  So we would probably need a way to use a
larger sample size to get more accurate n_distinct and MCV frequencies, but
not save the entire histogram that goes with that sample size.


Cheers,

Jeff


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Jim Nasby

On 12/6/13 3:21 AM, Andres Freund wrote:

On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:

Has anyone ever thought about opportunistic ANALYZE piggy-backing on
other full-table scans? That doesn't really help Greg, because his
complaint is mostly that a fresh ANALYZE is too expensive, but it
could be an interesting, albeit risky approach.


What I've been thinking of is

a) making it piggy back on scans vacuum is doing instead of doing
separate ones all the time (if possible, analyze needs to be more
frequent). Currently with quite some likelihood the cache will be gone
again when revisiting.


FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to fire up a 
2nd backend to do the ANALYZE portion (or perhaps use Robert's fancy new shared 
memory stuff).
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Jim Nasby

On 12/8/13 1:49 PM, Heikki Linnakangas wrote:

On 12/08/2013 08:14 PM, Greg Stark wrote:

The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):


One simple thing we could do, without or in addition to changing the algorithm, 
is to issue posix_fadvise() calls for the blocks we're going to read. It should 
at least be possible to match the speed of a plain sequential scan that way.


Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking 
it wouldn't be very hard to do the stage 2 work in the callback routine...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Peter Geoghegan
On Mon, Dec 9, 2013 at 1:18 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I don't recall ever tracing a bad plan down to a bad n_distinct.

It does happen. I've seen it several times.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Heikki Linnakangas

On 12/09/2013 11:35 PM, Jim Nasby wrote:

On 12/8/13 1:49 PM, Heikki Linnakangas wrote:

On 12/08/2013 08:14 PM, Greg Stark wrote:

The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):


One simple thing we could do, without or in addition to changing the
algorithm, is to issue posix_fadvise() calls for the blocks we're
going to read. It should at least be possible to match the speed of a
plain sequential scan that way.


Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
thinking it wouldn't be very hard to do the stage 2 work in the callback
routine...


Yeah, other than the fact we have no infrastructure to do asynchronous 
I/O anywhere in the backend. If we had that, then we could easily use it 
here. I doubt it would be much better than posix_fadvising the blocks, 
though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Claudio Freire
On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 12/09/2013 11:35 PM, Jim Nasby wrote:

 On 12/8/13 1:49 PM, Heikki Linnakangas wrote:

 On 12/08/2013 08:14 PM, Greg Stark wrote:

 The whole accounts table is 1.2GB and contains 10 million rows. As
 expected with rows_per_block set to 1 it reads 240MB of that
 containing nearly 2 million rows (and takes nearly 20s -- doing a full
 table scan for select count(*) only takes about 5s):


 One simple thing we could do, without or in addition to changing the
 algorithm, is to issue posix_fadvise() calls for the blocks we're
 going to read. It should at least be possible to match the speed of a
 plain sequential scan that way.


 Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
 thinking it wouldn't be very hard to do the stage 2 work in the callback
 routine...


 Yeah, other than the fact we have no infrastructure to do asynchronous I/O
 anywhere in the backend. If we had that, then we could easily use it here. I
 doubt it would be much better than posix_fadvising the blocks, though.


Without patches to the kernel, it is much better.

posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
bitmap heap scans (or similarly sorted analyze block samples) run at 1
IO / block, ie horrible, whereas aio can do read coalescence and
read-ahead when the kernel thinks it'll be profitable, significantly
increasing IOPS. I've seen everything from a 2x to 10x difference.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Robert Haas
On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 My reading of the code is that if it is not in the MCV, then it is assumed
 to have the average selectivity (about 1/n_distinct, but deflating top and
 bottom for the MCV list).  There is also a check that it is less than the
 least common of the MCV, but I don't know why that situation would ever
 prevail--that should always be higher or equal to the average selectivity.

I've never seen an n_distinct value of more than 5 digits, regardless
of reality.  Typically I've seen 20-50k, even if the real number is
much higher.  But the n_distinct value is only for non-MCVs, so if we
estimate the selectivity of column = 'rarevalue' to be
(1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
the estimate, and making the MCV list longer naturally makes mcvfrac
bigger.  I'm not sure how important the
less-frequent-than-the-least-common-MCV part is, but I'm very sure
that raising the statistics target helps to solve the problem of
overestimating the prevalence of uncommon values in a very big table.

 I think that parts of the planner are N^2 in the size of histogram (or was
 that the size of the MCV list?).  So we would probably need a way to use a
 larger sample size to get more accurate n_distinct and MCV frequencies, but
 not save the entire histogram that goes with that sample size.

I think the saving the histogram part is important.  As you say, the
MCVs are important for a variety of planning purposes, such as hash
joins.  More than that, in my experience, people with large tables are
typically very willing to spend more planning time to get a better
plan, because mistakes are expensive and the queries are likely to run
for a while anyway.  People with small tables care about planning
time, because it makes no sense to spend an extra 1ms planning a query
unless you improve the plan by enough to save at least 1ms when
executing it, and when the tables are small and access is expected to
be fast anyway that's often not the case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Heikki Linnakangas

On 12/09/2013 11:56 PM, Claudio Freire wrote:

On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

On 12/09/2013 11:35 PM, Jim Nasby wrote:


On 12/8/13 1:49 PM, Heikki Linnakangas wrote:


On 12/08/2013 08:14 PM, Greg Stark wrote:


The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):


One simple thing we could do, without or in addition to changing the
algorithm, is to issue posix_fadvise() calls for the blocks we're
going to read. It should at least be possible to match the speed of a
plain sequential scan that way.


Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
thinking it wouldn't be very hard to do the stage 2 work in the callback
routine...


Yeah, other than the fact we have no infrastructure to do asynchronous I/O
anywhere in the backend. If we had that, then we could easily use it here. I
doubt it would be much better than posix_fadvising the blocks, though.


Without patches to the kernel, it is much better.

posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
bitmap heap scans (or similarly sorted analyze block samples) run at 1
IO / block, ie horrible, whereas aio can do read coalescence and
read-ahead when the kernel thinks it'll be profitable, significantly
increasing IOPS. I've seen everything from a 2x to 10x difference.


How did you test that, given that we don't actually have an asynchronous 
I/O implementation? I don't recall any recent patches floating around 
either to do that. When Greg Stark investigated this back in 2007-2008 
and implemented posix_fadvise() for bitmap heap scans, posix_fadvise 
certainly gave a significant speedup on the test data he used. What kind 
of a data distribution gives a slowdown like that?


I took a stab at using posix_fadvise() in ANALYZE. It turned out to be 
very easy, patch attached. Your mileage may vary, but I'm seeing a nice 
gain from this on my laptop. Taking a 3 page sample of a table with 
717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 
seconds without the patch, and less than a second with the patch, with 
effective_io_concurrency=10. If anyone with a good test data set loaded 
would like to test this and post some numbers, that would be great.


- Heikki
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9845b0b..410d487 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -58,10 +58,18 @@
 /* Data structure for Algorithm S from Knuth 3.4.2 */
 typedef struct
 {
+	Relation	rel;
+	ForkNumber	forknum;
+
 	BlockNumber N;/* number of blocks, known in advance */
 	int			n;/* desired sample size */
 	BlockNumber t;/* current block number */
 	int			m;/* blocks selected so far */
+
+	int			prefetch_target;
+	int			start;
+	int			end;
+	BlockNumber *prefetched;
 } BlockSamplerData;
 
 typedef BlockSamplerData *BlockSampler;
@@ -87,10 +95,11 @@ static BufferAccessStrategy vac_strategy;
 static void do_analyze_rel(Relation onerel, VacuumStmt *vacstmt,
 			   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
 			   bool inh, int elevel);
-static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
+static void BlockSampler_Init(BlockSampler bs, Relation rel, ForkNumber forknum, BlockNumber nblocks,
   int samplesize);
 static bool BlockSampler_HasMore(BlockSampler bs);
 static BlockNumber BlockSampler_Next(BlockSampler bs);
+static BlockNumber BlockSampler_Next_internal(BlockSampler bs);
 static void compute_index_stats(Relation onerel, double totalrows,
 	AnlIndexData *indexdata, int nindexes,
 	HeapTuple *rows, int numrows,
@@ -954,8 +963,12 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
  * algorithm.
  */
 static void
-BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize)
+BlockSampler_Init(BlockSampler bs, Relation rel, ForkNumber forknum,
+  BlockNumber nblocks, int samplesize)
 {
+	bs-rel = rel;
+	bs-forknum = forknum;
+
 	bs-N = nblocks;			/* measured table size */
 
 	/*
@@ -965,17 +978,61 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize)
 	bs-n = samplesize;
 	bs-t = 0;	/* blocks scanned so far */
 	bs-m = 0;	/* blocks selected so far */
+
+	bs-prefetch_target = target_prefetch_pages;
+	if (target_prefetch_pages  0)
+		bs-prefetched = palloc((bs-prefetch_target + 1) * sizeof(BlockNumber));
+	else
+		bs-prefetched = NULL;
+	bs-start = bs-end = 0;
 }
 
 static bool
 BlockSampler_HasMore(BlockSampler bs)
 {
+	if (bs-end != bs-start)
+		return true;
 	return (bs-t  bs-N)  (bs-m  bs-n);
 }
 
+static void
+BlockSampler_Prefetch(BlockSampler bs)
+{
+	int		next;
+
+	next = (bs-end + 1) % (bs-prefetch_target + 1);
+	while (next != bs-start  (bs-t  bs-N)  (bs-m  bs-n))
+	{
+		

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Claudio Freire
On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 12/09/2013 11:56 PM, Claudio Freire wrote:
 Without patches to the kernel, it is much better.

 posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
 bitmap heap scans (or similarly sorted analyze block samples) run at 1
 IO / block, ie horrible, whereas aio can do read coalescence and
 read-ahead when the kernel thinks it'll be profitable, significantly
 increasing IOPS. I've seen everything from a 2x to 10x difference.


 How did you test that, given that we don't actually have an asynchronous I/O
 implementation? I don't recall any recent patches floating around either to
 do that. When Greg Stark investigated this back in 2007-2008 and implemented
 posix_fadvise() for bitmap heap scans, posix_fadvise certainly gave a
 significant speedup on the test data he used. What kind of a data
 distribution gives a slowdown like that?

That's basically my summarized experience from working on this[0]
patch, and the feedback given there about competing AIO work.

Sequential I/O was the biggest issue. I had to actively avoid
fadvising on sequential I/O, or sequential-ish I/O, which was a huge
burden on fadvise logic.


 I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very
 easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from
 this on my laptop. Taking a 3 page sample of a table with 717717 pages
 (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the
 patch, and less than a second with the patch, with
 effective_io_concurrency=10. If anyone with a good test data set loaded
 would like to test this and post some numbers, that would be great.

Kernel version?

I raised this issue on LKML, and, while I got no news on this front,
they might have silently fixed it. I'd have to check the sources
again.

[0] 
http://www.postgresql.org/message-id/col116-w162aeaa64173e77d4597eea3...@phx.gbl


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Heikki Linnakangas
Claudio Freire klaussfre...@gmail.com wrote:
On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I took a stab at using posix_fadvise() in ANALYZE. It turned out to
be very
 easy, patch attached. Your mileage may vary, but I'm seeing a nice
gain from
 this on my laptop. Taking a 3 page sample of a table with 717717
pages
 (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without
the
 patch, and less than a second with the patch, with
 effective_io_concurrency=10. If anyone with a good test data set
loaded
 would like to test this and post some numbers, that would be great.

Kernel version?

3.12, from Debian experimental. With an ssd drive and btrfs filesystem.  
Admittedly not your average database server setup, so it would be nice to get 
more reports from others. 

I raised this issue on LKML, and, while I got no news on this front,
they might have silently fixed it. I'd have to check the sources
again.

Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're 
prefetching with posix_fadvise anyway.
I

- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Claudio Freire
On Mon, Dec 9, 2013 at 8:45 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Claudio Freire klaussfre...@gmail.com wrote:
On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I took a stab at using posix_fadvise() in ANALYZE. It turned out to
be very
 easy, patch attached. Your mileage may vary, but I'm seeing a nice
gain from
 this on my laptop. Taking a 3 page sample of a table with 717717
pages
 (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without
the
 patch, and less than a second with the patch, with
 effective_io_concurrency=10. If anyone with a good test data set
loaded
 would like to test this and post some numbers, that would be great.

Kernel version?

 3.12, from Debian experimental. With an ssd drive and btrfs filesystem.  
 Admittedly not your average database server setup, so it would be nice to get 
 more reports from others.


Yeah, read-ahead isn't relevant for SSD.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when 
 you're prefetching with posix_fadvise anyway.

Yeah.  If we're not reading consecutive blocks, readahead is unlikely
to do anything anyhow.

Claudio's comments do suggest that it might be a bad idea to issue a
posix_fadvise when the next block to be examined *is* adjacent to the
current one, though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out to be 
very easy, patch attached. Your mileage may vary, but I'm seeing a 
nice gain from this on my laptop. Taking a 3 page sample of a 
table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes 
about 6 seconds without the patch, and less than a second with the 
patch, with effective_io_concurrency=10. If anyone with a good test 
data set loaded would like to test this and post some numbers, that 
would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(

Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Tom Lane
Mark Kirkwood mark.kirkw...@catalyst.net.nz writes:
 I did a test run:

 pgbench scale 2000 (pgbench_accounts approx 25GB).
 postgres 9.4

 i7 3.5Ghz Cpu
 16GB Ram
 500 GB Velociraptor 10K

 (cold os and pg cache both runs)
 Without patch:  ANALYZE pgbench_accounts90s
 With patch: ANALYZE pgbench_accounts  91s

 So I'm essentially seeing no difference :-(

What OS and filesystem?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out to 
be very easy, patch attached. Your mileage may vary, but I'm seeing a 
nice gain from this on my laptop. Taking a 3 page sample of a 
table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes 
about 6 seconds without the patch, and less than a second with the 
patch, with effective_io_concurrency=10. If anyone with a good test 
data set loaded would like to test this and post some numbers, that 
would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out to 
be very easy, patch attached. Your mileage may vary, but I'm seeing 
a nice gain from this on my laptop. Taking a 3 page sample of a 
table with 717717 pages (ie. slightly larger than RAM), ANALYZE 
takes about 6 seconds without the patch, and less than a second with 
the patch, with effective_io_concurrency=10. If anyone with a good 
test data set loaded would like to test this and post some numbers, 
that would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem 
mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Craig Ringer
On 12/06/2013 09:52 AM, Peter Geoghegan wrote:
 Has anyone ever thought about opportunistic ANALYZE piggy-backing on
 other full-table scans? That doesn't really help Greg, because his
 complaint is mostly that a fresh ANALYZE is too expensive, but it
 could be an interesting, albeit risky approach.

It'd be particularly interesting, IMO, if autovacuum was able to do it
using the synchronized scans machinery, so the analyze still ran in a
separate proc and didn't impact the user's query. Have an autovac worker
or two waiting for new scans to start on tables they need to analyze,
and if one doesn't start within a reasonable time frame they start their
own scan.

We've seen enough issues with hint-bit setting causing unpredictable
query times for user queries that I wouldn't want to add another might
write during a read-only query behaviour.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Craig Ringer
On 12/10/2013 05:20 AM, Jim Nasby wrote:

 
 FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to
 fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's
 fancy new shared memory stuff).

Apologies for posting the same as a new idea before I saw your post. I
need to finish the thread before posting.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood

On 10/12/13 13:53, Mark Kirkwood wrote:

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out to 
be very easy, patch attached. Your mileage may vary, but I'm seeing 
a nice gain from this on my laptop. Taking a 3 page sample of a 
table with 717717 pages (ie. slightly larger than RAM), ANALYZE 
takes about 6 seconds without the patch, and less than a second 
with the patch, with effective_io_concurrency=10. If anyone with a 
good test data set loaded would like to test this and post some 
numbers, that would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem 
mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s







Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood

On 10/12/13 15:04, Mark Kirkwood wrote:

On 10/12/13 13:53, Mark Kirkwood wrote:

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out 
to be very easy, patch attached. Your mileage may vary, but I'm 
seeing a nice gain from this on my laptop. Taking a 3 page 
sample of a table with 717717 pages (ie. slightly larger than 
RAM), ANALYZE takes about 6 seconds without the patch, and less 
than a second with the patch, with effective_io_concurrency=10. If 
anyone with a good test data set loaded would like to test this 
and post some numbers, that would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 
filesystem mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s







Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s





Ah - I have just realized I was not setting effective_io_concurrency - 
so I'll redo the test. - Apologies.


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood

On 10/12/13 15:11, Mark Kirkwood wrote:

On 10/12/13 15:04, Mark Kirkwood wrote:

On 10/12/13 13:53, Mark Kirkwood wrote:

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out 
to be very easy, patch attached. Your mileage may vary, but I'm 
seeing a nice gain from this on my laptop. Taking a 3 page 
sample of a table with 717717 pages (ie. slightly larger than 
RAM), ANALYZE takes about 6 seconds without the patch, and less 
than a second with the patch, with effective_io_concurrency=10. 
If anyone with a good test data set loaded would like to test 
this and post some numbers, that would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 
filesystem mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s







Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s





Ah - I have just realized I was not setting effective_io_concurrency - 
so I'll redo the test. - Apologies.





Redoing the test on the velociraptor gives me exactly the same numbers 
as before (effective_io_concurrency = 10 instead of 1).


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Josh Berkus
On 12/09/2013 02:37 PM, Robert Haas wrote:
 I've never seen an n_distinct value of more than 5 digits, regardless
 of reality.  Typically I've seen 20-50k, even if the real number is
 much higher.  But the n_distinct value is only for non-MCVs, so if we
 estimate the selectivity of column = 'rarevalue' to be
 (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
 the estimate, and making the MCV list longer naturally makes mcvfrac
 bigger.  I'm not sure how important the
 less-frequent-than-the-least-common-MCV part is, but I'm very sure
 that raising the statistics target helps to solve the problem of
 overestimating the prevalence of uncommon values in a very big table.

I did an analysis of our ndistinct algorithm several years ago ( ~~
8.1), and to sum up:

1. we take far too small of a sample to estimate ndistinct well for
tables larger than 100,000 rows.

2. the estimation algo we have chosen is one which tends to be wrong in
the downwards direction, rather strongly so.  That is, if we could
potentially have an ndistinct of 1000 to 100,000 based on the sample,
our algo estimates 1500 to 3000.

3. Other algos exist.  The tend to be wrong in other directions.

4. Nobody has done an analysis of whether it's worse, on average, to
estimate low vs. high for ndistinct.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood

On 10/12/13 15:17, Mark Kirkwood wrote:

On 10/12/13 15:11, Mark Kirkwood wrote:

On 10/12/13 15:04, Mark Kirkwood wrote:

On 10/12/13 13:53, Mark Kirkwood wrote:

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out 
to be very easy, patch attached. Your mileage may vary, but I'm 
seeing a nice gain from this on my laptop. Taking a 3 page 
sample of a table with 717717 pages (ie. slightly larger than 
RAM), ANALYZE takes about 6 seconds without the patch, and less 
than a second with the patch, with effective_io_concurrency=10. 
If anyone with a good test data set loaded would like to test 
this and post some numbers, that would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 
filesystem mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s







Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s





Ah - I have just realized I was not setting effective_io_concurrency 
- so I'll redo the test. - Apologies.





Redoing the test on the velociraptor gives me exactly the same numbers 
as before (effective_io_concurrency = 10 instead of 1).





Good grief - repeating the test gave:

Without patch:  ANALYZE pgbench_accounts 90s
With patch: ANALYZE pgbench_accounts  42s

pretty consistent *now*. No idea what was going on in the 1st run (maybe 
I happened to have it running at the same time as a checkpoint)? Anyway 
will stop now before creating more confusion.


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood

On 10/12/13 15:32, Mark Kirkwood wrote:

On 10/12/13 15:17, Mark Kirkwood wrote:

On 10/12/13 15:11, Mark Kirkwood wrote:

On 10/12/13 15:04, Mark Kirkwood wrote:

On 10/12/13 13:53, Mark Kirkwood wrote:

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned 
out to be very easy, patch attached. Your mileage may vary, but 
I'm seeing a nice gain from this on my laptop. Taking a 3 
page sample of a table with 717717 pages (ie. slightly larger 
than RAM), ANALYZE takes about 6 seconds without the patch, and 
less than a second with the patch, with 
effective_io_concurrency=10. If anyone with a good test data 
set loaded would like to test this and post some numbers, that 
would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 
filesystem mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s







Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s





Ah - I have just realized I was not setting effective_io_concurrency 
- so I'll redo the test. - Apologies.





Redoing the test on the velociraptor gives me exactly the same 
numbers as before (effective_io_concurrency = 10 instead of 1).





Good grief - repeating the test gave:

Without patch:  ANALYZE pgbench_accounts 90s
With patch: ANALYZE pgbench_accounts  42s

pretty consistent *now*. No idea what was going on in the 1st run 
(maybe I happened to have it running at the same time as a 
checkpoint)? Anyway will stop now before creating more confusion.





Just one more...

The Intel 520 with ext4:

Without patch:  ANALYZE pgbench_accounts 5s
With patch: ANALYZE pgbench_accounts  1s

And double checking -
With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts  5s

These results look more like Heikki's. Which suggests more benefit on 
SSD than spinning disks. Some more data points (apart from mine) would 
be good to see tho.


Cheers

Mark






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Claudio Freire
On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz wrote:
 Just one more...

 The Intel 520 with ext4:


 Without patch:  ANALYZE pgbench_accounts 5s
 With patch: ANALYZE pgbench_accounts  1s

 And double checking -
 With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts  5s

 These results look more like Heikki's. Which suggests more benefit on SSD
 than spinning disks. Some more data points (apart from mine) would be good
 to see tho.


Assuming ANALYZE is sampling less than 5% of the table, I'd say
fadvising will always be a win.

I'd also suggest higher e_i_c for rotating media. Rotating media has
longer latencias, and e_i_c has to be computed in terms of latency,
rather than spindles when doing prefetch.

For backward index scans, I found the optimum number for a single
spindle to be about 20.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood

On 10/12/13 17:18, Claudio Freire wrote:

On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz wrote:

Just one more...

The Intel 520 with ext4:


Without patch:  ANALYZE pgbench_accounts 5s
With patch: ANALYZE pgbench_accounts  1s

And double checking -
With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts  5s

These results look more like Heikki's. Which suggests more benefit on SSD
than spinning disks. Some more data points (apart from mine) would be good
to see tho.


Assuming ANALYZE is sampling less than 5% of the table, I'd say
fadvising will always be a win.

I'd also suggest higher e_i_c for rotating media. Rotating media has
longer latencias, and e_i_c has to be computed in terms of latency,
rather than spindles when doing prefetch.

For backward index scans, I found the optimum number for a single
spindle to be about 20.


Yeah - was just fooling about with this on the velociraptor, looks like 
somewhere in the 20's works well.


   | pgbench_accounts
eff_io_concurrency | analyze time (s)
---+-
8  |  43
16 |  40
24 |  36
32 |  35
64 |  35

Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ANALYZE sampling is too good

2013-12-08 Thread Greg Stark
On Sun, Dec 8, 2013 at 12:06 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz wrote:

 bench=# ANALYZE pgbench_accounts;
 NOTICE:  acquire sample will need 3 blocks
 NOTICE:  sampled 3 blocks
 ANALYZE
 Time: 10059.446 ms
 bench=# \q


I did some experimenting here as well.

I hacked up a version of analyze.c that has a guc for rows_per_block
to sample. Setting it to 1 doesn't modify the behaviour at all whereas
setting it to 4 divides the number of blocks to sample by 4 which
causes it to do less I/O and use more rows from each block.

I then initialized pgbench with scale factor 100 but modified the code
to run the actual pgbench with scale factor 1. In other words I ran a
lot of updates on 1% of the database but left the other 99% untouched
from the initial load.

Then I ran ANALYZE VERBOSE accounts with rows_per_block set to 1, 4,
16, and 64. The latter is slightly larger than the average number of
tuples per block so the resulting sample is actually slightly short.

The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):


stark=# analyze verbose pgbench_accounts;
INFO:  analyzing public.pgbench_accounts
INFO:  pgbench_accounts: scanned 3 of 158756 pages, containing
1889701 live rows and 0 dead rows; 3 rows in sample, 1036
estimated total rows
ANALYZE
Time: 19468.987 ms


With rows_per_block=4 it reads only a quarter as many blocks but it's
not much faster:

stark=# analyze verbose pgbench_accounts;
INFO:  analyzing public.pgbench_accounts
INFO:  pgbench_accounts: scanned 7501 of 158756 pages, containing
472489 live rows and 0 dead rows; 3 rows in sample, 1037
estimated total rows
ANALYZE
Time: 17062.331 ms


But with rows_per_block=16 it's much faster, 6.7s

stark=# set statistics_rows_per_block = 16;
SET
Time: 1.583 ms
stark=# analyze verbose pgbench_accounts;
INFO:  analyzing public.pgbench_accounts
INFO:  pgbench_accounts: scanned 1876 of 158756 pages, containing
118163 live rows and 0 dead rows; 3 rows in sample, 1031
estimated total rows
ANALYZE
Time: 6694.331 ms


And with rows_per_block=64 it's under 2s:

stark=# set statistics_rows_per_block = 64;
SET
Time: 0.693 ms
stark=# analyze verbose pgbench_accounts;
INFO:  analyzing public.pgbench_accounts
INFO:  pgbench_accounts: scanned 469 of 158756 pages, containing
29544 live rows and 0 dead rows; 29544 rows in sample, 1033
estimated total rows
ANALYZE
Time: 1937.055 ms


The estimates for the total rows is just as accurate in every case.
(It seems to be consistently sightly high though which is a bit
disconcerting)

However looking at the actual pg_stats entries the stats are
noticeably less accurate for the blockier samples. The bid column
actually has 100 distinct values and so with a statistics_target of
100 each value should appear in the MCV list with a frequency of about
.01.

With rows_per_block=1   the MCV frequency list ranges from .0082 to .0123
With rows_per_block=4   the MCV frequency list ranges from .0063 to .0125
With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213

I'm not really sure if this is due to the blocky sample combined with
the skewed pgbench run or not. It doesn't seem to be consistently
biasing towards or against bid 1 which I believe are the only rows
that would have been touched by pgbench. Still it's suspicious that
they seem to be consistently getting less accurate as the blockiness
increases.

I've attached the results of pg_stats following the analyze with the
various levels of blockiness.

-- 
greg
stark=# set statistics_rows_per_block = 1;
SET
Time: 1.363 ms
stark=# analyze verbose pgbench_accounts;
INFO:  analyzing public.pgbench_accounts
INFO:  pgbench_accounts: scanned 3 of 158756 pages, containing 1889701 
live rows and 0 dead rows; 3 rows in sample, 1036 estimated total rows
ANALYZE
Time: 19468.987 ms
stark=# select * from pg_stats where tablename = 'pgbench_accounts';
-[ RECORD 1 

Re: [HACKERS] ANALYZE sampling is too good

2013-12-08 Thread Josh Berkus
On 12/08/2013 10:14 AM, Greg Stark wrote:
 With rows_per_block=1   the MCV frequency list ranges from .0082 to .0123
 With rows_per_block=4   the MCV frequency list ranges from .0063 to .0125
 With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
 With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213
 
 I'm not really sure if this is due to the blocky sample combined with
 the skewed pgbench run or not. It doesn't seem to be consistently
 biasing towards or against bid 1 which I believe are the only rows
 that would have been touched by pgbench. Still it's suspicious that
 they seem to be consistently getting less accurate as the blockiness
 increases.

They will certainly do so if you don't apply any statistical adjustments
for selecting more rows from the same pages.

So there's a set of math designed to calculate for the skew introduced
by reading *all* of the rows in each block.  That's what I meant by
block-based sampling; you read, say, 400 pages, you compile statistics
on *all* of the rows on those pages, you apply some algorithms to adjust
for groupings of rows based on how grouped they are.  And you have a
pretty good estimate of how grouped they are, because you just looked a
complete sets of rows on a bunch of nonadjacent pages.

Obviously, you need to look at more rows than you would with a
pure-random sample.  Like I said, the 80%+ accurate point in the papers
seemed to be at a 5% sample.  However, since those rows come from the
same pages, the cost of looking at more rows is quite small, compared to
the cost of looking at 64 times as many disk pages.

My ACM subscription has lapsed, though; someone with a current ACM
subscription could search for this; there are several published papers,
with math and pseudocode.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >