subject:"\[HACKERS\] ANALYZE sampling is too good"

Re: [HACKERS] ANALYZE sampling is too good

2014-03-08 Thread Bruce Momjian


I assume we never came up with a TODO from this thread:

---

On Tue, Dec  3, 2013 at 11:30:44PM +, Greg Stark wrote:
 At multiple conferences I've heard about people trying all sorts of
 gymnastics to avoid ANALYZE which they expect to take too long and
 consume too much I/O. This is especially a big complain after upgrades
 when their new database performs poorly until the new statistics are
 in and they did pg_upgrade to avoid an extended downtime and complain
 about ANALYZE taking hours.
 
 I always gave the party line that ANALYZE only takes a small
 constant-sized sample so even very large tables should be very quick.
 But after hearing the same story again in Heroku I looked into it a
 bit further. I was kind of shocked but the numbers.
 
 ANALYZE takes a sample of 300 * statistics_target rows. That sounds
 pretty reasonable but with default_statistics_target set to 100 that's
 30,000 rows. If I'm reading the code right It takes this sample by
 sampling 30,000 blocks and then (if the table is large enough) taking
 an average of one row per block. Each block is 8192 bytes so that
 means it's reading 240MB of each table.That's a lot more than I
 realized.
 
 It means if your table is anywhere up to 240MB you're effectively
 doing a full table scan and then throwing out nearly all the data
 read.
 
 Worse, my experience with the posix_fadvise benchmarking is that on
 spinning media reading one out of every 16 blocks takes about the same
 time as reading them all. Presumably this is because the seek time
 between tracks dominates and reading one out of every 16 blocks is
 still reading every track. So in fact if your table is up to about
 3-4G ANALYZE is still effectively going to do a full table scan, at
 least as far as I/O time goes.
 
 The current algorithm seems like it was designed with a 100G+ table in
 mind but the consequences on the more common 100M-100G tables weren't
 really considered. Consider what this means for partitioned tables. If
 they partition their terabyte table into 10 partitions ANALYZE will
 suddenly want to use 10x as much I/O which seems like a perverse
 consequence.
 
 I'm not sure I have a prescription but my general feeling is that
 we're spending an awful lot of resources going after a statistically
 valid sample when we can spend a lot less resources and get something
 90% as good. Or if we're really going to read that much data that we
 might as well use more of the rows we find.
 
 -- 
 greg
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-17 Thread Heikki Linnakangas


On 12/17/2013 12:06 AM, Jeff Janes wrote:

On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:


  I took a stab at using posix_fadvise() in ANALYZE. It turned out to be
very easy, patch attached. Your mileage may vary, but I'm seeing a nice
gain from this on my laptop. Taking a 3 page sample of a table with
717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds
without the patch, and less than a second with the patch, with
effective_io_concurrency=10. If anyone with a good test data set loaded
would like to test this and post some numbers, that would be great.


Performance is often chaotic near transition points, so I try to avoid data
sets that are slightly bigger or slightly smaller than RAM (or some other
limit).

Do you know how many io channels your SSD has (or whatever the term of art
is for SSD drives)?


No idea. It's an Intel 335.


On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB)
with 4 GB of RAM goes from ~106 seconds to ~19 seconds.

However, I'm not sure what problem we want to solve here.


The case that Greg Stark mentioned in the email starting this thread is 
doing a database-wide ANALYZE after an upgrade. In that use case, you 
certainly want to get it done as quickly as possible, using all the 
available resources.



I certainly would not wish to give a background maintenance process
permission to confiscate my entire RAID throughput for its own
operation.


Then don't set effective_io_concurrency. If you're worried about that, 
you probably wouldn't want any other process to monopolize the RAID 
array either.



Perhaps this could only be active for explicit analyze, and only if
vacuum_cost_delay=0?


That would be a bit weird, because ANALYZE in general doesn't obey 
vacuum_cost_delay. Maybe it should, though...



Perhaps there should be something like alter background role autovac set
  Otherwise we are going to end up with an autovacuum_* shadow
parameter for many of our parameters, see autovacuum_work_mem discussions.


Yeah, so it seems.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-16 Thread Jeff Janes

On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:



  I took a stab at using posix_fadvise() in ANALYZE. It turned out to be
 very easy, patch attached. Your mileage may vary, but I'm seeing a nice
 gain from this on my laptop. Taking a 3 page sample of a table with
 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds
 without the patch, and less than a second with the patch, with
 effective_io_concurrency=10. If anyone with a good test data set loaded
 would like to test this and post some numbers, that would be great.


Performance is often chaotic near transition points, so I try to avoid data
sets that are slightly bigger or slightly smaller than RAM (or some other
limit).

Do you know how many io channels your SSD has (or whatever the term of art
is for SSD drives)?

On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB)
with 4 GB of RAM goes from ~106 seconds to ~19 seconds.

However, I'm not sure what problem we want to solve here.  I certainly
would not wish to give a background maintenance process permission to
confiscate my entire RAID throughput for its own operation.  Perhaps this
could only be active for explicit analyze, and only if vacuum_cost_delay=0?

Perhaps there should be something like alter background role autovac set
  Otherwise we are going to end up with an autovacuum_* shadow
parameter for many of our parameters, see autovacuum_work_mem discussions.

Cheers,

Jeff

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Florian Pflug

Here's an analysis of Jeff Janes' simple example of a table where our
n_distinct estimate is way off.

On Dec11, 2013, at 00:03 , Jeff Janes jeff.ja...@gmail.com wrote:
 create table baz as select floor(random()*1000), md5(random()::text) from 
 generate_series(1,1);
 create table baz2 as select * from baz order by floor;
 create table baz3 as select * from baz order by md5(floor::text);
 
 baz unclustered, baz2 is clustered with perfect correlation, baz3 is 
 clustered but without correlation.
 
 After analyzing all of them:
 
 select tablename, n_distinct,correlation  from pg_stats where tablename  like 
 'baz%' and attname='floor'  ;
  tablename | n_distinct  | correlation
 ---+-+-
  baz   | 8.56006e+06 |  0.00497713
  baz2  |  333774 |   1
  baz3  |  361048 |  -0.0118147

I think I understand what's going on here. I worked with a reduced test cases
of 1e7 rows containing random values between 0 and 5e5 and instrumented
analyze to print the raw ndistinct and nmultiple values of the sample
population (i.e. the number of distinct values in the sample, and the number
of distinct values which appeared more than once). I've also considered only
baz and baz2, and thus removed the than unnecessary md5 column. To account for
the reduced table sizes, I adjusted default_statistics_target to 10 instead of
100. The resulting estimates are then

 tablename | n_distinct (est) | n_distinct (act) 
---+--+--
 baz   |   391685 |   50 
 baz2  |36001 |   50 

ANALYZE assumes that both tables contain 1048 rows and samples 3000 of
those.

The sample of baz contains 2989 distinct values, 11 of which appear more than
once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
appear more than once.

The very different results stem from the Duj1 estimator we use. It estimates
n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
samples, N the number of rows, d the number of distinct samples, and f1 the
number of distinct samples occurring exactly once. If all samples are unique
(i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very
quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly
0.03% - and the estimate already less than a 1/10 of what it would be if f1
where 0.

Which leaves us with the question why sampling baz2 produces more duplicate
values than sampling baz does. Now, if we used block sampling, that behaviour
would be unsurprising - baz2 is sorted, so each block very likely contains
each value more than since, since the row count exceeds the number of possible
values by more than a magnitude. In fact, with block sampling, we'd probably
see f1 values close to 0 and thus our estimate of n_distinct would be roughly
equal to the number of distinct values in the *sample* population, i.e. about
300 or so.

So why does vitter's algorithm fail here? Given that we see inflated numbers
of duplicate values in our sample, yet still far fewer than block-based
sampling would yield, my guess is that it's the initial reservoir filling that
bites us here. After that initial filling step, the reservoir contains a lot of
consecutive rows, i.e. a block-based sample taken from just a few blocks. If
the replacement phase that follows somehow uses a slightly smaller replacement
probability than it should, more of these rows will survive replacement,
resulting in exactly the kind of inflated numbers of duplicate values we're
seeing. I've yet to validate this by looking at the reservoir before and after
the replacement stage, though.

So at least for the purpose of estimating n_distinct, our current sampling
method seems to exhibit the worst rather than the best properties of block-
and row- based sampling. What conclusions to draw of that, I'm not sure yet -
other that if we move to block-based sampling, we'll certainly have to change
the way we estimate n_distinct.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes

On Wed, Dec 11, 2013 at 2:33 PM, Greg Stark st...@mit.edu wrote:


 I think we're all wet here. I don't see any bias towards larger or smaller
 rows. Larger tied will be on a larger number of pages but there will be
 fewer of them on any one page. The average effect should be the same.

 Smaller values might have a higher variance with block based sampling than
 larger values. But that actually *is* the kind of thing that Simon's
 approach of just compensating with later samples can deal with.


I think that looking at all rows in randomly-chosen blocks will not bias
size, or histograms.  But it will bias n_distinct and MCV for some data
distributions of data, unless we find some way to compensate for it.

But even for avg size and histograms, what does block sampling get us?  We
get larger samples sizes for the same IO, but those samples are less
independent (assuming data is no randomly scattered over the table), so the
effective sample size is less than the true sample size.  So we can't
just sample 100 time fewer blocks because there are about 100 rows per
block--doing so would not bias our avg size or histogram boundaries, but it
would certainly make them noisier.

Cheers,

Jeff

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes

On Thu, Dec 12, 2013 at 6:39 AM, Florian Pflug f...@phlo.org wrote:

 Here's an analysis of Jeff Janes' simple example of a table where our
 n_distinct estimate is way off.

 On Dec11, 2013, at 00:03 , Jeff Janes jeff.ja...@gmail.com wrote:
  create table baz as select floor(random()*1000), md5(random()::text)
 from generate_series(1,1);
  create table baz2 as select * from baz order by floor;
  create table baz3 as select * from baz order by md5(floor::text);
 
  baz unclustered, baz2 is clustered with perfect correlation, baz3 is
 clustered but without correlation.
 
  After analyzing all of them:
 
  select tablename, n_distinct,correlation  from pg_stats where tablename
  like 'baz%' and attname='floor'  ;
   tablename | n_distinct  | correlation
  ---+-+-
   baz   | 8.56006e+06 |  0.00497713
   baz2  |  333774 |   1
   baz3  |  361048 |  -0.0118147

 I think I understand what's going on here. I worked with a reduced test
 cases
 of 1e7 rows containing random values between 0 and 5e5 and instrumented
 analyze to print the raw ndistinct and nmultiple values of the sample
 population (i.e. the number of distinct values in the sample, and the
 number
 of distinct values which appeared more than once). I've also considered
 only
 baz and baz2, and thus removed the than unnecessary md5 column. To account
 for
 the reduced table sizes, I adjusted default_statistics_target to 10
 instead of
 100. The resulting estimates are then

  tablename | n_distinct (est) | n_distinct (act)
 ---+--+--
  baz   |   391685 |   50
  baz2  |36001 |   50

 ANALYZE assumes that both tables contain 1048 rows and samples 3000 of
 those.

 The sample of baz contains 2989 distinct values, 11 of which appear more
 than
 once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
 appear more than once.

 The very different results stem from the Duj1 estimator we use. It
 estimates
 n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
 samples, N the number of rows, d the number of distinct samples, and f1 the
 number of distinct samples occurring exactly once. If all samples are
 unique
 (i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops
 very
 quickly - sampling baz2 produces 117 non-unique values out of 2878 -
 roughly
 0.03% - and the estimate already less than a 1/10 of what it would be if f1
 where 0.

 Which leaves us with the question why sampling baz2 produces more duplicate
 values than sampling baz does. Now, if we used block sampling, that
 behaviour
 would be unsurprising - baz2 is sorted, so each block very likely contains
 each value more than since, since the row count exceeds the number of
 possible
 values by more than a magnitude. In fact, with block sampling, we'd
 probably
 see f1 values close to 0 and thus our estimate of n_distinct would be
 roughly
 equal to the number of distinct values in the *sample* population, i.e.
 about
 300 or so.

 So why does vitter's algorithm fail here? Given that we see inflated
 numbers
 of duplicate values in our sample, yet still far fewer than block-based
 sampling would yield, my guess is that it's the initial reservoir filling
 that
 bites us here. After that initial filling step, the reservoir contains a
 lot of
 consecutive rows, i.e. a block-based sample taken from just a few blocks.
 If
 the replacement phase that follows somehow uses a slightly smaller
 replacement
 probability than it should, more of these rows will survive replacement,
 resulting in exactly the kind of inflated numbers of duplicate values we're
 seeing. I've yet to validate this by looking at the reservoir before and
 after
 the replacement stage, though.



I think the problem is more subtle than that.  It is easier to visualize it
if you think of every block having the same number of rows, with that
number being fairly large.  If you pick 30,000 rows at random from
1,000,000 blocks, the number of rows chosen from any given block should be
close to following a poisson distribution with average of 0.03, which means
about 29113 blocks should have exactly 1 row chosen from them while 441
would have two or more rows chosen from them.

But if you instead select 30,000 row from 30,000 blocks, which is what we
ask Vitter's algorithm to do, you get about a Poisson distribution with
average of 1.0.  Then about 11036 blocks have exactly one row chosen from
them, and 7927 blocks have two or more rows sampled from it.  Another
11,036 blocks get zero rows selected from them due to Vitter, in addition
to the 970,000 that didn't even get submitted to Vitter in the first place.
 That is why you see too many duplicates for clustered data, as too many
blocks are sampled multiple times.

The Poisson argument doesn't apply cleanly when blocks have variable number
of rows, but the general principle

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Tom Lane

Jeff Janes jeff.ja...@gmail.com writes:
 It would be relatively easy to fix this if we trusted the number of visible
 rows in each block to be fairly constant.  But without that assumption, I
 don't see a way to fix the sample selection process without reading the
 entire table.

Yeah, varying tuple density is the weak spot in every algorithm we've
looked at.  The current code is better than what was there before, but as
you say, not perfect.  You might be entertained to look at the threads
referenced by the patch that created the current sampling method:
http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at

particularly
http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at


However ... where this thread started was not about trying to reduce
the remaining statistical imperfections in our existing sampling method.
It was about whether we could reduce the number of pages read for an
acceptable cost in increased statistical imperfection.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire

On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
 It would be relatively easy to fix this if we trusted the number of visible
 rows in each block to be fairly constant.  But without that assumption, I
 don't see a way to fix the sample selection process without reading the
 entire table.

 Yeah, varying tuple density is the weak spot in every algorithm we've
 looked at.  The current code is better than what was there before, but as
 you say, not perfect.  You might be entertained to look at the threads
 referenced by the patch that created the current sampling method:
 http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at

 particularly
 http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at


 However ... where this thread started was not about trying to reduce
 the remaining statistical imperfections in our existing sampling method.
 It was about whether we could reduce the number of pages read for an
 acceptable cost in increased statistical imperfection.


Well, why not take a supersample containing all visible tuples from N
selected blocks, and do bootstrapping over it, with subsamples of M
independent rows each?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Josh Berkus

On 12/12/2013 10:33 AM, Claudio Freire wrote:
 Well, why not take a supersample containing all visible tuples from N
 selected blocks, and do bootstrapping over it, with subsamples of M
 independent rows each?

Well, we still need to look at each individual block to determine
grouping correlation.  Let's take a worst case example: imagine a table
has *just* been created by:

CREATE TABLE newdata AS SELECT * FROM olddata ORDER BY category, item;

If category is fairly low cardinality, then grouping will be severe;
we can reasonably expect that if we sample 100 blocks, many of them will
have only one category value present.  The answer to this is to make our
block samples fairly widely spaced and compare them.

In this simplified example, if the table had 1000 blocks, we would take
blocks 1,101,201,301,401,etc.  Then we would compare the number and
content of values found on each block with the number and content found
on each other block.  For example, if we see that block 101 is entirely
the category cats, and block 701 is entirely the category shopping
and block 901 is split 60/40 between the categories transportation and
voting, then we can assume that the level of grouping is very high,
and the number of unknown values we haven't seen is also high.

Whereas if 101 is cats and 201 is cats and 301 through 501 are
cats with 2% other stuff, then we assume that the level of grouping is
moderate and it's just the case that most of the dataset is cats.
Which means that the number of unknown values we haven't seen is low.

Whereas if 101, 201, 501, and 901 have near-identical distributions of
values, we assume that the level of grouping is very low, and that there
are very few values we haven't seen.

As someone else pointed out, full-block (the proposal) vs. random-row
(our current style) doesn't have a very significant effect on estimates
of Histograms and nullfrac, as long as the sampled blocks are widely
spaced.  Well, nullfrac is affected in the extreme example of a totally
ordered table where the nulls are all in one block, but I'll point out
that we can (and do) also miss that using our current algo.

Estimated grouping should, however, affect MCVs.  In cases where we
estimate that grouping levels are high, the expected % of observed
values should be discounted somehow.  That is, with total random
distribution you have a 1:1 ratio between observed frequency of a value
and assumed frequency.  However, with highly grouped values, you might
have a 2:1 ratio.

Again, more math (backed by statistical analysis) is needed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire

On Thu, Dec 12, 2013 at 3:56 PM, Josh Berkus j...@agliodbs.com wrote:

 Estimated grouping should, however, affect MCVs.  In cases where we
 estimate that grouping levels are high, the expected % of observed
 values should be discounted somehow.  That is, with total random
 distribution you have a 1:1 ratio between observed frequency of a value
 and assumed frequency.  However, with highly grouped values, you might
 have a 2:1 ratio.

Cross validation can help there. But it's costly.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes

On Thu, Dec 12, 2013 at 10:33 AM, Claudio Freire klaussfre...@gmail.comwrote:

On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Jeff Janes jeff.ja...@gmail.com writes:
It would be relatively easy to fix this if we trusted the number of
visible
rows in each block to be fairly constant. But without that assumption,
I
don't see a way to fix the sample selection process without reading the
entire table.

Yeah, varying tuple density is the weak spot in every algorithm we've
looked at. The current code is better than what was there before, but as
you say, not perfect. You might be entertained to look at the threads
referenced by the patch that created the current sampling method:

http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at

particularly

http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at

Thanks, I will read those.

However ... where this thread started was not about trying to reduce
the remaining statistical imperfections in our existing sampling method.
It was about whether we could reduce the number of pages read for an
acceptable cost in increased statistical imperfection.

I think it is pretty clear that n_distinct at least, and probably MCV,
would be a catastrophe under some common data distribution patterns if we
sample all rows in each block without changing our current computation
method. If we come up with a computation that works for that type of
sampling, it would probably be an improvement under our current sampling as
well. If we find such a thing, I wouldn't want it to get rejected just
because the larger block-sampling method change did not make it in.

Well, why not take a supersample containing all visible tuples from N
selected blocks, and do bootstrapping over it, with subsamples of M
independent rows each?

Bootstrapping methods generally do not work well when ties are significant
events, i.e. when two values being identical is meaningfully different from
them being very close but not identical.

1 2 >

1 - 100 of 117 matches

Mail list logo