Re: [HACKERS] ANALYZE sampling is too good

2014-03-08 Thread Bruce Momjian
I assume we never came up with a TODO from this thread: --- On Tue, Dec 3, 2013 at 11:30:44PM +, Greg Stark wrote: > At multiple conferences I've heard about people trying all sorts of > gymnastics to avoid ANALYZE whic

Re: [HACKERS] ANALYZE sampling is too good

2013-12-17 Thread Heikki Linnakangas
On 12/17/2013 12:06 AM, Jeff Janes wrote: On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page s

Re: [HACKERS] ANALYZE sampling is too good

2013-12-16 Thread Jeff Janes
On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas wrote: > I took a stab at using posix_fadvise() in ANALYZE. It turned out to be > very easy, patch attached. Your mileage may vary, but I'm seeing a nice > gain from this on my laptop. Taking a 3 page sample of a table with > 717717 pages (

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Florian Pflug
On Dec12, 2013, at 19:29 , Tom Lane wrote: > However ... where this thread started was not about trying to reduce > the remaining statistical imperfections in our existing sampling method. > It was about whether we could reduce the number of pages read for an > acceptable cost in increased statist

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes
On Tue, Dec 3, 2013 at 3:30 PM, Greg Stark wrote: > At multiple conferences I've heard about people trying all sorts of > gymnastics to avoid ANALYZE which they expect to take too long and > consume too much I/O. This is especially a big complain after upgrades > when their new database performs

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire
On Thu, Dec 12, 2013 at 4:13 PM, Jeff Janes wrote: >> Well, why not take a supersample containing all visible tuples from N >> selected blocks, and do bootstrapping over it, with subsamples of M >> independent rows each? > > > Bootstrapping methods generally do not work well when ties are signific

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes
On Thu, Dec 12, 2013 at 10:33 AM, Claudio Freire wrote: > On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane wrote: > > Jeff Janes writes: > >> It would be relatively easy to fix this if we trusted the number of > visible > >> rows in each block to be fairly constant. But without that assumption, > I >

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire
On Thu, Dec 12, 2013 at 3:56 PM, Josh Berkus wrote: > > Estimated grouping should, however, affect MCVs. In cases where we > estimate that grouping levels are high, the expected % of observed > values should be "discounted" somehow. That is, with total random > distribution you have a 1:1 ratio

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Josh Berkus
On 12/12/2013 10:33 AM, Claudio Freire wrote: > Well, why not take a supersample containing all visible tuples from N > selected blocks, and do bootstrapping over it, with subsamples of M > independent rows each? Well, we still need to look at each individual block to determine grouping correlatio

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire
On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane wrote: > Jeff Janes writes: >> It would be relatively easy to fix this if we trusted the number of visible >> rows in each block to be fairly constant. But without that assumption, I >> don't see a way to fix the sample selection process without reading

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Tom Lane
Jeff Janes writes: > It would be relatively easy to fix this if we trusted the number of visible > rows in each block to be fairly constant. But without that assumption, I > don't see a way to fix the sample selection process without reading the > entire table. Yeah, varying tuple density is the

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes
On Thu, Dec 12, 2013 at 6:39 AM, Florian Pflug wrote: > Here's an analysis of Jeff Janes' simple example of a table where our > n_distinct estimate is way off. > > On Dec11, 2013, at 00:03 , Jeff Janes wrote: > > create table baz as select floor(random()*1000), md5(random()::text) > from gen

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes
On Wed, Dec 11, 2013 at 2:33 PM, Greg Stark wrote: > > I think we're all wet here. I don't see any bias towards larger or smaller > rows. Larger tied will be on a larger number of pages but there will be > fewer of them on any one page. The average effect should be the same. > > Smaller values mi

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Florian Pflug
Here's an analysis of Jeff Janes' simple example of a table where our n_distinct estimate is way off. On Dec11, 2013, at 00:03 , Jeff Janes wrote: > create table baz as select floor(random()*1000), md5(random()::text) from > generate_series(1,1); > create table baz2 as select * from

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Josh Berkus
On 12/11/2013 02:39 PM, Martijn van Oosterhout wrote: > In this discussion we've mostly used block = 1 postgresql block of 8k. > But when reading from a disk once you've read one block you can > basically read the following ones practically for free. > > So I wonder if you could make your samplin

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Martijn van Oosterhout
On Thu, Dec 12, 2013 at 07:22:59AM +1300, Gavin Flower wrote: > Surely we want to sample a 'constant fraction' (obviously, in > practice you have to sample an integral number of rows in a page!) > of rows per page? The simplest way, as Tom suggests, is to use all > the rows in a page. > > However,

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Greg Stark
I think we're all wet here. I don't see any bias towards larger or smaller rows. Larger tied will be on a larger number of pages but there will be fewer of them on any one page. The average effect should be the same. Smaller values might have a higher variance with block based sampling than larger

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower
On 12/12/13 09:12, Gavin Flower wrote: On 12/12/13 08:39, Gavin Flower wrote: On 12/12/13 08:31, Kevin Grittner wrote: Gavin Flower wrote: For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing d

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower
On 12/12/13 08:39, Gavin Flower wrote: On 12/12/13 08:31, Kevin Grittner wrote: Gavin Flower wrote: For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing density and no page has both types: the l

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower
On 12/12/13 08:31, Kevin Grittner wrote: Gavin Flower wrote: For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing density and no page has both types: the large rows would occupy 500 pages and th

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 4:48 PM, Peter Geoghegan wrote: > Why would I even mention that to a statistician? We want guidance. But > yes, I bet I could give a statistician an explanation of statistics > target that they'd understand without too much trouble. Actually, I think that if we told a stat

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower
On 12/12/13 08:14, Gavin Flower wrote: On 12/12/13 07:22, Gavin Flower wrote: On 12/12/13 06:22, Tom Lane wrote: I wrote: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is "select up to N rows from e

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Kevin Grittner
Gavin Flower wrote: > For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, > using 400 byte pages.  In the pathologically worst case, assuming > maximum packing density and no page has both types: the large rows would > occupy  500 pages and the smaller rows 50 pages. So if one s

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower
On 12/12/13 07:22, Gavin Flower wrote: On 12/12/13 06:22, Tom Lane wrote: I wrote: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is "select up to N rows from each sampled block" --- and that is going

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower
On 12/12/13 06:22, Tom Lane wrote: I wrote: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is "select up to N rows from each sampled block" --- and that is going to favor the contents of blocks containi

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Gavin Flower
On 12/12/13 06:22, Tom Lane wrote: I wrote: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is "select up to N rows from each sampled block" --- and that is going to favor the contents of blocks containi

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Tom Lane
I wrote: > Hm. You can only take N rows from a block if there actually are at least > N rows in the block. So the sampling rule I suppose you are using is > "select up to N rows from each sampled block" --- and that is going to > favor the contents of blocks containing narrower-than-average rows.

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Tom Lane
Greg Stark writes: > So I've done some empirical tests for a table generated by: > create table sizeskew as (select i,j,repeat('i',i) from > generate_series(1,1000) as i, generate_series(1,1000) as j); > I find that using the whole block doesn't cause any problem with the > avg_width field for th

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Simon Riggs
On 11 December 2013 12:08, Greg Stark wrote: > So there is something clearly wonky in the histogram stats that's > affected by the distribution of the sample. ...in the case where the avg width changes in a consistent manner across the table. Well spotted. ISTM we can have a specific cross che

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Florian Pflug
On Dec10, 2013, at 15:32 , Claudio Freire wrote: > On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark wrote: >> >> On 10 Dec 2013 08:28, "Albe Laurenz" wrote: >>> >>> >>> Doesn't all that assume a normally distributed random variable? >> >> I don't think so because of the law of large numbers. If

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Heikki Linnakangas
On 12/11/2013 02:08 PM, Greg Stark wrote: On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark wrote: I'm not actually sure there is any systemic bias here. The larger number of rows per block generate less precise results but from my thought experiments they seem to still be accurate? So I've done s

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Greg Stark
On Wed, Dec 11, 2013 at 12:08 PM, Greg Stark wrote: > The only thing I can think > of is maybe the most common elements are being selected preferentially > from the early part of the sample which is removing a substantial part > of the lower end of the range. But even removing 100 from the > begin

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Greg Stark
On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark wrote: > I'm not actually sure there is any systemic bias here. The larger > number of rows per block generate less precise results but from my > thought experiments they seem to still be accurate? So I've done some empirical tests for a table generate

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Greg Stark
On Wed, Dec 11, 2013 at 12:58 AM, Simon Riggs wrote: > Yes, it is not a perfect statistical sample. All sampling is subject > to an error that is data dependent. Well there's random variation due to the limitations of dealing with a sample. And then there's systemic biases due to incorrect algor

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Mark Kirkwood
On 11/12/13 19:34, Simon Riggs wrote: Realistically, I never heard of an Oracle DBA doing advanced statistical mathematics before setting the sample size on ANALYZE. You use the default and bump it up if the sample is insufficient for the data. I'm not sure that Oracle's stats and optimizer d

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Hannu Krosing
On 12/11/2013 01:44 AM, Greg Stark wrote: > On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs wrote: >> When we select a block we should read all rows on that block, to help >> identify the extent of clustering within the data. > So how do you interpret the results of the sample read that way that > d

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 10:34 PM, Simon Riggs wrote: > On 11 December 2013 01:27, Sergey E. Koposov wrote: >> For what it's worth. >> >> I'll quote Chaudhuri et al. first line from the abstract about the block >> sampling. >> "Block-level sampling is far more efficient than true uniform-random >>

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 11 December 2013 01:27, Sergey E. Koposov wrote: > For what it's worth. > > I'll quote Chaudhuri et al. first line from the abstract about the block > sampling. > "Block-level sampling is far more efficient than true uniform-random > sampling over a large database, but prone to significant err

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Jeff Janes
On Tuesday, December 10, 2013, Simon Riggs wrote: > On 11 December 2013 00:28, Greg Stark > > wrote: > >On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs > > > > wrote: > >> Block sampling, with parameter to specify sample size. +1 > > > > Simon this is very frustrating. Can you define "block sam

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Tom Lane
Peter Geoghegan writes: > Again, it isn't as if the likely efficacy of *some* block sampling > approach is in question. I'm sure analyze.c is currently naive about > many things. It's not *that* naive; this is already about a third-generation algorithm. The last major revision (commit 9d6570b8a4)

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Sergey E. Koposov
For what it's worth. I'll quote Chaudhuri et al. first line from the abstract about the block sampling. "Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics." And after brie

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 11 December 2013 00:44, Greg Stark wrote: > On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs wrote: >> When we select a block we should read all rows on that block, to help >> identify the extent of clustering within the data. > > So how do you interpret the results of the sample read that way th

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 4:14 PM, Simon Riggs wrote: > err, so what does stats target mean exactly in statistical theory? Why would I even mention that to a statistician? We want guidance. But yes, I bet I could give a statistician an explanation of statistics target that they'd understand without

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Greg Stark
On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs wrote: > When we select a block we should read all rows on that block, to help > identify the extent of clustering within the data. So how do you interpret the results of the sample read that way that doesn't introduce bias? -- greg -- Sent via p

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 11 December 2013 00:28, Greg Stark wrote: >On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs > wrote: >> Block sampling, with parameter to specify sample size. +1 > > Simon this is very frustrating. Can you define "block sampling"? Blocks selected using Vitter's algorithm, using a parameteri

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Greg Stark
On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs wrote: > Block sampling, with parameter to specify sample size. +1 Simon this is very frustrating. Can you define "block sampling"? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscript

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 10 December 2013 23:43, Peter Geoghegan wrote: > On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby wrote: >>> I agree that looking for information on block level sampling >>> specifically, and its impact on estimation quality is likely to not >>> turn up very much, and whatever it does turn up will h

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby wrote: >> I agree that looking for information on block level sampling >> specifically, and its impact on estimation quality is likely to not >> turn up very much, and whatever it does turn up will have patent >> issues. > > > We have an entire analytics

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Jim Nasby
On 12/10/13 2:17 PM, Peter Geoghegan wrote: On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark wrote: But I don't really think this is the right way to go about this. Research papers are going to turn up pretty specialized solutions that are probably patented. We don't even have the basic understandi

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Jeff Janes
On Mon, Dec 9, 2013 at 2:37 PM, Robert Haas wrote: > On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes wrote: > > My reading of the code is that if it is not in the MCV, then it is > assumed > > to have the average selectivity (about 1/n_distinct, but deflating top > and > > bottom for the MCV list).

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Greg Stark
On Tue, Dec 10, 2013 at 7:49 PM, Peter Geoghegan wrote: >> Back in 2005/6, I advocated a block sampling method, as described by >> Chaudri et al (ref?) > > I don't think that anyone believes that not doing block sampling is > tenable, fwiw. Clearly some type of block sampling would be preferable >

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Josh Berkus
On 12/10/2013 01:33 PM, Mark Kirkwood wrote: > Yeah - and we seem to be back to Josh's point about needing 'some math' > to cope with the rows within a block not being a purely random selection. Well, sometimes they are effectively random. But sometimes they are not. The Chaudri et al paper had

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Mark Kirkwood
On 11/12/13 09:19, Heikki Linnakangas wrote: On 12/10/2013 10:00 PM, Simon Riggs wrote: On 10 December 2013 19:54, Josh Berkus wrote: On 12/10/2013 11:49 AM, Peter Geoghegan wrote: On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs wrote: I don't think that anyone believes that not doing block s

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Heikki Linnakangas
On 12/10/2013 10:00 PM, Simon Riggs wrote: On 10 December 2013 19:54, Josh Berkus wrote: On 12/10/2013 11:49 AM, Peter Geoghegan wrote: On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs wrote: I don't think that anyone believes that not doing block sampling is tenable, fwiw. Clearly some type of

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark wrote: > But I don't really think this is the right way to go about this. > Research papers are going to turn up pretty specialized solutions that > are probably patented. We don't even have the basic understanding we > need. I suspect a basic textbook

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 10 December 2013 19:54, Josh Berkus wrote: > On 12/10/2013 11:49 AM, Peter Geoghegan wrote: >> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs wrote: >> I don't think that anyone believes that not doing block sampling is >> tenable, fwiw. Clearly some type of block sampling would be preferable >

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Greg Stark
On Tue, Dec 10, 2013 at 7:54 PM, Josh Berkus wrote: > As discussed, we need math though. Does anyone have an ACM subscription > and time to do a search? Someone must. We can buy one with community > funds, but no reason to do so if we don't have to. Anyone in a university likely has access thr

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Josh Berkus
On 12/10/2013 11:49 AM, Peter Geoghegan wrote: > On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs wrote: > I don't think that anyone believes that not doing block sampling is > tenable, fwiw. Clearly some type of block sampling would be preferable > for most or all purposes. As discussed, we need m

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 10 December 2013 19:49, Peter Geoghegan wrote: > On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs wrote: >> However, these things presume that we need to continue scanning most >> of the blocks of the table, which I don't think needs to be the case. >> There is a better way. > > Do they? I think

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs wrote: > However, these things presume that we need to continue scanning most > of the blocks of the table, which I don't think needs to be the case. > There is a better way. Do they? I think it's one opportunistic way of ameliorating the cost. > Bac

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Andres Freund
On 2013-12-10 19:23:37 +, Simon Riggs wrote: > On 6 December 2013 09:21, Andres Freund wrote: > > On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: > >> Has anyone ever thought about opportunistic ANALYZE piggy-backing on > >> other full-table scans? That doesn't really help Greg, because h

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Simon Riggs
On 6 December 2013 09:21, Andres Freund wrote: > On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: >> Has anyone ever thought about opportunistic ANALYZE piggy-backing on >> other full-table scans? That doesn't really help Greg, because his >> complaint is mostly that a fresh ANALYZE is too exp

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Claudio Freire
On Tue, Dec 10, 2013 at 11:32 AM, Claudio Freire wrote: > On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark wrote: >> >> On 10 Dec 2013 08:28, "Albe Laurenz" wrote: >>> >>> >>> Doesn't all that assume a normally distributed random variable? >> >> I don't think so because of the law of large numbers.

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Claudio Freire
On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark wrote: > > On 10 Dec 2013 08:28, "Albe Laurenz" wrote: >> >> >> Doesn't all that assume a normally distributed random variable? > > I don't think so because of the law of large numbers. If you have a large > population and sample it the sample behaves

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Albe Laurenz
Greg Stark wrote: >> Doesn't all that assume a normally distributed random variable? > I don't think so because of the law of large numbers. If you have a large > population and sample it the > sample behaves like a normal distribution when if the distribution of the > population isn't. Statist

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Greg Stark
On 10 Dec 2013 08:28, "Albe Laurenz" wrote: > > > Doesn't all that assume a normally distributed random variable? I don't think so because of the law of large numbers. If you have a large population and sample it the sample behaves like a normal distribution when if the distribution of the popula

Re: [HACKERS] ANALYZE sampling is too good

2013-12-10 Thread Albe Laurenz
Greg Stark wrote: >> It's also applicable for the other stats; histogram buckets constructed >> from a 5% sample are more likely to be accurate than those constructed >> from a 0.1% sample. Same with nullfrac. The degree of improved >> accuracy, would, of course, require some math to determine.

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood
On 10/12/13 17:18, Claudio Freire wrote: On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood wrote: Just one more... The Intel 520 with ext4: Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 1s And double checking - With patch, but effective_io_concurrency = 1:

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Claudio Freire
On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood wrote: > Just one more... > > The Intel 520 with ext4: > > > Without patch: ANALYZE pgbench_accounts 5s > With patch: ANALYZE pgbench_accounts 1s > > And double checking - > With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood
On 10/12/13 15:32, Mark Kirkwood wrote: On 10/12/13 15:17, Mark Kirkwood wrote: On 10/12/13 15:11, Mark Kirkwood wrote: On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood
On 10/12/13 15:17, Mark Kirkwood wrote: On 10/12/13 15:11, Mark Kirkwood wrote: On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I to

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Josh Berkus
On 12/09/2013 02:37 PM, Robert Haas wrote: > I've never seen an n_distinct value of more than 5 digits, regardless > of reality. Typically I've seen 20-50k, even if the real number is > much higher. But the n_distinct value is only for non-MCVs, so if we > estimate the selectivity of column = 'ra

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood
On 10/12/13 15:11, Mark Kirkwood wrote: On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANA

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood
On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, pat

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood
On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Craig Ringer
On 12/10/2013 05:20 AM, Jim Nasby wrote: >> > > FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to > fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's > fancy new shared memory stuff). Apologies for posting the same as a new idea before I saw your post. I

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Craig Ringer
On 12/06/2013 09:52 AM, Peter Geoghegan wrote: > Has anyone ever thought about opportunistic ANALYZE piggy-backing on > other full-table scans? That doesn't really help Greg, because his > complaint is mostly that a fresh ANALYZE is too expensive, but it > could be an interesting, albeit risky appr

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood
On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my la

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood
On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a ta

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Tom Lane
Mark Kirkwood writes: > I did a test run: > pgbench scale 2000 (pgbench_accounts approx 25GB). > postgres 9.4 > i7 3.5Ghz Cpu > 16GB Ram > 500 GB Velociraptor 10K > (cold os and pg cache both runs) > Without patch: ANALYZE pgbench_accounts90s > With patch: ANALYZE pgbench_accounts 91s >

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Mark Kirkwood
On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly large

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Tom Lane
Heikki Linnakangas writes: > Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when > you're prefetching with posix_fadvise anyway. Yeah. If we're not reading consecutive blocks, readahead is unlikely to do anything anyhow. Claudio's comments do suggest that it might be a bad

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Claudio Freire
On Mon, Dec 9, 2013 at 8:45 PM, Heikki Linnakangas wrote: > Claudio Freire wrote: >>On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas >> wrote: >>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to >>be very >>> easy, patch attached. Your mileage may vary, but I'm seeing a nice

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Heikki Linnakangas
Claudio Freire wrote: >On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas > wrote: >> I took a stab at using posix_fadvise() in ANALYZE. It turned out to >be very >> easy, patch attached. Your mileage may vary, but I'm seeing a nice >gain from >> this on my laptop. Taking a 3 page sample of a

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Claudio Freire
On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas wrote: > On 12/09/2013 11:56 PM, Claudio Freire wrote: >> Without patches to the kernel, it is much better. >> >> posix_fadvise interferes with read-ahead, so posix_fadvise on, say, >> bitmap heap scans (or similarly sorted analyze block samples)

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Heikki Linnakangas
On 12/09/2013 11:56 PM, Claudio Freire wrote: On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas wrote: On 12/09/2013 11:35 PM, Jim Nasby wrote: On 12/8/13 1:49 PM, Heikki Linnakangas wrote: On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 millio

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Robert Haas
On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes wrote: > My reading of the code is that if it is not in the MCV, then it is assumed > to have the average selectivity (about 1/n_distinct, but deflating top and > bottom for the MCV list). There is also a check that it is less than the > least common of

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Claudio Freire
On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas wrote: > On 12/09/2013 11:35 PM, Jim Nasby wrote: >> >> On 12/8/13 1:49 PM, Heikki Linnakangas wrote: >>> >>> On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Heikki Linnakangas
On 12/09/2013 11:35 PM, Jim Nasby wrote: On 12/8/13 1:49 PM, Heikki Linnakangas wrote: On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Peter Geoghegan
On Mon, Dec 9, 2013 at 1:18 PM, Jeff Janes wrote: > I don't recall ever tracing a bad plan down to a bad n_distinct. It does happen. I've seen it several times. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: htt

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Jim Nasby
On 12/8/13 1:49 PM, Heikki Linnakangas wrote: On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table sca

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Jim Nasby
On 12/6/13 3:21 AM, Andres Freund wrote: On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: Has anyone ever thought about opportunistic ANALYZE piggy-backing on other full-table scans? That doesn't really help Greg, because his complaint is mostly that a fresh ANALYZE is too expensive, but it

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Jeff Janes
On Sat, Dec 7, 2013 at 11:46 AM, Robert Haas wrote: > On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark wrote: > > I always gave the party line that ANALYZE only takes a small > > constant-sized sample so even very large tables should be very quick. > > But after hearing the same story again in Heroku

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Greg Stark
On Mon, Dec 9, 2013 at 6:54 PM, Greg Stark wrote: > > This "some math" is straightforward basic statistics. The 95th > percentile confidence interval for a sample consisting of 300 samples > from a population of a 1 million would be 5.66%. A sample consisting > of 1000 samples would have a 95th p

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Greg Stark
On Mon, Dec 9, 2013 at 6:03 PM, Josh Berkus wrote: > > It's also applicable for the other stats; histogram buckets constructed > from a 5% sample are more likely to be accurate than those constructed > from a 0.1% sample. Same with nullfrac. The degree of improved > accuracy, would, of course,

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Robert Haas
On Mon, Dec 9, 2013 at 1:03 PM, Josh Berkus wrote: >> I really don't believe the 5% thing. It's not enough for n_distinct >> and it's *far* too high a value for linear properties like histograms >> or nullfrac etc. > > Actually, it is enough for n_distinct, or more properly, 5% is as good > as you

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Tom Lane
Josh Berkus writes: > Reading 5% of a 200GB table is going to be considerably faster than > reading the whole thing, if that 5% is being scanned in a way that the > FS understands. Really? See the upthread point that reading one sector from each track has just as much seek overhead as reading th

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Josh Berkus
Greg, > I really don't believe the 5% thing. It's not enough for n_distinct > and it's *far* too high a value for linear properties like histograms > or nullfrac etc. Actually, it is enough for n_distinct, or more properly, 5% is as good as you can get for n_distinct unless you're going to jump

Re: [HACKERS] ANALYZE sampling is too good

2013-12-08 Thread Mark Kirkwood
On 09/12/13 08:03, Josh Berkus wrote: So there's a set of math designed to calculate for the skew introduced by reading *all* of the rows in each block. That's what I meant by "block-based sampling"; you read, say, 400 pages, you compile statistics on *all* of the rows on those pages, you apply

Re: [HACKERS] ANALYZE sampling is too good

2013-12-08 Thread Greg Stark
On Sun, Dec 8, 2013 at 7:03 PM, Josh Berkus wrote: > They will certainly do so if you don't apply any statistical adjustments > for selecting more rows from the same pages. > > So there's a set of math designed to calculate for the skew introduced > by reading *all* of the rows in each block. I j

  1   2   >