Re: [HACKERS] ANALYZE sampling is too good
I assume we never came up with a TODO from this thread: --- On Tue, Dec 3, 2013 at 11:30:44PM +, Greg Stark wrote: At multiple conferences I've heard about people trying all sorts of gymnastics to avoid ANALYZE which they expect to take too long and consume too much I/O. This is especially a big complain after upgrades when their new database performs poorly until the new statistics are in and they did pg_upgrade to avoid an extended downtime and complain about ANALYZE taking hours. I always gave the party line that ANALYZE only takes a small constant-sized sample so even very large tables should be very quick. But after hearing the same story again in Heroku I looked into it a bit further. I was kind of shocked but the numbers. ANALYZE takes a sample of 300 * statistics_target rows. That sounds pretty reasonable but with default_statistics_target set to 100 that's 30,000 rows. If I'm reading the code right It takes this sample by sampling 30,000 blocks and then (if the table is large enough) taking an average of one row per block. Each block is 8192 bytes so that means it's reading 240MB of each table.That's a lot more than I realized. It means if your table is anywhere up to 240MB you're effectively doing a full table scan and then throwing out nearly all the data read. Worse, my experience with the posix_fadvise benchmarking is that on spinning media reading one out of every 16 blocks takes about the same time as reading them all. Presumably this is because the seek time between tracks dominates and reading one out of every 16 blocks is still reading every track. So in fact if your table is up to about 3-4G ANALYZE is still effectively going to do a full table scan, at least as far as I/O time goes. The current algorithm seems like it was designed with a 100G+ table in mind but the consequences on the more common 100M-100G tables weren't really considered. Consider what this means for partitioned tables. If they partition their terabyte table into 10 partitions ANALYZE will suddenly want to use 10x as much I/O which seems like a perverse consequence. I'm not sure I have a prescription but my general feeling is that we're spending an awful lot of resources going after a statistically valid sample when we can spend a lot less resources and get something 90% as good. Or if we're really going to read that much data that we might as well use more of the rows we find. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/17/2013 12:06 AM, Jeff Janes wrote: On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. Performance is often chaotic near transition points, so I try to avoid data sets that are slightly bigger or slightly smaller than RAM (or some other limit). Do you know how many io channels your SSD has (or whatever the term of art is for SSD drives)? No idea. It's an Intel 335. On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB) with 4 GB of RAM goes from ~106 seconds to ~19 seconds. However, I'm not sure what problem we want to solve here. The case that Greg Stark mentioned in the email starting this thread is doing a database-wide ANALYZE after an upgrade. In that use case, you certainly want to get it done as quickly as possible, using all the available resources. I certainly would not wish to give a background maintenance process permission to confiscate my entire RAID throughput for its own operation. Then don't set effective_io_concurrency. If you're worried about that, you probably wouldn't want any other process to monopolize the RAID array either. Perhaps this could only be active for explicit analyze, and only if vacuum_cost_delay=0? That would be a bit weird, because ANALYZE in general doesn't obey vacuum_cost_delay. Maybe it should, though... Perhaps there should be something like alter background role autovac set Otherwise we are going to end up with an autovacuum_* shadow parameter for many of our parameters, see autovacuum_work_mem discussions. Yeah, so it seems. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. Performance is often chaotic near transition points, so I try to avoid data sets that are slightly bigger or slightly smaller than RAM (or some other limit). Do you know how many io channels your SSD has (or whatever the term of art is for SSD drives)? On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB) with 4 GB of RAM goes from ~106 seconds to ~19 seconds. However, I'm not sure what problem we want to solve here. I certainly would not wish to give a background maintenance process permission to confiscate my entire RAID throughput for its own operation. Perhaps this could only be active for explicit analyze, and only if vacuum_cost_delay=0? Perhaps there should be something like alter background role autovac set Otherwise we are going to end up with an autovacuum_* shadow parameter for many of our parameters, see autovacuum_work_mem discussions. Cheers, Jeff
Re: [HACKERS] ANALYZE sampling is too good
Here's an analysis of Jeff Janes' simple example of a table where our n_distinct estimate is way off. On Dec11, 2013, at 00:03 , Jeff Janes jeff.ja...@gmail.com wrote: create table baz as select floor(random()*1000), md5(random()::text) from generate_series(1,1); create table baz2 as select * from baz order by floor; create table baz3 as select * from baz order by md5(floor::text); baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation. After analyzing all of them: select tablename, n_distinct,correlation from pg_stats where tablename like 'baz%' and attname='floor' ; tablename | n_distinct | correlation ---+-+- baz | 8.56006e+06 | 0.00497713 baz2 | 333774 | 1 baz3 | 361048 | -0.0118147 I think I understand what's going on here. I worked with a reduced test cases of 1e7 rows containing random values between 0 and 5e5 and instrumented analyze to print the raw ndistinct and nmultiple values of the sample population (i.e. the number of distinct values in the sample, and the number of distinct values which appeared more than once). I've also considered only baz and baz2, and thus removed the than unnecessary md5 column. To account for the reduced table sizes, I adjusted default_statistics_target to 10 instead of 100. The resulting estimates are then tablename | n_distinct (est) | n_distinct (act) ---+--+-- baz | 391685 | 50 baz2 |36001 | 50 ANALYZE assumes that both tables contain 1048 rows and samples 3000 of those. The sample of baz contains 2989 distinct values, 11 of which appear more than once. The sample of baz2 contains 2878 distinct values, 117 (!) of which appear more than once. The very different results stem from the Duj1 estimator we use. It estimates n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of samples, N the number of rows, d the number of distinct samples, and f1 the number of distinct samples occurring exactly once. If all samples are unique (i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly 0.03% - and the estimate already less than a 1/10 of what it would be if f1 where 0. Which leaves us with the question why sampling baz2 produces more duplicate values than sampling baz does. Now, if we used block sampling, that behaviour would be unsurprising - baz2 is sorted, so each block very likely contains each value more than since, since the row count exceeds the number of possible values by more than a magnitude. In fact, with block sampling, we'd probably see f1 values close to 0 and thus our estimate of n_distinct would be roughly equal to the number of distinct values in the *sample* population, i.e. about 300 or so. So why does vitter's algorithm fail here? Given that we see inflated numbers of duplicate values in our sample, yet still far fewer than block-based sampling would yield, my guess is that it's the initial reservoir filling that bites us here. After that initial filling step, the reservoir contains a lot of consecutive rows, i.e. a block-based sample taken from just a few blocks. If the replacement phase that follows somehow uses a slightly smaller replacement probability than it should, more of these rows will survive replacement, resulting in exactly the kind of inflated numbers of duplicate values we're seeing. I've yet to validate this by looking at the reservoir before and after the replacement stage, though. So at least for the purpose of estimating n_distinct, our current sampling method seems to exhibit the worst rather than the best properties of block- and row- based sampling. What conclusions to draw of that, I'm not sure yet - other that if we move to block-based sampling, we'll certainly have to change the way we estimate n_distinct. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Wed, Dec 11, 2013 at 2:33 PM, Greg Stark st...@mit.edu wrote: I think we're all wet here. I don't see any bias towards larger or smaller rows. Larger tied will be on a larger number of pages but there will be fewer of them on any one page. The average effect should be the same. Smaller values might have a higher variance with block based sampling than larger values. But that actually *is* the kind of thing that Simon's approach of just compensating with later samples can deal with. I think that looking at all rows in randomly-chosen blocks will not bias size, or histograms. But it will bias n_distinct and MCV for some data distributions of data, unless we find some way to compensate for it. But even for avg size and histograms, what does block sampling get us? We get larger samples sizes for the same IO, but those samples are less independent (assuming data is no randomly scattered over the table), so the effective sample size is less than the true sample size. So we can't just sample 100 time fewer blocks because there are about 100 rows per block--doing so would not bias our avg size or histogram boundaries, but it would certainly make them noisier. Cheers, Jeff
Re: [HACKERS] ANALYZE sampling is too good
On Thu, Dec 12, 2013 at 6:39 AM, Florian Pflug f...@phlo.org wrote: Here's an analysis of Jeff Janes' simple example of a table where our n_distinct estimate is way off. On Dec11, 2013, at 00:03 , Jeff Janes jeff.ja...@gmail.com wrote: create table baz as select floor(random()*1000), md5(random()::text) from generate_series(1,1); create table baz2 as select * from baz order by floor; create table baz3 as select * from baz order by md5(floor::text); baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation. After analyzing all of them: select tablename, n_distinct,correlation from pg_stats where tablename like 'baz%' and attname='floor' ; tablename | n_distinct | correlation ---+-+- baz | 8.56006e+06 | 0.00497713 baz2 | 333774 | 1 baz3 | 361048 | -0.0118147 I think I understand what's going on here. I worked with a reduced test cases of 1e7 rows containing random values between 0 and 5e5 and instrumented analyze to print the raw ndistinct and nmultiple values of the sample population (i.e. the number of distinct values in the sample, and the number of distinct values which appeared more than once). I've also considered only baz and baz2, and thus removed the than unnecessary md5 column. To account for the reduced table sizes, I adjusted default_statistics_target to 10 instead of 100. The resulting estimates are then tablename | n_distinct (est) | n_distinct (act) ---+--+-- baz | 391685 | 50 baz2 |36001 | 50 ANALYZE assumes that both tables contain 1048 rows and samples 3000 of those. The sample of baz contains 2989 distinct values, 11 of which appear more than once. The sample of baz2 contains 2878 distinct values, 117 (!) of which appear more than once. The very different results stem from the Duj1 estimator we use. It estimates n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of samples, N the number of rows, d the number of distinct samples, and f1 the number of distinct samples occurring exactly once. If all samples are unique (i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly 0.03% - and the estimate already less than a 1/10 of what it would be if f1 where 0. Which leaves us with the question why sampling baz2 produces more duplicate values than sampling baz does. Now, if we used block sampling, that behaviour would be unsurprising - baz2 is sorted, so each block very likely contains each value more than since, since the row count exceeds the number of possible values by more than a magnitude. In fact, with block sampling, we'd probably see f1 values close to 0 and thus our estimate of n_distinct would be roughly equal to the number of distinct values in the *sample* population, i.e. about 300 or so. So why does vitter's algorithm fail here? Given that we see inflated numbers of duplicate values in our sample, yet still far fewer than block-based sampling would yield, my guess is that it's the initial reservoir filling that bites us here. After that initial filling step, the reservoir contains a lot of consecutive rows, i.e. a block-based sample taken from just a few blocks. If the replacement phase that follows somehow uses a slightly smaller replacement probability than it should, more of these rows will survive replacement, resulting in exactly the kind of inflated numbers of duplicate values we're seeing. I've yet to validate this by looking at the reservoir before and after the replacement stage, though. I think the problem is more subtle than that. It is easier to visualize it if you think of every block having the same number of rows, with that number being fairly large. If you pick 30,000 rows at random from 1,000,000 blocks, the number of rows chosen from any given block should be close to following a poisson distribution with average of 0.03, which means about 29113 blocks should have exactly 1 row chosen from them while 441 would have two or more rows chosen from them. But if you instead select 30,000 row from 30,000 blocks, which is what we ask Vitter's algorithm to do, you get about a Poisson distribution with average of 1.0. Then about 11036 blocks have exactly one row chosen from them, and 7927 blocks have two or more rows sampled from it. Another 11,036 blocks get zero rows selected from them due to Vitter, in addition to the 970,000 that didn't even get submitted to Vitter in the first place. That is why you see too many duplicates for clustered data, as too many blocks are sampled multiple times. The Poisson argument doesn't apply cleanly when blocks have variable number of rows, but the general principle
Re: [HACKERS] ANALYZE sampling is too good
Jeff Janes jeff.ja...@gmail.com writes: It would be relatively easy to fix this if we trusted the number of visible rows in each block to be fairly constant. But without that assumption, I don't see a way to fix the sample selection process without reading the entire table. Yeah, varying tuple density is the weak spot in every algorithm we've looked at. The current code is better than what was there before, but as you say, not perfect. You might be entertained to look at the threads referenced by the patch that created the current sampling method: http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at particularly http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at However ... where this thread started was not about trying to reduce the remaining statistical imperfections in our existing sampling method. It was about whether we could reduce the number of pages read for an acceptable cost in increased statistical imperfection. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jeff Janes jeff.ja...@gmail.com writes: It would be relatively easy to fix this if we trusted the number of visible rows in each block to be fairly constant. But without that assumption, I don't see a way to fix the sample selection process without reading the entire table. Yeah, varying tuple density is the weak spot in every algorithm we've looked at. The current code is better than what was there before, but as you say, not perfect. You might be entertained to look at the threads referenced by the patch that created the current sampling method: http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at particularly http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at However ... where this thread started was not about trying to reduce the remaining statistical imperfections in our existing sampling method. It was about whether we could reduce the number of pages read for an acceptable cost in increased statistical imperfection. Well, why not take a supersample containing all visible tuples from N selected blocks, and do bootstrapping over it, with subsamples of M independent rows each? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/12/2013 10:33 AM, Claudio Freire wrote: Well, why not take a supersample containing all visible tuples from N selected blocks, and do bootstrapping over it, with subsamples of M independent rows each? Well, we still need to look at each individual block to determine grouping correlation. Let's take a worst case example: imagine a table has *just* been created by: CREATE TABLE newdata AS SELECT * FROM olddata ORDER BY category, item; If category is fairly low cardinality, then grouping will be severe; we can reasonably expect that if we sample 100 blocks, many of them will have only one category value present. The answer to this is to make our block samples fairly widely spaced and compare them. In this simplified example, if the table had 1000 blocks, we would take blocks 1,101,201,301,401,etc. Then we would compare the number and content of values found on each block with the number and content found on each other block. For example, if we see that block 101 is entirely the category cats, and block 701 is entirely the category shopping and block 901 is split 60/40 between the categories transportation and voting, then we can assume that the level of grouping is very high, and the number of unknown values we haven't seen is also high. Whereas if 101 is cats and 201 is cats and 301 through 501 are cats with 2% other stuff, then we assume that the level of grouping is moderate and it's just the case that most of the dataset is cats. Which means that the number of unknown values we haven't seen is low. Whereas if 101, 201, 501, and 901 have near-identical distributions of values, we assume that the level of grouping is very low, and that there are very few values we haven't seen. As someone else pointed out, full-block (the proposal) vs. random-row (our current style) doesn't have a very significant effect on estimates of Histograms and nullfrac, as long as the sampled blocks are widely spaced. Well, nullfrac is affected in the extreme example of a totally ordered table where the nulls are all in one block, but I'll point out that we can (and do) also miss that using our current algo. Estimated grouping should, however, affect MCVs. In cases where we estimate that grouping levels are high, the expected % of observed values should be discounted somehow. That is, with total random distribution you have a 1:1 ratio between observed frequency of a value and assumed frequency. However, with highly grouped values, you might have a 2:1 ratio. Again, more math (backed by statistical analysis) is needed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Thu, Dec 12, 2013 at 3:56 PM, Josh Berkus j...@agliodbs.com wrote: Estimated grouping should, however, affect MCVs. In cases where we estimate that grouping levels are high, the expected % of observed values should be discounted somehow. That is, with total random distribution you have a 1:1 ratio between observed frequency of a value and assumed frequency. However, with highly grouped values, you might have a 2:1 ratio. Cross validation can help there. But it's costly. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Thu, Dec 12, 2013 at 10:33 AM, Claudio Freire klaussfre...@gmail.comwrote: On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jeff Janes jeff.ja...@gmail.com writes: It would be relatively easy to fix this if we trusted the number of visible rows in each block to be fairly constant. But without that assumption, I don't see a way to fix the sample selection process without reading the entire table. Yeah, varying tuple density is the weak spot in every algorithm we've looked at. The current code is better than what was there before, but as you say, not perfect. You might be entertained to look at the threads referenced by the patch that created the current sampling method: http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at particularly http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at Thanks, I will read those. However ... where this thread started was not about trying to reduce the remaining statistical imperfections in our existing sampling method. It was about whether we could reduce the number of pages read for an acceptable cost in increased statistical imperfection. I think it is pretty clear that n_distinct at least, and probably MCV, would be a catastrophe under some common data distribution patterns if we sample all rows in each block without changing our current computation method. If we come up with a computation that works for that type of sampling, it would probably be an improvement under our current sampling as well. If we find such a thing, I wouldn't want it to get rejected just because the larger block-sampling method change did not make it in. Well, why not take a supersample containing all visible tuples from N selected blocks, and do bootstrapping over it, with subsamples of M independent rows each? Bootstrapping methods generally do not work well when ties are significant events, i.e. when two values being identical is meaningfully different from them being very close but not identical. Cheers, Jeff
Re: [HACKERS] ANALYZE sampling is too good
On Thu, Dec 12, 2013 at 4:13 PM, Jeff Janes jeff.ja...@gmail.com wrote: Well, why not take a supersample containing all visible tuples from N selected blocks, and do bootstrapping over it, with subsamples of M independent rows each? Bootstrapping methods generally do not work well when ties are significant events, i.e. when two values being identical is meaningfully different from them being very close but not identical. Yes, that's why I meant to say (but I see now that I didn't) that it wouldn't do much for n_distinct, just the histogram. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 3, 2013 at 3:30 PM, Greg Stark st...@mit.edu wrote: At multiple conferences I've heard about people trying all sorts of gymnastics to avoid ANALYZE which they expect to take too long and consume too much I/O. This is especially a big complain after upgrades when their new database performs poorly until the new statistics are in and they did pg_upgrade to avoid an extended downtime and complain about ANALYZE taking hours. Out of curiosity, are they using the 3 stage script analyze_new_cluster.sh? If so, is the complaint that even the first rounds are too slow, or that the database remains unusable until the last round is complete? Cheers, Jeff
Re: [HACKERS] ANALYZE sampling is too good
On Dec12, 2013, at 19:29 , Tom Lane t...@sss.pgh.pa.us wrote: However ... where this thread started was not about trying to reduce the remaining statistical imperfections in our existing sampling method. It was about whether we could reduce the number of pages read for an acceptable cost in increased statistical imperfection. True, but Jeff's case shows that even the imperfections of the current sampling method are larger than what the n_distinct estimator expects. Making it even more biased will thus require us to rethink how we obtain a n_distinct estimate or something equivalent. I don't mean that as an argument against changing the sampling method, just as something to watch out for. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 11/12/13 19:34, Simon Riggs wrote: Realistically, I never heard of an Oracle DBA doing advanced statistical mathematics before setting the sample size on ANALYZE. You use the default and bump it up if the sample is insufficient for the data. I'm not sure that Oracle's stats and optimizer design is an example to be envied - pretty much all Oracle DBA's I've encountered will apply hints all queries to get the plan they want... Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Wed, Dec 11, 2013 at 12:58 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, it is not a perfect statistical sample. All sampling is subject to an error that is data dependent. Well there's random variation due to the limitations of dealing with a sample. And then there's systemic biases due to incorrect algorithms. You wouldn't be happy if the samples discarded every row with NULLs or every row older than some date etc. These things would not be corrected by larger samples. That's the kind of error we're talking about here. But the more I think about things the less convinced I am that there is a systemic bias introduced by reading the entire block. I had assumed larger rows would be selected against but that's not really true, they're just selected against relative to the number of bytes they occupy which is the correct frequency to sample. Even blocks that are mostly empty don't really bias things. Picture a table that consists of 100 blocks with 100 rows each (value A) and another 100 blocks with only 1 row each (value B). The rows with value B have a 50% chance of being in any given block which is grossly inflated however each block selected with value A will produce 100 rows. So if you sample 10 blocks you'll get 100x10xA and 1x10xB which will be the correct proportion. I'm not actually sure there is any systemic bias here. The larger number of rows per block generate less precise results but from my thought experiments they seem to still be accurate? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark st...@mit.edu wrote: I'm not actually sure there is any systemic bias here. The larger number of rows per block generate less precise results but from my thought experiments they seem to still be accurate? So I've done some empirical tests for a table generated by: create table sizeskew as (select i,j,repeat('i',i) from generate_series(1,1000) as i, generate_series(1,1000) as j); I find that using the whole block doesn't cause any problem with the avg_width field for the repeat column.That does reinforce my belief that we might not need any particularly black magic here. It does however cause a systemic error in the histogram bounds. It seems the median is systematically overestimated by more and more the larger the number of rows per block are used: 1: 524 4: 549 8: 571 12: 596 16: 602 20: 618 (total sample slightly smaller than normal) 30: 703 (substantially smaller sample) So there is something clearly wonky in the histogram stats that's affected by the distribution of the sample. The only thing I can think of is maybe the most common elements are being selected preferentially from the early part of the sample which is removing a substantial part of the lower end of the range. But even removing 100 from the beginning shouldn't be enough to push the median above 550. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Wed, Dec 11, 2013 at 12:08 PM, Greg Stark st...@mit.edu wrote: The only thing I can think of is maybe the most common elements are being selected preferentially from the early part of the sample which is removing a substantial part of the lower end of the range. But even removing 100 from the beginning shouldn't be enough to push the median above 550. Just to follow up here. I think what's going is that not only are the most_common_vals being preferentially taken from the beginning of the sample but also their frequency is being massively overestimated. All values have a frequency of about .001 but the head of the MCV has a frequency as high as .10 in some of my tests. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/11/2013 02:08 PM, Greg Stark wrote: On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark st...@mit.edu wrote: I'm not actually sure there is any systemic bias here. The larger number of rows per block generate less precise results but from my thought experiments they seem to still be accurate? So I've done some empirical tests for a table generated by: create table sizeskew as (select i,j,repeat('i',i) from generate_series(1,1000) as i, generate_series(1,1000) as j); I find that using the whole block doesn't cause any problem with the avg_width field for the repeat column.That does reinforce my belief that we might not need any particularly black magic here. How large a sample did you use? Remember that the point of doing block-level sampling instead of the current approach would be to allow using a significantly smaller sample (in # of blocks), and still achieve the same sampling error. If the sample is large enough, it will mask any systemic bias caused by block-sampling, but the point is to reduce the number of sampled blocks. The practical question here is this: What happens to the quality of the statistics if you only read 1/2 the number of blocks than you normally would, but included all the rows in the blocks we read in the sample? How about 1/10 ? Or to put it another way: could we achieve more accurate statistics by including all rows from the sampled rows, while reading the same number of blocks? In particular, I wonder if it would help with estimating ndistinct. It generally helps to have a larger sample for ndistinct estimation, so it might be beneficial. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Dec10, 2013, at 15:32 , Claudio Freire klaussfre...@gmail.com wrote: On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark st...@mit.edu wrote: On 10 Dec 2013 08:28, Albe Laurenz laurenz.a...@wien.gv.at wrote: Doesn't all that assume a normally distributed random variable? I don't think so because of the law of large numbers. If you have a large population and sample it the sample behaves like a normal distribution when if the distribution of the population isn't. No, the large population says that if you have an AVERAGE of many samples of a random variable, the random variable that is the AVERAGE behaves like a normal. Actually, that's the central limit theorem, and it doesn't hold for all random variables, only for those with finite expected value and variance. The law of large numbers, in contrast, only tells you that the AVERAGE of n samples of a random variable will converge to the random variables' expected value as n goes to infinity (there are different versions of the law which guarantee different kinds of convergence, weak or strong). best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 11 December 2013 12:08, Greg Stark st...@mit.edu wrote: So there is something clearly wonky in the histogram stats that's affected by the distribution of the sample. ...in the case where the avg width changes in a consistent manner across the table. Well spotted. ISTM we can have a specific cross check for bias in the sample of that nature. We just calculate the avg width per block and then check for correlation of the avg width against block number. If we find bias we can calculate how many extra blocks to sample and from where. There may be other biases also, so we can check for them and respond accordingly. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Greg Stark st...@mit.edu writes: So I've done some empirical tests for a table generated by: create table sizeskew as (select i,j,repeat('i',i) from generate_series(1,1000) as i, generate_series(1,1000) as j); I find that using the whole block doesn't cause any problem with the avg_width field for the repeat column.That does reinforce my belief that we might not need any particularly black magic here. It does however cause a systemic error in the histogram bounds. It seems the median is systematically overestimated by more and more the larger the number of rows per block are used: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is select up to N rows from each sampled block --- and that is going to favor the contents of blocks containing narrower-than-average rows. Now in this case, it looks like that ought to favor rows with *smaller* i values, but you say the median goes up not down. So I'm not sure what's going on. I thought at first that TOAST compression might be part of the explanation, but TOAST shouldn't kick in on rows with raw representation narrower than 2KB. Did you do a run with no upper limit on the number of rows per block? Because I'm not sure that tests with a limit in place are a good guide to what happens without it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
I wrote: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is select up to N rows from each sampled block --- and that is going to favor the contents of blocks containing narrower-than-average rows. Oh, no, wait: that's backwards. (I plead insufficient caffeine.) Actually, this sampling rule discriminates *against* blocks with narrower rows. You previously argued, correctly I think, that sampling all rows on each page introduces no new bias because row width cancels out across all sampled pages. However, if you just include up to N rows from each page, then rows on pages with more than N rows have a lower probability of being selected, but there's no such bias against wider rows. This explains why you saw smaller values of i being undersampled. Had you run the test series all the way up to the max number of tuples per block, which is probably a couple hundred in this test, I think you'd have seen the bias go away again. But the takeaway point is that we have to sample all tuples per page, not just a limited number of them, if we want to change it like this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/12/13 06:22, Tom Lane wrote: I wrote: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is select up to N rows from each sampled block --- and that is going to favor the contents of blocks containing narrower-than-average rows. Oh, no, wait: that's backwards. (I plead insufficient caffeine.) Actually, this sampling rule discriminates *against* blocks with narrower rows. You previously argued, correctly I think, that sampling all rows on each page introduces no new bias because row width cancels out across all sampled pages. However, if you just include up to N rows from each page, then rows on pages with more than N rows have a lower probability of being selected, but there's no such bias against wider rows. This explains why you saw smaller values of i being undersampled. Had you run the test series all the way up to the max number of tuples per block, which is probably a couple hundred in this test, I think you'd have seen the bias go away again. But the takeaway point is that we have to sample all tuples per page, not just a limited number of them, if we want to change it like this. regards, tom lane Surely we want to sample a 'constant fraction' (obviously, in practice you have to sample an integral number of rows in a page!) of rows per page? The simplest way, as Tom suggests, is to use all the rows in a page. However, if you wanted the same number of rows from a greater number of pages, you could (for example) select a quarter of the rows from each page. In which case, when this is a fractional number: take the integral number of rows, plus on extra row with a probability equal to the fraction (here 0.25). Either way, if it is determined that you need N rows, then keep selecting pages at random (but never use the same page more than once) until you have at least N rows. Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/12/13 06:22, Tom Lane wrote: I wrote: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is select up to N rows from each sampled block --- and that is going to favor the contents of blocks containing narrower-than-average rows. Oh, no, wait: that's backwards. (I plead insufficient caffeine.) Actually, this sampling rule discriminates *against* blocks with narrower rows. You previously argued, correctly I think, that sampling all rows on each page introduces no new bias because row width cancels out across all sampled pages. However, if you just include up to N rows from each page, then rows on pages with more than N rows have a lower probability of being selected, but there's no such bias against wider rows. This explains why you saw smaller values of i being undersampled. Had you run the test series all the way up to the max number of tuples per block, which is probably a couple hundred in this test, I think you'd have seen the bias go away again. But the takeaway point is that we have to sample all tuples per page, not just a limited number of them, if we want to change it like this. regards, tom lane Hmm... In my previous reply, which hasn't shown up yet! I realized I made a mistake! The fraction/probability could any of 0.25. 0.50, and 0.75. Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/12/13 07:22, Gavin Flower wrote: On 12/12/13 06:22, Tom Lane wrote: I wrote: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is select up to N rows from each sampled block --- and that is going to favor the contents of blocks containing narrower-than-average rows. Oh, no, wait: that's backwards. (I plead insufficient caffeine.) Actually, this sampling rule discriminates *against* blocks with narrower rows. You previously argued, correctly I think, that sampling all rows on each page introduces no new bias because row width cancels out across all sampled pages. However, if you just include up to N rows from each page, then rows on pages with more than N rows have a lower probability of being selected, but there's no such bias against wider rows. This explains why you saw smaller values of i being undersampled. Had you run the test series all the way up to the max number of tuples per block, which is probably a couple hundred in this test, I think you'd have seen the bias go away again. But the takeaway point is that we have to sample all tuples per page, not just a limited number of them, if we want to change it like this. regards, tom lane Surely we want to sample a 'constant fraction' (obviously, in practice you have to sample an integral number of rows in a page!) of rows per page? The simplest way, as Tom suggests, is to use all the rows in a page. However, if you wanted the same number of rows from a greater number of pages, you could (for example) select a quarter of the rows from each page. In which case, when this is a fractional number: take the integral number of rows, plus on extra row with a probability equal to the fraction (here 0.25). Either way, if it is determined that you need N rows, then keep selecting pages at random (but never use the same page more than once) until you have at least N rows. Cheers, Gavin Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75. But there is a bias introduced by the arithmetic average size of the rows in a page. This results in block sampling favouring large rows, as they are in a larger proportion of pages. For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing density and no page has both types: the large rows would occupy 500 pages and the smaller rows 50 pages. So if one selected 11 pages at random, you get about 10 pages of large rows and about one for small rows! In practice, it would be much less extreme - for a start, not all blocks will be fully packed, most blocks would have both types of rows, and there is usually greater variation in row size - but still a bias towards sampling larger rows. So somehow, this bias needs to be counteracted. Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Gavin Flower gavinflo...@archidevsys.co.nz wrote: For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing density and no page has both types: the large rows would occupy 500 pages and the smaller rows 50 pages. So if one selected 11 pages at random, you get about 10 pages of large rows and about one for small rows! With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/12/13 08:14, Gavin Flower wrote: On 12/12/13 07:22, Gavin Flower wrote: On 12/12/13 06:22, Tom Lane wrote: I wrote: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is select up to N rows from each sampled block --- and that is going to favor the contents of blocks containing narrower-than-average rows. Oh, no, wait: that's backwards. (I plead insufficient caffeine.) Actually, this sampling rule discriminates *against* blocks with narrower rows. You previously argued, correctly I think, that sampling all rows on each page introduces no new bias because row width cancels out across all sampled pages. However, if you just include up to N rows from each page, then rows on pages with more than N rows have a lower probability of being selected, but there's no such bias against wider rows. This explains why you saw smaller values of i being undersampled. Had you run the test series all the way up to the max number of tuples per block, which is probably a couple hundred in this test, I think you'd have seen the bias go away again. But the takeaway point is that we have to sample all tuples per page, not just a limited number of them, if we want to change it like this. regards, tom lane Surely we want to sample a 'constant fraction' (obviously, in practice you have to sample an integral number of rows in a page!) of rows per page? The simplest way, as Tom suggests, is to use all the rows in a page. However, if you wanted the same number of rows from a greater number of pages, you could (for example) select a quarter of the rows from each page. In which case, when this is a fractional number: take the integral number of rows, plus on extra row with a probability equal to the fraction (here 0.25). Either way, if it is determined that you need N rows, then keep selecting pages at random (but never use the same page more than once) until you have at least N rows. Cheers, Gavin Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75. But there is a bias introduced by the arithmetic average size of the rows in a page. This results in block sampling favouring large rows, as they are in a larger proportion of pages. For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing density and no page has both types: the large rows would occupy 500 pages and the smaller rows 50 pages. So if one selected 11 pages at random, you get about 10 pages of large rows and about one for small rows! In practice, it would be much less extreme - for a start, not all blocks will be fully packed, most blocks would have both types of rows, and there is usually greater variation in row size - but still a bias towards sampling larger rows. So somehow, this bias needs to be counteracted. Cheers, Gavin Actually, I just thought of a possible way to overcome the bias towards large rows. 1. Calculate (a rough estimate may be sufficient, if not too 'rough') the size of the smallest row. 2. Select a page at random (never selecting the same page twice) 3. Then select rows at random within the page (never selecting the same row twice). For each row selected, accept it with the probability equal to (size of smallest row)/(size of selected row). I think you find that will almost completely offset the bias towards larger rows! 4. If you do not have sufficient rows, and you still have pages not yet selected, goto 2 Note that it will be normal for for some pages not to have any rows selected, especially for large tables! Cheers, Gavin P.S. I really need to stop thinking about this problem, and get on with my assigned project!!!
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 4:48 PM, Peter Geoghegan p...@heroku.com wrote: Why would I even mention that to a statistician? We want guidance. But yes, I bet I could give a statistician an explanation of statistics target that they'd understand without too much trouble. Actually, I think that if we told a statistician about the statistics target, his or her response would be: why would you presume to know ahead of time what statistics target is going to be effective? I suspect that the basic problem is that it isn't adaptive. I think that if we could somehow characterize the quality of our sample as we took it, and then cease sampling when we reached a certain degree of confidence in its quality, that would be helpful. It might not even matter that the sample was clustered from various blocks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/12/13 08:31, Kevin Grittner wrote: Gavin Flower gavinflo...@archidevsys.co.nz wrote: For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing density and no page has both types: the large rows would occupy 500 pages and the smaller rows 50 pages. So if one selected 11 pages at random, you get about 10 pages of large rows and about one for small rows! With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company Sorry, I've simply come up with well argued nonsense! Kevin, you're dead right. Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/12/13 08:39, Gavin Flower wrote: On 12/12/13 08:31, Kevin Grittner wrote: Gavin Flower gavinflo...@archidevsys.co.nz wrote: For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing density and no page has both types: the large rows would occupy 500 pages and the smaller rows 50 pages. So if one selected 11 pages at random, you get about 10 pages of large rows and about one for small rows! With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company Sorry, I've simply come up with well argued nonsense! Kevin, you're dead right. Cheers, Gavin I looked at: http://www.postgresql.org/docs/current/interactive/storage-page-layout.html this says that each row has an overhead, which suggests there should be a bias towards small rows. There must be a lot of things going on, that I'm simply not aware of, that affect sampling bias... Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/12/13 09:12, Gavin Flower wrote: On 12/12/13 08:39, Gavin Flower wrote: On 12/12/13 08:31, Kevin Grittner wrote: Gavin Flower gavinflo...@archidevsys.co.nz wrote: For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing density and no page has both types: the large rows would occupy 500 pages and the smaller rows 50 pages. So if one selected 11 pages at random, you get about 10 pages of large rows and about one for small rows! With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company Sorry, I've simply come up with well argued nonsense! Kevin, you're dead right. Cheers, Gavin I looked at: http://www.postgresql.org/docs/current/interactive/storage-page-layout.html this says that each row has an overhead, which suggests there should be a bias towards small rows. There must be a lot of things going on, that I'm simply not aware of, that affect sampling bias... Cheers, Gavin Ignoring overheads per row and other things... There will be a biasing affect when the distribution of sizes is not symmetric. For example: when the majority of rows have sizes greater than the arithmetic mean, then most samples will be biased towards larger rows. Similarly there could be a bias towards smaller rows when most rows are smaller than the arithmetic mean. Yes, I did think about this in depth - but it is way too complicated to attempt to quantify the bias, because it depends on too many factors (even just limiting it to the distribution of row sizes). So apart from the nature of volatility of the table, and the pattern of insertions/updates/deletes - there there will be a bias depending on the distribution of values in the table. So I despair, that a simple elegant practical algorithm will ever be found. Therefore, I expect the best answer is probably some kind of empirical adaptive approach - which I think has already been suggested. Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
I think we're all wet here. I don't see any bias towards larger or smaller rows. Larger tied will be on a larger number of pages but there will be fewer of them on any one page. The average effect should be the same. Smaller values might have a higher variance with block based sampling than larger values. But that actually *is* the kind of thing that Simon's approach of just compensating with later samples can deal with. -- greg
Re: [HACKERS] ANALYZE sampling is too good
On Thu, Dec 12, 2013 at 07:22:59AM +1300, Gavin Flower wrote: Surely we want to sample a 'constant fraction' (obviously, in practice you have to sample an integral number of rows in a page!) of rows per page? The simplest way, as Tom suggests, is to use all the rows in a page. However, if you wanted the same number of rows from a greater number of pages, you could (for example) select a quarter of the rows from each page. In which case, when this is a fractional number: take the integral number of rows, plus on extra row with a probability equal to the fraction (here 0.25). In this discussion we've mostly used block = 1 postgresql block of 8k. But when reading from a disk once you've read one block you can basically read the following ones practically for free. So I wonder if you could make your sampling read always 16 consecutive blocks, but then use 25-50% of the tuples. That way you get many more tuples for the same amount of disk I/O seeks.. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] ANALYZE sampling is too good
On 12/11/2013 02:39 PM, Martijn van Oosterhout wrote: In this discussion we've mostly used block = 1 postgresql block of 8k. But when reading from a disk once you've read one block you can basically read the following ones practically for free. So I wonder if you could make your sampling read always 16 consecutive blocks, but then use 25-50% of the tuples. That way you get many more tuples for the same amount of disk I/O seeks.. Yeah, that's what I meant by tune this for the FS. We'll probably have to test a lot of different block sizes on different FSes before we arrive at a reasonable size, and even then I'll bet we have to offer a GUC. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Greg Stark wrote: It's also applicable for the other stats; histogram buckets constructed from a 5% sample are more likely to be accurate than those constructed from a 0.1% sample. Same with nullfrac. The degree of improved accuracy, would, of course, require some math to determine. This some math is straightforward basic statistics. The 95th percentile confidence interval for a sample consisting of 300 samples from a population of a 1 million would be 5.66%. A sample consisting of 1000 samples would have a 95th percentile confidence interval of +/- 3.1%. Doesn't all that assume a normally distributed random variable? I don't think it can be applied to database table contents without further analysis. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10 Dec 2013 08:28, Albe Laurenz laurenz.a...@wien.gv.at wrote: Doesn't all that assume a normally distributed random variable? I don't think so because of the law of large numbers. If you have a large population and sample it the sample behaves like a normal distribution when if the distribution of the population isn't.
Re: [HACKERS] ANALYZE sampling is too good
Greg Stark wrote: Doesn't all that assume a normally distributed random variable? I don't think so because of the law of large numbers. If you have a large population and sample it the sample behaves like a normal distribution when if the distribution of the population isn't. Statistics is the part of mathematics I know least of, but aren't you saying that in a large enough sample of people there will always be some with age 0 (which is what a normal distribution would imply)? Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark st...@mit.edu wrote: On 10 Dec 2013 08:28, Albe Laurenz laurenz.a...@wien.gv.at wrote: Doesn't all that assume a normally distributed random variable? I don't think so because of the law of large numbers. If you have a large population and sample it the sample behaves like a normal distribution when if the distribution of the population isn't. No, the large population says that if you have an AVERAGE of many samples of a random variable, the random variable that is the AVERAGE behaves like a normal. The variable itself doesn't. And for n_distinct, you need to know the variable itself. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 11:32 AM, Claudio Freire klaussfre...@gmail.com wrote: On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark st...@mit.edu wrote: On 10 Dec 2013 08:28, Albe Laurenz laurenz.a...@wien.gv.at wrote: Doesn't all that assume a normally distributed random variable? I don't think so because of the law of large numbers. If you have a large population and sample it the sample behaves like a normal distribution when if the distribution of the population isn't. No, the large population says that if you have an AVERAGE of many samples of a random variable, the random variable that is the AVERAGE behaves like a normal. Sorry, that should be No, the *law of large numbers* says... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 6 December 2013 09:21, Andres Freund and...@2ndquadrant.com wrote: On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: Has anyone ever thought about opportunistic ANALYZE piggy-backing on other full-table scans? That doesn't really help Greg, because his complaint is mostly that a fresh ANALYZE is too expensive, but it could be an interesting, albeit risky approach. What I've been thinking of is a) making it piggy back on scans vacuum is doing instead of doing separate ones all the time (if possible, analyze needs to be more frequent). Currently with quite some likelihood the cache will be gone again when revisiting. b) make analyze incremental. In lots of bigger tables most of the table is static - and we actually *do* know that, thanks to the vm. So keep a rawer form of what ends in the catalogs around somewhere, chunked by the region of the table the statistic is from. Everytime a part of the table changes, re-sample only that part. Then recompute the aggregate. Piggy-backing sounds like a bad thing. If I run a query, I don't want to be given some extra task thanks! Especially if we might need to run data type code I'm not authroised to run, or to sample data I may not be authorised to see. The only way that could work is to kick off an autovacuum worker to run the ANALYZE as a separate process and then use synchronous scan behaviour to derive benefit indirectly. However, these things presume that we need to continue scanning most of the blocks of the table, which I don't think needs to be the case. There is a better way. Back in 2005/6, I advocated a block sampling method, as described by Chaudri et al (ref?) That has two advantages * We don't need to visit all of the blocks, reducing I/O * It would also give better analysis of clustered data (its all in that paper...) The recent patch on TABLESAMPLE contained a rewrite of the guts of ANALYZE into a generic sample scan, which would then be used as the basis for a TABLESAMPLE BERNOULLI. A TABLESAMPLE SYSTEM could use a block sample, as it does in some other DBMS (DB2, SQLServer). While I was unimpressed with the TABLESAMPLE patch, all it needs is some committer-love, so Greg... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 2013-12-10 19:23:37 +, Simon Riggs wrote: On 6 December 2013 09:21, Andres Freund and...@2ndquadrant.com wrote: On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: Has anyone ever thought about opportunistic ANALYZE piggy-backing on other full-table scans? That doesn't really help Greg, because his complaint is mostly that a fresh ANALYZE is too expensive, but it could be an interesting, albeit risky approach. What I've been thinking of is a) making it piggy back on scans vacuum is doing instead of doing separate ones all the time (if possible, analyze needs to be more frequent). Currently with quite some likelihood the cache will be gone again when revisiting. b) make analyze incremental. In lots of bigger tables most of the table is static - and we actually *do* know that, thanks to the vm. So keep a rawer form of what ends in the catalogs around somewhere, chunked by the region of the table the statistic is from. Everytime a part of the table changes, re-sample only that part. Then recompute the aggregate. Piggy-backing sounds like a bad thing. If I run a query, I don't want to be given some extra task thanks! Especially if we might need to run data type code I'm not authroised to run, or to sample data I may not be authorised to see. The only way that could work is to kick off an autovacuum worker to run the ANALYZE as a separate process and then use synchronous scan behaviour to derive benefit indirectly. I was suggesting to piggyback on VACUUM, not user queries. The latter suggestion was somebody else. In combination with incremental or chunk-wise building of statistics, doing more frequent partial vacuums which re-compute the changing part of the stats would be great. All those blocks have to be read anyway, and we should be more agressive about vacuuming anyway. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote: However, these things presume that we need to continue scanning most of the blocks of the table, which I don't think needs to be the case. There is a better way. Do they? I think it's one opportunistic way of ameliorating the cost. Back in 2005/6, I advocated a block sampling method, as described by Chaudri et al (ref?) I don't think that anyone believes that not doing block sampling is tenable, fwiw. Clearly some type of block sampling would be preferable for most or all purposes. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10 December 2013 19:49, Peter Geoghegan p...@heroku.com wrote: On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote: However, these things presume that we need to continue scanning most of the blocks of the table, which I don't think needs to be the case. There is a better way. Do they? I think it's one opportunistic way of ameliorating the cost. Back in 2005/6, I advocated a block sampling method, as described by Chaudri et al (ref?) I don't think that anyone believes that not doing block sampling is tenable, fwiw. Clearly some type of block sampling would be preferable for most or all purposes. If we have one way of reducing cost of ANALYZE, I'd suggest we don't need 2 ways - especially if the second way involves the interaction of otherwise not fully related parts of the code. Or to put it clearly, lets go with block sampling and then see if that needs even more work. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/10/2013 11:49 AM, Peter Geoghegan wrote: On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote: I don't think that anyone believes that not doing block sampling is tenable, fwiw. Clearly some type of block sampling would be preferable for most or all purposes. As discussed, we need math though. Does anyone have an ACM subscription and time to do a search? Someone must. We can buy one with community funds, but no reason to do so if we don't have to. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 7:54 PM, Josh Berkus j...@agliodbs.com wrote: As discussed, we need math though. Does anyone have an ACM subscription and time to do a search? Someone must. We can buy one with community funds, but no reason to do so if we don't have to. Anyone in a university likely has access through their library. But I don't really think this is the right way to go about this. Research papers are going to turn up pretty specialized solutions that are probably patented. We don't even have the basic understanding we need. I suspect a basic textbook chapter on multistage sampling will discuss at least the standard techniques. Once we have a handle on the standard multistage sampling techniques that would be safe from patents then we might want to go look at research papers to find how they've been applied to databases in the past but we would have to do that fairly carefully. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10 December 2013 19:54, Josh Berkus j...@agliodbs.com wrote: On 12/10/2013 11:49 AM, Peter Geoghegan wrote: On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote: I don't think that anyone believes that not doing block sampling is tenable, fwiw. Clearly some type of block sampling would be preferable for most or all purposes. As discussed, we need math though. Does anyone have an ACM subscription and time to do a search? Someone must. We can buy one with community funds, but no reason to do so if we don't have to. We already have that, just use Vitter's algorithm at the block level rather than the row level. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark st...@mit.edu wrote: But I don't really think this is the right way to go about this. Research papers are going to turn up pretty specialized solutions that are probably patented. We don't even have the basic understanding we need. I suspect a basic textbook chapter on multistage sampling will discuss at least the standard techniques. I agree that looking for information on block level sampling specifically, and its impact on estimation quality is likely to not turn up very much, and whatever it does turn up will have patent issues. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/10/2013 10:00 PM, Simon Riggs wrote: On 10 December 2013 19:54, Josh Berkus j...@agliodbs.com wrote: On 12/10/2013 11:49 AM, Peter Geoghegan wrote: On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote: I don't think that anyone believes that not doing block sampling is tenable, fwiw. Clearly some type of block sampling would be preferable for most or all purposes. As discussed, we need math though. Does anyone have an ACM subscription and time to do a search? Someone must. We can buy one with community funds, but no reason to do so if we don't have to. We already have that, just use Vitter's algorithm at the block level rather than the row level. And what do you do with the blocks? How many blocks do you choose? Details, please. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 11/12/13 09:19, Heikki Linnakangas wrote: On 12/10/2013 10:00 PM, Simon Riggs wrote: On 10 December 2013 19:54, Josh Berkus j...@agliodbs.com wrote: On 12/10/2013 11:49 AM, Peter Geoghegan wrote: On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs si...@2ndquadrant.com wrote: I don't think that anyone believes that not doing block sampling is tenable, fwiw. Clearly some type of block sampling would be preferable for most or all purposes. As discussed, we need math though. Does anyone have an ACM subscription and time to do a search? Someone must. We can buy one with community funds, but no reason to do so if we don't have to. We already have that, just use Vitter's algorithm at the block level rather than the row level. And what do you do with the blocks? How many blocks do you choose? Details, please. Yeah - and we seem to be back to Josh's point about needing 'some math' to cope with the rows within a block not being a purely random selection. Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/10/2013 01:33 PM, Mark Kirkwood wrote: Yeah - and we seem to be back to Josh's point about needing 'some math' to cope with the rows within a block not being a purely random selection. Well, sometimes they are effectively random. But sometimes they are not. The Chaudri et al paper had a formula for estimating randomness based on the grouping of rows in each block, assuming that the sampled blocks were widely spaced (if they aren't there's not much you can do). This is where you get up to needing a 5% sample; you need to take enough blocks that you're confident that the blocks you sampled are representative of the population. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 7:49 PM, Peter Geoghegan p...@heroku.com wrote: Back in 2005/6, I advocated a block sampling method, as described by Chaudri et al (ref?) I don't think that anyone believes that not doing block sampling is tenable, fwiw. Clearly some type of block sampling would be preferable for most or all purposes. We do block sampling now. But then we select rows from those blocks uniformly. Incidentally, if you mean Surajit Chaudhuri, he's a Microsoft Research lead so I would be nervous about anything he's published being patented by Microsoft. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 2:37 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes jeff.ja...@gmail.com wrote: My reading of the code is that if it is not in the MCV, then it is assumed to have the average selectivity (about 1/n_distinct, but deflating top and bottom for the MCV list). There is also a check that it is less than the least common of the MCV, but I don't know why that situation would ever prevail--that should always be higher or equal to the average selectivity. I've never seen an n_distinct value of more than 5 digits, regardless of reality. Typically I've seen 20-50k, even if the real number is much higher. I don't recall seeing an n_distinct that is literally above 100,000 in the wild, but I've seen negative ones that multiply with reltuples to give values more than that. In test cases it is easy enough to get values in the millions by creating tables using floor(random()*$something). create table baz as select floor(random()*1000), md5(random()::text) from generate_series(1,1); create table baz2 as select * from baz order by floor; create table baz3 as select * from baz order by md5(floor::text); baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation. After analyzing all of them: select tablename, n_distinct,correlation from pg_stats where tablename like 'baz%' and attname='floor' ; tablename | n_distinct | correlation ---+-+- baz | 8.56006e+06 | 0.00497713 baz2 | 333774 | 1 baz3 | 361048 | -0.0118147 So baz is pretty close, while the other two are way off. But the n_distinct computation doesn't depend on the order of the rows, as far as I can tell, but only the composition of the sample. So this can only mean that our random sampling method is desperately non-random, right? But the n_distinct value is only for non-MCVs, so if we estimate the selectivity of column = 'rarevalue' to be (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces the estimate, and making the MCV list longer naturally makes mcvfrac bigger. Ah, I see. By including more things into MCV, we crowd out the rest of the space implicitly. I'm not sure how important the less-frequent-than-the-least-common-MCV part is, but I'm very sure that raising the statistics target helps to solve the problem of overestimating the prevalence of uncommon values in a very big table. I think that parts of the planner are N^2 in the size of histogram (or was that the size of the MCV list?). So we would probably need a way to use a larger sample size to get more accurate n_distinct and MCV frequencies, but not save the entire histogram that goes with that sample size. I think the saving the histogram part is important. As you say, the MCVs are important for a variety of planning purposes, such as hash joins. More than that, in my experience, people with large tables are typically very willing to spend more planning time to get a better plan, because mistakes are expensive and the queries are likely to run for a while anyway. People with small tables care about planning time, because it makes no sense to spend an extra 1ms planning a query unless you improve the plan by enough to save at least 1ms when executing it, and when the tables are small and access is expected to be fast anyway that's often not the case. I would think that the dichotomy is more about the size of the query-plan than of the tables. I think a lot of people with huge tables end up doing mostly indexed lookups in unique or highly selective indexes, once all the planning is done. Does anyone have generators for examples of cases where increasing the sample size to get better histograms (as opposed more accurate n_distinct or more accurate MCV) was the key to fixing bad plans? I wonder if it is more bins, or more accurate boundaries, that makes the difference. Cheers, Jeff
Re: [HACKERS] ANALYZE sampling is too good
On 12/10/13 2:17 PM, Peter Geoghegan wrote: On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark st...@mit.edu wrote: But I don't really think this is the right way to go about this. Research papers are going to turn up pretty specialized solutions that are probably patented. We don't even have the basic understanding we need. I suspect a basic textbook chapter on multistage sampling will discuss at least the standard techniques. I agree that looking for information on block level sampling specifically, and its impact on estimation quality is likely to not turn up very much, and whatever it does turn up will have patent issues. We have an entire analytics dept. at work that specializes in finding patterns in our data. I might be able to get some time from them to at least provide some guidance here, if the community is interested. They could really only serve in a consulting role though. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby j...@nasby.net wrote: I agree that looking for information on block level sampling specifically, and its impact on estimation quality is likely to not turn up very much, and whatever it does turn up will have patent issues. We have an entire analytics dept. at work that specializes in finding patterns in our data. I might be able to get some time from them to at least provide some guidance here, if the community is interested. They could really only serve in a consulting role though. I think that Greg had this right several years ago: it would probably be very useful to have the input of someone with a strong background in statistics. It doesn't seem that important that they already know a lot about databases, provided they can understand what our constraints are, and what is important to us. It might just be a matter of having them point us in the right direction. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10 December 2013 23:43, Peter Geoghegan p...@heroku.com wrote: On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby j...@nasby.net wrote: I agree that looking for information on block level sampling specifically, and its impact on estimation quality is likely to not turn up very much, and whatever it does turn up will have patent issues. We have an entire analytics dept. at work that specializes in finding patterns in our data. I might be able to get some time from them to at least provide some guidance here, if the community is interested. They could really only serve in a consulting role though. I think that Greg had this right several years ago: it would probably be very useful to have the input of someone with a strong background in statistics. It doesn't seem that important that they already know a lot about databases, provided they can understand what our constraints are, and what is important to us. It might just be a matter of having them point us in the right direction. err, so what does stats target mean exactly in statistical theory? Waiting for a statistician, and confirming his credentials before you believe him above others here, seems like wasted time. What your statistician will tell you is it that YMMV, depending on the data. So we'll still need a parameter to fine tune things when the default is off. We can argue about the default later, in various level of rigour. Block sampling, with parameter to specify sample size. +1 -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs si...@2ndquadrant.com wrote: Block sampling, with parameter to specify sample size. +1 Simon this is very frustrating. Can you define block sampling? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 11 December 2013 00:28, Greg Stark st...@mit.edu wrote: On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs si...@2ndquadrant.com wrote: Block sampling, with parameter to specify sample size. +1 Simon this is very frustrating. Can you define block sampling? Blocks selected using Vitter's algorithm, using a parameterised fraction of the total. When we select a block we should read all rows on that block, to help identify the extent of clustering within the data. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs si...@2ndquadrant.com wrote: When we select a block we should read all rows on that block, to help identify the extent of clustering within the data. So how do you interpret the results of the sample read that way that doesn't introduce bias? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 4:14 PM, Simon Riggs si...@2ndquadrant.com wrote: err, so what does stats target mean exactly in statistical theory? Why would I even mention that to a statistician? We want guidance. But yes, I bet I could give a statistician an explanation of statistics target that they'd understand without too much trouble. Waiting for a statistician, and confirming his credentials before you believe him above others here, seems like wasted time. What your statistician will tell you is it that YMMV, depending on the data. I'm reasonably confident that they'd give me more than that. So we'll still need a parameter to fine tune things when the default is off. We can argue about the default later, in various level of rigour. Block sampling, with parameter to specify sample size. +1 Again, it isn't as if the likely efficacy of *some* block sampling approach is in question. I'm sure analyze.c is currently naive about many things. Everyone knows that there are big gains to be had. Balancing those gains against the possible downsides in terms of impact on the quality of statistics generated is pretty nuanced. I do know enough to know that a lot of thought goes into mitigating and/or detecting the downsides of block-based sampling. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 11 December 2013 00:44, Greg Stark st...@mit.edu wrote: On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs si...@2ndquadrant.com wrote: When we select a block we should read all rows on that block, to help identify the extent of clustering within the data. So how do you interpret the results of the sample read that way that doesn't introduce bias? Yes, it is not a perfect statistical sample. All sampling is subject to an error that is data dependent. I'm happy that we have an option to select this/or not and a default that maintains current behaviour, since otherwise we might expect some plan instability. I would like to be able to * allow ANALYZE to run faster in some cases * increase/decrease sample size when it matters * have the default sample size vary according to the size of the table, i.e. a proportional sample -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
For what it's worth. I'll quote Chaudhuri et al. first line from the abstract about the block sampling. Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. And after briefly glancing through the paper, my opinion is why it works is because after making one version of statistics they cross-validate, see how well it goes and then collect more if the cross-validation error is large (for example because the data is clustered). Without this bit, as far as I can a simply block based sampler will be bound to make catastrophic mistakes depending on the distribution Also, just another point about targets (e.g X%) for estimating stuff from the samples (as it was discussed in the thread). Basically, the is a point talking about a sampling a fixed target (5%) of the data ONLY if you fix the actual distribution of your data in the table, and decide what statistic you are trying to find, e.g. average, std. dev. a 90% percentile, ndistinct or a histogram and so forth. There won't be a general answer as the percentages will be distribution dependend and statistic dependent. Cheers, Sergey PS I'm not a statistician, but I use statistics a lot *** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Peter Geoghegan p...@heroku.com writes: Again, it isn't as if the likely efficacy of *some* block sampling approach is in question. I'm sure analyze.c is currently naive about many things. It's not *that* naive; this is already about a third-generation algorithm. The last major revision (commit 9d6570b8a4) was to address problems with misestimating the number of live tuples due to nonuniform tuple density in a table. IIRC, the previous code could be seriously misled if the first few pages in the table were significantly non-representative of the live-tuple density further on. I'm not sure how we can significantly reduce the number of blocks examined without re-introducing that hazard in some form. In particular, given that you want to see at least N tuples, how many blocks will you read if you don't have an a-priori estimate of tuple density? You have to decide that before you start sampling blocks, if you want all blocks to have the same probability of being selected and you want to read them in sequence. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tuesday, December 10, 2013, Simon Riggs wrote: On 11 December 2013 00:28, Greg Stark st...@mit.edu javascript:; wrote: On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs si...@2ndquadrant.comjavascript:; wrote: Block sampling, with parameter to specify sample size. +1 Simon this is very frustrating. Can you define block sampling? Blocks selected using Vitter's algorithm, using a parameterised fraction of the total. OK, thanks for defining that. We only need Vitter's algorithm when we don't know in advance how many items we are sampling from (such as for tuples--unless we want to rely on the previous estimate for the current round of analysis). But for blocks, we do know how many there are, so there are simpler ways to pick them. When we select a block we should read all rows on that block, to help identify the extent of clustering within the data. But we have no mechanism to store such information (or to use it if it were stored), nor even ways to prevent the resulting skew in the sample from seriously messing up the estimates which we do have ways of storing and using. Cheers, Jeff
Re: [HACKERS] ANALYZE sampling is too good
On 11 December 2013 01:27, Sergey E. Koposov m...@sai.msu.ru wrote: For what it's worth. I'll quote Chaudhuri et al. first line from the abstract about the block sampling. Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. This glosses over the point that both SQLServer and Oracle use this technique. And after briefly glancing through the paper, my opinion is why it works is because after making one version of statistics they cross-validate, see how well it goes and then collect more if the cross-validation error is large (for example because the data is clustered). Without this bit, as far as I can a simply block based sampler will be bound to make catastrophic mistakes depending on the distribution I don't think its true that a block based sampler will be *bound* to make catastrophic mistakes. They can clearly happen, just as they can with random samples, hence the need for a parameter to control the sample with a parameter. Realistically, I never heard of an Oracle DBA doing advanced statistical mathematics before setting the sample size on ANALYZE. You use the default and bump it up if the sample is insufficient for the data. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 10:34 PM, Simon Riggs si...@2ndquadrant.com wrote: On 11 December 2013 01:27, Sergey E. Koposov m...@sai.msu.ru wrote: For what it's worth. I'll quote Chaudhuri et al. first line from the abstract about the block sampling. Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. This glosses over the point that both SQLServer and Oracle use this technique. That seems like an unusual omission for Microsoft Research to have made. I didn't read that paper, because undoubtedly it's all patented. But before I figured that out, after finding it on Google randomly, I did read the first couple of paragraphs, which more or less said what follows - the entire paper - is an explanation as to why it's okay that we do block sampling. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/11/2013 01:44 AM, Greg Stark wrote: On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs si...@2ndquadrant.com wrote: When we select a block we should read all rows on that block, to help identify the extent of clustering within the data. So how do you interpret the results of the sample read that way that doesn't introduce bias? Initially/experimentally we could just compare it to our current approach :) That is, implement *some* block sampling and then check it against what we currently have. Then figure out the bad differences. Rinse. Repeat. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Greg, I really don't believe the 5% thing. It's not enough for n_distinct and it's *far* too high a value for linear properties like histograms or nullfrac etc. Actually, it is enough for n_distinct, or more properly, 5% is as good as you can get for n_distinct unless you're going to jump to scanning 50% or more. It's also applicable for the other stats; histogram buckets constructed from a 5% sample are more likely to be accurate than those constructed from a 0.1% sample. Same with nullfrac. The degree of improved accuracy, would, of course, require some math to determine. From a computer point of view it's too high to be worth bothering. If we have to read 5% of the table we might as well do a full scan anyways, it'll be marginally slower but much better quality results. Reading 5% of a 200GB table is going to be considerably faster than reading the whole thing, if that 5% is being scanned in a way that the FS understands. Also, we can optimize this significantly by using the VM, as Robert (I think) suggested. In the advanced approaches section, there's also the idea of collecting analyze data from table pages while they're in memory anyway for other reasons. You do seem kind of hostile to the idea of full-page-sampling, going pretty far beyond the I'd need to see the math. Why? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Josh Berkus j...@agliodbs.com writes: Reading 5% of a 200GB table is going to be considerably faster than reading the whole thing, if that 5% is being scanned in a way that the FS understands. Really? See the upthread point that reading one sector from each track has just as much seek overhead as reading the whole thing. I will grant that if you think that reading a *contiguous* 5% of the table is good enough, you can make it faster --- but I don't believe offhand that you can make this better without seriously compromising the randomness of your sample. Too many tables are loaded roughly in time order, or in other ways that make contiguous subsets nonrandom. You do seem kind of hostile to the idea of full-page-sampling, going pretty far beyond the I'd need to see the math. Why? I'm detecting a lot of hostility to assertions unsupported by any math. For good reason. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 1:03 PM, Josh Berkus j...@agliodbs.com wrote: I really don't believe the 5% thing. It's not enough for n_distinct and it's *far* too high a value for linear properties like histograms or nullfrac etc. Actually, it is enough for n_distinct, or more properly, 5% is as good as you can get for n_distinct unless you're going to jump to scanning 50% or more. I'd like to see a proof of that result. Not because I'm hostile to changing the algorithm, but because you've made numerous mathematical claims on this thread that fly in the face of what Greg, myself, and others understand to be mathematically true - including this one. If our understanding is wrong, then by all means let's get that fixed. But you're not going to convince anyone here that we should rip out the existing algorithm and its peer-reviewed journal citations by making categorical assertions about the right way to do things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 6:03 PM, Josh Berkus j...@agliodbs.com wrote: It's also applicable for the other stats; histogram buckets constructed from a 5% sample are more likely to be accurate than those constructed from a 0.1% sample. Same with nullfrac. The degree of improved accuracy, would, of course, require some math to determine. This some math is straightforward basic statistics. The 95th percentile confidence interval for a sample consisting of 300 samples from a population of a 1 million would be 5.66%. A sample consisting of 1000 samples would have a 95th percentile confidence interval of +/- 3.1%. The histogram and nullfact answers the same kind of question as a political poll, what fraction of the population falls within this subset. This is why pollsters don't need to sample 15 million Americans to have a decent poll result. That's just not how the math works for these kinds of questions. n_distinct is an entirely different kettle of fish. It's a different kind of problem and the error rate there *is* going to be dependent on the percentage of the total population that you sampled. Moreover from the papers I read I'm convinced any sample less than 50-80% is nearly useless so I'm convinced you can't get good results without reading the whole table. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 6:54 PM, Greg Stark st...@mit.edu wrote: This some math is straightforward basic statistics. The 95th percentile confidence interval for a sample consisting of 300 samples from a population of a 1 million would be 5.66%. A sample consisting of 1000 samples would have a 95th percentile confidence interval of +/- 3.1%. Incidentally I got this using an online sample size calculator. Google turns up several but this one seems the easiest to use: http://www.raosoft.com/samplesize.html -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Sat, Dec 7, 2013 at 11:46 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark st...@mit.edu wrote: I always gave the party line that ANALYZE only takes a small constant-sized sample so even very large tables should be very quick. But after hearing the same story again in Heroku I looked into it a bit further. I was kind of shocked but the numbers. ANALYZE takes a sample of 300 * statistics_target rows. That sounds pretty reasonable but with default_statistics_target set to 100 that's 30,000 rows. If I'm reading the code right It takes this sample by sampling 30,000 blocks and then (if the table is large enough) taking an average of one row per block. Each block is 8192 bytes so that means it's reading 240MB of each table.That's a lot more than I realized. That is a lot. On the other hand, I question the subject line: sometimes, our ANALYZE sampling is not good enough. Before we raised the default statistics target from 10 to 100, complaints about bad plan choices due to insufficiently-precise statistics were legion -- and we still have people periodically proposing to sample a fixed percentage of the table instead of a fixed amount of the table, even on large tables, which is going the opposite direction. I think this is because they're getting really bad n_distinct estimates, and no fixed-size sample can reliably give a good one. I don't recall ever tracing a bad plan down to a bad n_distinct. I have seen several that were due to bad frequency estimates in MCV list, because hash join planning is extremely sensitive to that. Do we have some kind of catalog of generators of problematic data, so that changes can be tested on known problem sets? Perhaps a wiki page to accumulate them would be useful. For automated testing I guess the generator and query is the easy part, the hard part is the cost settings/caching/RAM needed to trigger the problem, and parsing and interpreting the results. More generally, I think the basic problem that people are trying to solve by raising the statistics target is avoid index scans on gigantic tables. Obviously, there are a lot of other situations where inadequate statistics can cause problems, but that's a pretty easy-to-understand one that we do not always get right. We know that an equality search looking for some_column = 'some_constant', where some_constant is an MCV, must be more selective than a search for the least-frequent MCV. If you store more and more MCVs for a table, eventually you'll have enough that the least-frequent one is pretty infrequent, and then things work a lot better. My reading of the code is that if it is not in the MCV, then it is assumed to have the average selectivity (about 1/n_distinct, but deflating top and bottom for the MCV list). There is also a check that it is less than the least common of the MCV, but I don't know why that situation would ever prevail--that should always be higher or equal to the average selectivity. This is more of a problem for big tables than for small tables. MCV #100 can't have a frequency of greater than 1/100 = 0.01, but that's a lot more rows on a big table than small one. On a table with 10 million rows we might estimate something close to 100,000 rows when the real number is just a handful; when the table has only 10,000 rows, we just can't be off by as many orders of magnitude. Things don't always work out that badly, but in the worst case they do. Maybe there's some highly-principled statistical approach which could be taken here, and if so that's fine, but I suspect not. So what I think we should do is auto-tune the statistics target based on the table size. If, say, we think that the generally useful range for the statistics target is something like 10 to 400, then let's come up with a formula based on table size that outputs 10 for small tables, 400 for really big tables, and intermediate values for tables in the middle. I think that parts of the planner are N^2 in the size of histogram (or was that the size of the MCV list?). So we would probably need a way to use a larger sample size to get more accurate n_distinct and MCV frequencies, but not save the entire histogram that goes with that sample size. Cheers, Jeff
Re: [HACKERS] ANALYZE sampling is too good
On 12/6/13 3:21 AM, Andres Freund wrote: On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: Has anyone ever thought about opportunistic ANALYZE piggy-backing on other full-table scans? That doesn't really help Greg, because his complaint is mostly that a fresh ANALYZE is too expensive, but it could be an interesting, albeit risky approach. What I've been thinking of is a) making it piggy back on scans vacuum is doing instead of doing separate ones all the time (if possible, analyze needs to be more frequent). Currently with quite some likelihood the cache will be gone again when revisiting. FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's fancy new shared memory stuff). -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/8/13 1:49 PM, Heikki Linnakangas wrote: On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table scan for select count(*) only takes about 5s): One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for the blocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way. Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking it wouldn't be very hard to do the stage 2 work in the callback routine... -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 1:18 PM, Jeff Janes jeff.ja...@gmail.com wrote: I don't recall ever tracing a bad plan down to a bad n_distinct. It does happen. I've seen it several times. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/09/2013 11:35 PM, Jim Nasby wrote: On 12/8/13 1:49 PM, Heikki Linnakangas wrote: On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table scan for select count(*) only takes about 5s): One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for the blocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way. Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking it wouldn't be very hard to do the stage 2 work in the callback routine... Yeah, other than the fact we have no infrastructure to do asynchronous I/O anywhere in the backend. If we had that, then we could easily use it here. I doubt it would be much better than posix_fadvising the blocks, though. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/09/2013 11:35 PM, Jim Nasby wrote: On 12/8/13 1:49 PM, Heikki Linnakangas wrote: On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table scan for select count(*) only takes about 5s): One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for the blocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way. Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking it wouldn't be very hard to do the stage 2 work in the callback routine... Yeah, other than the fact we have no infrastructure to do asynchronous I/O anywhere in the backend. If we had that, then we could easily use it here. I doubt it would be much better than posix_fadvising the blocks, though. Without patches to the kernel, it is much better. posix_fadvise interferes with read-ahead, so posix_fadvise on, say, bitmap heap scans (or similarly sorted analyze block samples) run at 1 IO / block, ie horrible, whereas aio can do read coalescence and read-ahead when the kernel thinks it'll be profitable, significantly increasing IOPS. I've seen everything from a 2x to 10x difference. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes jeff.ja...@gmail.com wrote: My reading of the code is that if it is not in the MCV, then it is assumed to have the average selectivity (about 1/n_distinct, but deflating top and bottom for the MCV list). There is also a check that it is less than the least common of the MCV, but I don't know why that situation would ever prevail--that should always be higher or equal to the average selectivity. I've never seen an n_distinct value of more than 5 digits, regardless of reality. Typically I've seen 20-50k, even if the real number is much higher. But the n_distinct value is only for non-MCVs, so if we estimate the selectivity of column = 'rarevalue' to be (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces the estimate, and making the MCV list longer naturally makes mcvfrac bigger. I'm not sure how important the less-frequent-than-the-least-common-MCV part is, but I'm very sure that raising the statistics target helps to solve the problem of overestimating the prevalence of uncommon values in a very big table. I think that parts of the planner are N^2 in the size of histogram (or was that the size of the MCV list?). So we would probably need a way to use a larger sample size to get more accurate n_distinct and MCV frequencies, but not save the entire histogram that goes with that sample size. I think the saving the histogram part is important. As you say, the MCVs are important for a variety of planning purposes, such as hash joins. More than that, in my experience, people with large tables are typically very willing to spend more planning time to get a better plan, because mistakes are expensive and the queries are likely to run for a while anyway. People with small tables care about planning time, because it makes no sense to spend an extra 1ms planning a query unless you improve the plan by enough to save at least 1ms when executing it, and when the tables are small and access is expected to be fast anyway that's often not the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/09/2013 11:56 PM, Claudio Freire wrote: On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/09/2013 11:35 PM, Jim Nasby wrote: On 12/8/13 1:49 PM, Heikki Linnakangas wrote: On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table scan for select count(*) only takes about 5s): One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for the blocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way. Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking it wouldn't be very hard to do the stage 2 work in the callback routine... Yeah, other than the fact we have no infrastructure to do asynchronous I/O anywhere in the backend. If we had that, then we could easily use it here. I doubt it would be much better than posix_fadvising the blocks, though. Without patches to the kernel, it is much better. posix_fadvise interferes with read-ahead, so posix_fadvise on, say, bitmap heap scans (or similarly sorted analyze block samples) run at 1 IO / block, ie horrible, whereas aio can do read coalescence and read-ahead when the kernel thinks it'll be profitable, significantly increasing IOPS. I've seen everything from a 2x to 10x difference. How did you test that, given that we don't actually have an asynchronous I/O implementation? I don't recall any recent patches floating around either to do that. When Greg Stark investigated this back in 2007-2008 and implemented posix_fadvise() for bitmap heap scans, posix_fadvise certainly gave a significant speedup on the test data he used. What kind of a data distribution gives a slowdown like that? I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. - Heikki diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 9845b0b..410d487 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -58,10 +58,18 @@ /* Data structure for Algorithm S from Knuth 3.4.2 */ typedef struct { + Relation rel; + ForkNumber forknum; + BlockNumber N;/* number of blocks, known in advance */ int n;/* desired sample size */ BlockNumber t;/* current block number */ int m;/* blocks selected so far */ + + int prefetch_target; + int start; + int end; + BlockNumber *prefetched; } BlockSamplerData; typedef BlockSamplerData *BlockSampler; @@ -87,10 +95,11 @@ static BufferAccessStrategy vac_strategy; static void do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, AcquireSampleRowsFunc acquirefunc, BlockNumber relpages, bool inh, int elevel); -static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, +static void BlockSampler_Init(BlockSampler bs, Relation rel, ForkNumber forknum, BlockNumber nblocks, int samplesize); static bool BlockSampler_HasMore(BlockSampler bs); static BlockNumber BlockSampler_Next(BlockSampler bs); +static BlockNumber BlockSampler_Next_internal(BlockSampler bs); static void compute_index_stats(Relation onerel, double totalrows, AnlIndexData *indexdata, int nindexes, HeapTuple *rows, int numrows, @@ -954,8 +963,12 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr) * algorithm. */ static void -BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize) +BlockSampler_Init(BlockSampler bs, Relation rel, ForkNumber forknum, + BlockNumber nblocks, int samplesize) { + bs-rel = rel; + bs-forknum = forknum; + bs-N = nblocks; /* measured table size */ /* @@ -965,17 +978,61 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize) bs-n = samplesize; bs-t = 0; /* blocks scanned so far */ bs-m = 0; /* blocks selected so far */ + + bs-prefetch_target = target_prefetch_pages; + if (target_prefetch_pages 0) + bs-prefetched = palloc((bs-prefetch_target + 1) * sizeof(BlockNumber)); + else + bs-prefetched = NULL; + bs-start = bs-end = 0; } static bool BlockSampler_HasMore(BlockSampler bs) { + if (bs-end != bs-start) + return true; return (bs-t bs-N) (bs-m bs-n); } +static void +BlockSampler_Prefetch(BlockSampler bs) +{ + int next; + + next = (bs-end + 1) % (bs-prefetch_target + 1); + while (next != bs-start (bs-t bs-N) (bs-m bs-n)) + { +
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/09/2013 11:56 PM, Claudio Freire wrote: Without patches to the kernel, it is much better. posix_fadvise interferes with read-ahead, so posix_fadvise on, say, bitmap heap scans (or similarly sorted analyze block samples) run at 1 IO / block, ie horrible, whereas aio can do read coalescence and read-ahead when the kernel thinks it'll be profitable, significantly increasing IOPS. I've seen everything from a 2x to 10x difference. How did you test that, given that we don't actually have an asynchronous I/O implementation? I don't recall any recent patches floating around either to do that. When Greg Stark investigated this back in 2007-2008 and implemented posix_fadvise() for bitmap heap scans, posix_fadvise certainly gave a significant speedup on the test data he used. What kind of a data distribution gives a slowdown like that? That's basically my summarized experience from working on this[0] patch, and the feedback given there about competing AIO work. Sequential I/O was the biggest issue. I had to actively avoid fadvising on sequential I/O, or sequential-ish I/O, which was a huge burden on fadvise logic. I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. Kernel version? I raised this issue on LKML, and, while I got no news on this front, they might have silently fixed it. I'd have to check the sources again. [0] http://www.postgresql.org/message-id/col116-w162aeaa64173e77d4597eea3...@phx.gbl -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Claudio Freire klaussfre...@gmail.com wrote: On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. Kernel version? 3.12, from Debian experimental. With an ssd drive and btrfs filesystem. Admittedly not your average database server setup, so it would be nice to get more reports from others. I raised this issue on LKML, and, while I got no news on this front, they might have silently fixed it. I'd have to check the sources again. Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're prefetching with posix_fadvise anyway. I - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 8:45 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Claudio Freire klaussfre...@gmail.com wrote: On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. Kernel version? 3.12, from Debian experimental. With an ssd drive and btrfs filesystem. Admittedly not your average database server setup, so it would be nice to get more reports from others. Yeah, read-ahead isn't relevant for SSD. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Heikki Linnakangas hlinnakan...@vmware.com writes: Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're prefetching with posix_fadvise anyway. Yeah. If we're not reading consecutive blocks, readahead is unlikely to do anything anyhow. Claudio's comments do suggest that it might be a bad idea to issue a posix_fadvise when the next block to be examined *is* adjacent to the current one, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Mark Kirkwood mark.kirkw...@catalyst.net.nz writes: I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( What OS and filesystem? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/06/2013 09:52 AM, Peter Geoghegan wrote: Has anyone ever thought about opportunistic ANALYZE piggy-backing on other full-table scans? That doesn't really help Greg, because his complaint is mostly that a fresh ANALYZE is too expensive, but it could be an interesting, albeit risky approach. It'd be particularly interesting, IMO, if autovacuum was able to do it using the synchronized scans machinery, so the analyze still ran in a separate proc and didn't impact the user's query. Have an autovac worker or two waiting for new scans to start on tables they need to analyze, and if one doesn't start within a reasonable time frame they start their own scan. We've seen enough issues with hint-bit setting causing unpredictable query times for user queries that I wouldn't want to add another might write during a read-only query behaviour. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/10/2013 05:20 AM, Jim Nasby wrote: FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's fancy new shared memory stuff). Apologies for posting the same as a new idea before I saw your post. I need to finish the thread before posting. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s Ah - I have just realized I was not setting effective_io_concurrency - so I'll redo the test. - Apologies. Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 15:11, Mark Kirkwood wrote: On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s Ah - I have just realized I was not setting effective_io_concurrency - so I'll redo the test. - Apologies. Redoing the test on the velociraptor gives me exactly the same numbers as before (effective_io_concurrency = 10 instead of 1). Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/09/2013 02:37 PM, Robert Haas wrote: I've never seen an n_distinct value of more than 5 digits, regardless of reality. Typically I've seen 20-50k, even if the real number is much higher. But the n_distinct value is only for non-MCVs, so if we estimate the selectivity of column = 'rarevalue' to be (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces the estimate, and making the MCV list longer naturally makes mcvfrac bigger. I'm not sure how important the less-frequent-than-the-least-common-MCV part is, but I'm very sure that raising the statistics target helps to solve the problem of overestimating the prevalence of uncommon values in a very big table. I did an analysis of our ndistinct algorithm several years ago ( ~~ 8.1), and to sum up: 1. we take far too small of a sample to estimate ndistinct well for tables larger than 100,000 rows. 2. the estimation algo we have chosen is one which tends to be wrong in the downwards direction, rather strongly so. That is, if we could potentially have an ndistinct of 1000 to 100,000 based on the sample, our algo estimates 1500 to 3000. 3. Other algos exist. The tend to be wrong in other directions. 4. Nobody has done an analysis of whether it's worse, on average, to estimate low vs. high for ndistinct. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 15:17, Mark Kirkwood wrote: On 10/12/13 15:11, Mark Kirkwood wrote: On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s Ah - I have just realized I was not setting effective_io_concurrency - so I'll redo the test. - Apologies. Redoing the test on the velociraptor gives me exactly the same numbers as before (effective_io_concurrency = 10 instead of 1). Good grief - repeating the test gave: Without patch: ANALYZE pgbench_accounts 90s With patch: ANALYZE pgbench_accounts 42s pretty consistent *now*. No idea what was going on in the 1st run (maybe I happened to have it running at the same time as a checkpoint)? Anyway will stop now before creating more confusion. Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 15:32, Mark Kirkwood wrote: On 10/12/13 15:17, Mark Kirkwood wrote: On 10/12/13 15:11, Mark Kirkwood wrote: On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s Ah - I have just realized I was not setting effective_io_concurrency - so I'll redo the test. - Apologies. Redoing the test on the velociraptor gives me exactly the same numbers as before (effective_io_concurrency = 10 instead of 1). Good grief - repeating the test gave: Without patch: ANALYZE pgbench_accounts 90s With patch: ANALYZE pgbench_accounts 42s pretty consistent *now*. No idea what was going on in the 1st run (maybe I happened to have it running at the same time as a checkpoint)? Anyway will stop now before creating more confusion. Just one more... The Intel 520 with ext4: Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 1s And double checking - With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s These results look more like Heikki's. Which suggests more benefit on SSD than spinning disks. Some more data points (apart from mine) would be good to see tho. Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: Just one more... The Intel 520 with ext4: Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 1s And double checking - With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s These results look more like Heikki's. Which suggests more benefit on SSD than spinning disks. Some more data points (apart from mine) would be good to see tho. Assuming ANALYZE is sampling less than 5% of the table, I'd say fadvising will always be a win. I'd also suggest higher e_i_c for rotating media. Rotating media has longer latencias, and e_i_c has to be computed in terms of latency, rather than spindles when doing prefetch. For backward index scans, I found the optimum number for a single spindle to be about 20. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 17:18, Claudio Freire wrote: On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: Just one more... The Intel 520 with ext4: Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 1s And double checking - With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s These results look more like Heikki's. Which suggests more benefit on SSD than spinning disks. Some more data points (apart from mine) would be good to see tho. Assuming ANALYZE is sampling less than 5% of the table, I'd say fadvising will always be a win. I'd also suggest higher e_i_c for rotating media. Rotating media has longer latencias, and e_i_c has to be computed in terms of latency, rather than spindles when doing prefetch. For backward index scans, I found the optimum number for a single spindle to be about 20. Yeah - was just fooling about with this on the velociraptor, looks like somewhere in the 20's works well. | pgbench_accounts eff_io_concurrency | analyze time (s) ---+- 8 | 43 16 | 40 24 | 36 32 | 35 64 | 35 Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Sun, Dec 8, 2013 at 12:06 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: bench=# ANALYZE pgbench_accounts; NOTICE: acquire sample will need 3 blocks NOTICE: sampled 3 blocks ANALYZE Time: 10059.446 ms bench=# \q I did some experimenting here as well. I hacked up a version of analyze.c that has a guc for rows_per_block to sample. Setting it to 1 doesn't modify the behaviour at all whereas setting it to 4 divides the number of blocks to sample by 4 which causes it to do less I/O and use more rows from each block. I then initialized pgbench with scale factor 100 but modified the code to run the actual pgbench with scale factor 1. In other words I ran a lot of updates on 1% of the database but left the other 99% untouched from the initial load. Then I ran ANALYZE VERBOSE accounts with rows_per_block set to 1, 4, 16, and 64. The latter is slightly larger than the average number of tuples per block so the resulting sample is actually slightly short. The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table scan for select count(*) only takes about 5s): stark=# analyze verbose pgbench_accounts; INFO: analyzing public.pgbench_accounts INFO: pgbench_accounts: scanned 3 of 158756 pages, containing 1889701 live rows and 0 dead rows; 3 rows in sample, 1036 estimated total rows ANALYZE Time: 19468.987 ms With rows_per_block=4 it reads only a quarter as many blocks but it's not much faster: stark=# analyze verbose pgbench_accounts; INFO: analyzing public.pgbench_accounts INFO: pgbench_accounts: scanned 7501 of 158756 pages, containing 472489 live rows and 0 dead rows; 3 rows in sample, 1037 estimated total rows ANALYZE Time: 17062.331 ms But with rows_per_block=16 it's much faster, 6.7s stark=# set statistics_rows_per_block = 16; SET Time: 1.583 ms stark=# analyze verbose pgbench_accounts; INFO: analyzing public.pgbench_accounts INFO: pgbench_accounts: scanned 1876 of 158756 pages, containing 118163 live rows and 0 dead rows; 3 rows in sample, 1031 estimated total rows ANALYZE Time: 6694.331 ms And with rows_per_block=64 it's under 2s: stark=# set statistics_rows_per_block = 64; SET Time: 0.693 ms stark=# analyze verbose pgbench_accounts; INFO: analyzing public.pgbench_accounts INFO: pgbench_accounts: scanned 469 of 158756 pages, containing 29544 live rows and 0 dead rows; 29544 rows in sample, 1033 estimated total rows ANALYZE Time: 1937.055 ms The estimates for the total rows is just as accurate in every case. (It seems to be consistently sightly high though which is a bit disconcerting) However looking at the actual pg_stats entries the stats are noticeably less accurate for the blockier samples. The bid column actually has 100 distinct values and so with a statistics_target of 100 each value should appear in the MCV list with a frequency of about .01. With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123 With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125 With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164 With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213 I'm not really sure if this is due to the blocky sample combined with the skewed pgbench run or not. It doesn't seem to be consistently biasing towards or against bid 1 which I believe are the only rows that would have been touched by pgbench. Still it's suspicious that they seem to be consistently getting less accurate as the blockiness increases. I've attached the results of pg_stats following the analyze with the various levels of blockiness. -- greg stark=# set statistics_rows_per_block = 1; SET Time: 1.363 ms stark=# analyze verbose pgbench_accounts; INFO: analyzing public.pgbench_accounts INFO: pgbench_accounts: scanned 3 of 158756 pages, containing 1889701 live rows and 0 dead rows; 3 rows in sample, 1036 estimated total rows ANALYZE Time: 19468.987 ms stark=# select * from pg_stats where tablename = 'pgbench_accounts'; -[ RECORD 1
Re: [HACKERS] ANALYZE sampling is too good
On 12/08/2013 10:14 AM, Greg Stark wrote: With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123 With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125 With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164 With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213 I'm not really sure if this is due to the blocky sample combined with the skewed pgbench run or not. It doesn't seem to be consistently biasing towards or against bid 1 which I believe are the only rows that would have been touched by pgbench. Still it's suspicious that they seem to be consistently getting less accurate as the blockiness increases. They will certainly do so if you don't apply any statistical adjustments for selecting more rows from the same pages. So there's a set of math designed to calculate for the skew introduced by reading *all* of the rows in each block. That's what I meant by block-based sampling; you read, say, 400 pages, you compile statistics on *all* of the rows on those pages, you apply some algorithms to adjust for groupings of rows based on how grouped they are. And you have a pretty good estimate of how grouped they are, because you just looked a complete sets of rows on a bunch of nonadjacent pages. Obviously, you need to look at more rows than you would with a pure-random sample. Like I said, the 80%+ accurate point in the papers seemed to be at a 5% sample. However, since those rows come from the same pages, the cost of looking at more rows is quite small, compared to the cost of looking at 64 times as many disk pages. My ACM subscription has lapsed, though; someone with a current ACM subscription could search for this; there are several published papers, with math and pseudocode. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers