Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Fri, Jan 13, 2006 at 11:37:38PM -0500, Tom Lane wrote: Josh Berkus josh@agliodbs.com writes: It's also worth mentioning that for datatypes that only have an = operator the performance of compute_minimal_stats is O(N^2) when values are unique, so increasing sample size is a very bad idea in that case. Hmmm ... does ANALYZE check for UNIQUE constraints? Our only implementation of UNIQUE constraints is btree indexes, which require more than an = operator, so this seems irrelevant. IIRC, the point was that if we know a field has to be unique, there's no sense in doing that part of the analysis on it; you'd only care about correlation. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Mon, 2006-01-16 at 12:26 -0600, Jim C. Nasby wrote: On Fri, Jan 13, 2006 at 11:37:38PM -0500, Tom Lane wrote: Josh Berkus josh@agliodbs.com writes: It's also worth mentioning that for datatypes that only have an = operator the performance of compute_minimal_stats is O(N^2) when values are unique, so increasing sample size is a very bad idea in that case. Hmmm ... does ANALYZE check for UNIQUE constraints? Our only implementation of UNIQUE constraints is btree indexes, which require more than an = operator, so this seems irrelevant. IIRC, the point was that if we know a field has to be unique, there's no sense in doing that part of the analysis on it; you'd only care about correlation. An interesting point: if we know the cardinality by definition there is no need to ANALYZE for that column. An argument in favour of the definitional rather than the discovery approach. As a designer I very often already know the cardinality by definition, so I would appreciate the opportunity to specify that to the database. Tom has not spoken against checking for UNIQUE constraints: he is just pointing out that there never could be a constraint in the case I was identifying. For other datatypes we could check for them rather than analyzing the sample, but the effect would be the same either way. My original point was that behaviour is O(N^2) when unique; it is still O(N^2) when nearly unique, albeit O(N^2) - O(N). Reducing stats_target does not help there - the effect varies according to number of rows in the sample, not num slots in MCV list. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Fri, 13 Jan 2006 19:18:29 +, Simon Riggs [EMAIL PROTECTED] wrote: I enclose a patch for checking out block sampling. Can't comment on the merits of block sampling and your implementation thereof. Just some nitpicking: |! * Row Sampling: As of May 2004, we use the Vitter algorithm to create Linking the use of the Vitter algorithm to May 2004 is not quite appropriate. We introduced two stage sampling at that time. | * a random sample of targrows rows (or less, if there are less in the |! * sample of blocks). In this case, targblocks is always the same as |! * targrows, so we always read one row per block. This is just wrong, unless you add on average. Even then it is a bit misleading, because in most cases we *read* more tuples than we use. | * Although every row has an equal chance of ending up in the final | * sample, this sampling method is not perfect: not every possible | * sample has an equal chance of being selected. For large relations | * the number of different blocks represented by the sample tends to be |! * too small. In that case, block sampling should be used. Is the last sentence a fact or personal opinion? | * block. The previous sampling method put too much credence in the row | * density near the start of the table. FYI, previous refers to the date mentioned above: previous == before May 2004 == before two stage sampling. |+ /* Assume that we will have rows at least 64 bytes wide |+ * Currently we're very unlikely to overflow availMem here |+ */ |+ if ((allocrows * sizeof(HeapTuple)) (allowedMem 4)) This is a funny way of saying if (allocrows * (sizeof(Heaptuple) + 60) allowedMem) It doesn't match the comment above; and it is something different if the size of a pointer is not four bytes. |- if (bs.m 0) |+ if (bs.m 0 ) Oops. |+ ereport(DEBUG2, |+ (errmsg(ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u, |+ stats-tupattnum,samplerows, nmultiple, f1, d))); ^ missing space here and in some more places I haven't been following the discussion too closely but didn't you say that a block sampling algorithm somehow compensates for the fact that the sample is not random? Servus Manfred ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Simon Riggs [EMAIL PROTECTED] writes: Tom has not spoken against checking for UNIQUE constraints: he is just pointing out that there never could be a constraint in the case I was identifying. More generally, arguing for or against any system-wide change on the basis of performance of compute_minimal_stats() is folly, because that function is not used for any common datatypes. (Ideally it'd never be used at all.) regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Mon, 2006-01-16 at 21:24 +0100, Manfred Koizar wrote: On Fri, 13 Jan 2006 19:18:29 +, Simon Riggs [EMAIL PROTECTED] wrote: I enclose a patch for checking out block sampling. Can't comment on the merits of block sampling and your implementation thereof. But your thoughts are welcome... |! * Row Sampling: As of May 2004, we use the Vitter algorithm to create Linking the use of the Vitter algorithm to May 2004 is not quite appropriate. We introduced two stage sampling at that time. I'll redo some of the comments as you suggest and review coding points. | * a random sample of targrows rows (or less, if there are less in the |! * sample of blocks). In this case, targblocks is always the same as |! * targrows, so we always read one row per block. This is just wrong, unless you add on average. Even then it is a bit misleading, because in most cases we *read* more tuples than we use. The current code reads a number of blocks == targrows, hence one per block but I'll change the comment. | * Although every row has an equal chance of ending up in the final | * sample, this sampling method is not perfect: not every possible | * sample has an equal chance of being selected. For large relations | * the number of different blocks represented by the sample tends to be |! * too small. In that case, block sampling should be used. Is the last sentence a fact or personal opinion? That is my contention, based upon research. The existing code clearly identifies that case as requiring a solution. We use Chaudhuri et al's 1998 result for the sufficiency of sample size, yet overlook his estimators and his ideas for random block selection. My first point is that we need proportional sample size, not fixed size. [I show that for large tables we currently use a sample that is 2-5 times too small, even based on Chaudhuri's work.] ANALYZE is very I/O bound: random row sampling would need to scan the whole table to take a 2-3% sample in most cases. Block sampling is a realistic way of getting a larger sample size without loss of performance, and also a way of getting a reasonable sample when accessing very few blocks. We would keep random row sampling for smaller relations where the performance effects of sample collection are not noticeable. I haven't been following the discussion too closely but didn't you say that a block sampling algorithm somehow compensates for the fact that the sample is not random? I haven't said that block sampling compensates for the sample being non-random. There is a danger of bias in the sample and I made that clear. I've tried to mitigate that by the use of random blocks, reusing your code and taking note of the earlier problems you solved. However, block sampling allows us to spot cases where rows are naturally grouped together, which row sampling doesn't spot until very high sample sizes. I explored in detail a common case where this causes the current method to fall down very badly. Brutlag Richardson [2000] give a good run down on block sampling and I'm looking to implement some of his block estimators to complement the current patch. (Credit to Andrew Dunstan for locating the paper...) http://www.stat.washington.edu/www/research/reports/1999/tr355.ps [I did briefly explore the possibility of hybrid row/block sampling, using row sampling as well some block sampling to identify grouping, with an attempt to avoid swamping the sample with biased rows; but I don't have a strong basis for that, so I've ignored that for now.] This patch allows us to explore the possible bias that block sampling gives, allowing us to make a later decision to include it, or not. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Fri, 2006-01-06 at 15:26 -0800, Josh Berkus wrote: Anyway, since the proof is in the pudding, Simon and I will be working on some demo code for different sampling methods so that we can debate results rather than theory. I enclose a patch for checking out block sampling. This is not production ready, yet, but works bug-free on cvstip. Code comments have been fully updated to explain what's going on inside. All you need to do is set analyze_blocks=b and ANALYZE will switch over to using block sampling method and will read all the rows in b blocks. The sample size will also be limited by maintenance_work_mem. (Memory limitations could be smarter). This de-couples the specification of the sample size from the specification of the MCV/histogram size (stats_target). [Right now, I'm not suggesting that we have a GUC named this - it just exists for testing. If/when we agree to allow block sampling, then we can discuss how to invoke/specify it] The stats calculations aren't touched - it still uses Haas-Stokes. If you set log_min_messages=DEBUG2 you'll also get more useful explanations of what the variables are and what decisions it makes about D for each column being analyzed. This patch has two main effects: - ANALYZE runs more than x10 faster to retrieve the same size sample - you can specify much larger samples for bigger tables, without increasing the stats targets Generally, the larger samples will give better results for the estimation. However, what is immediately apparent is that the Haas-Stokes estimator actually gets even worse with block sampling in the particular case I raised on-list. (Which is understandable, but not desirable). ISTM this is a strike against Haas-Stokes, rather than a strike against block sampling. So I'm looking at implementing the Brutlag Richardson estimator(s) that cope with number-of-values-appearing in only one block. Not surprisingly that means some non-trivial additional code to retrieve blockids for each tuple and make decisions accordingly. I plan to use a similar technique to the existing TupnoLink array to match blockids. The BR estimators should allow a fairly fast, small sample to be taken, making this more usable for dynamic sampling during query planning (when desirable, see recent -perform thread). It's also worth mentioning that for datatypes that only have an = operator the performance of compute_minimal_stats is O(N^2) when values are unique, so increasing sample size is a very bad idea in that case. It may be possible to re-sample the sample, so that we get only one row per block as with the current row sampling method. Another idea might be just to abort the analysis when it looks fairly unique, rather than churn through the whole sample. Best Regards, Simon Riggs Index: src/backend/commands/analyze.c === RCS file: /projects/cvsroot/pgsql/src/backend/commands/analyze.c,v retrieving revision 1.90 diff -c -r1.90 analyze.c *** src/backend/commands/analyze.c 22 Nov 2005 18:17:08 - 1.90 --- src/backend/commands/analyze.c 13 Jan 2006 18:54:26 - *** *** 62,67 --- 62,68 /* Default statistics target (GUC parameter) */ int default_statistics_target = 10; + int analyze_blocks = 0; static int elevel = -1; *** *** 77,84 HeapTuple *rows, int numrows, MemoryContext col_context); static VacAttrStats *examine_attribute(Relation onerel, int attnum); ! static int acquire_sample_rows(Relation onerel, HeapTuple *rows, ! int targrows, double *totalrows, double *totaldeadrows); static double random_fract(void); static double init_selection_state(int n); static double get_next_S(double t, int n, double *stateptr); --- 78,86 HeapTuple *rows, int numrows, MemoryContext col_context); static VacAttrStats *examine_attribute(Relation onerel, int attnum); ! static HeapTuple *acquire_sample_rows(Relation onerel, ! int targblocks, int targrows, long allowedMem, ! int *samplerows, double *totalrows, double *totaldeadrows); static double random_fract(void); static double init_selection_state(int n); static double get_next_S(double t, int n, double *stateptr); *** *** 108,117 --- 110,122 VacAttrStats **vacattrstats; AnlIndexData *indexdata; int targrows, + targblocks, numrows; double totalrows, totaldeadrows; HeapTuple *rows; + long allowedMem = 0; /* total memory allowed, in bytes */ + if (vacstmt-verbose) elevel = INFO; *** *** 348,357 /* * Acquire the sample rows */ - rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple)); - numrows = acquire_sample_rows(onerel, rows, targrows, - totalrows, totaldeadrows); /* * Compute the statistics. Temporary results during the calculations for * each column are stored in a
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Simon, It's also worth mentioning that for datatypes that only have an = operator the performance of compute_minimal_stats is O(N^2) when values are unique, so increasing sample size is a very bad idea in that case. It may be possible to re-sample the sample, so that we get only one row per block as with the current row sampling method. Another idea might be just to abort the analysis when it looks fairly unique, rather than churn through the whole sample. I'd tend to do the latter. If we haven't had a value repeat in 25 blocks, how likely is one to appear later? Hmmm ... does ANALYZE check for UNIQUE constraints? Most unique values are going to have a constraint, in which case we don't need to sample them at all for N-distinct. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Josh Berkus josh@agliodbs.com writes: It's also worth mentioning that for datatypes that only have an = operator the performance of compute_minimal_stats is O(N^2) when values are unique, so increasing sample size is a very bad idea in that case. Hmmm ... does ANALYZE check for UNIQUE constraints? Our only implementation of UNIQUE constraints is btree indexes, which require more than an = operator, so this seems irrelevant. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Mon, 2006-01-09 at 22:08 -0500, Greg Stark wrote: So it's not the 8k block reading that's fooling Linux into reading ahead 32k. It seems 32k readahead is the default for Linux, or perhaps it's the sequential access pattern that's triggering it. Nah, Linux 2.6 uses flexible readahead logic. It increases slowly when you read sequentially, but halves the readahead if you do another access type. Can't see that would give an average readahead size of 32k. Anyway this is one just reason for change... Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Simon Riggs [EMAIL PROTECTED] writes: On Mon, 2006-01-09 at 22:08 -0500, Greg Stark wrote: So it's not the 8k block reading that's fooling Linux into reading ahead 32k. It seems 32k readahead is the default for Linux, or perhaps it's the sequential access pattern that's triggering it. Nah, Linux 2.6 uses flexible readahead logic. It increases slowly when you read sequentially, but halves the readahead if you do another access type. Can't see that would give an average readahead size of 32k. I've actually read this code at one point in the past. IIRC the readahead is capped at 32k, which I find interesting given the results. Since this is testing sequential access patterns perhaps what's happening is the test for readahead is too liberal. All the numbers I'm getting are consistent with a 32k readahead. Even I run my program with a 4k block size I get performance equivalent to a full table scan very quickly. If I use a 32k block size then the breakeven point is just over 20%. I suppose what I really ought to do is make some pretty graphs. -- greg ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
FWIW, results from a FBSD6.0 box. Between every run I ran code to zero 800MB (out of 1G) of memory, which should have effectively flushed the OS cache (it would just put the OS into swapping). Reading 1% (1280/128000 blocks 1048576000 bytes) total time 6870595us MB/s 1.53 effective MB/s 152.62 Reading 2% (2560/128000 blocks 1048576000 bytes) total time 13397833us MB/s 1.57 effective MB/s 78.26 Reading 3% (3840/128000 blocks 1048576000 bytes) total time 16288218us MB/s 1.93 effective MB/s 64.38 Reading 4% (5120/128000 blocks 1048576000 bytes) total time 17998334us MB/s 2.33 effective MB/s 58.26 Reading 5% (6400/128000 blocks 1048576000 bytes) total time 22958791us MB/s 2.28 effective MB/s 45.67 Reading 6% (7680/128000 blocks 1048576000 bytes) total time 21272511us MB/s 2.96 effective MB/s 49.29 Reading 7% (8960/128000 blocks 1048576000 bytes) total time 21708130us MB/s 3.38 effective MB/s 48.30 Reading 8% (10240/128000 blocks 1048576000 bytes) total time 23213377us MB/s 3.61 effective MB/s 45.17 Reading 9% (11520/128000 blocks 1048576000 bytes) total time 33641027us MB/s 2.81 effective MB/s 31.17 Reading 10% (12800/128000 blocks 1048576000 bytes) total time 22484234us MB/s 4.66 effective MB/s 46.64 Reading 15% (19200/128000 blocks 1048576000 bytes) total time 23985072us MB/s 6.56 effective MB/s 43.72 Reading 20% (25600/128000 blocks 1048576000 bytes) total time 26332888us MB/s 7.96 effective MB/s 39.82 -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Fri, 2006-01-06 at 16:13 -0500, Greg Stark wrote: Jim C. Nasby [EMAIL PROTECTED] writes: Before we start debating merits of proposals based on random reads, can someone confirm that the sampling code actually does read randomly? I looked at it yesterday; there is a comment that states that blocks to be scanned are passed to the analyze function in physical order, and AFAICT the function that chooses blocks does so based strictly on applying a probability function to block numbers as it increments a counter. It seems that any reading is actually sequential and not random, which makes all the random_page_cost hand-waving null and void. Hm. I'm curious just how much that behaves like a sequential scan actually. I think I'll do some experiments. Reading 1% (1267 read, 126733 skipped):7748264us Reading 2% (2609 read, 125391 skipped): 12672025us Reading 5% (6502 read, 121498 skipped): 19005678us Reading 5% (6246 read, 121754 skipped): 18509770us Reading 10% (12975 read, 115025 skipped): 19305446us Reading 20% (25716 read, 102284 skipped): 18147151us Reading 50% (63656 read, 64344 skipped): 18089229us Reading 100% (128000 read, 0 skipped):18173003us These numbers don't make much sense to me. It seems like 5% is about as slow as reading the whole file which is even worse than I expected. I thought I was being a bit pessimistic to think reading 5% would be as slow as reading 20% of the table. Just to put a few rumours to bed: - the current code does *not* use block sampling, it uses random row sampling. (There is a part of the code that selects the next block but that should not confuse you into thinking that the whole block is sampled). - yes, the random sampling is random - please read the code and comments - yes, I would expect the results you get. If you sample 5% of rows and each block has on average at least 20 rows, then we should expect the majority of blocks to be hit. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Simon Riggs wrote: - yes, the random sampling is random - please read the code and comments - yes, I would expect the results you get. If you sample 5% of rows and each block has on average at least 20 rows, then we should expect the majority of blocks to be hit. and it seems from the benchmark posted to this list that random is _very_ expensive (probably because the random reads are spread out so well, that we do alot of I/O instead of just logical I/O from some cache) regards, Lukas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Simon Riggs [EMAIL PROTECTED] writes: - yes, I would expect the results you get. If you sample 5% of rows and each block has on average at least 20 rows, then we should expect the majority of blocks to be hit. These results are from my test program. 5% means 5% of 8k blocks from the test file. In other words, reading a random 5% of the blocks from the test file in sequential order but seeking over the skipped blocks is just as slow as reading the entire file. I feel like that can't be right but I can't find anything wrong with the methodology. An updated program is attached with which I got these results: bash-3.00# for i in `seq 1 100` ; do umount /u6; mount /dev/sda1 /u6; ~stark/src/pg/a.out /u6/temp/small $i ; done Reading 1% (1280/128000 blocks 1048576000 bytes) total time 7662706us MB/s 1.37 effective MB/s 136.84u Reading 2% (2560/128000 blocks 1048576000 bytes) total time 12495106us MB/s 1.68 effective MB/s 83.92 Reading 3% (3840/128000 blocks 1048576000 bytes) total time 15847342us MB/s 1.99 effective MB/s 66.17 Reading 4% (5120/128000 blocks 1048576000 bytes) total time 18281244us MB/s 2.29 effective MB/s 57.36 Reading 5% (6400/128000 blocks 1048576000 bytes) total time 18988843us MB/s 2.76 effective MB/s 55.22 Reading 6% (7680/128000 blocks 1048576000 bytes) total time 19225394us MB/s 3.27 effective MB/s 54.54 Reading 7% (8960/128000 blocks 1048576000 bytes) total time 19462241us MB/s 3.77 effective MB/s 53.88 Reading 8% (10240/128000 blocks 1048576000 bytes) total time 19747881us MB/s 4.25 effective MB/s 53.10 Reading 9% (11520/128000 blocks 1048576000 bytes) total time 19451411us MB/s 4.85 effective MB/s 53.91 Reading 10% (12800/128000 blocks 1048576000 bytes) total time 19546511us MB/s 5.36 effective MB/s 53.65 Reading 11% (14080/128000 blocks 1048576000 bytes) total time 18989375us MB/s 6.07 effective MB/s 55.22 Reading 12% (15360/128000 blocks 1048576000 bytes) total time 18722848us MB/s 6.72 effective MB/s 56.01 Reading 13% (16640/128000 blocks 1048576000 bytes) total time 18621588us MB/s 7.32 effective MB/s 56.31 Reading 14% (17920/128000 blocks 1048576000 bytes) total time 18581751us MB/s 7.90 effective MB/s 56.43 Reading 15% (19200/128000 blocks 1048576000 bytes) total time 18422160us MB/s 8.54 effective MB/s 56.92 Reading 16% (20480/128000 blocks 1048576000 bytes) total time 18148012us MB/s 9.24 effective MB/s 57.78 Reading 17% (21760/128000 blocks 1048576000 bytes) total time 18147779us MB/s 9.82 effective MB/s 57.78 Reading 18% (23040/128000 blocks 1048576000 bytes) total time 18023256us MB/s 10.47 effective MB/s 58.18 Reading 19% (24320/128000 blocks 1048576000 bytes) total time 18039846us MB/s 11.04 effective MB/s 58.13 Reading 20% (25600/128000 blocks 1048576000 bytes) total time 18081214us MB/s 11.60 effective MB/s 57.99 ... #include sys/types.h #include sys/stat.h #include sys/time.h #include time.h #include fcntl.h #include unistd.h #include stdio.h #include stdlib.h #define BLOCKSIZE 8192 int main(int argc, char *argv[], char *arge[]) { char *fn; int fd; int perc; struct stat statbuf; struct timeval tv1,tv2; off_t size, offset; char *buf[BLOCKSIZE]; int b_toread, b_toskip, b_read=0, b_skipped=0; long us; fn = argv[1]; perc = atoi(argv[2]); fd = open(fn, O_RDONLY); fstat(fd, statbuf); size = statbuf.st_size; size = size/BLOCKSIZE*BLOCKSIZE; gettimeofday(tv1, NULL); srandom(getpid()^tv1.tv_sec^tv1.tv_usec); b_toread = size/BLOCKSIZE*perc/100; b_toskip = size/BLOCKSIZE-b_toread; for(offset=0;offsetsize;offset+=BLOCKSIZE) { if (random()%(b_toread+b_toskip) b_toread) { lseek(fd, offset, SEEK_SET); read(fd, buf, BLOCKSIZE); b_toread--; b_read++; } else { b_toskip--; b_skipped++; } } gettimeofday(tv2, NULL); us = (tv2.tv_sec-tv1.tv_sec)*100 + (tv2.tv_usec-tv1.tv_usec); fprintf(stderr, Reading %d%% (%d/%d blocks %ld bytes) total time %ldus MB/s %.2f effective MB/s %.2f\n, perc, b_read, b_read+b_skipped, size, us, (double)b_read*BLOCKSIZE/us, (double)size/us ); exit(0); } -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
These numbers don't make much sense to me. It seems like 5% is about as slow as reading the whole file which is even worse than I expected. I thought I was being a bit pessimistic to think reading 5% would be as slow as reading 20% of the table. I have a theory. My test program, like Postgres, is reading in 8k chunks. Perhaps that's fooling Linux into thinking it's a sequential read and reading in 32k chunks internally. That would effectively make a 25% scan a full table scan. And a 5% scan would be a 20% scan which is about where I would have expected the breakeven point to be. -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Greg Stark [EMAIL PROTECTED] writes: These numbers don't make much sense to me. It seems like 5% is about as slow as reading the whole file which is even worse than I expected. I thought I was being a bit pessimistic to think reading 5% would be as slow as reading 20% of the table. I have a theory. My test program, like Postgres, is reading in 8k chunks. Perhaps that's fooling Linux into thinking it's a sequential read and reading in 32k chunks internally. That would effectively make a 25% scan a full table scan. And a 5% scan would be a 20% scan which is about where I would have expected the breakeven point to be. Well my theory was sort of half right. It has nothing to do with fooling Linux into thinking it's a sequential read. Apparently this filesystem was created with 32k blocks. I don't remember if that was intentional or if ext2/3 did it automatically based on the size of the filesystem. So it doesn't have wide-ranging implications for Postgres's default 8k block size. But it is a good lesson about the importance of not using a larger filesystem block than Postgres's block size. The net effect is that if the filesystem block is N*8k then your random_page_cost goes up by a factor of N. That could be devastating for OLTP performance. -- greg ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Greg Stark [EMAIL PROTECTED] writes: Well my theory was sort of half right. It has nothing to do with fooling Linux into thinking it's a sequential read. Apparently this filesystem was created with 32k blocks. I don't remember if that was intentional or if ext2/3 did it automatically based on the size of the filesystem. So it doesn't have wide-ranging implications for Postgres's default 8k block size. But it is a good lesson about the importance of not using a larger filesystem block than Postgres's block size. The net effect is that if the filesystem block is N*8k then your random_page_cost goes up by a factor of N. That could be devastating for OLTP performance. Hm, apparently I spoke too soon. tune2fs says the block size is in fact 4k. Yet the performance of the block reading test program with 4k or 8k blocks behaves as if Linux is reading 32k blocks. And in fact when I run it with 32k blocks I get kind of behaviour we were expecting where the breakeven point is around 20%. So it's not the 8k block reading that's fooling Linux into reading ahead 32k. It seems 32k readahead is the default for Linux, or perhaps it's the sequential access pattern that's triggering it. I'm trying to test it with posix_fadvise() set to random access but I'm having trouble compiling. Do I need a special #define to get posix_fadvise from glibc? -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Fri, Jan 06, 2006 at 06:36:52PM -0500, Greg Stark wrote: Josh Berkus josh@agliodbs.com writes: These numbers don't make much sense to me. It seems like 5% is about as slow as reading the whole file which is even worse than I expected. I thought I was being a bit pessimistic to think reading 5% would be as slow as reading 20% of the table. It's about what *I* expected. Disk seeking is the bane of many access methods. Sure, but that bad? That means realistic random_page_cost values should be something more like 20 rather than 4. And that's with seeks only going to subsequent blocks in a single file, which one would expect to average less than the half rotation that a random seek would average. That seems worse than anyone expects. Anyway, since the proof is in the pudding, Simon and I will be working on some demo code for different sampling methods so that we can debate results rather than theory. Note that if these numbers are realistic then there's no i/o benefit to any sampling method that requires anything like 5% of the entire table and is still unreliable. Instead it makes more sense to implement an algorithm that requires a full table scan and can produce good results more reliably. -- greg I have not looked closely at the sampling process in PostgreSQL, but given that getting an accurate answer is expensive in terms of read/seeks, would it be possible to iteratively refine the estimate overtime? Ideally we could use the same disk blocks that are already being read to satisfy a query. The initial estimate could be done with a minimal impact and then we piggy- back the I/O needed to refine the estimate on the normal query I/O. The obvious limit would be if the entire table were scanned we would get a 100% exact estimate of the distict value at that time, and otherwise a progressively more accurate approximation as queries are run. This would allow us to hide the I/O in other normal block accesses. Ken ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Fri, Jan 06, 2006 at 01:24:41AM -0500, Greg Stark wrote: 5% based on block-based sampling is reasonable; that means a straight 5% of the on-disk size of the table, so 5gb for a 100gb table. With random-row sampling, that would require as much as 25% of the table, making it easier to just scan the whole thing. Postgres's current sample sizes are clearly geared towards the histograms where they are entirely realistic. All of the distinct estimates are clearly just ad hoc attempts based on the existing sampling. Is a mechanism that is only 5x faster than reading the whole table (assuming random_page_cost of 4) and is off by more than a factor of three 10% of the time really worth it? Before we start debating merits of proposals based on random reads, can someone confirm that the sampling code actually does read randomly? I looked at it yesterday; there is a comment that states that blocks to be scanned are passed to the analyze function in physical order, and AFAICT the function that chooses blocks does so based strictly on applying a probability function to block numbers as it increments a counter. It seems that any reading is actually sequential and not random, which makes all the random_page_cost hand-waving null and void. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Jim C. Nasby [EMAIL PROTECTED] writes: Before we start debating merits of proposals based on random reads, can someone confirm that the sampling code actually does read randomly? Well, it's not so much that it's not random, as that it's not sequential --- it skips blocks, and therefore you'd expect that kernel-level read-ahead would not kick in, or at least not be very effective. If there weren't much else going on, you could still assume that you'd be paying less seek cost than in a genuinely random-order fetching of the same number of blocks. Not sure how these effects would add up. I agree that some investigation would be wise before making any claims about how expensive the current method actually is. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Jim C. Nasby [EMAIL PROTECTED] writes: Before we start debating merits of proposals based on random reads, can someone confirm that the sampling code actually does read randomly? I looked at it yesterday; there is a comment that states that blocks to be scanned are passed to the analyze function in physical order, and AFAICT the function that chooses blocks does so based strictly on applying a probability function to block numbers as it increments a counter. It seems that any reading is actually sequential and not random, which makes all the random_page_cost hand-waving null and void. Hm. I'm curious just how much that behaves like a sequential scan actually. I think I'll do some experiments. Reading 1% (1267 read, 126733 skipped): 7748264us Reading 2% (2609 read, 125391 skipped): 12672025us Reading 5% (6502 read, 121498 skipped): 19005678us Reading 5% (6246 read, 121754 skipped): 18509770us Reading 10% (12975 read, 115025 skipped): 19305446us Reading 20% (25716 read, 102284 skipped): 18147151us Reading 50% (63656 read, 64344 skipped):18089229us Reading 100% (128000 read, 0 skipped): 18173003us These numbers don't make much sense to me. It seems like 5% is about as slow as reading the whole file which is even worse than I expected. I thought I was being a bit pessimistic to think reading 5% would be as slow as reading 20% of the table. Anyone see anything wrong my my methodology? #include sys/types.h #include sys/stat.h #include sys/time.h #include time.h #include fcntl.h #include unistd.h #include stdio.h #include stdlib.h #define BLOCKSIZE 8192 int main(int argc, char *argv[], char *arge[]) { char *fn; int fd; int perc; struct stat statbuf; struct timeval tv1,tv2; off_t size, offset; char *buf[BLOCKSIZE]; int b_read=0, b_skipped=0; fn = argv[1]; perc = atoi(argv[2]); fd = open(fn, O_RDONLY); fstat(fd, statbuf); size = statbuf.st_size; size = size/BLOCKSIZE*BLOCKSIZE; gettimeofday(tv1, NULL); srandom(getpid()^tv1.tv_sec^tv1.tv_usec); for(offset=0;offsetsize;offset+=BLOCKSIZE) { if (random()%100 perc) { lseek(fd, offset, SEEK_SET); read(fd, buf, BLOCKSIZE); b_read++; } else { b_skipped++; } } gettimeofday(tv2, NULL); fprintf(stderr, Reading %d%% (%d read, %d skipped): %ldus\n, (int)perc, b_read, b_skipped, (tv2.tv_sec-tv1.tv_sec)*100 + (tv2.tv_usec-tv1.tv_usec) ); exit(0); } -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Greg, These numbers don't make much sense to me. It seems like 5% is about as slow as reading the whole file which is even worse than I expected. I thought I was being a bit pessimistic to think reading 5% would be as slow as reading 20% of the table. It's about what *I* expected. Disk seeking is the bane of many access methods. Anyway, since the proof is in the pudding, Simon and I will be working on some demo code for different sampling methods so that we can debate results rather than theory. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Josh Berkus josh@agliodbs.com writes: These numbers don't make much sense to me. It seems like 5% is about as slow as reading the whole file which is even worse than I expected. I thought I was being a bit pessimistic to think reading 5% would be as slow as reading 20% of the table. It's about what *I* expected. Disk seeking is the bane of many access methods. Sure, but that bad? That means realistic random_page_cost values should be something more like 20 rather than 4. And that's with seeks only going to subsequent blocks in a single file, which one would expect to average less than the half rotation that a random seek would average. That seems worse than anyone expects. Anyway, since the proof is in the pudding, Simon and I will be working on some demo code for different sampling methods so that we can debate results rather than theory. Note that if these numbers are realistic then there's no i/o benefit to any sampling method that requires anything like 5% of the entire table and is still unreliable. Instead it makes more sense to implement an algorithm that requires a full table scan and can produce good results more reliably. -- greg ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Thu, 2006-01-05 at 00:33 -0500, Greg Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: The approach I suggested uses the existing technique for selecting random blocks, then either an exhaustive check on all of the rows in a block or the existing random row approach, depending upon available memory. We need to check all of the rows in a reasonable sample of blocks otherwise we might miss clusters of rows in large tables - which is the source of the problems identified. The other reason was to increase the sample size, which is a win in any form of statistics. Only if your sample is random and independent. The existing mechanism tries fairly hard to ensure that every record has an equal chance of being selected. If you read the entire block and not appropriate samples then you'll introduce systematic sampling errors. For example, if you read an entire block you'll be biasing towards smaller records. Yes, I discussed that, following Brutlag Richardson [2000]. The bottom line is if there is no clustering, block sampling is random, which is good; if there is clustering, then you spot it, which is good. I think it would be useful to have a knob to increase the sample size separately from the knob for the amount of data retained in the statistics tables. Though I think you'll be disappointed and find you have to read an unreasonably large sample out of the table before you get more useful distinct estimates. OK, I'll look at doing that. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Do you *really* want the median estimate in these case? Are you certain you do not want something with the opposite behavior of Chaudhuri's estimate so that for small sample sizes the bias is toward a high estimate of D? (Converges on D from the right instead of the left.) Chaudhuri's -D-- needed Estimate estimate Hmmm. Yeah, I see what you mean. True, the ideal approach would to deterime for each query operation whether a too-low D or a too-high D was more risky, and then use the more conservative number. However, that would complicate the query planner enough that I think Tom would leave us. :-p You could have some specific functions vote themselves out if their cost is shakey. We know that the cost of a miscalculated nestloop is huge, so after calculating the common case it might apply a multiplier for the risk involved. There have been lots of requests for a way to achieve more consistent plans that have a determined worst case performance, even if they never perform as well in the best case as another algorithm might. Perhaps this could be a GUC. PlanCost + PlanCost * Risk * RiskGUC Risk is a number that indicates how badly things can go wrong. RiskGUC is an integer multiplier. Someone who is risk averse (wants a predictable execution time rather than the best possible time) would set this value high. Others who want the best possible plan in most cases even if it performs poorly once in a while will set the value very low, possibly 0. -- ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Josh Berkus josh@agliodbs.com writes: Only if your sample is random and independent. The existing mechanism tries fairly hard to ensure that every record has an equal chance of being selected. If you read the entire block and not appropriate samples then you'll introduce systematic sampling errors. For example, if you read an entire block you'll be biasing towards smaller records. Did you read any of the papers on block-based sampling? These sorts of issues are specifically addressed in the algorithms. We *currently* use a block based sampling algorithm that addresses this issue by taking care to select rows within the selected blocks in an unbiased way. You were proposing reading *all* the records from the selected blocks, which throws away that feature. -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Josh Berkus josh@agliodbs.com writes: These statements are at odds with my admittedly basic understanding of statistics. Isn't the power of a sample more related to the absolute size of the sample than the sample as fraction of the population? Why not just pick a smallish sample size, say about 3000, and apply it to all the tables, even the ones with just a single row (modify appropriately from block sampling). Nope, it's definitely proportional. As a simple example, a sample of 500 rows in a table of 1000 rows should yeild stats estimates with 90%+ accuracy. But a sample of 500 rows in a 600,000,000 row table is so small as to be nearly useless; it's quite possible to get all the same value in a random sample of 0.1% even on a column with a D/N of 0.001. If you look at the papers cited, almost all researchers more recent than Chaudhuri use a proportional sample size. To be clear Josh is talking specifically about the estimate of how many distinct values a query will see. Not the more usual estimates of how many records the query will see. For estimating how many records a query like SELECT * FROM tab WHERE x BETWEEN ? AND ? the fixed size sample is on fairly solid ground. A sample of 600 gives (iirc) +/- 2% 19 times out of 20. That's the same sample size most major opinion polls use... However this same logic doesn't work for estimating distinct values. Since a single occurrence of a distinct value is just as important as hundreds of occurrences, and your chances of finding the single occurrence is proportional to what percentage of the overall table you sample, to maintain a given accuracy you're going to have to maintain a sample of percentage of the overall table. Worse, my recollection from the paper I mentioned earlier was that sampling small percentages like 3-5% didn't get you an acceptable accuracy. Before you got anything reliable you found you were sampling very large percentages of the table. And note that if you have to sample anything over 10-20% you may as well just read the whole table. Random access reads are that much slower. -- greg ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Greg, We *currently* use a block based sampling algorithm that addresses this issue by taking care to select rows within the selected blocks in an unbiased way. You were proposing reading *all* the records from the selected blocks, which throws away that feature. The block-based algorithms have specific math to cope with this. Stuff which is better grounded in statistical analysis than our code. Please read the papers before you judge the solution. Worse, my recollection from the paper I mentioned earlier was that sampling small percentages like 3-5% didn't get you an acceptable accuracy. Before you got anything reliable you found you were sampling very large percentages of the table. And note that if you have to sample anything over 10-20% you may as well just read the whole table. Random access reads are that much slower. Right, which is why researchers are currently looking for something better. The Brutlag Richardson claims to be able to produce estimates which are within +/- 3x 90% of the time using a 5% sample, which is far better than our current accuracy. Nobody claims to be able to estimate based on 0.1% of the table, which is what Postgres tries to do for large tables. 5% based on block-based sampling is reasonable; that means a straight 5% of the on-disk size of the table, so 5gb for a 100gb table. With random-row sampling, that would require as much as 25% of the table, making it easier to just scan the whole thing. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Thu, Jan 05, 2006 at 10:12:29AM -0500, Greg Stark wrote: Worse, my recollection from the paper I mentioned earlier was that sampling small percentages like 3-5% didn't get you an acceptable accuracy. Before you got anything reliable you found you were sampling very large percentages of the table. And note that if you have to sample anything over 10-20% you may as well just read the whole table. Random access reads are that much slower. If I'm reading backend/commands/analyze.c right, the heap is accessed linearly, only reading blocks that get selected but reading them in heap order, which shouldn't be anywhere near as bad as random access. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Josh Berkus josh@agliodbs.com writes: Greg, We *currently* use a block based sampling algorithm that addresses this issue by taking care to select rows within the selected blocks in an unbiased way. You were proposing reading *all* the records from the selected blocks, which throws away that feature. The block-based algorithms have specific math to cope with this. Stuff which is better grounded in statistical analysis than our code. Please read the papers before you judge the solution. I'm confused since Postgres's current setup *was* based on papers on just this topic. I haven't read any of the papers myself, I'm just repeating what was previously discussed when the current block sampling code went in. There was extensive discussion of the pros and cons of different algorithms taken from various papers and how they related to Postgres. I don't recall any of them suggesting that there was any way to take a sample which included every row from a sampled block and then somehow unbias the sample after the fact. Right, which is why researchers are currently looking for something better. The Brutlag Richardson claims to be able to produce estimates which are within +/- 3x 90% of the time using a 5% sample, which is far better than our current accuracy. Nobody claims to be able to estimate based on 0.1% of the table, which is what Postgres tries to do for large tables. 5% based on block-based sampling is reasonable; that means a straight 5% of the on-disk size of the table, so 5gb for a 100gb table. With random-row sampling, that would require as much as 25% of the table, making it easier to just scan the whole thing. Postgres's current sample sizes are clearly geared towards the histograms where they are entirely realistic. All of the distinct estimates are clearly just ad hoc attempts based on the existing sampling. Is a mechanism that is only 5x faster than reading the whole table (assuming random_page_cost of 4) and is off by more than a factor of three 10% of the time really worth it? -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Improving N-Distinct estimation by ANALYZE
Improving N-Distinct estimation === v1.1 OBJECTIVES Answer these points... - Are there any performance issues that can be directly attributed to mis-estimation of N-Distinct (D) by the ANALYZE command? - If so, can we do better than we currently achieve? How? - Would altering the estimate of D cause problems in other places? Comments are sought on the problem and the possible solutions. Please get involved if you can help with detailed analyses on this topic. SUMMARY The estimation of D is difficult and imprecise. The current method works well in many cases, yet breaks down *badly* in one particular very common use case of database design: a large dependent table with a multi-column Primary Key. In some cases the estimate of D *decreases* as the size of the table increases, though the estimate of D is an underestimate in almost all cases, whatever the table design. PostgreSQL cvstip currently seriously under estimates D for very large dependent tables with rows that are clustered around one of the columns of a multi-column Primary Key. The mis-estimation identified can lead to poor system performance should the mis-estimation lead to the use of HashAggregate or general mis-selection of optimal join plans. An example of this is the orders and lineitem tables from DBT-3/TPC-H, which have a 1:M relationship. There are M lineitems associated with each order and all are inserted in one transaction into lineitem. If M is relatively large, then problems may ensue. A problem SQL statement may be something like: SELECT l_orderkey, sum(l_extendedprice) from lineitem; This issue could have an impact on any large table in a 1:M relationship where the actual number of distinct values, D, is much larger than the sample size, n. (D n). This can also effect associative tables where more than one 1:M relationship exists to separate tables, such as Fact tables in a star schema. The issue is alleviated by setting the column statistics target higher, though this merely increases the size range of the table over which problems may occur. There are a number of ways we can improve on the current estimates, using techniques suggested in later statistical research. Some proposals are made and comments are sought on the problem and the possible solutions. It is possible that we need more than one estimate of D for various purposes. We might potentially need a low estimate of D for use in join planning, whereas a higher estimate to reduce the risk of hash table operations. This approach might be taken initially to allow us to implement improved estimators without throwing out many previously good plans. WHAT WE CURRENTLY DO WITH ANALYZE Notation D = estimate of the number of distinct values (aka n_distinct) N = number of rows in table n = number of rows in sample Sampling method * Fixed sample size, no matter how big table, following Chaudhuri et al's 1998 paper on sample size sufficiency for statistics histograms. * Sample blocks = sample rows = 300 * col stats target Results * Count rows/value for all values observed in sample f1 = number of unique values in sample d = number of values in sample * If f1 == n = assume unique: D = N and scale with N else If f1 == 0 = assume D = d else = apply Haas-Stokes [1998] estimator * If D 10% of N = scale with N [There are a variety of techniques selected from Haas-Stokes [1998], Chaudhuri et al [1998], Vitter and Knuth. Sometimes these authors have discussed the same subject and come up with different answers, so you need to be careful to say which reference you mean when discussing these.] ISSUES 1. Estimation of D; mentioned above and covered in more detail below. (see ESTIMATES OF D FOR DEPENDENT TABLES) 2. The sample size calculation correctly follows Chaudhuri et al [1998] when the number of rows in the table is 1 million. However, smaller tables are overestimated and larger tables are underestimated. The sample size should be multiplied by 2.3 (i.e. ln(10)) for every x10 larger table size. e.g. a 100 million row table requires sample size 4.6 times larger to have the same accuracy for histogram selection. OBSERVATIONS 1. *All* methods of statistical analysis are improved by larger sample fractions. The D estimator method currently in use shows an optimum of accuracy and sample fraction at around 5% of a table, as shown in the author's original paper [Haas Stokes (1998)]. The current implementation's error rates climb higher as table size increases. 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a row sampling technique rather than a block sampling technique. This would translate directly into a performance drop from large sample ratios, but since we currently use a fixed sample size this problem is not yet visible for larger tables. With a 2GB table, we would typically sample 1% of the blocks, yet around 0.025 - 0.05% of the rows. 3. Large values of statistics target (up to 1000) could cause a
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Simon Riggs [EMAIL PROTECTED] writes: [ ... a large amount of analysis based on exactly one test case ... ] I think you are putting too much emphasis on fixing one case and not enough on considering what may happen in other cases ... In general, estimating n-distinct from a sample is just plain a hard problem, and it's probably foolish to suppose we'll ever be able to do it robustly. What we need is to minimize the impact when we get it wrong. So I agree with the comment that we need to finish the unfinished project of making HashAggregate tables expansible, but I'm dubious about the rest of this. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Tom, In general, estimating n-distinct from a sample is just plain a hard problem, and it's probably foolish to suppose we'll ever be able to do it robustly. What we need is to minimize the impact when we get it wrong. Well, I think it's pretty well proven that to be accurate at all you need to be able to sample at least 5%, even if some users choose to sample less. Also I don't think anyone on this list disputes that the current algorithm is very inaccurate for large tables. Or do they? While I don't think that we can estimate N-distinct completely accurately, I do think that we can get within +/- 5x for 80-90% of all cases, instead of 40-50% of cases like now. We can't be perfectly accurate, but we can be *more* accurate. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Wed, Jan 04, 2006 at 07:10:29PM +, Simon Riggs wrote: 3. We should also apply multi-column heuristics to the estimation of D, once we have estimated all columns. For column groups (pairs, triples etc) that form part of a PK, we know that it must be true that D1 * D2 ... Dk = N. In many cases we will be very confident of our estimate of D when we decide = d. i.e. When we have two columns, we can use this to infer that D1 = N/d when D2 = d. So we can do this in any case where we have confident estimates of all but one column; the required information is available at that time. e.g. if line_item primary key ( l_orderkey, l_linenumber ) and we know that there are at most 10 l_linenumber values in the table, then there should be N/10 values for l_orderkey, so set it to that if it is lower (only). Sorry if I'm pointing out the obwious, but I would do this for any unique index, not just a PK. (It should still hold for any unique index, right?) Also, was an approach of sampling random rows within random blocks considered? Something like: until row sample size reached read random block sample x% of rows in that block randomly done -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Simon, - Are there any performance issues that can be directly attributed to mis-estimation of N-Distinct (D) by the ANALYZE command? Yes. There's at least one query (maybe two) from TPC-H which bombs because of bad N-distinct estimation, even with stats_target =1000. Based on my experience with data warehouses, this also occurs in the field. - If so, can we do better than we currently achieve? How? Replace the current algorithm and broaden the sample. - Would altering the estimate of D cause problems in other places? Unlike Index Cost Estimation, I wouldn't expect it to. We make pretty proper use of D right now, it's just that for some common cases our estimates of D are bad. The estimation of D is difficult and imprecise. The current method works well in many cases, yet breaks down *badly* in one particular very common use case of database design: a large dependent table with a multi-column Primary Key. In some cases the estimate of D *decreases* as the size of the table increases, though the estimate of D is an underestimate in almost all cases, whatever the table design. Actually, the current estimator underestimates D for *any* large table's high-cardinality columns, primary key, multi-column, or not. Chaudhuri's calculation seems to be designed to yield the smallest number of cardinal values that could reasonably be expected to yield the provided sample. That is, if the estimate range within a stdev of 2.0 gives from 50,000 to 500,000 distinct values, Chaudhuri picks 50,000. This conservative approach makes sense when you're only considering join strategies. That is, given an unreliable estimate you want to estimate D low so that you don't wrongly choose a nested loop, the cost for which mistake being much higher than the cost of performing an unnecessary hash join. It's conservative in that sense. However, PostgreSQL now has a whole set of hash operations and other query types for which a too-low estimate of D causes query lockup. So for these operations, Chaudhuri ceases to be conservative and becomes high-risk. FWIW, my testing with TPCH showed that estimate error is usually OK within +/- 5x. Beyond that any you start to get bad query plans. (yes, I know all of the above begs examples. I'm swamped. I believe I posted examples when I first started talking about n-distinct estimation a year ago) So I think it's vital that we look at algorithms designed to deliver us the median estimated D, not the lowest reasonable, in addition to increasing sample size. The block-based estimator functions which Andrew and I looked at seem designed to do that provided a sample of between 1% and 10%. 1. *All* methods of statistical analysis are improved by larger sample fractions. The D estimator method currently in use shows an optimum of accuracy and sample fraction at around 5% of a table, as shown in the author's original paper [Haas Stokes (1998)]. The current implementation's error rates climb higher as table size increases. I read 5 different papers on ACM about sampling D. All of them were united in saying that you couldn't get even slightly accurate estimates with less than 3% sampling. 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a row sampling technique rather than a block sampling technique. This would translate directly into a performance drop from large sample ratios, but since we currently use a fixed sample size this problem is not yet visible for larger tables. With a 2GB table, we would typically sample 1% of the blocks, yet around 0.025 - 0.05% of the rows. This woudl be a reason to use block-sampling ONLY, rather than hybrid sampling. 3. Large values of statistics target (up to 1000) could cause a number of problems with statistics catalog table growth (mentioned on -perform). Setting these values appropriately can take significant effort. Automatic scaling of such parameters is desirable. Well, decoupling the n-distinct sample size from the # of heuristics rows would fix that. 4. ANALYZE doesn't use more memory if maintenance_work_mem is set high, nor does it use less if it is set low. Actual memory usage isn't measured at all. With very long rows this is max 24 MB with default stats target settings and BLCKSZ, or 2.4 GB with highest stats target (1000). This probably is of lower importance since stats targets are only usually set higher on larger databases, which typically have larger memory configurations anyway - and very long rows are uncommon because of TOAST. Typical memory usage by ANALYZE would be 1 MB with default settings i.e. maybe as low as a 0.01% sample for a very large table. Yeah, I think I pointed out that ANALYZE doesn't seem to be using maintenance_mem a year ago. It should, although there should be options to use less than maintenance_mem if the user wants low-impact analyze. There is a further effect of concern here. We currently apply a lift
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Wed, 2006-01-04 at 14:49 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: [ ... a large amount of analysis based on exactly one test case ... ] [Hmmm, those are your opinions, not my words. Funny guy ;-) ] The one test case just happens to be a very common 1:M relationship, an example of which occurs within the TPC-H/DBT-3 test database. I picked that so it was an obvious publicly accessible test case, rather than an isolated customer issue that couldn't be aired fully. I was hoping to allow the problem to be explored and improved. I think you are putting too much emphasis on fixing one case and not enough on considering what may happen in other cases ... It's not just one problem, but knowing that you would accept only a detailed analysis, I researched that one case so that it was indisputable. I'm dubious about the rest of this. Excellent. Much better than It's Wrong. I'll write some code and run some tests. Thanks for reading it all; sorry it had to be so long. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Josh Berkus josh@agliodbs.com writes: Tom, In general, estimating n-distinct from a sample is just plain a hard problem, and it's probably foolish to suppose we'll ever be able to do it robustly. What we need is to minimize the impact when we get it wrong. Well, I think it's pretty well proven that to be accurate at all you need to be able to sample at least 5%, even if some users choose to sample less. Also I don't think anyone on this list disputes that the current algorithm is very inaccurate for large tables. Or do they? I think it's worse than that. It's proven that to get any kind of accuracy in this kind of estimate you basically have to look at the entire table. Someone posted a URL to a paper that had a very clever way of estimating distinct values. It required scanning the entire table but only kept a sample of the values found using a method that guaranteed the sample was representative not of the entire table but of the distinct values. I think you're right that a reasonable sample size for this kind of estimate is going to be proportional to the table size, not the constant sized sample that regular statistics need. However that same paper has some preliminary verbiage about how previous papers had found that the sample sizes needed were unacceptably large. Something like 90%. Unfortunately I can't find the reference to the paper this moment. I'll look again later. I think it would be interesting to get the algorithm for sampling from that paper in. It would only be able to be used on a VACUUM ANALYZE but currently VACUUMs are necessary anyways. I do wonder what would happen when we get the incremental VACUUMs and the bitmaps to avoid vacuuming pages unnecessarily though. Then it would be less useful. -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Wed, 2006-01-04 at 17:57 -0600, Jim C. Nasby wrote: On Wed, Jan 04, 2006 at 07:10:29PM +, Simon Riggs wrote: 3. We should also apply multi-column heuristics to the estimation of D, once we have estimated all columns. For column groups (pairs, triples etc) that form part of a PK, we know that it must be true that D1 * D2 ... Dk = N. In many cases we will be very confident of our estimate of D when we decide = d. i.e. When we have two columns, we can use this to infer that D1 = N/d when D2 = d. So we can do this in any case where we have confident estimates of all but one column; the required information is available at that time. e.g. if line_item primary key ( l_orderkey, l_linenumber ) and we know that there are at most 10 l_linenumber values in the table, then there should be N/10 values for l_orderkey, so set it to that if it is lower (only). Sorry if I'm pointing out the obwious, but I would do this for any unique index, not just a PK. (It should still hold for any unique index, right?) Yes. It's just a less common case to have 1 unique indexes. Also, was an approach of sampling random rows within random blocks considered? Something like: until row sample size reached read random block sample x% of rows in that block randomly done Yes. I was trying to maintain the existing approach as much as possible, rather than ripping everything out and starting again which could cause just as many problems as it solves. So evolution, rather than revolution. The approach I suggested uses the existing technique for selecting random blocks, then either an exhaustive check on all of the rows in a block or the existing random row approach, depending upon available memory. We need to check all of the rows in a reasonable sample of blocks otherwise we might miss clusters of rows in large tables - which is the source of the problems identified. The other reason was to increase the sample size, which is a win in any form of statistics. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote: I think you're right that a reasonable sample size for this kind of estimate is going to be proportional to the table size, not the constant sized sample that regular statistics need. Agreed [I said exactly that in April]; the counter argument at that time was that proportional samples on large tables lead to external sorts in many cases which would lead to unacceptably long run times - since we need to sort the values for each attribute in turn. I've proposed limiting ourselves to maintenance_work_mem (I credit Josh with that idea from April). If you can allocate 1 GB of memory to an ANALYZE then this would be a very large proportion of a medium sized table (or partition). Considering how few rows we sample at the moment, increasing the actual sample size by many 000s would have a very beneficial effect. On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote: This one looks *really* good. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf It does require a single full table scan The Distinct Sampling approach you mention would scan the whole table and also use large in-memory hash tables. So it uses more I/O, the same memory and probably less CPU - no sorting required. The technique is the same implementation as a HashAgg, just one that loses rows in a predictable manner when it spills out of memory. It doesn't identify columns that scale with N, nor does it calculate correlation. Thats the same as re-writing Count(Distinct) to use hashing, which is a TODO item. So perhaps you could plan the code to do the Distinct Sampling approach at the same time. Hmmm. I'll think about that. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Sorry to interupt. The discussion is interesting, but I need some help to follow along. On Wednesday 2006-01-04 17:07, Josh Berkus wrote: Simon, - Are there any performance issues that can be directly attributed to mis-estimation of N-Distinct (D) by the ANALYZE command? Yes. There's at least one query (maybe two) from TPC-H which bombs because of bad N-distinct estimation, even with stats_target =1000. Based on my experience with data warehouses, this also occurs in the field. - If so, can we do better than we currently achieve? How? Replace the current algorithm and broaden the sample. Is replace the algorithm the same as saying contextually use some estimate of D that is not Chaudhuri? - Would altering the estimate of D cause problems in other places? Unlike Index Cost Estimation, I wouldn't expect it to. We make pretty proper use of D right now, it's just that for some common cases our estimates of D are bad. The estimation of D is difficult and imprecise. The current method works well in many cases, yet breaks down *badly* in one particular very common use case of database design: a large dependent table with a multi-column Primary Key. In some cases the estimate of D *decreases* as the size of the table increases, though the estimate of D is an underestimate in almost all cases, whatever the table design. Actually, the current estimator underestimates D for *any* large table's high-cardinality columns, primary key, multi-column, or not. Chaudhuri's calculation seems to be designed to yield the smallest number of cardinal values that could reasonably be expected to yield the provided sample. That is, if the estimate range within a stdev of 2.0 gives from 50,000 to 500,000 distinct values, Chaudhuri picks 50,000. This conservative approach makes sense when you're only considering join strategies. That is, given an unreliable estimate you want to estimate D low so that you don't wrongly choose a nested loop, the cost for which mistake being much higher than the cost of performing an unnecessary hash join. It's conservative in that sense. So Chaudhuri's estimate of D is appropriate (and is working) when making decisions about joins. However, PostgreSQL now has a whole set of hash operations and other query types for which a too-low estimate of D causes query lockup. Why? So for these operations, Chaudhuri ceases to be conservative and becomes high-risk. FWIW, my testing with TPCH showed that estimate error is usually OK within +/- 5x. Beyond that any you start to get bad query plans. (yes, I know all of the above begs examples. I'm swamped. I believe I posted examples when I first started talking about n-distinct estimation a year ago) So I think it's vital that we look at algorithms designed to deliver us the median estimated D, not the lowest reasonable, in addition to increasing sample size. The block-based estimator functions which Andrew and I looked at seem designed to do that provided a sample of between 1% and 10%. Do you *really* want the median estimate in these case? Are you certain you do not want something with the opposite behavior of Chaudhuri's estimate so that for small sample sizes the bias is toward a high estimate of D? (Converges on D from the right instead of the left.) Chaudhuri's -D-- needed Estimate estimate 1. *All* methods of statistical analysis are improved by larger sample fractions. The D estimator method currently in use shows an optimum of accuracy and sample fraction at around 5% of a table, as shown in the author's original paper [Haas Stokes (1998)]. The current implementation's error rates climb higher as table size increases. I read 5 different papers on ACM about sampling D. All of them were united in saying that you couldn't get even slightly accurate estimates with less than 3% sampling. These statements are at odds with my admittedly basic understanding of statistics. Isn't the power of a sample more related to the absolute size of the sample than the sample as fraction of the population? Why not just pick a smallish sample size, say about 3000, and apply it to all the tables, even the ones with just a single row (modify appropriately from block sampling). 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a row sampling technique rather than a block sampling technique. This would translate directly into a performance drop from large sample ratios, but since we currently use a fixed sample size this problem is not yet visible for larger tables. With a 2GB table, we would typically sample 1% of the blocks, yet around 0.025 - 0.05% of the rows. This woudl be a reason to use block-sampling ONLY, rather than hybrid sampling. 3. Large values of statistics target (up to 1000) could cause a number of problems with statistics catalog table growth (mentioned on
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Simon Riggs [EMAIL PROTECTED] writes: The approach I suggested uses the existing technique for selecting random blocks, then either an exhaustive check on all of the rows in a block or the existing random row approach, depending upon available memory. We need to check all of the rows in a reasonable sample of blocks otherwise we might miss clusters of rows in large tables - which is the source of the problems identified. The other reason was to increase the sample size, which is a win in any form of statistics. Only if your sample is random and independent. The existing mechanism tries fairly hard to ensure that every record has an equal chance of being selected. If you read the entire block and not appropriate samples then you'll introduce systematic sampling errors. For example, if you read an entire block you'll be biasing towards smaller records. I think it would be useful to have a knob to increase the sample size separately from the knob for the amount of data retained in the statistics tables. Though I think you'll be disappointed and find you have to read an unreasonably large sample out of the table before you get more useful distinct estimates. Certainly it's worth testing this in a low impact way like just keeping the existing sample method and dialing up the sample sizes before you try anything that would sacrifice the statistical validity of the more solid estimates. -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Trent, Sorry to interupt. The discussion is interesting, but I need some help to follow along. Thought-out commentary is welcome. Is replace the algorithm the same as saying contextually use some estimate of D that is not Chaudhuri? Yes. I favor a block-based approach like Brutlag, largely because it allows us to increase the sample size without dramatically increasing I/O. So Chaudhuri's estimate of D is appropriate (and is working) when making decisions about joins. Some kinds of joins. It avoids, for example, risky use of nested loops. However, PostgreSQL now has a whole set of hash operations and other query types for which a too-low estimate of D causes query lockup. Why? Two specific examples, both of which I've encountered in the field: 1) too-low D will cause an aggregate query to use a hashagg which is larger than memory resulting in swapping (or disk spill when it's fixed) which makes the query very much slower than if the hashagg was not used. 2) much too-low D will cause the planner to pick a seq scan when it's not necessary, resulting in increased query times. Do you *really* want the median estimate in these case? Are you certain you do not want something with the opposite behavior of Chaudhuri's estimate so that for small sample sizes the bias is toward a high estimate of D? (Converges on D from the right instead of the left.) Chaudhuri's -D-- needed Estimate estimate Hmmm. Yeah, I see what you mean. True, the ideal approach would to deterime for each query operation whether a too-low D or a too-high D was more risky, and then use the more conservative number. However, that would complicate the query planner enough that I think Tom would leave us. :-p These statements are at odds with my admittedly basic understanding of statistics. Isn't the power of a sample more related to the absolute size of the sample than the sample as fraction of the population? Why not just pick a smallish sample size, say about 3000, and apply it to all the tables, even the ones with just a single row (modify appropriately from block sampling). Nope, it's definitely proportional. As a simple example, a sample of 500 rows in a table of 1000 rows should yeild stats estimates with 90%+ accuracy. But a sample of 500 rows in a 600,000,000 row table is so small as to be nearly useless; it's quite possible to get all the same value in a random sample of 0.1% even on a column with a D/N of 0.001. If you look at the papers cited, almost all researchers more recent than Chaudhuri use a proportional sample size. And we're taking the fixed-sample-size approach now, and it's not working very well for large tables. --Josh Berkus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Greg, Only if your sample is random and independent. The existing mechanism tries fairly hard to ensure that every record has an equal chance of being selected. If you read the entire block and not appropriate samples then you'll introduce systematic sampling errors. For example, if you read an entire block you'll be biasing towards smaller records. Did you read any of the papers on block-based sampling? These sorts of issues are specifically addressed in the algorithms. --Josh ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Folks, Nope, it's definitely proportional. As a simple example, a sample of 500 rows in a table of 1000 rows should yeild stats estimates with 90%+ accuracy. But a sample of 500 rows in a 600,000,000 row table is so small as to be nearly useless; it's quite possible to get all the same value in a random sample of 0.1% even on a column with a D/N of 0.001. I meant a D/N of 0.1. Sorry. --Josh ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match