Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Josh Berkus wrote: Mischa, Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is that to do, for example, 10% of rows purely randomly would actually mean loading 50% of pages. With 20% of rows, you might as well scan the whole table. Unless, of course, we use indexes for sampling, which seems like a *really good* idea to me But doesn't an index only sample one column at a time, whereas with page-based sampling, you can sample all of the columns at once. And not all columns would have indexes, though it could be assumed that if a column doesn't have an index, then it doesn't matter as much for calculations such as n_distinct. But if you had 5 indexed rows in your table, then doing it index wise means you would have to make 5 passes instead of just one. Though I agree that page-based sampling is important for performance reasons. John =:- signature.asc Description: OpenPGP digital signature
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Hi, Josh, Josh Berkus wrote: Yes, actually. We need 3 different estimation methods: 1 for tables where we can sample a large % of pages (say, = 0.1) 1 for tables where we sample a small % of pages but are easily estimated 1 for tables which are not easily estimated by we can't afford to sample a large % of pages. If we're doing sampling-based estimation, I really don't want people to lose sight of the fact that page-based random sampling is much less expensive than row-based random sampling. We should really be focusing on methods which are page-based. Would it make sense to have a sample method that scans indices? I think that, at least for tree based indices (btree, gist), rather good estimates could be derived. And the presence of a unique index should lead to 100% distinct values estimation without any scan at all. Markus ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Markus Schaber [EMAIL PROTECTED]: Hi, Josh, Josh Berkus wrote: Yes, actually. We need 3 different estimation methods: 1 for tables where we can sample a large % of pages (say, = 0.1) 1 for tables where we sample a small % of pages but are easily estimated 1 for tables which are not easily estimated by we can't afford to sample a large % of pages. If we're doing sampling-based estimation, I really don't want people to lose sight of the fact that page-based random sampling is much less expensive than row-based random sampling. We should really be focusing on methods which are page-based. Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) How about applying the distinct-sampling filter on a small extra data stream to the stats collector? -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Mischa, Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is that to do, for example, 10% of rows purely randomly would actually mean loading 50% of pages. With 20% of rows, you might as well scan the whole table. Unless, of course, we use indexes for sampling, which seems like a *really good* idea to me -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
John, But doesn't an index only sample one column at a time, whereas with page-based sampling, you can sample all of the columns at once. Hmmm. Yeah, we're not currently doing that though. Another good idea ... -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Josh Berkus josh@agliodbs.com: Mischa, Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) Bad analogy, but funny. Bad analogy? Page-sampling effort versus row-sampling effort, c'est moot. It's not good enough for stats to produce good behaviour on the average. Straight random sampling, page or row, is going to cause enough untrustworthy engine behaviour,for any %ages small enough to allow sampling from scratch at any time. I'm curious what the problem is with relying on a start-up plus incremental method, when the method in the distinct-sampling paper doesn't degenerate: you can start when the table is still empty. Constructing an index requires an initial full scan plus incremental update; what's the diff? Unless, of course, we use indexes for sampling, which seems like a *really good* idea to me distinct-sampling applies for indexes, too. I started tracking the discussion of this a bit late. Smart method for this is in VLDB'92: Gennady Antoshenkov, Random Sampling from Pseudo-ranked B+-trees. I don't think this is online anywhere, except if you have a DBLP membership. Does nybod else know better? Antoshenkov was the brains behind some of the really cool stuff in DEC Rdb (what eventually became Oracle). Compressed bitmap indices, parallel competing query plans, and smart handling of keys with hyperbolic distributions. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Josh Berkus josh@agliodbs.com: Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than just taking the larger sample on its own; GIGO. Whether they are consistent with one another has no relationship to whether the larger sample correlates with the whole population. You can think of the tiny sample like anecdotal evidence for wonderdrugs. Actually, it's more to characterize how large of a sample we need. For example, if we sample 0.005 of disk pages, and get an estimate, and then sample another 0.005 of disk pages and get an estimate which is not even close to the first estimate, then we have an idea that this is a table which defies analysis based on small samples. Wheras if the two estimates are 1.0 stdev apart, we can have good confidence that the table is easily estimated. Note that this doesn't require progressively larger samples; any two samples would work. We're sort of wandering away from the area where words are a good way to describe the problem. Lacking a common scratchpad to work with, could I suggest you talk to someone you consider has a background in stats, and have them draw for you why this doesn't work? About all you can get out of it is, if the two samples are disjunct by a stddev, yes, you've demonstrated that the union of the two populations has a larger stddev than either of them; but your two stddevs are less info than the stddev of the whole. Breaking your sample into two (or three, or four, ...) arbitrary pieces and looking at their stddevs just doesn't tell you any more than what you start with. -- Dreams come true, not free. -- S.Sondheim, ITW ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
First I will comment my original idea. Second I will give another improved suggestion (an idea). I hope, that they will be useful for you. (I don't know, wether the first one was useful at all because it showed, that I and some others of us are not very good with statistics :( ) I haven't looked about the PostgreSQL code, so I don't know, that what is possible now, and what is not. I do know, that the full table scan and after that incremental statistics changes are a very big change, without looking at the code. I meant the following idea: - compare two equal sized samples. Then redo the same thing with double sized samples. So do lots of unnecessary work. Check out the correlation of the two samples to try to guess the distribution. So I tried to give you an idea, not to give you a full answer into the whole problem. I did read some parts of the attached PDFs. They did convince me, that it seems, that the heuristics for the hard cases would actually read almost the whole table in many cases. I did cover the too little sample problem by stating that the user should be able to give the minimum size of samples. This way you would avoid the too small sampling problem. My purpose was not to achieve at most 5% wrong estimates, but to decrease the 2000% wrong estimates, that are seen now sometimes. Conclusions: - No heuristics or similar thing of small samples will grant excellent results. - If you need excellent estimates, you need to process the whole table! - Some special cases, like primary keys and the unique indexes and special case column types do give easy ways to make estimates: For example, wether a boolean column has zero, one or two distinct values, it does not matter so much ??? Hashing seems the right choise for all of them. If I have understund correctly, the full table scans are out of questions for large tables at this time. The percentage idea of taking 10% samples seems good. So here is another suggestion: 1. Do a full percentage scan, starting at an arbitrary position. If the user's data is not homogenous, this hurts it, but this way it is faster. During that scan, try to figure out all those columns, that have at most 100 distinct values. Of course, with it you can't go into 100% accuracy, but if the full table scan is out of question now, it is better, if the accuracy is for example at most ten times wrong. You could also improve accuracy by instead of doing a 10% partial table scan, you could do 20 pieces of 0,5 percent partial table scans: This would improve accuracy a bit, but keep the speed almost the same as the partial table scan. Here are questions for the statisticians for distinct values calculation: If we want at most 1000% tolerance, how big percentage of table's one column must be processed? If we want at most 500% tolerance, how big percentage of table's one column must be processed? If we want at most 250% tolerance, how big percentage of table's one column must be processed? Better to assume, that there are at most 100 distinct values on a table, if it helps calculations. If we try to get as much with one discontinuous partial table scan (0,1-10% sample), here is the information, we can gather: 1. We could gather a histogram for max(100) distinct values for each column for every column. 2. We could measure variance and average, and the number of rows for these 100 distinct values. 3. We could count the number of rows, that didn't match with these 100 distinct values: they were left out from the histogram. 4. We could get a minimum and a maximum value for each column. = We could get exact information about the sample with one 0,1-10% pass for many columns. What you statisticans can gather about these values? My idea is programmatical combined with statistics: + Performance: scan for example 100 blocks each of size 100Mb, because disc I/O is much faster this way. + Enables larger table percentage. I hope it helps with the statistics formula. Required because of more robust statistics: take those blocks at random (not over each other) places to decrease the effect from hitting into statistically bad parts on the table. + Less table scan passes: scan all columns with limited hashing in the first pass. + All easy columns are found here with one pass. +- Harder columns need an own pass each, but we have some preliminary knoledge of them on the given sample after all (minimum and maximum values and the histogram of the 100 distinct values). Marko Ristola Greg Stark wrote: Dave Held [EMAIL PROTECTED] writes: Actually, it's more to characterize how large of a sample we need. For example, if we sample 0.005 of disk pages, and get an estimate, and then sample another 0.005 of disk pages and get an estimate which is not even close to the first estimate, then we have an idea that this is a table which defies analysis based on small samples. I buy that. Better yet is to use the entire sample you've gathered of .01 and
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Tom Lane [EMAIL PROTECTED] writes: Rod Taylor [EMAIL PROTECTED] writes: If when we have partitions, that'll be good enough. If partitions aren't available this would be quite painful to anyone with large tables -- much as the days of old used to be painful for ANALYZE. Yeah ... I am very un-enthused about these suggestions to make ANALYZE go back to doing a full scan ... Well one option would be to sample only a small number of records, but add the data found from those records to the existing statistics. This would make sense for a steady-state situation, but make it hard to recover from a drastic change in data distribution. I think in the case of n_distinct it would also bias the results towards underestimating n_distinct but perhaps that could be corrected for. But I'm unclear for what situation this is a concern. For most use cases users have to run vacuum occasionally. In those cases vacuum analyze would be no worse than a straight normal vacuum. Note that this algorithm doesn't require storing more data because of the large scan or performing large sorts per column. It's purely O(n) time and O(1) space. On the other hand, if you have tables you aren't vacuuming that means you perform zero updates or deletes. In which case some sort of incremental statistics updating would be a good solution. A better solution even than sampling. -- greg ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
On Tue, 2005-04-26 at 15:00 -0700, Gurmeet Manku wrote: 2. In a single scan, it is possible to estimate n_distinct by using a very simple algorithm: Distinct sampling for highly-accurate answers to distinct value queries and event reports by Gibbons, VLDB 2001. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf That looks like the one... ...though it looks like some more complex changes to the current algorithm to use it, and we want the other stats as well... 3. In fact, Gibbon's basic idea has been extended to sliding windows (this extension is useful in streaming systems like Aurora / Stream): Distributed streams algorithms for sliding windows by Gibbons and Tirthapura, SPAA 2002. http://home.eng.iastate.edu/~snt/research/tocs.pdf ...and this offers the possibility of calculating statistics at load time, as part of the COPY command Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote: This one looks *really* good. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf It does require a single full table scan Ack.. Not by default please. I have a few large append-only tables (vacuum isn't necessary) which do need stats rebuilt periodically. Lets just say that we've been working hard to upgrade to 8.0 primarily because pg_dump was taking over 18 hours to make a backup. -- Rod Taylor [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Hi everybody! Perhaps the following papers are relevant to the discussion here (their contact authors have been cc'd): 1. The following proposes effective algorithms for using block-level sampling for n_distinct estimation: Effective use of block-level sampling in statistics estimation by Chaudhuri, Das and Srivastava, SIGMOD 2004. http://www-db.stanford.edu/~usriv/papers/block-sampling.pdf 2. In a single scan, it is possible to estimate n_distinct by using a very simple algorithm: Distinct sampling for highly-accurate answers to distinct value queries and event reports by Gibbons, VLDB 2001. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf 3. In fact, Gibbon's basic idea has been extended to sliding windows (this extension is useful in streaming systems like Aurora / Stream): Distributed streams algorithms for sliding windows by Gibbons and Tirthapura, SPAA 2002. http://home.eng.iastate.edu/~snt/research/tocs.pdf Thanks, Gurmeet Gurmeet Singh Manku Google Inc. http://www.cs.stanford.edu/~manku(650) 967 1890 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Mischa Sandberg wrote: Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than just taking the larger sample on its own; GIGO. Whether they are consistent with one another has no relationship to whether the larger sample correlates with the whole population. You can think of the tiny sample like anecdotal evidence for wonderdrugs. Ok, good point. I'm with Tom though in being very wary of solutions that require even one-off whole table scans. Maybe we need an additional per-table statistics setting which could specify the sample size, either as an absolute number or as a percentage of the table. It certainly seems that where D/N ~ 0.3, the estimates on very large tables at least are way way out. Or maybe we need to support more than one estimation method. Or both ;-) cheers andrew ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
-Original Message- From: Gurmeet Manku [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 26, 2005 5:01 PM To: Simon Riggs Cc: Tom Lane; josh@agliodbs.com; Greg Stark; Marko Ristola; pgsql-perform; pgsql-hackers@postgresql.org; Utkarsh Srivastava; [EMAIL PROTECTED] Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested? [...] 2. In a single scan, it is possible to estimate n_distinct by using a very simple algorithm: Distinct sampling for highly-accurate answers to distinct value queries and event reports by Gibbons, VLDB 2001. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf [...] This paper looks the most promising, and isn't too different from what I suggested about collecting stats over the whole table continuously. What Gibbons does is give a hard upper bound on the sample size by using a logarithmic technique for storing sample information. His technique appears to offer very good error bounds and confidence intervals as shown by tests on synthetic and real data. I think it deserves a hard look from people hacking the estimator. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
-Original Message- From: Greg Stark [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 27, 2005 1:00 AM To: Tom Lane Cc: Rod Taylor; Greg Stark; pgsql-hackers@postgresql.org; Gurmeet Manku Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested? Tom Lane [EMAIL PROTECTED] writes: Rod Taylor [EMAIL PROTECTED] writes: If when we have partitions, that'll be good enough. If partitions aren't available this would be quite painful to anyone with large tables -- much as the days of old used to be painful for ANALYZE. Yeah ... I am very un-enthused about these suggestions to make ANALYZE go back to doing a full scan ... I don't see why ANALYZE would always have to do a full scan. Clearly, this statistic would only be useful to people who need a very accurate n_distinct on tables where the current metric does not work very well. Applying a specialized solution to every table doesn't seem like an efficient way to go about things. Instead, the distinct sampling mechanism should be purely optional, and probably purely separate from the vanilla ANALYZE mechanism, because it works differently. If it were designed that way, the full table scan would be a one-time cost that would not even need to be paid if the user turned on this mechanism at table creation. Thereafter, the statistic would need to be updated incrementally, but that just distributes the cost of ANALYZE over the INSERT/UPDATE/DELETEs. Obviously, it's a higher cost because you touch every record that hits the table, but that's the price you pay for a good n_distinct. The block estimator should probably become the default, since it works within the current ANALYZE paradigm of sampling the data. [...] For most use cases users have to run vacuum occasionally. In those cases vacuum analyze would be no worse than a straight normal vacuum. And that's only if you do a full table scan every time. In the incremental implementation, there are no lump sum costs involved except when the statistic is first initialized. Note that this algorithm doesn't require storing more data because of the large scan or performing large sorts per column. It's purely O(n) time and O(1) space. And I think it should be emphasized that distinct sampling not only gives you a good n_distinct for query planning, it also gives you a very fast approximate answer for related aggregate queries. So you're getting more than just query tuning for that cost. On the other hand, if you have tables you aren't vacuuming that means you perform zero updates or deletes. In which case some sort of incremental statistics updating would be a good solution. A better solution even than sampling. Which, for the large data warehousing situations where it seems this mechanism would be most useful, this would probably be the most common case. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Mischa, Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than just taking the larger sample on its own; GIGO. Whether they are consistent with one another has no relationship to whether the larger sample correlates with the whole population. You can think of the tiny sample like anecdotal evidence for wonderdrugs. Actually, it's more to characterize how large of a sample we need. For example, if we sample 0.005 of disk pages, and get an estimate, and then sample another 0.005 of disk pages and get an estimate which is not even close to the first estimate, then we have an idea that this is a table which defies analysis based on small samples. Wheras if the two estimates are 1.0 stdev apart, we can have good confidence that the table is easily estimated. Note that this doesn't require progressively larger samples; any two samples would work. I'm with Tom though in being very wary of solutions that require even one-off whole table scans. Maybe we need an additional per-table statistics setting which could specify the sample size, either as an absolute number or as a percentage of the table. It certainly seems that where D/N ~ 0.3, the estimates on very large tables at least are way way out. Oh, I think there are several other cases where estimates are way out. Basically the estimation method we have doesn't work for samples smaller than 0.10. Or maybe we need to support more than one estimation method. Yes, actually. We need 3 different estimation methods: 1 for tables where we can sample a large % of pages (say, = 0.1) 1 for tables where we sample a small % of pages but are easily estimated 1 for tables which are not easily estimated by we can't afford to sample a large % of pages. If we're doing sampling-based estimation, I really don't want people to lose sight of the fact that page-based random sampling is much less expensive than row-based random sampling. We should really be focusing on methods which are page-based. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
-Original Message- From: Josh Berkus [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 27, 2005 10:25 AM To: Andrew Dunstan Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested? [...] Actually, it's more to characterize how large of a sample we need. For example, if we sample 0.005 of disk pages, and get an estimate, and then sample another 0.005 of disk pages and get an estimate which is not even close to the first estimate, then we have an idea that this is a table which defies analysis based on small samples. I buy that. Wheras if the two estimates are 1.0 stdev apart, we can have good confidence that the table is easily estimated. I don't buy that. A negative indication is nothing more than proof by contradiction. A positive indication is mathematical induction over the set, which in this type of context is logically unsound. There is no reason to believe that two small samples with a small difference imply that a table is easily estimated rather than that you got unlucky in your samples. [...] Yes, actually. We need 3 different estimation methods: 1 for tables where we can sample a large % of pages (say, = 0.1) 1 for tables where we sample a small % of pages but are easily estimated 1 for tables which are not easily estimated by we can't afford to sample a large % of pages. I don't buy that the first and second need to be different estimation methods. I think you can use the same block sample estimator for both, and simply stop sampling at different points. If you set the default to be a fixed number of blocks, you could get a large % of pages on small tables and a small % of pages on large tables, which is exactly how you define the first two cases. However, I think such a default should also be overridable to a % of the table or a desired accuracy. Of course, I would recommend the distinct sample technique for the third case. If we're doing sampling-based estimation, I really don't want people to lose sight of the fact that page-based random sampling is much less expensive than row-based random sampling. We should really be focusing on methods which are page-based. Of course, that savings comes at the expense of having to account for factors like clustering within blocks. So block sampling is more efficient, but can also be less accurate. Nonetheless, I agree that of the sampling estimators, block sampling is the better technique. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Dave Held [EMAIL PROTECTED] writes: Actually, it's more to characterize how large of a sample we need. For example, if we sample 0.005 of disk pages, and get an estimate, and then sample another 0.005 of disk pages and get an estimate which is not even close to the first estimate, then we have an idea that this is a table which defies analysis based on small samples. I buy that. Better yet is to use the entire sample you've gathered of .01 and then perform analysis on that sample to see what the confidence interval is. Which is effectively the same as what you're proposing except looking at every possible partition. Unfortunately the reality according to the papers that were sent earlier is that you will always find the results disappointing. Until your sample is nearly the entire table your estimates for n_distinct will be extremely unreliable. -- greg ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Simon, Could it be that we have overlooked this simple explanation and that the Haas and Stokes equation is actually quite good, but just not being applied? That's probably part of it, but I've tried Haas and Stokes on a pure random sample and it's still bad, or more specifically overly conservative. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: It's not just the scan --- you also have to sort, or something like that, if you want to count distinct values. I doubt anyone is really going to consider this a feasible answer for large tables. Assuming you don't use the HashAgg plan, which seems very appropriate for the task? (...but I understand the plan otherwise). The context here is a case with a very large number of distinct values... Yes, but is there another way of doing this other than sampling a larger proportion of the table? I don't like that answer either, for the reasons you give. The manual doesn't actually say this, but you can already alter the sample size by setting one of the statistics targets higher, but all of those samples are fixed sample sizes, not a proportion of the table itself. It seems reasonable to allow an option to scan a higher proportion of the table. (It would be even better if you could say keep going until you run out of memory, then stop, to avoid needing to have an external sort mode added to ANALYZE). Oracle and DB2 allow a proportion of the table to be specified as a sample size during statistics collection. IBM seem to be ignoring their own research note on estimating ndistinct... keep in mind also that we have to do this for *all* the columns of the table. You can collect stats for individual columns. You need only use an option to increase sample size when required. Also, if you have a large table and the performance of ANALYZE worries you, set some fields to 0. Perhaps that should be the default setting for very long text columns, since analyzing those doesn't help much (usually) and takes ages. (I'm aware we already don't analyze var length column values 1024 bytes). A full-table scan for each column seems right out to me. Some systems analyze multiple columns simultaneously. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Simon Riggs wrote: The comment * Every value in the sample appeared more than once. Assume * the column has just these values. doesn't seem to apply when using larger samples, as Josh is using. Looking at Josh's application it does seem likely that when taking a sample, all site visitors clicked more than once during their session, especially if they include home page, adverts, images etc for each page. Could it be that we have overlooked this simple explanation and that the Haas and Stokes equation is actually quite good, but just not being applied? No, it is being aplied. If every value in the sample appears more than once, then f1 in the formula is 0, and the result is then just d, the number of distinct values in the sample. cheers andrew ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
This one looks *really* good. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf It does require a single full table scan but it works in O(n) time and constant space and it guarantees the confidence intervals for the estimates it provides like the histograms do for regular range scans. It can even keep enough data to provide estimates for n_distinct when unrelated predicates are applied. I'm not sure Postgres would want to do this though; this seems like it's part of the cross-column correlation story more than the n_distinct story. It seems to require keeping an entire copy of the sampled record in the stats tables which would be prohibitive quickly in wide tables (it would be O(n^2) storage in the number of columns) . It also seems like a lot of work to implement. Nothing particular that would be impossible, but it does require storing a moderately complex data structure. Perhaps Postgres's new support for data structures will make this easier. -- greg ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Rod Taylor [EMAIL PROTECTED] writes: On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote: This one looks *really* good. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf It does require a single full table scan Ack.. Not by default please. I have a few large append-only tables (vacuum isn't necessary) which do need stats rebuilt periodically. The algorithm can also naturally be implemented incrementally. Which would be nice for your append-only tables. But that's not Postgres's current philosophy with statistics. Perhaps some trigger function that you could install yourself to update statistics for a newly inserted record would be useful. The paper is pretty straightforward and easy to read, but here's an executive summary: The goal is to gather a uniform sample of *distinct values* in the table as opposed to a sample of records. Instead of using a fixed percentage sampling rate for each record, use a hash of the value to determine whether to include it. At first include everything, but if the sample space overflows throw out half the values based on their hash value. Repeat until finished. In the end you'll have a sample of 1/2^n of your distinct values from your entire data set where n is large enough for you sample to fit in your predetermined constant sample space. -- greg ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
On Tue, 2005-04-26 at 19:28 -0400, Greg Stark wrote: Rod Taylor [EMAIL PROTECTED] writes: On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote: This one looks *really* good. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf It does require a single full table scan Ack.. Not by default please. I have a few large append-only tables (vacuum isn't necessary) which do need stats rebuilt periodically. The algorithm can also naturally be implemented incrementally. Which would be nice for your append-only tables. But that's not Postgres's current philosophy with statistics. Perhaps some trigger function that you could install yourself to update statistics for a newly inserted record would be useful. If when we have partitions, that'll be good enough. If partitions aren't available this would be quite painful to anyone with large tables -- much as the days of old used to be painful for ANALYZE. -- ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Andrew Dunstan [EMAIL PROTECTED]: After some more experimentation, I'm wondering about some sort of adaptive algorithm, a bit along the lines suggested by Marko Ristola, but limited to 2 rounds. The idea would be that we take a sample (either of fixed size, or some small proportion of the table) , see how well it fits a larger sample (say a few times the size of the first sample), and then adjust the formula accordingly to project from the larger sample the estimate for the full population. Math not worked out yet - I think we want to ensure that the result remains bounded by [d,N]. Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than just taking the larger sample on its own; GIGO. Whether they are consistent with one another has no relationship to whether the larger sample correlates with the whole population. You can think of the tiny sample like anecdotal evidence for wonderdrugs. -- Dreams come true, not free. -- S.Sondheim, ITW ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Guys, While it's not possible to get accurate estimates from a fixed size sample, I think it would be possible from a small but scalable sample: say, 0.1% of all data pages on large tables, up to the limit of maintenance_work_mem. BTW, when I say accurate estimates here, I'm talking about accurate enough for planner purposes which in my experience is a range between 0.2x to 5x. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Here is my opinion. I hope this helps. Maybe there is no one good formula: On boolean type, there are at most 3 distinct values. There is an upper bound for fornames in one country. There is an upper bound for last names in one country. There is a fixed number of states and postal codes in one country. On the other hand, with timestamp, every value could be distinct. A primary key with only one column has only distinct values. If the integer column refers with a foreign key into another table's only primary key, we could take advantage of that knolege. A column with a unique index has only distinct values. First ones are for classifying and the second ones measure continuous or discrete time or something like the time. The upper bound for classifying might be 3 (boolean), or it might be one million. The properties of the distribution might be hard to guess. Here is one way: 1. Find out the number of distinct values for 500 rows. 2. Try to guess, how many distinct values are for 1000 rows. Find out the real number of distinct values for 1000 rows. 3. If the guess and the reality are 50% wrong, do the iteration for 2x1000 rows. Iterate using a power of two to increase the samples, until you trust the estimate enough. So, in the phase two, you could try to guess with two distinct formulas: One for the classifying target (boolean columns hit there). Another one for the timestamp and numerical values. If there are one million classifications on one column, how you can find it out, by other means than checking at least two million rows? This means, that the user should have a possibility to tell the lower bound for the number of rows for sampling. Regards, Marko Ristola Tom Lane wrote: Josh Berkus josh@agliodbs.com writes: Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given sample, rather than the *median* one. This is contrary to the notes in analyze.c, which seem to think that we're *overestimating* n_distinct. Well, the notes are there because the early tests I ran on that formula did show it overestimating n_distinct more often than not. Greg is correct that this is inherently a hard problem :-( I have nothing against adopting a different formula, if you can find something with a comparable amount of math behind it ... but I fear it'd only shift the failure cases around. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org