Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Heikki Linnakangas
On 20.01.2011 04:36, Robert Haas wrote: ... Even better, the code changes would be confined to ANALYZE rather than spread out all over the system, which has positive implications for robustness and likelihood of commit. Keep in mind that the administrator can already override the ndistinct

Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Csaba Nagy
Hi Tomas, On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote: No, the multi-column statistics do not require constant updating. There are cases where a sampling is perfectly fine, although you may need a bit larger sample. Generally if you can use a multi-dimensional histogram, you don't

Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 03:06, Nathan Boley napsal(a): And actually it does not depend on ndistinct for the columns only, it depends on ndistinct estimates for the combination of columns. So improving the ndistinct estimates for columns is just a necessary first step (and only if it works reasonably

Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 03:36, Robert Haas napsal(a): On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra t...@fuzzy.cz wrote: Regarding the crash scenario - if the commit fails, just throw away the local estimator copy, it's not needed. I'm not sure how to take care of the case when commit succeeds and the

Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 09:10, Heikki Linnakangas napsal(a): It seems that the suggested multi-column selectivity estimator would be more sensitive to ndistinct of the individual columns. Is that correct? How is it biased? If we routinely under-estimate ndistinct of individual columns, for example, does

Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 11:05, Csaba Nagy napsal(a): Hi Tomas, On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote: No, the multi-column statistics do not require constant updating. There are cases where a sampling is perfectly fine, although you may need a bit larger sample. Generally if you can

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Robert Haas
On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote: Yes, I was aware of this problem (amount of memory consumed with lots of updates), and I kind of hoped someone will come up with a reasonable solution. As far as I can see, periodically sampling some or all of the table is

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes: ... I guess I'm just saying I'd think really, really hard before abandoning the idea of periodic sampling. You have to get an awful lot of benefit out of those cross-column stats to make it worth paying a run-time cost for them. I've been trying to

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 20:25, Robert Haas napsal(a): On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote: Yes, I was aware of this problem (amount of memory consumed with lots of updates), and I kind of hoped someone will come up with a reasonable solution. As far as I can see,

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 20:46, Tom Lane napsal(a): Robert Haas robertmh...@gmail.com writes: ... I guess I'm just saying I'd think really, really hard before abandoning the idea of periodic sampling. You have to get an awful lot of benefit out of those cross-column stats to make it worth paying a

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra t...@fuzzy.cz wrote: Dne 19.1.2011 20:25, Robert Haas napsal(a): On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote: Yes, I was aware of this problem (amount of memory consumed with lots of updates), and I kind of hoped someone will

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 23:44, Nathan Boley napsal(a): 1) The distribution of values in a table is rarely pathological, and usually follows one of several common patterns. ( IOW, we have good heuristics ) Not true. You're missing the goal of this effort - to get ndistinct estimate for combination of

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Florian Pflug
On Jan19, 2011, at 23:44 , Nathan Boley wrote: If you think about it, it's a bit ridiculous to look at the whole table *just* to estimate ndistinct - if we go that far why dont we just store the full distribution and be done with it? The crucial point that you're missing here is that ndistinct

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
If you think about it, it's a bit ridiculous to look at the whole table *just* to estimate ndistinct - if we go that far why dont we just store the full distribution and be done with it? - the best you could do is to average the individual probabilities which gives ... well, 1/ndistinct.

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
Not true. You're missing the goal of this effort - to get ndistinct estimate for combination of multiple columns. Which is usually pathological in case of dependent columns. snip Again, don't think about a single column (although even in that case there are known fail cases). Think about

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Florian Pflug
On Jan20, 2011, at 02:41 , Nathan Boley wrote: If you think about it, it's a bit ridiculous to look at the whole table *just* to estimate ndistinct - if we go that far why dont we just store the full distribution and be done with it? - the best you could do is to average the individual

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Robert Haas
On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra t...@fuzzy.cz wrote: Regarding the crash scenario - if the commit fails, just throw away the local estimator copy, it's not needed. I'm not sure how to take care of the case when commit succeeds and the write of the merged estimator fails, but I

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Robert Haas
On Wed, Jan 19, 2011 at 9:35 PM, Florian Pflug f...@phlo.org wrote: Also, in my personal experience this isn't really the area we've got problems now. The problem cases for me always were queries with a rather large number of joins (7 to 10 or so) plus rather complex search conditions. The

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug f...@phlo.org wrote: On Jan20, 2011, at 02:41 , Nathan Boley wrote: If you think about it, it's a bit ridiculous to look at the whole table *just* to estimate ndistinct - if we go that far why dont we just store the full distribution and be done

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread tv
On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote: 1) Forks are 'per relation' but the distinct estimators are 'per column' (or 'per group of columns') so I'm not sure whether the file should contain all the estimators for the table, or if there should be one fork for each estimator. The

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas
On Tue, Jan 18, 2011 at 4:53 AM, t...@fuzzy.cz wrote: So the most important question is how to intercept the new/updated rows, and where to store them. I think each backend should maintain it's own private list of new records and forward them only in case of commit. Does that sound

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby
On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote: - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amount of overhead we pay for the current setup. That seems like

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas
On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote: On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote: - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes: On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote: On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote: - Forks are very possibly a more efficient way to deal with TOAST than

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby
On Jan 18, 2011, at 11:24 AM, Robert Haas wrote: On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote: On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote: - Forks are very possibly a more efficient way to deal with TOAST

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby
On Jan 18, 2011, at 11:32 AM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote: On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote: - Forks are very possibly

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas
On Tue, Jan 18, 2011 at 12:32 PM, Jim Nasby j...@nasby.net wrote: How's that different from what vacuum does on a TOAST table now? TOAST vacuum is currently an entirely separate vacuum. It might run at the same time as the main table vacuum, but it still has all the work that would be

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Tomas Vondra
Dne 18.1.2011 18:12, Robert Haas napsal(a): On Tue, Jan 18, 2011 at 4:53 AM, t...@fuzzy.cz wrote: So the most important question is how to intercept the new/updated rows, and where to store them. I think each backend should maintain it's own private list of new records and forward them only

Re: [HACKERS] estimating # of distinct values

2011-01-17 Thread Tomas Vondra
Dne 9.1.2011 13:58, Jim Nasby napsal(a): A resource fork? Not sure what you mean, could you describe it in more detail? Ooops, resource forks are a filesystem thing; we call them relation forks. From src/backend/storage/smgr/README: OK, I think using relation forks seems like a good

Re: [HACKERS] estimating # of distinct values

2011-01-17 Thread Jim Nasby
On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote: 1) Forks are 'per relation' but the distinct estimators are 'per column' (or 'per group of columns') so I'm not sure whether the file should contain all the estimators for the table, or if there should be one fork for each estimator. The

Re: [HACKERS] estimating # of distinct values

2011-01-17 Thread Robert Haas
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote: - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amount of overhead we pay for the current setup. That seems like an interesting idea, but I actually don't see why it

Re: [HACKERS] estimating # of distinct values

2011-01-15 Thread Tomas Vondra
Hi, a short update regarding the ndistinct stream estimators - I've implemented the estimators described in the papers I've mentioned in my previous posts. If you want try it, the sources are available at github, at http://tvondra.github.com/pg_estimator (ignore the page, I have to update it,

Re: [HACKERS] estimating # of distinct values

2011-01-10 Thread tv
On Fri, 2011-01-07 at 12:32 +0100, t...@fuzzy.cz wrote: the problem is you will eventually need to drop the results and rebuild it, as the algorithms do not handle deletes (ok, Florian mentioned an algorithm L_0 described in one of the papers, but I'm not sure we can use it). Yes, but even

Re: [HACKERS] estimating # of distinct values

2011-01-10 Thread Csaba Nagy
On Fri, 2011-01-07 at 12:32 +0100, t...@fuzzy.cz wrote: the problem is you will eventually need to drop the results and rebuild it, as the algorithms do not handle deletes (ok, Florian mentioned an algorithm L_0 described in one of the papers, but I'm not sure we can use it). Yes, but even

Re: [HACKERS] estimating # of distinct values

2011-01-09 Thread Jim Nasby
A resource fork? Not sure what you mean, could you describe it in more detail? Ooops, resource forks are a filesystem thing; we call them relation forks. From src/backend/storage/smgr/README: Relation Forks == Since 8.4, a single smgr relation can be comprised of multiple

Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread Csaba Nagy
On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote: How is an incremental ANALYZE going to work at all? How about a kind of continuous analyze ? Instead of analyzing just once and then drop the intermediate results, keep them on disk for all tables and then piggyback the background writer (or

Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread tv
On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote: How is an incremental ANALYZE going to work at all? How about a kind of continuous analyze ? Instead of analyzing just once and then drop the intermediate results, keep them on disk for all tables and then piggyback the background writer

Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread Jim Nasby
On Jan 7, 2011, at 5:32 AM, t...@fuzzy.cz wrote: Another thing I'm not sure about is where to store those intermediate stats (used to get the current estimate, updated incrementally). I was thinking about pg_stats but I'm not sure it's the right place - depending on the algorithm, this may be

Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread Tomas Vondra
Dne 7.1.2011 20:56, Jim Nasby napsal(a): On Jan 7, 2011, at 5:32 AM, t...@fuzzy.cz wrote: Another thing I'm not sure about is where to store those intermediate stats (used to get the current estimate, updated incrementally). I was thinking about pg_stats but I'm not sure it's the right place -

Re: [HACKERS] estimating # of distinct values

2010-12-31 Thread Alvaro Herrera
Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010: Alvaro Herrera alvhe...@commandprompt.com writes: I was thinking that we could have two different ANALYZE modes, one full and one incremental; autovacuum could be modified to use one or the other depending on how many

Re: [HACKERS] estimating # of distinct values

2010-12-31 Thread Jim Nasby
On Dec 31, 2010, at 7:34 AM, Alvaro Herrera wrote: Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010: Alvaro Herrera alvhe...@commandprompt.com writes: I was thinking that we could have two different ANALYZE modes, one full and one incremental; autovacuum could be modified to

Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Florian Pflug
On Dec27, 2010, at 23:49 , Kevin Grittner wrote: Robert Haas robertmh...@gmail.com wrote: With respect to (b), I think I'd need to see a much more detailed design for how you intend to make this work. Off the top of my head there seems to be some pretty serious feasibility problems. I

Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Tomas Vondra
Dne 30.12.2010 15:43, Florian Pflug napsal(a): On Dec27, 2010, at 23:49 , Kevin Grittner wrote: Robert Haas robertmh...@gmail.com wrote: With respect to (b), I think I'd need to see a much more detailed design for how you intend to make this work. Off the top of my head there seems to be

Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Alvaro Herrera
Excerpts from Tomas Vondra's message of jue dic 30 16:38:03 -0300 2010: Since the need to regularly VACUUM tables hit by updated or deleted won't go away any time soon, we could piggy-back the bit field rebuilding onto VACUUM to avoid a second scan. Well, I guess it's a bit more

Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes: I was thinking that we could have two different ANALYZE modes, one full and one incremental; autovacuum could be modified to use one or the other depending on how many changes there are (of course, the user could request one or the other, too;

Re: [HACKERS] estimating # of distinct values

2010-12-29 Thread Josh Berkus
Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal' so this is actually the minimum - you can get a much bigger difference with lower probability. So you can easily get an estimate that is a few orders off. FWIW, based on query performance, estimates which are up to 5X

Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread Robert Haas
On Tue, Dec 28, 2010 at 1:39 AM, Josh Berkus j...@agliodbs.com wrote: While I don't want to discourage you from working on steam-based estimators ... I'd love to see you implement a proof-of-concept for PostgreSQL, and test it ... the above is a non-argument.  It requires us to accept that

Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread tv
The simple truth is 1) sampling-based estimators are a dead-end The Charikar and Chaudhuri paper does not, in fact, say that it is impossible to improve sampling-based estimators as you claim it does. In fact, the authors offer several ways to improve sampling-based estimators. Further,

Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread Kevin Grittner
t...@fuzzy.cz wrote: So even with 10% of the table, there's a 10% probability to get an estimate that's 7x overestimated or underestimated. With lower probability the interval is much wider. Hmmm... Currently I generally feel I'm doing OK when the estimated rows for a step are in the right

Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread tv
t...@fuzzy.cz wrote: So even with 10% of the table, there's a 10% probability to get an estimate that's 7x overestimated or underestimated. With lower probability the interval is much wider. Hmmm... Currently I generally feel I'm doing OK when the estimated rows for a step are in the

[HACKERS] estimating # of distinct values

2010-12-27 Thread Tomas Vondra
Hi everyone, about two weeks ago I've started a thread about cross-column stats. One of the proposed solutions is based on number of distinct values, and the truth is the current distinct estimator has some problems. I've done some small research - I have read about 20 papers on this, and I'd

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Robert Haas
2010/12/27 Tomas Vondra t...@fuzzy.cz:   But even though these disadvantages, there really is no other   way to enhance the estimates. I don't think this should be a   default behavior - just as in case of cross-column stats this should   be optional when the current estimator does not work

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: With respect to (b), I think I'd need to see a much more detailed design for how you intend to make this work. Off the top of my head there seems to be some pretty serious feasibility problems. I had one random thought on that -- it seemed like a

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes: Robert Haas robertmh...@gmail.com wrote: With respect to (b), I think I'd need to see a much more detailed design for how you intend to make this work. Off the top of my head there seems to be some pretty serious feasibility problems. I

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tomas Vondra
Dne 27.12.2010 22:46, Robert Haas napsal(a): 2010/12/27 Tomas Vondra t...@fuzzy.cz: But even though these disadvantages, there really is no other way to enhance the estimates. I don't think this should be a default behavior - just as in case of cross-column stats this should be

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tomas Vondra
Dne 28.12.2010 00:04, Kevin Grittner napsal(a): Tom Lane t...@sss.pgh.pa.us wrote: Well, first, those scans occur only once every few hundred million transactions, which is not likely a suitable timescale for maintaining statistics. I was assuming that the pass of the entire table was

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Josh Berkus
The simple truth is 1) sampling-based estimators are a dead-end While I don't want to discourage you from working on steam-based estimators ... I'd love to see you implement a proof-of-concept for PostgreSQL, and test it ... the above is a non-argument. It requires us to accept that