Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Heikki Linnakangas

On 20.01.2011 04:36, Robert Haas wrote:

... Even better, the
code changes would be confined to ANALYZE rather than spread out all
over the system, which has positive implications for robustness and
likelihood of commit.


Keep in mind that the administrator can already override the ndistinct 
estimate with ALTER TABLE. If he needs to manually run a special ANALYZE 
command to make it scan the whole table, he might as well just use ALTER 
TABLE to tell the system what the real (or good enough) value is. A DBA 
should have a pretty good feeling of what the distribution of his data 
is like.


And how good does the estimate need to be? For a single-column, it's 
usually not that critical, because if the column has only a few distinct 
values then we'll already estimate that pretty well, and OTOH if 
ndistinct is large, it doesn't usually affect the plans much if it's 10% 
of the number of rows or 90%.


It seems that the suggested multi-column selectivity estimator would be 
more sensitive to ndistinct of the individual columns. Is that correct? 
How is it biased? If we routinely under-estimate ndistinct of individual 
columns, for example, does the bias accumulate or cancel itself in the 
multi-column estimate?


I'd like to see some testing of the suggested selectivity estimator with 
the ndistinct estimates we have. Who knows, maybe it works fine in practice.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Csaba Nagy
Hi Tomas,

On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote:
 No, the multi-column statistics do not require constant updating. There
 are cases where a sampling is perfectly fine, although you may need a
 bit larger sample. Generally if you can use a multi-dimensional
 histogram, you don't need to scan the whole table.

In the cases where sampling is enough, you can do that to the updates
too: do a sampling on the changes, in that you only process every Nth
change to make it to the estimator. If you can also dynamically tune the
N to grow it as the statistics stabilize, and lower it if you detect
high variance, even better.

If the analyze process could be decoupled from the backends, and maybe
just get the data passed over to be processed asynchronously, then that
could be a feasible way to have always up to date statistics when the
bottleneck is IO and CPU power is in excess. If that then leads to
better plans, it could really be a win exceeding the overhead.

If this analyze process (or more of them) could also just get the data
from the modified buffers in a cyclic way, so that backends need nothing
extra to do, then I don't see any performance disadvantage other than
possible extra locking contention on the buffers and non-determinism of
the actual time when a change makes it to the statistics. Then you just
need to get more CPU power and higher memory bandwidth to pay for the
accurate statistics.

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 03:06, Nathan Boley napsal(a):
 And actually it does not depend on ndistinct for the columns only, it
 depends on ndistinct estimates for the combination of columns. So
 improving the ndistinct estimates for columns is just a necessary first
 step (and only if it works reasonably well, we can do the next step).
 
 I think that any approach which depends on precise estimates of
 ndistinct is not practical.

I'm not aware of any other approach to the 'discrete fail case' (where
the multi-dimensional histograms are not applicable). If someone finds a
better solution, I'll be the first one to throw away this stuff.

 I am very happy that you've spent so much time on this, and I'm sorry
 if my previous email came off as combative. My point was only that
 simple heuristics have served us well in the past and, before we go to
 the effort of new, complicated schemes, we should see how well similar
 heuristics work in the multiple column case. I am worried that if the
 initial plan is too complicated then nothing will happen and, even if
 something does happen, it will be tough to get it committed ( check
 the archives for cross column stat threads - there are a lot ).

If I've leaned one thing over the years in IT, it's not to take critique
personally. All the problems mentioned in this thread are valid
concerns, pointing out weak points of the approach. And I'm quite happy
to receive this feedback - that's why I started it.

On the other hand - Jara Cimrman (a famous Czech fictional character,
depicted as the best scientist/poet/teacher/traveller/... - see [1])
once said that you can't be really sure you don't get gold by blowing
cigarette smoke into a basin drain, until you actually try it. So I'm
blowing cigaretter smoke into the drain ...

It may wery vell happen this will be a dead end, but I'll do my best to
fix all the issues or to prove that the pros outweight the cons. And
even if it will be eventually rejected, I hope to get -1 from TL to be
eligible for that t-shirt ...

[1] http://en.wikipedia.org/wiki/Jara_Cimrman

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 03:36, Robert Haas napsal(a):
 On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Regarding the crash scenario - if the commit fails, just throw away the
 local estimator copy, it's not needed. I'm not sure how to take care of
 the case when commit succeeds and the write of the merged estimator
 fails, but I think it might be possible to write the estimator to xlog
 or something like that. So it would be replayed during recovery etc. Or
 is it a stupid idea?

 It's not stupid, in the sense that that is what you'd need to do if
 you want to avoid ever having to rescan the table.  But it is another
 thing that I think is going to be way too expensive.

 Way too expensive? All you need to put into the logfile is a copy of the
 estimator, which is a few kBs. How is that 'way too expensive'?
 
 At this point, this is all a matter of religion, right?  Neither of us
 has a working implementation we've benchmarked.  But yes, I believe
 you're going to find that implementing some kind of streaming
 estimator is going to impose a...  pulls number out of rear end 6%
 performance penalty, even after you've optimized the living daylights
 out of it.  Now you might say... big deal, it improves my problem
 queries by 100x.  OK, but if you could get the same benefit by doing
 an occasional full table scan during off hours, you could have the
 same performance with a *0%* performance penalty.  Even better, the
 code changes would be confined to ANALYZE rather than spread out all
 over the system, which has positive implications for robustness and
 likelihood of commit.

Good point. What I was trying to do was to continuously update the
estimator with new data - that was the whole idea behind the collecting
of new values (which might lead to problems with memory etc. as you've
pointed out) and updating a local copy of the estimator (which is a good
idea I think).

But this might be another option - let the user decide if he wants to
continuously update the estimates (and pay the price) or do that off the
hours (and pay almost nothing). That sounds as a very good solution to me.

 I'm not trying to argue you out of working on this.  It's obviously
 your time to spend, and if works better than I think it will, great!
 I'm merely offering you an opinion on what will probably happen if you
 go this route - namely, it'll carry an unpalatable run-time penalty.
 That opinion may be worth no more than what you paid for it, but there
 you have it.

Yes, and I appreciate all feedback. But I still believe this can be done
so that users that don't need the feature don't pay for it.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 09:10, Heikki Linnakangas napsal(a):
 It seems that the suggested multi-column selectivity estimator would be
 more sensitive to ndistinct of the individual columns. Is that correct?
 How is it biased? If we routinely under-estimate ndistinct of individual
 columns, for example, does the bias accumulate or cancel itself in the
 multi-column estimate?
 
 I'd like to see some testing of the suggested selectivity estimator with
 the ndistinct estimates we have. Who knows, maybe it works fine in
 practice.

The estimator for two columns and query 'A=a AND B=b' is about

 0.5 * (dist(A)/dist(A,B) * Prob(A=a) + dist(B)/dist(A,B) * Prob(B=b))

so it's quite simple. It's not that sensitive to errors or ndistinct
estimates for individual columns, but the problem is in the multi-column
ndistinct estimates. It's very likely that with dependent colunms (e.g.
with the ZIP codes / cities) the distribution is so pathological that
the sampling-based estimate will be very off.

I guess this was a way too short analysis, but if you can provide more
details of the expected tests etc. I'll be happy to provide that.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 11:05, Csaba Nagy napsal(a):
 Hi Tomas,
 
 On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote:
 No, the multi-column statistics do not require constant updating. There
 are cases where a sampling is perfectly fine, although you may need a
 bit larger sample. Generally if you can use a multi-dimensional
 histogram, you don't need to scan the whole table.
 
 In the cases where sampling is enough, you can do that to the updates
 too: do a sampling on the changes, in that you only process every Nth
 change to make it to the estimator. If you can also dynamically tune the
 N to grow it as the statistics stabilize, and lower it if you detect
 high variance, even better.
 
 If the analyze process could be decoupled from the backends, and maybe
 just get the data passed over to be processed asynchronously, then that
 could be a feasible way to have always up to date statistics when the
 bottleneck is IO and CPU power is in excess. If that then leads to
 better plans, it could really be a win exceeding the overhead.

OK, this sounds interesting. I'm not sure how to do that but it might be
a good solution. What about transactions? If the client inserts data
(and it will be sent asynchronously to update the estimator) and then
rolls back, is the estimator 'rolled back' or what happens?

This was exactly the reason why I initially wanted to collect all the
data at the backend (and send them to the estimator at commit time).
Which was then replaced by the idea to keep a local estimator copy and
merge it back to the original estimator at commit time.

 If this analyze process (or more of them) could also just get the data
 from the modified buffers in a cyclic way, so that backends need nothing
 extra to do, then I don't see any performance disadvantage other than
 possible extra locking contention on the buffers and non-determinism of
 the actual time when a change makes it to the statistics. Then you just
 need to get more CPU power and higher memory bandwidth to pay for the
 accurate statistics.

Well, the possible locking contention sounds like a quite significant
problem to me :-(

The lag between an update and a change to the stats is not that big deal
I guess - we have the same behaviour with the rest of the stats (updated
by autovacuum every once a while).

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Robert Haas
On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Yes, I was aware of this problem (amount of memory consumed with lots of
 updates), and I kind of hoped someone will come up with a reasonable
 solution.

As far as I can see, periodically sampling some or all of the table is
really the only thing that has a chance of working OK.  You could try
to track changes only when they're not too big, but I think that's
still going to have nasty performance consequences.

 I was thinking about it and I believe at least one of the algorithms
 (the adaptive sampling) might handle merging i.e. the backends would
 not need to store the list of values, just a private copy of the
 estimator and update it. And at the end (after commit), the estimator
 would be merged back into the actual one.

 So the algorithm would be something like this:

 1. create copy for all distinct estimators influenced by the INSERT
 2. update the local copy
 3. commit and merge the local copies back into the original estimator

Well, maybe.  But DELETEs seem like a problem.  And it's still adding
foreground work to every query, which I just have a hard time
believing is ever going to work out

 Regarding the crash scenario - if the commit fails, just throw away the
 local estimator copy, it's not needed. I'm not sure how to take care of
 the case when commit succeeds and the write of the merged estimator
 fails, but I think it might be possible to write the estimator to xlog
 or something like that. So it would be replayed during recovery etc. Or
 is it a stupid idea?

It's not stupid, in the sense that that is what you'd need to do if
you want to avoid ever having to rescan the table.  But it is another
thing that I think is going to be way too expensive.

 Yes, as I've mentioned above, the multi-column stats are the reason why
 I started working on this. And yes, it really should work like this:

 1. user realizes there's something wrong as the plans are bad
 2. after analysis, the user realizes the reason is an imprecise
   estimate due to dependence between columns
 3. the user enables cross-column stats on the columns and checks
   if it actually solved the issue
 4. if the cost outweights the gains, the user drops the stats

 Does that sound reasonable?

Yes.  The only caveat I'm trying to insert is that it's hard for me to
see how the proposed approach could ever be cheap enough to be a win.
If we need some kind of more expensive kind of ANALYZE that scans the
whole table, and it's optional, sure, why not?  But that's a one-time
hit.  You run it, it sucks, it's over, and now your queries work.
Odds are good you may never need to touch it again.  Now, if we can
convince ourselves that multi-column stats are likely to require
constant updating, then maybe there would be more to recommend the
stream-based approach, although even then it seems dreadfully complex
and expensive to me.  But I bet these things are pretty static.  If
the city and state tend to imply the zip code when there are 10K rows
in the table, they probably still will when there are 100K rows in the
table.  If users with org_id = 1 have a higher karma score on average
than users with org_id != 1, that'll probably still be true when there
are more users in both classes.  Whether the database can understand
that without rescanning the table depends on the data representation,
of course, but I guess I'm just saying I'd think really, really hard
before abandoning the idea of periodic sampling.  You have to get an
awful lot of benefit out of those cross-column stats to make it worth
paying a run-time cost for them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 ... I guess I'm just saying I'd think really, really hard
 before abandoning the idea of periodic sampling.  You have to get an
 awful lot of benefit out of those cross-column stats to make it worth
 paying a run-time cost for them.

I've been trying to not discourage Tomas from blue-sky speculation,
but I have to say I agree with Robert that the chance of any usable
result from this approach is very very small.  Feel free to do some
experimentation to prove us wrong --- but don't put a lot of effort
into it before that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 20:25, Robert Haas napsal(a):
 On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Yes, I was aware of this problem (amount of memory consumed with lots of
 updates), and I kind of hoped someone will come up with a reasonable
 solution.
 
 As far as I can see, periodically sampling some or all of the table is
 really the only thing that has a chance of working OK.  You could try
 to track changes only when they're not too big, but I think that's
 still going to have nasty performance consequences.

Yes, sampling all the table is the only way to get reliable ndistinct
estimates. I'm not sure what you mean by 'tracking changes' - as I've
mentioned in the previous post, this might be solved by updating a local
copy. Which requires a constant space (a few kB, see the previous post).
Is that acceptable? I don't know, that's up to the user if he want's to
pay this price.

 So the algorithm would be something like this:

 1. create copy for all distinct estimators influenced by the INSERT
 2. update the local copy
 3. commit and merge the local copies back into the original estimator
 
 Well, maybe.  But DELETEs seem like a problem.  And it's still adding
 foreground work to every query, which I just have a hard time
 believing is ever going to work out

Yes, deletes are difficult to handle. My idea was to compute something
like ((tuples changed + tuples deleted) / tuples total), and indicate
that a rebuild of the estimator is needed if this coefficient changes
too much - e.g. log a message or something like that.

 Regarding the crash scenario - if the commit fails, just throw away the
 local estimator copy, it's not needed. I'm not sure how to take care of
 the case when commit succeeds and the write of the merged estimator
 fails, but I think it might be possible to write the estimator to xlog
 or something like that. So it would be replayed during recovery etc. Or
 is it a stupid idea?
 
 It's not stupid, in the sense that that is what you'd need to do if
 you want to avoid ever having to rescan the table.  But it is another
 thing that I think is going to be way too expensive.

Way too expensive? All you need to put into the logfile is a copy of the
estimator, which is a few kBs. How is that 'way too expensive'? Sure, it
might be expensive when the user does a lot of small changes in separate
transactions, that's true, but I guess we could minimize the amount of
data written to the xlog by doing a diff of the estimators or something
like that.

 Yes, as I've mentioned above, the multi-column stats are the reason why
 I started working on this. And yes, it really should work like this:

 1. user realizes there's something wrong as the plans are bad
 2. after analysis, the user realizes the reason is an imprecise
   estimate due to dependence between columns
 3. the user enables cross-column stats on the columns and checks
   if it actually solved the issue
 4. if the cost outweights the gains, the user drops the stats

 Does that sound reasonable?
 
 Yes.  The only caveat I'm trying to insert is that it's hard for me to
 see how the proposed approach could ever be cheap enough to be a win.

IMHO the question is not 'How expensive is that?' but rather 'How
expensive is it compared to the gains?' If the user gets much better
estimates and a then a much better plan, then this may be a completely
acceptable price.

 If we need some kind of more expensive kind of ANALYZE that scans the
 whole table, and it's optional, sure, why not?  But that's a one-time
 hit.  You run it, it sucks, it's over, and now your queries work.
 Odds are good you may never need to touch it again.  Now, if we can
 convince ourselves that multi-column stats are likely to require
 constant updating, then maybe there would be more to recommend the
 stream-based approach, although even then it seems dreadfully complex
 and expensive to me.  But I bet these things are pretty static.  If

No, the multi-column statistics do not require constant updating. There
are cases where a sampling is perfectly fine, although you may need a
bit larger sample. Generally if you can use a multi-dimensional
histogram, you don't need to scan the whole table.

But the multi-dimensional histograms are not applicable to some cases.
Especially to the ZIP code fail case, that was repeatedly discussed. So
in case of discrete data, we need a different approach - and the
solution I've proposed is based on using ndistinct estimates to get the
estimates (actually it's based on one of the papers listed on wiki).

 and expensive to me.  But I bet these things are pretty static.  If
 the city and state tend to imply the zip code when there are 10K rows
 in the table, they probably still will when there are 100K rows in the
 table.  If users with org_id = 1 have a higher karma score on average

OK, how will you measure the implicativeness?

We have discussed this in another thread and there is a lot of gotchas
although it seems like a really 

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 20:46, Tom Lane napsal(a):
 Robert Haas robertmh...@gmail.com writes:
 ... I guess I'm just saying I'd think really, really hard
 before abandoning the idea of periodic sampling.  You have to get an
 awful lot of benefit out of those cross-column stats to make it worth
 paying a run-time cost for them.
 
 I've been trying to not discourage Tomas from blue-sky speculation,
 but I have to say I agree with Robert that the chance of any usable
 result from this approach is very very small.  Feel free to do some
 experimentation to prove us wrong --- but don't put a lot of effort
 into it before that.
 
   regards, tom lane

Discourage? Not really - I have not expected this to be a simple thing
to implement. And yes, it might turn out to be a dead end eventually.

OTOH the approach it seems really expensive so let's not try to build
it is a certain dead end, so I'm not going to surrender due to such
speculations (althouh those are valid concerns and I'll have to address
them in the future).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Dne 19.1.2011 20:25, Robert Haas napsal(a):
 On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Yes, I was aware of this problem (amount of memory consumed with lots of
 updates), and I kind of hoped someone will come up with a reasonable
 solution.

 As far as I can see, periodically sampling some or all of the table is
 really the only thing that has a chance of working OK.  You could try
 to track changes only when they're not too big, but I think that's
 still going to have nasty performance consequences.

 Yes, sampling all the table is the only way to get reliable ndistinct
 estimates.

IMHO continuing to focus on worst case results is a dead end. Yes, for
some distributions, ndistinct is very hard to estimate well. However,
let us not forget that the current ndistinct estimator is very bad but
the postgres planner is, on the whole, very good.  As far as I can see
this is due to two facts:

1) The distribution of values in a table is rarely pathological, and
usually follows one of several common patterns. ( IOW, we have good
heuristics )

2) The choice of plan is fairly robust to mis-estimation of ndistinct,
because there are only a few things the executer can choose. It
doesn't usually matter if a value composes 5% or 20%  ( or 99% ) of
the table, we still usually need to scan the entire table.

Again, in my *very* humble opinion, If the ultimate goal is to improve
the planner, we should try to cut down on the number of cases in which
a poor estimate of ndistinct gives a really bad plan instead of
chasing after marginal gains from a better ndistinct estimator. Maybe
( as I've advocated for before ) this means parameterizing the
distribution of values ( ie, find that the values are roughly uniform
), maybe this means estimating the error of our statistics and using
the most robust rather than the best plan, or maybe it means something
totally different. But: ( and the literature seems to support me )
pounding away at the ndistinct estimator seems like a dead end. If you
think about it, it's a bit ridiculous to look at the whole table
*just* to estimate ndistinct - if we go that far why dont we just
store the full distribution and be done with it?

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 23:44, Nathan Boley napsal(a):
 1) The distribution of values in a table is rarely pathological, and
 usually follows one of several common patterns. ( IOW, we have good
 heuristics )

Not true. You're missing the goal of this effort - to get ndistinct
estimate for combination of multiple columns. Which is usually
pathological in case of dependent columns. Which is exactly the case
when the user will explicitly enable this feature to get better
estimates (and thus better plans).

 2) The choice of plan is fairly robust to mis-estimation of ndistinct,
 because there are only a few things the executer can choose. It
 doesn't usually matter if a value composes 5% or 20%  ( or 99% ) of
 the table, we still usually need to scan the entire table.

Again, don't think about a single column (although even in that case
there are known fail cases). Think about multiple columns and the number
of distinct combinations. With attribute value independence assumption,
this is usually significantly underestimated. And that's a problem as it
often leads to an index scan instead of sequential scan (or something
like that).

 Again, in my *very* humble opinion, If the ultimate goal is to improve
 the planner, we should try to cut down on the number of cases in which
 a poor estimate of ndistinct gives a really bad plan instead of
 chasing after marginal gains from a better ndistinct estimator. Maybe

What is a marginal gain? The ultimate goal is to build cross-column
stats, which in case of dependent columns usually results is poor plans.
How is fixing that a marginal gain?

Yes, it may turn out it's not worth it, but it's a bit early to say that
without an implementation and some hard data.

 ( as I've advocated for before ) this means parameterizing the
 distribution of values ( ie, find that the values are roughly uniform
 ), maybe this means estimating the error of our statistics and using
 the most robust rather than the best plan, or maybe it means something
 totally different. But: ( and the literature seems to support me )

Which literature? I've read an awful lot of papers on this topic lately,
and I don't remember anything like that. If there's an interesting
paper, I'd like to read it. Yes, all the papers state estimating a
ndistinct is a particularly tricky task, but that's somehow expected?

I've even checked how other databases do this - e.g. Oracle does it very
similarly to what I proposed (the user has to enable the feature, it has
to be rebuilt from time to time, etc.). I'm not saying we should do
everything like Oracle, but they are not clueless idiots.

 pounding away at the ndistinct estimator seems like a dead end. If you
 think about it, it's a bit ridiculous to look at the whole table
 *just* to estimate ndistinct - if we go that far why dont we just
 store the full distribution and be done with it?

No, I'm not trying to do this just to improve the ndistinct estimate.
The goal is to improve the cardinality estimate in case of dependent
columns, and one of the papers depends on good ndistinct estimates.

And actually it does not depend on ndistinct for the columns only, it
depends on ndistinct estimates for the combination of columns. So
improving the ndistinct estimates for columns is just a necessary first
step (and only if it works reasonably well, we can do the next step).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Florian Pflug
On Jan19, 2011, at 23:44 , Nathan Boley wrote:
 If you think about it, it's a bit ridiculous to look at the whole table
 *just* to estimate ndistinct - if we go that far why dont we just
 store the full distribution and be done with it?

The crucial point that you're missing here is that ndistinct provides an
estimate even if you *don't* have a specific value to search for at hand.
This is way more common than you may think, it e.g. happens every you time
PREPARE are statement with parameters. Even knowing the full distribution
has no advantage in this case - the best you could do is to average the
individual probabilities which gives ... well, 1/ndistinct.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
 If you think about it, it's a bit ridiculous to look at the whole table
 *just* to estimate ndistinct - if we go that far why dont we just
 store the full distribution and be done with it?

 - the best you could do is to average the
 individual probabilities which gives ... well, 1/ndistinct.


That is certainly *not* the best you could do in every case. The mean
is only the best estimate in L2, which is definitely not the case
here.

Consider a table with 10K values, 9,990 of which are 1 and the rest of
which are 2, 3, ..., 10, versus a table that has the same 10 distinct
values evenly distributed. For a simple equality query, in the first
case, a bitmap scan might be best. In the second case, a sequential
scan would always be best.

This is precisely the point I was trying to make in my email: the loss
function is very complicated. Improving the ndistinct estimator could
significantly improve the estimates of ndistinct ( in the quadratic
loss sense ) while only marginally improving the plans.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley

 Not true. You're missing the goal of this effort - to get ndistinct
 estimate for combination of multiple columns. Which is usually
 pathological in case of dependent columns.

snip

 Again, don't think about a single column (although even in that case
 there are known fail cases). Think about multiple columns and the number
 of distinct combinations. With attribute value independence assumption,
 this is usually significantly underestimated. And that's a problem as it
 often leads to an index scan instead of sequential scan (or something
 like that).

My point was that, in the case of single columns, we've done pretty
well despite the poor ndistinct estimates. I suspect the same will be
true in the case of multiple columns; good heuristics will trump
minimax estimators.

 ( as I've advocated for before ) this means parameterizing the
 distribution of values ( ie, find that the values are roughly uniform
 ), maybe this means estimating the error of our statistics and using
 the most robust rather than the best plan, or maybe it means something
 totally different. But: ( and the literature seems to support me )

 Which literature? I've read an awful lot of papers on this topic lately,
 and I don't remember anything like that. If there's an interesting
 paper, I'd like to read it. Yes, all the papers state estimating a
 ndistinct is a particularly tricky task, but that's somehow expected?

It is definitely expected - non-parametric minimax results are always
*very* weak. The papers that you've cited seem to confirm this;
bounding ndistinct estimation error is tricky in the general case (
even with very simple loss functions, which we do not have ). The same
is true for other non-parametric methods. Think about kernel density
estimation: how many data points do you need to estimate the density
of a normal distribution well? What about if you *know* that the data
is normal ( or even that it comes from a big family? ).

 No, I'm not trying to do this just to improve the ndistinct estimate.
 The goal is to improve the cardinality estimate in case of dependent
 columns, and one of the papers depends on good ndistinct estimates.

 And actually it does not depend on ndistinct for the columns only, it
 depends on ndistinct estimates for the combination of columns. So
 improving the ndistinct estimates for columns is just a necessary first
 step (and only if it works reasonably well, we can do the next step).

I think that any approach which depends on precise estimates of
ndistinct is not practical.

I am very happy that you've spent so much time on this, and I'm sorry
if my previous email came off as combative. My point was only that
simple heuristics have served us well in the past and, before we go to
the effort of new, complicated schemes, we should see how well similar
heuristics work in the multiple column case. I am worried that if the
initial plan is too complicated then nothing will happen and, even if
something does happen, it will be tough to get it committed ( check
the archives for cross column stat threads - there are a lot ).

Best,
Nathan Boley

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Florian Pflug
On Jan20, 2011, at 02:41 , Nathan Boley wrote:
 If you think about it, it's a bit ridiculous to look at the whole table
 *just* to estimate ndistinct - if we go that far why dont we just
 store the full distribution and be done with it?
 
 - the best you could do is to average the
 individual probabilities which gives ... well, 1/ndistinct.
 
 That is certainly *not* the best you could do in every case. The mean
 is only the best estimate in L2, which is definitely not the case
 here.

No, not in every case. But on average it comes out best, no?

 Consider a table with 10K values, 9,990 of which are 1 and the rest of
 which are 2, 3, ..., 10, versus a table that has the same 10 distinct
 values evenly distributed. For a simple equality query, in the first
 case, a bitmap scan might be best. In the second case, a sequential
 scan would always be best.

True. But selectivity estimates alone don't show the difference there.

Also, in my personal experience this isn't really the area we've got
problems now. The problem cases for me always were queries with a rather
large number of joins (7 to 10 or so) plus rather complex search
conditions. The join order, not the access strategy, then becomes the
deciding factor in plan performance. And the join order *is* driven purely
off the selectivity estimates, unlike the access strategy which even today
takes other factors into account (like clusteredness, I believe)

 This is precisely the point I was trying to make in my email: the loss
 function is very complicated. Improving the ndistinct estimator could
 significantly improve the estimates of ndistinct ( in the quadratic
 loss sense ) while only marginally improving the plans.


The point of this exercise isn't really to improve the ndistinct estimates
for single columns. Rather, we'd like to get ndistinct estimates for
groups of columns because that allows us to use the uniform bayesian
approach to multi-column selectivity estimation that Tomas found literature
on. Which on the whole seems like it *does* have a chance of greatly
improving the plans in some cases. We could, of course, estimate multi-column
ndistinct the same way we estimate single-column ndistincts, but quite a few
people raised concerns that this wouldn't work due to the large error our
current ndistinct estimates have. 

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Robert Haas
On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Regarding the crash scenario - if the commit fails, just throw away the
 local estimator copy, it's not needed. I'm not sure how to take care of
 the case when commit succeeds and the write of the merged estimator
 fails, but I think it might be possible to write the estimator to xlog
 or something like that. So it would be replayed during recovery etc. Or
 is it a stupid idea?

 It's not stupid, in the sense that that is what you'd need to do if
 you want to avoid ever having to rescan the table.  But it is another
 thing that I think is going to be way too expensive.

 Way too expensive? All you need to put into the logfile is a copy of the
 estimator, which is a few kBs. How is that 'way too expensive'?

At this point, this is all a matter of religion, right?  Neither of us
has a working implementation we've benchmarked.  But yes, I believe
you're going to find that implementing some kind of streaming
estimator is going to impose a...  pulls number out of rear end 6%
performance penalty, even after you've optimized the living daylights
out of it.  Now you might say... big deal, it improves my problem
queries by 100x.  OK, but if you could get the same benefit by doing
an occasional full table scan during off hours, you could have the
same performance with a *0%* performance penalty.  Even better, the
code changes would be confined to ANALYZE rather than spread out all
over the system, which has positive implications for robustness and
likelihood of commit.

I'm not trying to argue you out of working on this.  It's obviously
your time to spend, and if works better than I think it will, great!
I'm merely offering you an opinion on what will probably happen if you
go this route - namely, it'll carry an unpalatable run-time penalty.
That opinion may be worth no more than what you paid for it, but there
you have it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Robert Haas
On Wed, Jan 19, 2011 at 9:35 PM, Florian Pflug f...@phlo.org wrote:
 Also, in my personal experience this isn't really the area we've got
 problems now. The problem cases for me always were queries with a rather
 large number of joins (7 to 10 or so) plus rather complex search
 conditions. The join order, not the access strategy, then becomes the
 deciding factor in plan performance.

This certainly matches my experience (except sometimes with more joins).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug f...@phlo.org wrote:
 On Jan20, 2011, at 02:41 , Nathan Boley wrote:
 If you think about it, it's a bit ridiculous to look at the whole table
 *just* to estimate ndistinct - if we go that far why dont we just
 store the full distribution and be done with it?

 - the best you could do is to average the
 individual probabilities which gives ... well, 1/ndistinct.

 That is certainly *not* the best you could do in every case. The mean
 is only the best estimate in L2, which is definitely not the case
 here.

 No, not in every case. But on average it comes out best, no?

In the sense of minimizing the average plan cost over all values?
Definitely not. Consider the trivial case where a bitmap scan is the
cost of a sequential scan + epsilon.


 Consider a table with 10K values, 9,990 of which are 1 and the rest of
 which are 2, 3, ..., 10, versus a table that has the same 10 distinct
 values evenly distributed. For a simple equality query, in the first
 case, a bitmap scan might be best. In the second case, a sequential
 scan would always be best.

 True. But selectivity estimates alone don't show the difference there.

Yet the full distribution would - supporting my argument that even in
cases where we dont know a specific value, the full distribution is
more informative.


 Also, in my personal experience this isn't really the area we've got
 problems now. The problem cases for me always were queries with a rather
 large number of joins (7 to 10 or so) plus rather complex search
 conditions. The join order, not the access strategy, then becomes the
 deciding factor in plan performance. And the join order *is* driven purely
 off the selectivity estimates, unlike the access strategy which even today
 takes other factors into account (like clusteredness, I believe)


I think I'm losing you. Why would ndistinct be complete sufficient for
estimating the optimal join order?

 This is precisely the point I was trying to make in my email: the loss
 function is very complicated. Improving the ndistinct estimator could
 significantly improve the estimates of ndistinct ( in the quadratic
 loss sense ) while only marginally improving the plans.


 The point of this exercise isn't really to improve the ndistinct estimates
 for single columns. Rather, we'd like to get ndistinct estimates for
 groups of columns because that allows us to use the uniform bayesian
 approach to multi-column selectivity estimation that Tomas found literature
 on. Which on the whole seems like it *does* have a chance of greatly
 improving the plans in some cases. We could, of course, estimate multi-column
 ndistinct the same way we estimate single-column ndistincts, but quite a few
 people raised concerns that this wouldn't work due to the large error our
 current ndistinct estimates have.

Sure. But estimating multi-column ndistinct is surely not the only way
to approach this problem.

The comment I made, which you objected to, was that it's silly to look
at the whole table and then throw away all information *except*
ndistinct. If we really think that looking at the whole table is the
best approach, then shouldn't we be able to get better statistics? Is
this really an open question?

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread tv
 On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote:
 1) Forks are 'per relation' but the distinct estimators are 'per
   column' (or 'per group of columns') so I'm not sure whether the file
   should contain all the estimators for the table, or if there should
   be one fork for each estimator. The former is a bit difficult to
   manage, the latter somehow breaks the current fork naming convention.

 Yeah, when I looked at the fork stuff I was disappointed to find out
 there's essentially no support for dynamically adding forks. There's two
 other possible uses for that I can think of:

 - Forks are very possibly a more efficient way to deal with TOAST than
 having separate tables. There's a fair amount of overhead we pay for the
 current setup.
 - Dynamic forks would make it possible to do a column-store database, or
 at least something approximating one.

 Without some research, there's no way to know if either of the above makes
 sense; but without dynamic forks we're pretty much dead in the water.

 So I wonder what it would take to support dynamically adding forks...

Interesting ideas, but a bit out of scope. I think I'll go with one fork
containing all the estimators for now, although it might be inconvenient
in some cases. I was thinking about rebuilding a single estimator with
increased precision - in that case the size changes so that all the other
data has to be shifted. But this won't be very common (usually all the
estimators will be rebuilt at the same time), and it's actually doable.

So the most important question is how to intercept the new/updated rows,
and where to store them. I think each backend should maintain it's own
private list of new records and forward them only in case of commit. Does
that sound reasonable?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas
On Tue, Jan 18, 2011 at 4:53 AM,  t...@fuzzy.cz wrote:
 So the most important question is how to intercept the new/updated rows,
 and where to store them. I think each backend should maintain it's own
 private list of new records and forward them only in case of commit. Does
 that sound reasonable?

At the risk of sounding demoralizing, nothing about this proposal
sounds very promising to me, and that sounds like a particularly bad
idea.  What happens if the transaction updates a billion records?  Or
even a million records?  Are you going to store all of those in
backend-local memory until commit time?  Or spool them in a
potentially-gigantic disk file somewhere?  That memory allocation - or
file - could grow to be larger than the size of the entire database in
the worst case.  And COMMIT could take an awfully long time if it has
to spool megabytes or gigabytes of data off to some other process.
And what happens if there's a crash after the COMMIT but before all
that data is sent?  The estimates become permanently wrong?

And are we doing all of this just to get a more accurate estimate of
ndistinct?  For the amount of effort that it will probably take to get
this working at all, you could probably implement index-only scans and
have enough bandwidth left over to tackle global temporary tables.
And unless I'm missing something, the chances that the performance
consequences will be tolerable are pretty much zero.  And it would
only benefit the tiny fraction of users for whom bad n_distinct
estimates cause bad plans, and then the even tinier percentage of
those who can't conveniently fix it by using the manual override that
we already have in place - which presumably means people who have
gigantic tables that are regularly rewritten with massive changes in
the data distribution that affect plan choice.  Is that more than the
empty set?

Maybe the idea here is that this wouldn't fix just ndistinct estimates
but would also help with multi-column statistics.  Even if that's the
case, I think it's almost certainly a dead end from a performance
standpoint.  Some kind of manual ANALYZE process that can be invoked
when needed to scan the entire table would be painful but maybe
acceptable for certain people with a problem they can't fix any other
way, but doing this automatically for everyone seems like a really bad
idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby
On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
 On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.
 
 That seems like an interesting idea, but I actually don't see why it
 would be any more efficient, and it seems like you'd end up
 reinventing things like vacuum and free space map management.

The FSM would take some effort, but I don't think vacuum would be that hard to 
deal with; you'd just have to free up the space in any referenced toast forks 
at the same time that you vacuumed the heap.

 - Dynamic forks would make it possible to do a column-store database, or at 
 least something approximating one.
 
 I've been wondering whether we could do something like this by
 treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
 tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
 columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
 SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.

Possibly, but you'd be paying tuple overhead twice, which is what I was looking 
to avoid with forks.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas
On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
 On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.

 That seems like an interesting idea, but I actually don't see why it
 would be any more efficient, and it seems like you'd end up
 reinventing things like vacuum and free space map management.

 The FSM would take some effort, but I don't think vacuum would be that hard 
 to deal with; you'd just have to free up the space in any referenced toast 
 forks at the same time that you vacuumed the heap.

How's that different from what vacuum does on a TOAST table now?

 - Dynamic forks would make it possible to do a column-store database, or at 
 least something approximating one.

 I've been wondering whether we could do something like this by
 treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
 tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
 columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
 SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.

 Possibly, but you'd be paying tuple overhead twice, which is what I was 
 looking to avoid with forks.

What exactly do you mean by tuple overhead?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
 On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.
 
 That seems like an interesting idea, but I actually don't see why it
 would be any more efficient, and it seems like you'd end up
 reinventing things like vacuum and free space map management.
 
 The FSM would take some effort, but I don't think vacuum would be that hard 
 to deal with; you'd just have to free up the space in any referenced toast 
 forks at the same time that you vacuumed the heap.

 How's that different from what vacuum does on a TOAST table now?

Even more to the point: Jim hasn't provided one single reason to suppose
that this would be better-performing than the existing approach.  It
looks to me like a large amount of work, and loss of on-disk
compatibility, for nothing at all except the sake of change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby
On Jan 18, 2011, at 11:24 AM, Robert Haas wrote:
 On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
 On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.
 
 That seems like an interesting idea, but I actually don't see why it
 would be any more efficient, and it seems like you'd end up
 reinventing things like vacuum and free space map management.
 
 The FSM would take some effort, but I don't think vacuum would be that hard 
 to deal with; you'd just have to free up the space in any referenced toast 
 forks at the same time that you vacuumed the heap.
 
 How's that different from what vacuum does on a TOAST table now?

TOAST vacuum is currently an entirely separate vacuum. It might run at the same 
time as the main table vacuum, but it still has all the work that would be 
associated with vacuuming a table with the definition of a toast table. In 
fact, at one point vacuuming toast took two passes: the first deleted the toast 
rows that were no longer needed, then you had to go back and vacuum those 
deleted rows.

 - Dynamic forks would make it possible to do a column-store database, or 
 at least something approximating one.
 
 I've been wondering whether we could do something like this by
 treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
 tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
 columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
 SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.
 
 Possibly, but you'd be paying tuple overhead twice, which is what I was 
 looking to avoid with forks.
 
 What exactly do you mean by tuple overhead?

typedef struct HeapTupleHeaderData. With only two tables it might not be that 
bad, depending on the fields. Beyond two tables it's almost certainly a loser.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby
On Jan 18, 2011, at 11:32 AM, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
 On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.
 
 That seems like an interesting idea, but I actually don't see why it
 would be any more efficient, and it seems like you'd end up
 reinventing things like vacuum and free space map management.
 
 The FSM would take some effort, but I don't think vacuum would be that hard 
 to deal with; you'd just have to free up the space in any referenced toast 
 forks at the same time that you vacuumed the heap.
 
 How's that different from what vacuum does on a TOAST table now?
 
 Even more to the point: Jim hasn't provided one single reason to suppose
 that this would be better-performing than the existing approach.  It
 looks to me like a large amount of work, and loss of on-disk
 compatibility, for nothing at all except the sake of change.

Yes, I was only pointing out that there were possible uses for allowing a 
variable number of forks per relation if Tomas felt the need to go that 
direction. Changing toast would certainly require some very convincing 
arguments as to the benefits.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas
On Tue, Jan 18, 2011 at 12:32 PM, Jim Nasby j...@nasby.net wrote:
 How's that different from what vacuum does on a TOAST table now?

 TOAST vacuum is currently an entirely separate vacuum. It might run at the 
 same time as the main table vacuum, but it still has all the work that would 
 be associated with vacuuming a table with the definition of a toast table. In 
 fact, at one point vacuuming toast took two passes: the first deleted the 
 toast rows that were no longer needed, then you had to go back and vacuum 
 those deleted rows.

I don't know whether it still works that way or not, but if it does,
fixing it could perhaps be done with far less drastic surgery than
what you're proposing here.

 Possibly, but you'd be paying tuple overhead twice, which is what I was 
 looking to avoid with forks.

 What exactly do you mean by tuple overhead?

 typedef struct HeapTupleHeaderData. With only two tables it might not be that 
 bad, depending on the fields. Beyond two tables it's almost certainly a loser.

A struct definition by itself doesn't cause overhead.  Are you talking
about storage on disk?  CPU usage for MVCC visibility testing?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Tomas Vondra
Dne 18.1.2011 18:12, Robert Haas napsal(a):
 On Tue, Jan 18, 2011 at 4:53 AM,  t...@fuzzy.cz wrote:
 So the most important question is how to intercept the new/updated rows,
 and where to store them. I think each backend should maintain it's own
 private list of new records and forward them only in case of commit. Does
 that sound reasonable?
 
 At the risk of sounding demoralizing, nothing about this proposal
 sounds very promising to me, and that sounds like a particularly bad
 idea.  What happens if the transaction updates a billion records?  Or
 even a million records?  Are you going to store all of those in
 backend-local memory until commit time?  Or spool them in a
 potentially-gigantic disk file somewhere?  That memory allocation - or
 file - could grow to be larger than the size of the entire database in
 the worst case.  And COMMIT could take an awfully long time if it has
 to spool megabytes or gigabytes of data off to some other process.
 And what happens if there's a crash after the COMMIT but before all
 that data is sent?  The estimates become permanently wrong?

Yes, I was aware of this problem (amount of memory consumed with lots of
updates), and I kind of hoped someone will come up with a reasonable
solution.

I was thinking about it and I believe at least one of the algorithms
(the adaptive sampling) might handle merging i.e. the backends would
not need to store the list of values, just a private copy of the
estimator and update it. And at the end (after commit), the estimator
would be merged back into the actual one.

So the algorithm would be something like this:

1. create copy for all distinct estimators influenced by the INSERT
2. update the local copy
3. commit and merge the local copies back into the original estimator

The amount of copied data is very small, depending on the number of
distinct items it can handle and precision (see the adaptive.c for
details). Some examples

2^48 items + 1% error = 86 kB
2^48 items + 5% error = 3.4 kB
2^64 items + 1% error = 115 kB
2^64 items + 5% error = 4.6 kB

so it's much better than storing all the data.

But I really need to check this is possible (right now I'm quite busy
with organizing a local conference, so maybe next week).

Regarding the crash scenario - if the commit fails, just throw away the
local estimator copy, it's not needed. I'm not sure how to take care of
the case when commit succeeds and the write of the merged estimator
fails, but I think it might be possible to write the estimator to xlog
or something like that. So it would be replayed during recovery etc. Or
is it a stupid idea?

 And are we doing all of this just to get a more accurate estimate of
 ndistinct?  For the amount of effort that it will probably take to get
 this working at all, you could probably implement index-only scans and
 have enough bandwidth left over to tackle global temporary tables.
 And unless I'm missing something, the chances that the performance
 consequences will be tolerable are pretty much zero.  And it would
 only benefit the tiny fraction of users for whom bad n_distinct
 estimates cause bad plans, and then the even tinier percentage of
 those who can't conveniently fix it by using the manual override that
 we already have in place - which presumably means people who have
 gigantic tables that are regularly rewritten with massive changes in
 the data distribution that affect plan choice.  Is that more than the
 empty set?

First of all, I've never proposed to use this as a default behaviour.
Actually I've strongly argued agains that idea on several occasions. So
the user would have to enable that explicitly for some columns, I really
am not an idiot to force everyone to pay for this.

So the goal is to help the small fraction of users and keep the
current solution for the rest of them.

And yes, the main reason why I am working on this are the multi-column
stats (in case of discrete columns). You can fix the ndistinct estimate
by manually overriding the estimate, but for the multi-column stats you
need an estimate of ndistinct for the combination of columns, and that's
a bit difficult to do manually.

The other reason is I've studied statistics (and I enjoy it, which seems
weird to most people). And it's a good way to learn the internals.

 Maybe the idea here is that this wouldn't fix just ndistinct estimates
 but would also help with multi-column statistics.  Even if that's the
 case, I think it's almost certainly a dead end from a performance
 standpoint.  Some kind of manual ANALYZE process that can be invoked
 when needed to scan the entire table would be painful but maybe
 acceptable for certain people with a problem they can't fix any other
 way, but doing this automatically for everyone seems like a really bad
 idea.

Yes, as I've mentioned above, the multi-column stats are the reason why
I started working on this. And yes, it really should work like this:

1. user realizes there's something wrong as the plans are bad
2. after 

Re: [HACKERS] estimating # of distinct values

2011-01-17 Thread Tomas Vondra
Dne 9.1.2011 13:58, Jim Nasby napsal(a):
 A resource fork? Not sure what you mean, could you describe it in more
 detail?
 
 Ooops, resource forks are a filesystem thing; we call them relation forks. 
 From src/backend/storage/smgr/README:

OK, I think using relation forks seems like a good solution. I've done
some basic research and I think these are the basic steps when adding a
new fork:

1) define a new item in the ForkNum enum (relfilenode.h) - this should
   be somethink like DISTINCT_FORK I guess

2) modify the ForkNames (catalog.c) and the relpathbackend so that the
   proper filename is assigned to the fork

And then it will be accessed through smgr (smgrcreate etc.). Am I right
or is there something else I need to do?

There are a few open questions though:

1) Forks are 'per relation' but the distinct estimators are 'per
   column' (or 'per group of columns') so I'm not sure whether the file
   should contain all the estimators for the table, or if there should
   be one fork for each estimator. The former is a bit difficult to
   manage, the latter somehow breaks the current fork naming convention.

2) Where to keep the info that there is an estimator for a column? I
   guess we could put this into pg_attribute (it's one boolean). But
   what about the estimators for groups of columns? Because that's why
   I'm building this - to get distinct estimates for groups of columns.

   I guess we'll need a new system catalog to track this? (The same
   will be true for multi-dimensional histograms anyway).

3) I still am not sure how to manage the updates, i.e. how to track the
   new values.

   One possibility might be to do that synchronously - whenever a new
   item is inserted into the table, check if there's an estimator and
   update it. Updating the estimators is quite efficient (e.g. the
   bitmap manages to do 10.000.000 inserts in 9 seconds on my ancient
   workstation) although there might be issues with locking etc.

   The other possibility is to update the estimator asynchronously, i.e.
   store the new values somewhere (or just ctid of the row), and then
   process it periodically. I'm not sure how to intercept the new rows
   and where to store them. In another fork? Somewhere else?

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-17 Thread Jim Nasby
On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote:
 1) Forks are 'per relation' but the distinct estimators are 'per
   column' (or 'per group of columns') so I'm not sure whether the file
   should contain all the estimators for the table, or if there should
   be one fork for each estimator. The former is a bit difficult to
   manage, the latter somehow breaks the current fork naming convention.

Yeah, when I looked at the fork stuff I was disappointed to find out there's 
essentially no support for dynamically adding forks. There's two other possible 
uses for that I can think of:

- Forks are very possibly a more efficient way to deal with TOAST than having 
separate tables. There's a fair amount of overhead we pay for the current setup.
- Dynamic forks would make it possible to do a column-store database, or at 
least something approximating one.

Without some research, there's no way to know if either of the above makes 
sense; but without dynamic forks we're pretty much dead in the water.

So I wonder what it would take to support dynamically adding forks...
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-17 Thread Robert Haas
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than having 
 separate tables. There's a fair amount of overhead we pay for the current 
 setup.

That seems like an interesting idea, but I actually don't see why it
would be any more efficient, and it seems like you'd end up
reinventing things like vacuum and free space map management.

 - Dynamic forks would make it possible to do a column-store database, or at 
 least something approximating one.

I've been wondering whether we could do something like this by
treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-15 Thread Tomas Vondra
Hi,

a short update regarding the ndistinct stream estimators - I've
implemented the estimators described in the papers I've mentioned in my
previous posts. If you want try it, the sources are available at github,
at http://tvondra.github.com/pg_estimator (ignore the page, I have to
update it, just download the sources).

You can build the estimators as standalone benchmarks (build.sh) or as
aggregate functions (build-agg.sh) - I guess the latter is much easier
to play with. The installation is described on my blog, along with some
two simple benchmarks

http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/

Feel free to play with it with different data, and if you notice
something strange just let me know.

Several comment regarding the implemented estimators:

1) PCSA, Probabilistic Counting (both described in the 1985 paper)

   - quite good performance, especially with respect to the memory
   requirements (95 and 495 B)

2) Adaptive Sampling (1990), Self-Learning Bitmap (2009)

   - very different algorithms, almost the same performance (especially
   with large number of distinct values)

   - my personal favourites, as the memory requirements are still very
   reasonable (compared to the Bloom Counter)

3) Rough Estimator (2010)

   - this algorithm is described as an optimal algorithm (not
   exactly, it's a building block of the optimal algorithm), but the
   results are not as good as with (1) and (2)

   - one of the problems is it needs k-wise independent hash functions,
   and the MD5 hashing scheme used with the other algorithms probably
   does not conform to this condition :-(

   - I've tried to implement such hash functions on my own (using prime
   field and polynomials in modulo arithmetics), but the results were
   quite bad - if you know how to get such hash functions, let me know.

   - this algorithm is much more complex than the other algorithms, so
   if someone could do a review, that would be nice (maybe there is a
   bug resulting in the big estimate error)

4) Bloom Counter

   - basically a Bloom filter + a counter (incremented when an new
   element is added to the filter)

   - due to the design (no false negatives, positive false positives)
   it's significantly biased (gives underestimates), but I think that
   can be easily fixed with a coefficient (depends on the number of
   items / false positive eror rate)

   - needs significantly more memory to get similar performance as the
   other algorithms (a Bloom counter that can track 1 milion distinct
   values needs about 1.2MB of memory, while a S-Bitmap that tracks 10
   milion items needs just 65kB)

   - so unless we can use the special feature of Bloom Filter (i.e. the
   ability to detect items that are in the data), this is a waste of
   memory

   I know Google uses Bloom filters in BigTable to limit I/O, i.e.
   before doing the I/O they ask the Bloom filter if there are such
   data - if the answer is no, they don't have to do the I/O (false
   negatives not allowed), if the answer is yes they do the I/O.

   Not sure if we can / want to do something like this in PostgreSQL,
   as it significantly depends on the workload (and in most cases we
   know the data are there, so we have to do the I/O anyway). But I
   guess it might be a good idea for a contrib module (to maintain the
   Bloom filter on your own and do the decision on application). The
   tricky part here is to maintain the bloom filter up to date.

I'll try to fix the optimal estimator (not sure how to do that right
now), and I'll investigate the proposed 'relation fork' proposed as a
way to store the estimator.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-10 Thread tv
 On Fri, 2011-01-07 at 12:32 +0100, t...@fuzzy.cz wrote:
 the problem is you will eventually need to drop the results and rebuild
 it, as the algorithms do not handle deletes (ok, Florian mentioned an
 algorithm L_0 described in one of the papers, but I'm not sure we can
 use
 it).

 Yes, but even then you can start with much better cards if you already
 have an estimate of what it looks like, based on the fact that you did
 continuous updating of it. For example you'll have a pretty good
 estimate of the bounds of the number of distinct values, while if you
 really start from scratch you have nothing to start with but assume that
 you must cope with the complete range between all values are distinct -
 there's only a few of them.

Sure, using the previous estimate is a good idea. I just wanted to point
out there is no reasonable way to handle deletes, so that you have to drop
the stats are rebuild it from scratch.

The biggest problem is not choosing a reasonable parameters (some of the
parameters can handle a few billions ndistinct values with something like
128kB of memory and less than 5% error). The really serious concern is I/O
generated by rebuilding the stats.

 Another thing I'm not sure about is where to store those intermediate
 stats (used to get the current estimate, updated incrementally).

 The forks implementation proposed in other responses is probably the
 best idea if usable. It will also solve you the problem of memory
 limitations, at the expense of more resources used if the structure
 grows too big, but it will still be probably fast enough if it stays
 small and in cache.

Hm, the forks seem to be an interesting option. It's probably much better
than storing that directly in the memory (not a good idea if there is a
lot of data). And you don't really need the data to get the estimate, it's
needed just when updating the stats.

So the last thing I'm not sure is how to store the changed rows, so that
the update process can get a list of new values. Someone already offered
LISTEN/NOTIFY, but I guess we could just write the ctids into a file
(maybe another fork)?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-10 Thread Csaba Nagy
On Fri, 2011-01-07 at 12:32 +0100, t...@fuzzy.cz wrote:
 the problem is you will eventually need to drop the results and rebuild
 it, as the algorithms do not handle deletes (ok, Florian mentioned an
 algorithm L_0 described in one of the papers, but I'm not sure we can use
 it).

Yes, but even then you can start with much better cards if you already
have an estimate of what it looks like, based on the fact that you did
continuous updating of it. For example you'll have a pretty good
estimate of the bounds of the number of distinct values, while if you
really start from scratch you have nothing to start with but assume that
you must cope with the complete range between all values are distinct -
there's only a few of them.

 I'm not sure a constantly running background process is a good idea. I'd
 prefer storing an info about the modified tuples somewhere, and starting
 analyze only when a given threshold is reached. I'm not sure how to do
 that, though.
 
 Another thing I'm not sure about is where to store those intermediate
 stats (used to get the current estimate, updated incrementally).

The forks implementation proposed in other responses is probably the
best idea if usable. It will also solve you the problem of memory
limitations, at the expense of more resources used if the structure
grows too big, but it will still be probably fast enough if it stays
small and in cache.

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-09 Thread Jim Nasby
 A resource fork? Not sure what you mean, could you describe it in more
 detail?

Ooops, resource forks are a filesystem thing; we call them relation forks. 
From src/backend/storage/smgr/README:

Relation Forks
==

Since 8.4, a single smgr relation can be comprised of multiple physical
files, called relation forks. This allows storing additional metadata like
Free Space information in additional forks, which can be grown and truncated
independently of the main data file, while still treating it all as a single
physical relation in system catalogs.

It is assumed that the main fork, fork number 0 or MAIN_FORKNUM, always
exists. Fork numbers are assigned in src/include/storage/relfilenode.h.
Functions in smgr.c and md.c take an extra fork number argument, in addition
to relfilenode and block number, to identify which relation fork you want to
access. Since most code wants to access the main fork, a shortcut version of
ReadBuffer that accesses MAIN_FORKNUM is provided in the buffer manager for
convenience.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread Csaba Nagy
On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote:
 How is an incremental ANALYZE going to work at all?

How about a kind of continuous analyze ?

Instead of analyzing just once and then drop the intermediate results,
keep them on disk for all tables and then piggyback the background
writer (or have a dedicated process if that's not algorithmically
feasible) and before writing out stuff update the statistics based on
the values found in modified buffers. Probably it could take a random
sample of buffers to minimize overhead, but if it is done by a
background thread the overhead could be minimal anyway on multi-core
systems.

Not sure this makes sense at all, but if yes it would deliver the most
up to date statistics you can think of.

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread tv
 On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote:
 How is an incremental ANALYZE going to work at all?

 How about a kind of continuous analyze ?

 Instead of analyzing just once and then drop the intermediate results,
 keep them on disk for all tables and then piggyback the background
 writer (or have a dedicated process if that's not algorithmically
 feasible) and before writing out stuff update the statistics based on
 the values found in modified buffers. Probably it could take a random
 sample of buffers to minimize overhead, but if it is done by a
 background thread the overhead could be minimal anyway on multi-core
 systems.

Hi,

the problem is you will eventually need to drop the results and rebuild
it, as the algorithms do not handle deletes (ok, Florian mentioned an
algorithm L_0 described in one of the papers, but I'm not sure we can use
it).

I'm not sure a constantly running background process is a good idea. I'd
prefer storing an info about the modified tuples somewhere, and starting
analyze only when a given threshold is reached. I'm not sure how to do
that, though.

Another thing I'm not sure about is where to store those intermediate
stats (used to get the current estimate, updated incrementally). I was
thinking about pg_stats but I'm not sure it's the right place - depending
on the algorithm, this may be a fet kilobytes up to several megabytes. And
it's not needed except when updating it. Any ideas?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread Jim Nasby
On Jan 7, 2011, at 5:32 AM, t...@fuzzy.cz wrote:
 Another thing I'm not sure about is where to store those intermediate
 stats (used to get the current estimate, updated incrementally). I was
 thinking about pg_stats but I'm not sure it's the right place - depending
 on the algorithm, this may be a fet kilobytes up to several megabytes. And
 it's not needed except when updating it. Any ideas?

If you're using it essentially as a queue, maybe a resource fork makes sense?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread Tomas Vondra
Dne 7.1.2011 20:56, Jim Nasby napsal(a):
 On Jan 7, 2011, at 5:32 AM, t...@fuzzy.cz wrote:
 Another thing I'm not sure about is where to store those intermediate
 stats (used to get the current estimate, updated incrementally). I was
 thinking about pg_stats but I'm not sure it's the right place - depending
 on the algorithm, this may be a fet kilobytes up to several megabytes. And
 it's not needed except when updating it. Any ideas?
 
 If you're using it essentially as a queue, maybe a resource fork makes sense?

A resource fork? Not sure what you mean, could you describe it in more
detail?

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-31 Thread Alvaro Herrera
Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010:
 Alvaro Herrera alvhe...@commandprompt.com writes:
  I was thinking that we could have two different ANALYZE modes, one
  full and one incremental; autovacuum could be modified to use one or
  the other depending on how many changes there are (of course, the user
  could request one or the other, too; not sure what should be the default
  behavior).
 
 How is an incremental ANALYZE going to work at all?  It has no way to
 find out the recent changes in the table, for *either* inserts or
 deletes.  Unless you want to seqscan the whole table looking for tuples
 with xmin later than something-or-other ... which more or less defeats
 the purpose.

Yeah, I was thinking that this incremental ANALYZE would be the stream
in the stream-based estimator but evidently it doesn't work that way.
The stream that needs to be passed to the estimator consists of new
tuples as they are being inserted into the table, so this would need to
be done by the inserter process ... or it'd need to transmit the CTIDs
for someone else to stream them ... not an easy thing, in itself.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-31 Thread Jim Nasby
On Dec 31, 2010, at 7:34 AM, Alvaro Herrera wrote:
 Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 I was thinking that we could have two different ANALYZE modes, one
 full and one incremental; autovacuum could be modified to use one or
 the other depending on how many changes there are (of course, the user
 could request one or the other, too; not sure what should be the default
 behavior).
 
 How is an incremental ANALYZE going to work at all?  It has no way to
 find out the recent changes in the table, for *either* inserts or
 deletes.  Unless you want to seqscan the whole table looking for tuples
 with xmin later than something-or-other ... which more or less defeats
 the purpose.
 
 Yeah, I was thinking that this incremental ANALYZE would be the stream
 in the stream-based estimator but evidently it doesn't work that way.
 The stream that needs to be passed to the estimator consists of new
 tuples as they are being inserted into the table, so this would need to
 be done by the inserter process ... or it'd need to transmit the CTIDs
 for someone else to stream them ... not an easy thing, in itself.

Perhaps listen/notify could be used for this, now that it allows passing a 
payload.

BTW, if we reduce the frequency at which full scans of large tables are needed 
then presumably the cost of the scans could be largely ignored. If we don't 
need to scan frequently then we shouldn't care very much about how long a scan 
takes, which means we could throttle it heavily. Presumably even a heavily used 
system can spare 500kB/s of IO to perform background scanning...
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Florian Pflug
On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
 Robert Haas robertmh...@gmail.com wrote:
 
 With respect to (b), I think I'd need to see a much more detailed
 design for how you intend to make this work.  Off the top of my
 head there seems to be some pretty serious feasibility problems.
 
 I had one random thought on that -- it seemed like a large concern
 was that there would need to be at least an occasional scan of the
 entire table to rebuild the distinct value information.

I believe we could actually avoid that.

First, the paper An Optimal Algorithm for the Distinct Elements Problem
actually contains an algorithm with *does* handle deletes - it's called
L_0 estimate there.

Second, as Tomas pointed out, the stream-based estimator is essentially a
simplified version of a bloom filter. It starts out with a field of
N zero bits, and sets K of them to 1 for each value v in the stream.
Which bits are set to 1 depends on some hash function(s) H_i(v). It's
then easy to compute how many 1-bits you'd expect to find in the bit
field after seeing D distinct values, and by reversing that you can
estimate D from the number of 1-bits in the bit field.

To avoid having to rescan large tables, instead of storing one such
bit field, we'd store one per B pages of data. We'd then only need
to scan a range of B pages around every updated or deleted tuple,
and could afterwards compute a new global estimate of D by combining
the individual bit fields with bitwise and.

Since the need to regularly VACUUM tables hit by updated or deleted
won't go away any time soon, we could piggy-back the bit field
rebuilding onto VACUUM to avoid a second scan.

A good value for B would probably be around
1000*size of bitfield/page size. If the bitfield needs ~100k, that'd
make B ~= 12000 pages ~= 100MB.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Tomas Vondra
Dne 30.12.2010 15:43, Florian Pflug napsal(a):
 On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
 Robert Haas robertmh...@gmail.com wrote:

 With respect to (b), I think I'd need to see a much more detailed
 design for how you intend to make this work.  Off the top of my
 head there seems to be some pretty serious feasibility problems.

 I had one random thought on that -- it seemed like a large concern
 was that there would need to be at least an occasional scan of the
 entire table to rebuild the distinct value information.
 
 I believe we could actually avoid that.
 
 First, the paper An Optimal Algorithm for the Distinct Elements Problem
 actually contains an algorithm with *does* handle deletes - it's called
 L_0 estimate there.

Hmmm, that's interesting. I know there's a part about L_0 estimation,
but that's about estimating Hamming norm of a vector - so I've ignored
it as I thought we can't use it to estimate number of distinct values.
But if it really handles deletions and if we can use it, then it's
really interesting.

 Second, as Tomas pointed out, the stream-based estimator is essentially a
 simplified version of a bloom filter. It starts out with a field of
 N zero bits, and sets K of them to 1 for each value v in the stream.
 Which bits are set to 1 depends on some hash function(s) H_i(v). It's
 then easy to compute how many 1-bits you'd expect to find in the bit
 field after seeing D distinct values, and by reversing that you can
 estimate D from the number of 1-bits in the bit field.

No, I haven't said the stream-based estimators are simplified versions
of a Bloom filter. I said the approach is very similar - all the
algorithms use bitmaps and hash functions, but the algorithms (Bloom
filter vs. probabilistic counting and adaptive sampling) are very different.

The Bloom filter is much more straightforward. The other algorithms are
much more sophisticated which allows to use less space.

 To avoid having to rescan large tables, instead of storing one such
 bit field, we'd store one per B pages of data. We'd then only need
 to scan a range of B pages around every updated or deleted tuple,
 and could afterwards compute a new global estimate of D by combining
 the individual bit fields with bitwise and.

I don't think this could help.

1) This works just with the Bloom filters, not with the other
   algorithms (you can't combine the segments using bitmap OR).

2) With heavily modified tables the updates are usually 'spread'
   through the whole table, so you'll have to rebuild all the
   segments anyway.

 Since the need to regularly VACUUM tables hit by updated or deleted
 won't go away any time soon, we could piggy-back the bit field
 rebuilding onto VACUUM to avoid a second scan.

Well, I guess it's a bit more complicated. First of all, there's a local
VACUUM when doing HOT updates. Second, you need to handle inserts too
(what if the table just grows?).

But I'm not a VACUUM expert, so maybe I'm wrong and this is the right
place to handle rebuilds of distinct stats.

I was thinking about something else - we could 'attach' the rebuild to
an actual seq scan if the amount of changes reaches some threshold
(since the last rebuild). Only in case the amount of changes reaches a
higher threshold, we would rebuild the stats on our own.

Something like

IF (# of updates * deletes  5%) THEN wait for seq scan
IF (# of updates * deletes  10%) THEN rebuild the stats

I've found a nice ppt describing how Oracle does this:

   http://www.oraclegeek.net/downloads/One_Pass_Distinct_Sampling.ppt

and there's PDF version

   http://www.oraclegeek.net/downloads/OnePassDistinctSampling.pdf

According to this, Oracle is using the probabilistic counting approach
(see slide 26). And once they find out there were to many changes in the
table, they rebuild the whole thing.

I'm not saying we should do exactly the same, just that this might be a
good direction.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Alvaro Herrera
Excerpts from Tomas Vondra's message of jue dic 30 16:38:03 -0300 2010:

  Since the need to regularly VACUUM tables hit by updated or deleted
  won't go away any time soon, we could piggy-back the bit field
  rebuilding onto VACUUM to avoid a second scan.
 
 Well, I guess it's a bit more complicated. First of all, there's a local
 VACUUM when doing HOT updates. Second, you need to handle inserts too
 (what if the table just grows?).
 
 But I'm not a VACUUM expert, so maybe I'm wrong and this is the right
 place to handle rebuilds of distinct stats.

I was thinking that we could have two different ANALYZE modes, one
full and one incremental; autovacuum could be modified to use one or
the other depending on how many changes there are (of course, the user
could request one or the other, too; not sure what should be the default
behavior).  So the incremental one wouldn't worry about deletes, only
inserts, and could be called very frequently.  The other one would
trigger a full table scan (or nearly so) to produce a better estimate in
the face of many deletions.

I haven't followed this discussion closely so I'm not sure that this
would be workable.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 I was thinking that we could have two different ANALYZE modes, one
 full and one incremental; autovacuum could be modified to use one or
 the other depending on how many changes there are (of course, the user
 could request one or the other, too; not sure what should be the default
 behavior).

How is an incremental ANALYZE going to work at all?  It has no way to
find out the recent changes in the table, for *either* inserts or
deletes.  Unless you want to seqscan the whole table looking for tuples
with xmin later than something-or-other ... which more or less defeats
the purpose.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-29 Thread Josh Berkus

 Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal'
 so this is actually the minimum - you can get a much bigger difference
 with lower probability. So you can easily get an estimate that is a few
 orders off.

FWIW, based on query performance, estimates which are up to 5X off are
tolerable, and anything within 3X is considered accurate.  Above 5X
the probability of bad query plans becomes problematically high.

Of course, if you're doing cross-column stats, the accuracy of each
individual column becomes critical since estimation error could be
combiniational in the worst case (i.e. if colA is 3X and colB is 0.3X
then colA-colB will be 9X off).

Anyway, I look forward to your experiments with stream-based estimators.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread Robert Haas
On Tue, Dec 28, 2010 at 1:39 AM, Josh Berkus j...@agliodbs.com wrote:
 While I don't want to discourage you from working on steam-based
 estimators ... I'd love to see you implement a proof-of-concept for
 PostgreSQL, and test it ... the above is a non-argument.  It requires us
 to accept that sample-based estimates cannot ever be made to work,
 simply because you say so.

This argument has been made on this mailing list and on
pgsql-performance many times, so I have been assuming that this is
settled mathematics.  Admittedly, I don't have a citation for this...

 I would agree that it's impossible to get a decent estimate of
 n-distinct from a 1% sample.  But there's a huge difference between 5%
 or 10% and a majority of the table.

Considering the way that disks work, it's not that huge.  If the table
is small enough to fit in memory conveniently, it probably wouldn't be
unacceptably costly even to read the whole thing.  But if it's a
terabyte on disk, reading every tenth block is probably going to carry
most of the I/O cost of reading every block.  Even reading every
twentieth or hundredth block is going to be significantly more
expensive than what we do now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread tv

 The simple truth is

 1) sampling-based estimators are a dead-end

 The Charikar and Chaudhuri paper does not, in fact, say that it is
 impossible to improve sampling-based estimators as you claim it does. In
 fact, the authors offer several ways to improve sampling-based
 estimators.  Further, 2000 was hardly the end of sampling-estimation
 paper publication; there are later papers with newer ideas.

Well, the paper states that there is a lower bound of the possible error
of a sampling based estimator, depending on the sample size. The actual
inequality is

   error(d) = sqrt( (n-r)/2r * 1/q)

where error(D) is ratio error (d - estimated number of distinct values,
D - actual number of distinct values)

   error(d) = max{ D/d, d/D }

And all this is with probability q = e^{-1000} (you can choose this).

Say you have a table  with 1.000.000 rows and you use a sample of 1.000
rows to do an estimate. In that case you get

  erorr(d) = 99  with   q = 0.05
  erorr(d) = 70  with   q = 0.1
  error(d) = 31  with   q = 0.5

if you can 10% of the table, you get this

  error(d) = 9.5 with   q = 0.05
  error(d) = 6.7 with   q = 0.1
  error(d) = 3   with   q = 0.5

So even with 10% of the table, there's a 10% probability to get an
estimate that's 7x overestimated or underestimated. With lower probability
the interval is much wider.

 For example, I still think we could tremendously improve our current
 sampling-based estimator without increasing I/O by moving to block-based
 estimation*.  The accuracy statistics for block-based samples of 5% of
 the table look quite good.

Well, that's certainly possible. But you can only achieve the error(d)
lower boundary consistently (for all datasets), you can't do better. And
they've already presented an estimator that does exactly this (called AE -
Adaptive Estimator in the paper).

 I would agree that it's impossible to get a decent estimate of
 n-distinct from a 1% sample.  But there's a huge difference between 5%
 or 10% and a majority of the table.

Sure we can do better. But there's a limit we can't cross no matter what
estimator we choose and how large the sample will be.

I've seen several post-2000 paper on sample-based estimators, but those
are mostly hybrid estimators, i.e. estimators composed of several simple
estimators. So in the end it's pretty complicated, you need to gather a
lot of different stats, and still you can't get better error than the
lower bound :-(

So yes, we can improve the current estimator (making it more complex), but
it does not solve the problem actually.

 Again, don't let this discourage you from attempting to write a
 steam-based estimator.  But do realize that you'll need to *prove* its
 superiority, head-to-head, against sxampling-based estimators.

 [* http://www.jstor.org/pss/1391058 (unfortunately, no longer
 public-access)]

It's available here http://www.stat.washington.edu/research/reports/1990s/

But it does not present an estimator contradicting the Charikar and
Chaudhuri paper - it's still 'just' a sample-based estimator, alough they
draw the sample at the block level. But yes, that's good idea and I've
already mentioned it in the cross-column stats thread I think.

The question is if a sample obtained in this way will be as good as the
current samples. This way you could get quite separate 'clusters' of
values, one cluster for each block, although in reality the values are
uniformly distributed. And the resulting histogram would be crippled by
this I guess too.

But if you know about interesting papers on sample-based estimators
(especially post-2000), let me know. I've searched for them, but those I
found were not very interesting IMHO.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread Kevin Grittner
t...@fuzzy.cz wrote:
 
 So even with 10% of the table, there's a 10% probability to get an
 estimate that's 7x overestimated or underestimated. With lower
 probability the interval is much wider.
 
Hmmm...  Currently I generally feel I'm doing OK when the estimated
rows for a step are in the right order of magnitude -- a 7% error
would be a big improvement in most cases.  Let's not lose track of
the fact that these estimates are useful even when they are not
dead-on accurate.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread tv
 t...@fuzzy.cz wrote:

 So even with 10% of the table, there's a 10% probability to get an
 estimate that's 7x overestimated or underestimated. With lower
 probability the interval is much wider.

 Hmmm...  Currently I generally feel I'm doing OK when the estimated
 rows for a step are in the right order of magnitude -- a 7% error
 would be a big improvement in most cases.  Let's not lose track of
 the fact that these estimates are useful even when they are not
 dead-on accurate.

Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal'
so this is actually the minimum - you can get a much bigger difference
with lower probability. So you can easily get an estimate that is a few
orders off.

Anyway I really don't want precise values, just a reasonable estimate. As
I said, we could use the AE estimate they proposed in the paper. It has
the nice feature that it actually reaches the low boundary (thus the
inequality changes to equality). The downside is that there are estimators
with better behavior on some datasets.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Robert Haas
2010/12/27 Tomas Vondra t...@fuzzy.cz:
   But even though these disadvantages, there really is no other
   way to enhance the estimates. I don't think this should be a
   default behavior - just as in case of cross-column stats this should
   be optional when the current estimator does not work well.

This is going to be a lot of work to implement, so before you do it,
we should try to reach a consensus that (a) it's part of an overall
strategy that the community generally supports and (b) we have
consensus on the design for this part.

With respect to (a), I have to admit I've found the discussion on
cross-column stats to be quite difficult to follow.  I'd like to see a
very simple description of exactly what information we're going to
store, under what circumstances we'll store it, and how we'll use it
to compute selectivity estimates.

With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work.  Off the top of my head
there seems to be some pretty serious feasibility problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 With respect to (b), I think I'd need to see a much more detailed
 design for how you intend to make this work.  Off the top of my
 head there seems to be some pretty serious feasibility problems.
 
I had one random thought on that -- it seemed like a large concern
was that there would need to be at least an occasional scan of the
entire table to rebuild the distinct value information.  Don't we
already require an occasional scan of the entire table for freezing
transaction IDs?  Could this be part of any vacuum of the entire
table?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Robert Haas robertmh...@gmail.com wrote:
 With respect to (b), I think I'd need to see a much more detailed
 design for how you intend to make this work.  Off the top of my
 head there seems to be some pretty serious feasibility problems.
 
 I had one random thought on that -- it seemed like a large concern
 was that there would need to be at least an occasional scan of the
 entire table to rebuild the distinct value information.  Don't we
 already require an occasional scan of the entire table for freezing
 transaction IDs?  Could this be part of any vacuum of the entire
 table?

Well, first, those scans occur only once every few hundred million
transactions, which is not likely a suitable timescale for maintaining
statistics.  And second, we keep on having discussions about rejiggering
the whole tuple-freezing strategy.  Even if piggybacking on those scans
looked useful, it'd be unwise to assume it'll continue to work the same
way it does now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tomas Vondra
Dne 27.12.2010 22:46, Robert Haas napsal(a):
 2010/12/27 Tomas Vondra t...@fuzzy.cz:
   But even though these disadvantages, there really is no other
   way to enhance the estimates. I don't think this should be a
   default behavior - just as in case of cross-column stats this should
   be optional when the current estimator does not work well.
 
 This is going to be a lot of work to implement, so before you do it,
 we should try to reach a consensus that (a) it's part of an overall
 strategy that the community generally supports and (b) we have
 consensus on the design for this part.

Yes, it is going to be a lot of work. But I don't think we need a
consensus of the whole community prior to building something that works.
I plan to build a very simple prototype soon, so let's talk about this
later.

I've started these two threads mainly as a 'brainstorming' - to present
what I've learned from the various papers, propose a possible solution,
highlight issues I see, and get ideas on how to solve them.

 With respect to (a), I have to admit I've found the discussion on
 cross-column stats to be quite difficult to follow.  I'd like to see a
 very simple description of exactly what information we're going to
 store, under what circumstances we'll store it, and how we'll use it
 to compute selectivity estimates.

Yes, I know it was difficult to follow that discussion. That's why I
created a 'summary page' on wiki

   http://wiki.postgresql.org/wiki/Cross_Columns_Stats

Generally we need to gather two types of stats:

  (a) multi-dimensional histogram - this can be done about the same way
  we create single-dimensional histograms, that's not a big problem
  (although more data might be necessary to get accurate histograms)

  (b) # of distinct values - this is the tricky part, as described in
  this problem

 With respect to (b), I think I'd need to see a much more detailed
 design for how you intend to make this work.  Off the top of my head
 there seems to be some pretty serious feasibility problems.

Well, it's not going to be easy, that's for sure. My big plan is
something like this:

(a) Build a simple 'contrib-like' module that allows manual collection
of stats and computing of estimates (not integrated with the
optimizer in any way).

This should help us better understand what stats do we need etc.

(b) Minimal integration with the core - still manual collection fo
stats, the optimizer uses the stats if available.

(c) Enhancements - automatic updates of the stats, etc.

And all this should be implemented so that you don't have to pay unless
you want to use the new stats.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tomas Vondra
Dne 28.12.2010 00:04, Kevin Grittner napsal(a):
 Tom Lane t...@sss.pgh.pa.us wrote:
  
 Well, first, those scans occur only once every few hundred million
 transactions, which is not likely a suitable timescale for
 maintaining statistics.
  
 I was assuming that the pass of the entire table was priming for the
 incremental updates described at the start of this thread.  I'm not
 clear on how often the base needs to be updated for the incremental
 updates to keep the numbers close enough.

Well, that really depends on the workload. If you never remove all
occurences of a given value (e.g. a ZIP code), you don't need to rescan
the table at all.

All you need is to build the stats once and then update them
incrementally - and I belive this could be handled by autovacuum.

 And second, we keep on having discussions about rejiggering
 the whole tuple-freezing strategy.  Even if piggybacking on those
 scans looked useful, it'd be unwise to assume it'll continue to
 work the same way it does now.
  
 Sure, it might need to trigger its own scan in the face of heavy
 deletes anyway, since the original post points out that the
 algorithm handles inserts better than deletes, but as long as we
 currently have some sequential pass of the data, it seemed sane to
 piggyback on it when possible.  And maybe we should be considering
 things like this when we weigh the pros and cons of rejiggering. 
 This issue of correlated values comes up pretty often

Again, there are two types of stats - one of them needs to scan the
whole table (estimate of distinct values), the other one does not
(multi-dimentional histograms).

These two cases are independent - you don't necessarily need both.

Better ndistinct estimates would actually solve some of the current
issues, it's not just a matter of cross-column stats.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Josh Berkus

 The simple truth is
 
 1) sampling-based estimators are a dead-end

While I don't want to discourage you from working on steam-based
estimators ... I'd love to see you implement a proof-of-concept for
PostgreSQL, and test it ... the above is a non-argument.  It requires us
to accept that sample-based estimates cannot ever be made to work,
simply because you say so.

The Charikar and Chaudhuri paper does not, in fact, say that it is
impossible to improve sampling-based estimators as you claim it does. In
fact, the authors offer several ways to improve sampling-based
estimators.  Further, 2000 was hardly the end of sampling-estimation
paper publication; there are later papers with newer ideas.

For example, I still think we could tremendously improve our current
sampling-based estimator without increasing I/O by moving to block-based
estimation*.  The accuracy statistics for block-based samples of 5% of
the table look quite good.

I would agree that it's impossible to get a decent estimate of
n-distinct from a 1% sample.  But there's a huge difference between 5%
or 10% and a majority of the table.

Again, don't let this discourage you from attempting to write a
steam-based estimator.  But do realize that you'll need to *prove* its
superiority, head-to-head, against sampling-based estimators.

[* http://www.jstor.org/pss/1391058 (unfortunately, no longer
public-access)]

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers