On 20.01.2011 04:36, Robert Haas wrote:
... Even better, the
code changes would be confined to ANALYZE rather than spread out all
over the system, which has positive implications for robustness and
likelihood of commit.
Keep in mind that the administrator can already override the ndistinct
Hi Tomas,
On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote:
No, the multi-column statistics do not require constant updating. There
are cases where a sampling is perfectly fine, although you may need a
bit larger sample. Generally if you can use a multi-dimensional
histogram, you don't
Dne 20.1.2011 03:06, Nathan Boley napsal(a):
And actually it does not depend on ndistinct for the columns only, it
depends on ndistinct estimates for the combination of columns. So
improving the ndistinct estimates for columns is just a necessary first
step (and only if it works reasonably
Dne 20.1.2011 03:36, Robert Haas napsal(a):
On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
Regarding the crash scenario - if the commit fails, just throw away the
local estimator copy, it's not needed. I'm not sure how to take care of
the case when commit succeeds and the
Dne 20.1.2011 09:10, Heikki Linnakangas napsal(a):
It seems that the suggested multi-column selectivity estimator would be
more sensitive to ndistinct of the individual columns. Is that correct?
How is it biased? If we routinely under-estimate ndistinct of individual
columns, for example, does
Dne 20.1.2011 11:05, Csaba Nagy napsal(a):
Hi Tomas,
On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote:
No, the multi-column statistics do not require constant updating. There
are cases where a sampling is perfectly fine, although you may need a
bit larger sample. Generally if you can
On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
Yes, I was aware of this problem (amount of memory consumed with lots of
updates), and I kind of hoped someone will come up with a reasonable
solution.
As far as I can see, periodically sampling some or all of the table is
Robert Haas robertmh...@gmail.com writes:
... I guess I'm just saying I'd think really, really hard
before abandoning the idea of periodic sampling. You have to get an
awful lot of benefit out of those cross-column stats to make it worth
paying a run-time cost for them.
I've been trying to
Dne 19.1.2011 20:25, Robert Haas napsal(a):
On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
Yes, I was aware of this problem (amount of memory consumed with lots of
updates), and I kind of hoped someone will come up with a reasonable
solution.
As far as I can see,
Dne 19.1.2011 20:46, Tom Lane napsal(a):
Robert Haas robertmh...@gmail.com writes:
... I guess I'm just saying I'd think really, really hard
before abandoning the idea of periodic sampling. You have to get an
awful lot of benefit out of those cross-column stats to make it worth
paying a
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
Dne 19.1.2011 20:25, Robert Haas napsal(a):
On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
Yes, I was aware of this problem (amount of memory consumed with lots of
updates), and I kind of hoped someone will
Dne 19.1.2011 23:44, Nathan Boley napsal(a):
1) The distribution of values in a table is rarely pathological, and
usually follows one of several common patterns. ( IOW, we have good
heuristics )
Not true. You're missing the goal of this effort - to get ndistinct
estimate for combination of
On Jan19, 2011, at 23:44 , Nathan Boley wrote:
If you think about it, it's a bit ridiculous to look at the whole table
*just* to estimate ndistinct - if we go that far why dont we just
store the full distribution and be done with it?
The crucial point that you're missing here is that ndistinct
If you think about it, it's a bit ridiculous to look at the whole table
*just* to estimate ndistinct - if we go that far why dont we just
store the full distribution and be done with it?
- the best you could do is to average the
individual probabilities which gives ... well, 1/ndistinct.
Not true. You're missing the goal of this effort - to get ndistinct
estimate for combination of multiple columns. Which is usually
pathological in case of dependent columns.
snip
Again, don't think about a single column (although even in that case
there are known fail cases). Think about
On Jan20, 2011, at 02:41 , Nathan Boley wrote:
If you think about it, it's a bit ridiculous to look at the whole table
*just* to estimate ndistinct - if we go that far why dont we just
store the full distribution and be done with it?
- the best you could do is to average the
individual
On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
Regarding the crash scenario - if the commit fails, just throw away the
local estimator copy, it's not needed. I'm not sure how to take care of
the case when commit succeeds and the write of the merged estimator
fails, but I
On Wed, Jan 19, 2011 at 9:35 PM, Florian Pflug f...@phlo.org wrote:
Also, in my personal experience this isn't really the area we've got
problems now. The problem cases for me always were queries with a rather
large number of joins (7 to 10 or so) plus rather complex search
conditions. The
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug f...@phlo.org wrote:
On Jan20, 2011, at 02:41 , Nathan Boley wrote:
If you think about it, it's a bit ridiculous to look at the whole table
*just* to estimate ndistinct - if we go that far why dont we just
store the full distribution and be done
On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote:
1) Forks are 'per relation' but the distinct estimators are 'per
column' (or 'per group of columns') so I'm not sure whether the file
should contain all the estimators for the table, or if there should
be one fork for each estimator. The
On Tue, Jan 18, 2011 at 4:53 AM, t...@fuzzy.cz wrote:
So the most important question is how to intercept the new/updated rows,
and where to store them. I think each backend should maintain it's own
private list of new records and forward them only in case of commit. Does
that sound
On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
- Forks are very possibly a more efficient way to deal with TOAST than
having separate tables. There's a fair amount of overhead we pay for the
current setup.
That seems like
On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
- Forks are very possibly a more efficient way to deal with TOAST than
having separate tables. There's a fair
Robert Haas robertmh...@gmail.com writes:
On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
- Forks are very possibly a more efficient way to deal with TOAST than
On Jan 18, 2011, at 11:24 AM, Robert Haas wrote:
On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
- Forks are very possibly a more efficient way to deal with TOAST
On Jan 18, 2011, at 11:32 AM, Tom Lane wrote:
Robert Haas robertmh...@gmail.com writes:
On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
- Forks are very possibly
On Tue, Jan 18, 2011 at 12:32 PM, Jim Nasby j...@nasby.net wrote:
How's that different from what vacuum does on a TOAST table now?
TOAST vacuum is currently an entirely separate vacuum. It might run at the
same time as the main table vacuum, but it still has all the work that would
be
Dne 18.1.2011 18:12, Robert Haas napsal(a):
On Tue, Jan 18, 2011 at 4:53 AM, t...@fuzzy.cz wrote:
So the most important question is how to intercept the new/updated rows,
and where to store them. I think each backend should maintain it's own
private list of new records and forward them only
Dne 9.1.2011 13:58, Jim Nasby napsal(a):
A resource fork? Not sure what you mean, could you describe it in more
detail?
Ooops, resource forks are a filesystem thing; we call them relation forks.
From src/backend/storage/smgr/README:
OK, I think using relation forks seems like a good
On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote:
1) Forks are 'per relation' but the distinct estimators are 'per
column' (or 'per group of columns') so I'm not sure whether the file
should contain all the estimators for the table, or if there should
be one fork for each estimator. The
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
- Forks are very possibly a more efficient way to deal with TOAST than having
separate tables. There's a fair amount of overhead we pay for the current
setup.
That seems like an interesting idea, but I actually don't see why it
Hi,
a short update regarding the ndistinct stream estimators - I've
implemented the estimators described in the papers I've mentioned in my
previous posts. If you want try it, the sources are available at github,
at http://tvondra.github.com/pg_estimator (ignore the page, I have to
update it,
On Fri, 2011-01-07 at 12:32 +0100, t...@fuzzy.cz wrote:
the problem is you will eventually need to drop the results and rebuild
it, as the algorithms do not handle deletes (ok, Florian mentioned an
algorithm L_0 described in one of the papers, but I'm not sure we can
use
it).
Yes, but even
On Fri, 2011-01-07 at 12:32 +0100, t...@fuzzy.cz wrote:
the problem is you will eventually need to drop the results and rebuild
it, as the algorithms do not handle deletes (ok, Florian mentioned an
algorithm L_0 described in one of the papers, but I'm not sure we can use
it).
Yes, but even
A resource fork? Not sure what you mean, could you describe it in more
detail?
Ooops, resource forks are a filesystem thing; we call them relation forks.
From src/backend/storage/smgr/README:
Relation Forks
==
Since 8.4, a single smgr relation can be comprised of multiple
On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote:
How is an incremental ANALYZE going to work at all?
How about a kind of continuous analyze ?
Instead of analyzing just once and then drop the intermediate results,
keep them on disk for all tables and then piggyback the background
writer (or
On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote:
How is an incremental ANALYZE going to work at all?
How about a kind of continuous analyze ?
Instead of analyzing just once and then drop the intermediate results,
keep them on disk for all tables and then piggyback the background
writer
On Jan 7, 2011, at 5:32 AM, t...@fuzzy.cz wrote:
Another thing I'm not sure about is where to store those intermediate
stats (used to get the current estimate, updated incrementally). I was
thinking about pg_stats but I'm not sure it's the right place - depending
on the algorithm, this may be
Dne 7.1.2011 20:56, Jim Nasby napsal(a):
On Jan 7, 2011, at 5:32 AM, t...@fuzzy.cz wrote:
Another thing I'm not sure about is where to store those intermediate
stats (used to get the current estimate, updated incrementally). I was
thinking about pg_stats but I'm not sure it's the right place -
Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010:
Alvaro Herrera alvhe...@commandprompt.com writes:
I was thinking that we could have two different ANALYZE modes, one
full and one incremental; autovacuum could be modified to use one or
the other depending on how many
On Dec 31, 2010, at 7:34 AM, Alvaro Herrera wrote:
Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010:
Alvaro Herrera alvhe...@commandprompt.com writes:
I was thinking that we could have two different ANALYZE modes, one
full and one incremental; autovacuum could be modified to
On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
Robert Haas robertmh...@gmail.com wrote:
With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work. Off the top of my
head there seems to be some pretty serious feasibility problems.
I
Dne 30.12.2010 15:43, Florian Pflug napsal(a):
On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
Robert Haas robertmh...@gmail.com wrote:
With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work. Off the top of my
head there seems to be
Excerpts from Tomas Vondra's message of jue dic 30 16:38:03 -0300 2010:
Since the need to regularly VACUUM tables hit by updated or deleted
won't go away any time soon, we could piggy-back the bit field
rebuilding onto VACUUM to avoid a second scan.
Well, I guess it's a bit more
Alvaro Herrera alvhe...@commandprompt.com writes:
I was thinking that we could have two different ANALYZE modes, one
full and one incremental; autovacuum could be modified to use one or
the other depending on how many changes there are (of course, the user
could request one or the other, too;
Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal'
so this is actually the minimum - you can get a much bigger difference
with lower probability. So you can easily get an estimate that is a few
orders off.
FWIW, based on query performance, estimates which are up to 5X
On Tue, Dec 28, 2010 at 1:39 AM, Josh Berkus j...@agliodbs.com wrote:
While I don't want to discourage you from working on steam-based
estimators ... I'd love to see you implement a proof-of-concept for
PostgreSQL, and test it ... the above is a non-argument. It requires us
to accept that
The simple truth is
1) sampling-based estimators are a dead-end
The Charikar and Chaudhuri paper does not, in fact, say that it is
impossible to improve sampling-based estimators as you claim it does. In
fact, the authors offer several ways to improve sampling-based
estimators. Further,
t...@fuzzy.cz wrote:
So even with 10% of the table, there's a 10% probability to get an
estimate that's 7x overestimated or underestimated. With lower
probability the interval is much wider.
Hmmm... Currently I generally feel I'm doing OK when the estimated
rows for a step are in the right
t...@fuzzy.cz wrote:
So even with 10% of the table, there's a 10% probability to get an
estimate that's 7x overestimated or underestimated. With lower
probability the interval is much wider.
Hmmm... Currently I generally feel I'm doing OK when the estimated
rows for a step are in the
Hi everyone,
about two weeks ago I've started a thread about cross-column stats. One
of the proposed solutions is based on number of distinct values, and the
truth is the current distinct estimator has some problems.
I've done some small research - I have read about 20 papers on this, and
I'd
2010/12/27 Tomas Vondra t...@fuzzy.cz:
But even though these disadvantages, there really is no other
way to enhance the estimates. I don't think this should be a
default behavior - just as in case of cross-column stats this should
be optional when the current estimator does not work
Robert Haas robertmh...@gmail.com wrote:
With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work. Off the top of my
head there seems to be some pretty serious feasibility problems.
I had one random thought on that -- it seemed like a
Kevin Grittner kevin.gritt...@wicourts.gov writes:
Robert Haas robertmh...@gmail.com wrote:
With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work. Off the top of my
head there seems to be some pretty serious feasibility problems.
I
Dne 27.12.2010 22:46, Robert Haas napsal(a):
2010/12/27 Tomas Vondra t...@fuzzy.cz:
But even though these disadvantages, there really is no other
way to enhance the estimates. I don't think this should be a
default behavior - just as in case of cross-column stats this should
be
Dne 28.12.2010 00:04, Kevin Grittner napsal(a):
Tom Lane t...@sss.pgh.pa.us wrote:
Well, first, those scans occur only once every few hundred million
transactions, which is not likely a suitable timescale for
maintaining statistics.
I was assuming that the pass of the entire table was
The simple truth is
1) sampling-based estimators are a dead-end
While I don't want to discourage you from working on steam-based
estimators ... I'd love to see you implement a proof-of-concept for
PostgreSQL, and test it ... the above is a non-argument. It requires us
to accept that
57 matches
Mail list logo