Dne 24.12.2010 13:37, Florian Pflug napsal(a):
On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
2010/12/24 Florian Pflug f...@phlo.org:
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
I guess we could use the highest possible value (equal to the number
of tuples) - according to wiki
2010/12/24 Florian Pflug f...@phlo.org:
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
I guess we could use the highest possible value (equal to the number
of tuples) - according to wiki you need about 10 bits per element
with 1% error, i.e. about 10MB of memory for each million of
2010/12/24 Florian Pflug f...@phlo.org:
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
I guess we could use the highest possible value (equal to the number
of tuples) - according to wiki you need about 10 bits per element
with 1% error, i.e. about 10MB of memory for each million of
On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
2010/12/24 Florian Pflug f...@phlo.org:
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
I guess we could use the highest possible value (equal to the number
of tuples) - according to wiki you need about 10 bits per element
with 1%
Dne 24.12.2010 13:15, t...@fuzzy.cz napsal(a):
2010/12/24 Florian Pflug f...@phlo.org:
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
I guess we could use the highest possible value (equal to the number
of tuples) - according to wiki you need about 10 bits per element
with 1% error,
Dne 24.12.2010 13:37, Florian Pflug napsal(a):
On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
2010/12/24 Florian Pflug f...@phlo.org:
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
I guess we could use the highest possible value (equal to the number
of tuples) - according to wiki
Dne 24.12.2010 04:41, Florian Pflug napsal(a):
The filter size could be derived from the table's statistics target, or
be otherwise user-definable. We could also auto-resize once it gets too
full. But still, that all seems awfully complex :-(
Using a statistics target is a good idea I think. I
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
I think that maybe it'd be acceptable to scan a large portion of the
table to estimate dist(A,B), *if* we must only do so only once in a while.
But even with
a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all
values in
Dne 21.12.2010 19:03, Tomas Vondra napsal(a):
My plan for the near future (a few weeks) is to build a simple 'module'
with the ability to estimate the number of rows for a given condition.
This could actually be quite useful as a stand-alone contrib module, as
the users often ask how to get a
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
I think that maybe it'd be acceptable to scan a large portion of the
table to estimate dist(A,B), *if* we must only do so only once in a while.
But even with
a full scan, how would we arrive at an
On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
It seems to me you're missing one very important thing - this was not
meant as a new default way to do estimates. It was meant as an option
when the user (DBA, developer, ...) realizes the current solution gives
really bad estimates (due to
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug f...@phlo.org wrote:
I tried to pick up Robert's idea of quantifying Implicativeness -
i.e., finding a number between 0 and 1 that describes how close the
(A,B) are to representing a function A - B.
Actually Heikki's idea...
Observe that
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug f...@phlo.org wrote:
You might use that to decide if either A-B or B-a looks function-like
enough to use the uniform bayesian approach. Or you might even go
further,
and decide *with* bayesian formula to use - the paper you cited always
averages
On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote:
I doubt there is a way to this decision with just dist(A), dist(B) and
dist(A,B) values. Well, we could go with a rule
if [dist(A) == dist(A,B)] the [A = B]
but that's very fragile. Think about estimates (we're not going to work
with exact
On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote:
I doubt there is a way to this decision with just dist(A), dist(B) and
dist(A,B) values. Well, we could go with a rule
if [dist(A) == dist(A,B)] the [A = B]
but that's very fragile. Think about estimates (we're not going to work
with exact
On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
This is the reason why they choose to always combine the values (with
varying weights).
There are no varying weights involved there. What they do is to express
P(A=x,B=y) once as
...
P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 +
On Dec21, 2010, at 13:25 , t...@fuzzy.cz wrote:
And there's one additional - IMHO very important - requirement. The whole
thing should easily extend to more than two columns. This IF (F(A,B)
F(B,A)) THEN ... probably is not a good solution regarding this.
For example given 3 columns A,B,C,
On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
This is the reason why they choose to always combine the values (with
varying weights).
There are no varying weights involved there. What they do is to express
P(A=x,B=y) once as
...
P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
Hmmm, maybe we could give this possibility (to identify two separate
groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
user would say 'build stats for (A,B) and (C)' - this actually represents
apriori knowledge of
On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
It seems to me you're missing one very important thing - this was not
meant as a new default way to do estimates. It was meant as an option
when the user (DBA, developer, ...) realizes the current solution gives
really bad estimates (due to
On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
Robert Haas robertmh...@gmail.com writes:
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the
Dne 19.12.2010 21:21, Simon Riggs napsal(a):
On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
Robert Haas robertmh...@gmail.com writes:
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
The proposed solution is based on contingency tables, built for selected
groups of
On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:
On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is
(dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
where A and B are columns, dist(X)
Dne 18.12.2010 16:59, Florian Pflug napsal(a):
On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:
On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is
(dist(A) * sel(A) + dist(B) * sel(B)) / (2 *
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
On 12.12.2010 15:17, Martijn van Oosterhout wrote:
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
Very cool that you're working on this.
+1
Lets talk about one special case - I'll explain how the proposed
solution works,
On Fri, Dec 17, 2010 at 12:58 PM, Tomas Vondra t...@fuzzy.cz wrote:
In the end, all they need to compute an estimate is number of distinct
values for each of the columns (we already have that in pg_stats) and a
number of distinct values for the group of columns in a query. They
really don't
Hi,
I've read about 10 papers about estimation this week, and I have gained
some insight. I'm not going to describe each of the papers here, I plan
to set up a wiki page about cross column statistics
http://wiki.postgresql.org/wiki/Cross_Columns_Stats
and I'll put the list of papers and some
Dne 17.12.2010 19:58, Robert Haas napsal(a):
I haven't read the paper yet (sorry) but just off the top of my head,
one possible problem here is that our n_distinct estimates aren't
always very accurate, especially for large tables. As we've discussed
before, making them accurate requires
Tomas Vondra t...@fuzzy.cz writes:
AFAIK it will work with reasonably precise estimates, but the point is
you need an estimate of distinct values of the whole group of columns.
So when you want to get an estimate for queries on columns (a,b), you
need the number of distinct value combinations
On 17.12.2010 23:13, Tomas Vondra wrote:
Dne 17.12.2010 19:58, Robert Haas napsal(a):
I haven't read the paper yet (sorry) but just off the top of my head,
one possible problem here is that our n_distinct estimates aren't
always very accurate, especially for large tables. As we've discussed
Dne 17.12.2010 22:24, Tom Lane napsal(a):
That seems likely to be even more unreliable than our existing
single-column estimates :-(
regards, tom lane
Well, yes :-(
I guess this is a place where we could use a multi-column index, if it
contains all the interesting
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
On 17.12.2010 23:13, Tomas Vondra wrote:
Dne 17.12.2010 19:58, Robert Haas napsal(a):
I haven't read the paper yet (sorry) but just off the top of my head,
one possible problem here is that our n_distinct estimates aren't
always very
On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is
(dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
where A and B are columns, dist(X) is number of distinct values in
column X and
On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is
(dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
where A and B are columns, dist(X) is number of distinct values in
column X and
On 2010-12-13 03:28, Robert Haas wrote:
Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases. For example, in the United States,
there are over 40,000 postal codes, and some
On 2010-12-13 03:28, Robert Haas wrote:
Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases. For example, in the United States,
there are over 40,000 postal codes, and some
Tomas Vondra t...@fuzzy.cz writes:
Well, until this point we've discussed failure cases involving 'AND'
conditions. What about 'OR' conditions? I think the current optimizer
computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
found in backend/optimizer/path/clausesel.c:630).
Robert Haas robertmh...@gmail.com writes:
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the
On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
Another quick note: I think that storing the full contingency table is
wasteful since the marginals are already stored in the single column
statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
looking at this a couple
Dne 13.12.2010 16:34, Tom Lane napsal(a):
Tomas Vondra t...@fuzzy.cz writes:
Well, until this point we've discussed failure cases involving 'AND'
conditions. What about 'OR' conditions? I think the current optimizer
computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
found in
Dne 13.12.2010 16:38, Tom Lane napsal(a):
The reason that this wasn't done years ago is precisely that nobody's
figured out how to do it with a tolerable amount of stats data and a
tolerable amount of processing time (both at ANALYZE time and during
query planning). It's not hard to see what
Dne 13.12.2010 18:59, Joshua Tolley napsal(a):
On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
Another quick note: I think that storing the full contingency table is
wasteful since the marginals are already stored in the single column
statistics. Look at copulas [2] ( FWIW I
Tomas,
(a) find out what statistics do we need to collect and how to use it
(b) implement a really stupid inefficient solution
(c) optimize in iterations, i.e. making it faster, consuming less
space etc.
I'll suggest again how to decide *which* columns to cross: whichever
columns
Dne 13.12.2010 22:50, Josh Berkus napsal(a):
Tomas,
(a) find out what statistics do we need to collect and how to use it
(b) implement a really stupid inefficient solution
(c) optimize in iterations, i.e. making it faster, consuming less
space etc.
I'll suggest again how to
2010/12/13 Josh Berkus j...@agliodbs.com:
Tomas,
(a) find out what statistics do we need to collect and how to use it
(b) implement a really stupid inefficient solution
(c) optimize in iterations, i.e. making it faster, consuming less
space etc.
I'll suggest again how to decide
On Mon, Dec 13, 2010 at 5:45 PM, Tomas Vondra t...@fuzzy.cz wrote:
I'll suggest again how to decide *which* columns to cross: whichever
columns are combined in composite indexes. In version 2, allow the DBA
to specify combinations.
I think this is a bit early to discuss this, given the fact
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
Hi everyone,
one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
On 12.12.2010 15:17, Martijn van Oosterhout wrote:
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
Very cool that you're working on this.
+1
Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
Hi!
Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a):
Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.
Yeah, and that's actually analogous to the example I used in my
presentation.
The way I think of
On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
The way I think of that problem is that once you know the postcode, knowing
the city name doesn't add any information. The postcode implies the city
name. So the selectivity for postcode = ? AND city = ? should be the
selectivity of
Dne 12.12.2010 17:33, Florian Pflug napsal(a):
On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
The way I think of that problem is that once you know the postcode, knowing
the city name doesn't add any information. The postcode implies the city
name. So the selectivity for postcode = ?
On Sun, Dec 12, 2010 at 9:43 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
The way I think of that problem is that once you know the postcode, knowing
the city name doesn't add any information. The postcode implies the city
name. So the selectivity for postcode = ? AND city
Dne 13.12.2010 01:05, Robert Haas napsal(a):
This is a good idea, but I guess the question is what you do next. If
you know that the applicability is 100%, you can disregard the
restriction clause on the implied column. And if it has no
implicatory power, then you just do what we do now.
On Sun, Dec 12, 2010 at 8:46 PM, Tomas Vondra t...@fuzzy.cz wrote:
Dne 13.12.2010 01:05, Robert Haas napsal(a):
This is a good idea, but I guess the question is what you do next. If
you know that the applicability is 100%, you can disregard the
restriction clause on the implied column. And
P(A|B) = P(A and B) / P(B).
Well, until this point we've discussed failure cases involving 'AND'
conditions. What about 'OR' conditions? I think the current optimizer
computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
found in backend/optimizer/path/clausesel.c:630).
Sometimes
Dne 13.12.2010 03:00, Robert Haas napsal(a):
Well, the question is what data you are actually storing. It's
appealing to store a measure of the extent to which a constraint on
column X constrains column Y, because you'd only need to store
O(ncolumns^2) values, which would be reasonably
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
Dne 13.12.2010 03:00, Robert Haas napsal(a):
Well, the question is what data you are actually storing. It's
appealing to store a measure of the extent to which a constraint on
column X constrains column Y, because you'd only
The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the probabilities needed to
compute the selectivity. Or am I missing something?
Well, I'm not real familiar
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
Dne 13.12.2010 03:00, Robert Haas napsal(a):
Well, the question is what data you are actually storing. It's
appealing to store a measure of the extent to which a constraint on
column X constrains column Y, because you'd only
Hi everyone,
one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
columns are not independent, this usually result in poor estimates
61 matches
Mail list logo