### Re: [HACKERS] multivariate statistics (v19)

```On Sun, Feb 12, 2017 at 10:35:04AM +, Dean Rasheed wrote:
> On 11 February 2017 at 01:17, Tomas Vondra
> wrote:
> > Thanks for the feedback, I'll fix this. I've allowed myself to be a bit
> > sloppy because the number of attributes in the statistics is currently
> > limited to 8, so the overflows are currently not an issue. But it doesn't
> > hurt to make it future-proof, in case we change that mostly artificial limit
> > sometime in the future.
> >
>
> Ah right, so it can't overflow at present, but it's neater to have an
> overflow-proof algorithm.
>
> Thinking about the exactness of the division steps is quite
> interesting. Actually, the order of the multiplying factors doesn't
> matter as long as the divisors are in increasing order. So in both my
> proposal:
>
> result = 1
> for (i = 1; i <= k; i++)
> result = (result * (n-k+i)) / i;
>
> and David's proposal, which is equivalent but has the multiplying
> factors in the opposite order, equivalent to:
>
> result = 1
> for (i = 1; i <= k; i++)
> result = (result * (n-i+1)) / i;
>
> the divisions are exact at each step. The first time through the loop
> it divides by 1 which is trivially exact. The second time it divides
> by 2, having multiplied by 2 consecutive factors, one of which is
> therefore guaranteed to be divisible by 2. The third time it divides
> by 3, having multiplied by 3 consecutive factors, one of which is
> therefore guaranteed to be divisible by 3, and so on.

Right.  You know you can use integer division, which make sense as
permutations of discrete sets are always integers.

> My approach originally seemed more logical to me because of the way it
> derives from the recurrence relation binomial(n, k) = binomial(n-1,
> k-1) * n / k, but they both work fine as long as they have suitable
> overflow checks.

Right.  We could even cache those checks (sorry) based on data type
limits by architecture and OS if performance on those operations ever
matters that much.

> It's also interesting that descriptions of this algorithm tend to
> talk about setting k to min(k, n-k) at the start as an optimisation
> step, as I did in fact, whereas it's actually more than that -- it
> helps prevent unnecessary intermediate overflows when k > n/2. Of
> course, that's not a worry for the current use of this function, but
> it's good to have a robust algorithm.

Indeed. :)

Best,
David.
--
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On 11 February 2017 at 01:17, Tomas Vondra  wrote:
> Thanks for the feedback, I'll fix this. I've allowed myself to be a bit
> sloppy because the number of attributes in the statistics is currently
> limited to 8, so the overflows are currently not an issue. But it doesn't
> hurt to make it future-proof, in case we change that mostly artificial limit
> sometime in the future.
>

Ah right, so it can't overflow at present, but it's neater to have an
overflow-proof algorithm.

Thinking about the exactness of the division steps is quite
interesting. Actually, the order of the multiplying factors doesn't
matter as long as the divisors are in increasing order. So in both my
proposal:

result = 1
for (i = 1; i <= k; i++)
result = (result * (n-k+i)) / i;

and David's proposal, which is equivalent but has the multiplying
factors in the opposite order, equivalent to:

result = 1
for (i = 1; i <= k; i++)
result = (result * (n-i+1)) / i;

the divisions are exact at each step. The first time through the loop
it divides by 1 which is trivially exact. The second time it divides
by 2, having multiplied by 2 consecutive factors, one of which is
therefore guaranteed to be divisible by 2. The third time it divides
by 3, having multiplied by 3 consecutive factors, one of which is
therefore guaranteed to be divisible by 3, and so on.

My approach originally seemed more logical to me because of the way it
derives from the recurrence relation binomial(n, k) = binomial(n-1,
k-1) * n / k, but they both work fine as long as they have suitable
overflow checks.

It's also interesting that descriptions of this algorithm tend to talk
about setting k to min(k, n-k) at the start as an optimisation step,
as I did in fact, whereas it's actually more than that -- it helps
prevent unnecessary intermediate overflows when k > n/2. Of course,
that's not a worry for the current use of this function, but it's good
to have a robust algorithm.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 02/08/2017 07:40 PM, Dean Rasheed wrote:

On 8 February 2017 at 16:09, David Fetter  wrote:

Combinations are n!/(k! * (n-k)!), so computing those is more
along the lines of:

unsigned long long
choose(unsigned long long n, unsigned long long k) {
if (k > n) {
return 0;
}
unsigned long long r = 1;
for (unsigned long long d = 1; d <= k; ++d) {
r *= n--;
r /= d;
}
return r;
}

which greatly reduces the chance of overflow.

Hmm, but that doesn't actually prevent overflows, since it can
overflow in the multiplication step, and there is no protection
against that.

In the algorithm I presented, the inputs and the intermediate result
are kept below INT_MAX, so the multiplication step cannot overflow the
64-bit integer, and it will only raise an overflow error if the actual
result won't fit in a 32-bit int. Actually a crucial part of that,
which I failed to mention previously, is the first step replacing k
with min(k, n-k). This is necessary for inputs like (100,99), which
should return 100, and which must be computed as 100 choose 1, not 100
choose 99, otherwise it will overflow internally before getting to the
final result.

Thanks for the feedback, I'll fix this. I've allowed myself to be a bit
sloppy because the number of attributes in the statistics is currently
limited to 8, so the overflows are currently not an issue. But it
doesn't hurt to make it future-proof, in case we change that mostly
artificial limit sometime in the future.

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On 8 February 2017 at 16:09, David Fetter  wrote:
> Combinations are n!/(k! * (n-k)!), so computing those is more
> along the lines of:
>
> unsigned long long
> choose(unsigned long long n, unsigned long long k) {
> if (k > n) {
> return 0;
> }
> unsigned long long r = 1;
> for (unsigned long long d = 1; d <= k; ++d) {
> r *= n--;
> r /= d;
> }
> return r;
> }
>
> which greatly reduces the chance of overflow.
>

Hmm, but that doesn't actually prevent overflows, since it can
overflow in the multiplication step, and there is no protection
against that.

In the algorithm I presented, the inputs and the intermediate result
are kept below INT_MAX, so the multiplication step cannot overflow the
64-bit integer, and it will only raise an overflow error if the actual
result won't fit in a 32-bit int. Actually a crucial part of that,
which I failed to mention previously, is the first step replacing k
with min(k, n-k). This is necessary for inputs like (100,99), which
should return 100, and which must be computed as 100 choose 1, not 100
choose 99, otherwise it will overflow internally before getting to the
final result.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Wed, Feb 08, 2017 at 03:23:25PM +, Dean Rasheed wrote:
> On 6 February 2017 at 21:26, Alvaro Herrera  wrote:
> > Tomas Vondra wrote:
> >> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:
> >
> >> > Nearby, some auxiliary functions such as n_choose_k and
> >> > num_combinations are not documented. What it is that they do? I'd
> >> > move these at the end of the file, keeping the important entry points
> >> > at the top of the file.
> >>
> >> I'd say n-choose-k is pretty widely known term from combinatorics. The
> >> comment would essentially say just 'this is n-choose-k' which seems rather
> >> pointless. So as much as I dislike the self-documenting code, this actually
> >> seems like a good case of that.
> >
> > Actually, we do have such comments all over the place.  I knew this as
> > "n sobre k", so the english name doesn't immediately ring a bell with me
> > until I look it up; I think the function comment could just say
> > "n_choose_k -- this function returns the binomial coefficient".
>
> One of the things you have to watch out for when writing code to
> compute binomial coefficients is integer overflow, since the numerator
> and denominator get large very quickly. For example, the current code
> will overflow for n=13, k=12, which really isn't that large.
>
> This can be avoided by computing the product in reverse and using a
> larger datatype like a 64-bit integer to store a single intermediate
> result. The point about multiplying the terms in reverse is that it
> guarantees that each intermediate result is an exact integer (a
> smaller binomial coefficient), so there is no need to track separate
> numerators and denominators, and you avoid huge intermediate
> factorials. Here's what that looks like in psuedo-code:
>
> binomial(int n, int k):
> # Save computational effort by using the symmetry of the binomial
> # coefficients
> k = min(k, n-k);
>
> # Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k,
> # starting from binomial(n-k, 0) = 1, and computing the sequence
> # binomial(n-k+1, 1), binomial(n-k+2, 2), ...
> #
> # Note that each intermediate result is an exact integer.
> int64 result = 1;
> for (int i = 1; i <= k; i++)
> {
> result = (result * (n-k+i)) / i;
> if (result > INT_MAX) Raise overflow error
> }
> return (int) result;
>
>
> Note also that I think num_combinations(n) is just an expensive way of
> calculating 2^n - n - 1.

Combinations are n!/(k! * (n-k)!), so computing those is more
along the lines of:

unsigned long long
choose(unsigned long long n, unsigned long long k) {
if (k > n) {
return 0;
}
unsigned long long r = 1;
for (unsigned long long d = 1; d <= k; ++d) {
r *= n--;
r /= d;
}
return r;
}

which greatly reduces the chance of overflow.

Best,
David.
--
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On 6 February 2017 at 21:26, Alvaro Herrera  wrote:
> Tomas Vondra wrote:
>> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:
>
>> > Nearby, some auxiliary functions such as n_choose_k and
>> > num_combinations are not documented. What it is that they do? I'd
>> > move these at the end of the file, keeping the important entry points
>> > at the top of the file.
>>
>> I'd say n-choose-k is pretty widely known term from combinatorics. The
>> comment would essentially say just 'this is n-choose-k' which seems rather
>> pointless. So as much as I dislike the self-documenting code, this actually
>> seems like a good case of that.
>
> Actually, we do have such comments all over the place.  I knew this as
> "n sobre k", so the english name doesn't immediately ring a bell with me
> until I look it up; I think the function comment could just say
> "n_choose_k -- this function returns the binomial coefficient".
>

One of the things you have to watch out for when writing code to
compute binomial coefficients is integer overflow, since the numerator
and denominator get large very quickly. For example, the current code
will overflow for n=13, k=12, which really isn't that large.

This can be avoided by computing the product in reverse and using a
larger datatype like a 64-bit integer to store a single intermediate
result. The point about multiplying the terms in reverse is that it
guarantees that each intermediate result is an exact integer (a
smaller binomial coefficient), so there is no need to track separate
numerators and denominators, and you avoid huge intermediate
factorials. Here's what that looks like in psuedo-code:

binomial(int n, int k):
# Save computational effort by using the symmetry of the binomial
# coefficients
k = min(k, n-k);

# Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k,
# starting from binomial(n-k, 0) = 1, and computing the sequence
# binomial(n-k+1, 1), binomial(n-k+2, 2), ...
#
# Note that each intermediate result is an exact integer.
int64 result = 1;
for (int i = 1; i <= k; i++)
{
result = (result * (n-k+i)) / i;
if (result > INT_MAX) Raise overflow error
}
return (int) result;

Note also that I think num_combinations(n) is just an expensive way of
calculating 2^n - n - 1.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```Still about 0003.  dependencies.c comment at the top of the file should
contain some details about what is it implementing and a general
description of the algorithm and data structures.  As before, it's best
to have the main entry point build_mv_dependencies at the top, the other
public functions, keeping the internal routines at the bottom of the
file.  That eases code study for future readers.  (Minimizing number of
function prototypes is not a goal.)

What is MVSTAT_DEPS_TYPE_BASIC?  Is "functional dependencies" really
BASIC?  I wonder if it should be TYPE_FUNCTIONAL_DEPS or something.

As with pg_ndistinct_out, there's no need to pstrdup(str.data), as it's
already palloc'ed in the right context.

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```Looking at 0003, I notice that gram.y is changed to add a WITH ( .. )
clause.  If it's not specified, an error is raised.  If you create
stats with (ndistinct) then you can't alter it later to add
"dependencies" or whatever; unless I misunderstand, you have to drop the
statistics and create another one.  Probably in a forthcoming patch we
should have ALTER support to add a stats type.

Also, why isn't the default to build everything, rather than nothing?

BTW, almost everything in the backend could be inside "utils/", so let's
not do that -- let's just create src/backend/statistics/ for all your
code.

questions.

index 908f094..7f3ed3d 100644
@@ -36,7 +36,7 @@ design choice to model the dataset in denormalized way,
either because of
performance or to make querying easier.

-soft dependencies
+Soft dependencies
-

Real-world data sets often contain data errors, either because of data entry
@@ -48,7 +48,7 @@ rendering the approach mostly useless even for slightly noisy
data sets, or
result in sudden changes in behavior depending on minor differences between
samples provided to ANALYZE.

-For this reason the statistics implementes "soft" functional dependencies,
+For this reason the statistics implements "soft" functional dependencies,
associating each functional dependency with a degree of validity (a number
number between 0 and 1). This degree is then used to combine selectivities
in a smooth manner.
@@ -75,6 +75,7 @@ The algorithm also requires a minimum size of the group to
consider it
consistent (currently 3 rows in the sample). Small groups make it less likely
to break the consistency.

+## What is it that we store in the catalog?

Clause reduction (planner/optimizer)

@@ -95,12 +96,12 @@ example for (a,b,c) we first use (a,b=>c) to break the
computation into
and then apply (a=>b) the same way on P(a=?,b=?).

-Consistecy of clauses
+Consistency of clauses
-

Functional dependencies only express general dependencies between columns,
without referencing particular values. This assumes that the equality clauses
-are in fact consistent with the functinal dependency, i.e. that given a
+are in fact consistent with the functional dependency, i.e. that given a
dependency (a=>b), the value in (b=?) clause is the value determined by (a=?).
If that's not the case, the clauses are "inconsistent" with the functional
dependency and the result will be over-estimation.
@@ -111,6 +112,7 @@ set will be empty, but we'll estimate the selectivity using
the ZIP condition.

In this case the default estimation based on AVIA principle happens to work
better, but mostly by chance.
+## what is AVIA principle?

This issue is the price for the simplicity of functional dependencies. If the
application frequently constructs queries with clauses inconsistent with

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```Tomas Vondra wrote:
> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:

> > Nearby, some auxiliary functions such as n_choose_k and
> > num_combinations are not documented. What it is that they do? I'd
> > move these at the end of the file, keeping the important entry points
> > at the top of the file.
>
> I'd say n-choose-k is pretty widely known term from combinatorics. The
> comment would essentially say just 'this is n-choose-k' which seems rather
> pointless. So as much as I dislike the self-documenting code, this actually
> seems like a good case of that.

Actually, we do have such comments all over the place.  I knew this as
"n sobre k", so the english name doesn't immediately ring a bell with me
until I look it up; I think the function comment could just say
"n_choose_k -- this function returns the binomial coefficient".

> > I see this patch has a estimate_ndistinct() which claims to be a re-
> > implementation of code already in analyze.c, but it is actually a lot
> > simpler than what analyze.c does.  I've been wondering if it'd be a good
> > idea to use some of this code so that some routines are moved out of
> > analyze.c; good implementations of statistics-related functions would
> > live in src/backend/statistics/ where they can be used both by analyze.c
> > and your new mvstats stuff.  (More generally I am beginning to wonder if
> > the new directory should be just src/backend/statistics.)
>
> I'll look into that. I have to check if I ignored some assumptions or corner
> cases the analyze.c deals with.

Maybe it's not terribly important to refactor analyze.c from the get go,
but let's give the subdir a more general name.  Hence my vote for having
the subdir be "statistics" instead of "mvstats".

> > In another subthread you seem to have surrendered to the opinion that
> > the new catalog should be called pg_statistics_ext, just in case in the
> > future we come up with additional things to put on it.  However, given
> > its schema, with a "starelid / stakeys", is it sensible to think that
> > we're going to get anything other than something that involves multiple
> > variables?  Maybe it should just be "pg_statistics_multivar" and if
> > something else comes along we create another catalog with an appropriate
> > schema.  Heck, how does this catalog serve the purpose of cross-table
> > statistics in the first place, given that it has room to record a single
> > relid only?  Are you thinking that in the future you'd change starelid
> > into an oidvector column?
>
> Yes, I think the starelid will turn into OID vector. The reason why I
> haven't done that in the current version of the catalog is to keep it
> simple.

OK -- as long as we know what the way forward is, I'm good.  Still, my
main point was that even if we have multiple rels, this catalog will be
about having multivariate statistics, and not different kinds of
statistical data.  I would keep pg_mv_statistics, really.

> > The comment in gram.y about the CREATE STATISTICS is at odds with what
> > is actually allowed by the grammar.
>
> Which comment?

This one:
*  CREATE STATISTICS stats_name ON relname (columns) WITH (options)
the production actually says:
CREATE STATISTICS any_name ON '(' columnList ')' FROM qualified_name

> > I think the name of a statistics is only useful to DROP/ALTER it, right?
> > I wonder why it's useful that statistics belongs in a schema.  Perhaps
> > it should be a global object?  I suppose the name collisions would
> > become bothersome if you have many mvstats.
>
> I think it shouldn't be a global object. I consider them to be a part of a
> schema (just like indexes, for example). Imagine you have a multi-tenant
> database, with using exactly the same (tables/indexes) schema, but keept in
> different schemas. Why shouldn't it be possible to also use the same set of
> statistics for each tenant?

True.  Suggestion withdrawn.

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Thu, Feb 2, 2017 at 3:59 AM, Tomas Vondra
wrote:
> There's a subtle difference between pg_node_tree and the data types for
> statistics - pg_node_tree stores the value as a string (matching the
> nodeToString output), so the _in function is fairly simple. Of course,
> stringToNode() assumes safe input, which is why the input is disabled.
>
> OTOH the statistics are stored in an optimized binary format, allowing to
> use the value directly (without having to do expensive parsing etc).
>
> I was thinking that the easiest way to add support for _in would be to add a
> bunch of Nodes for the statistics, along with in/out functions, but keeping
> the internal binary representation. But that'll be tricky to do in a safe
> way - even if those nodes are coded in a very defensive ways, I'd bet
> there'll be ways to inject unsafe nodes.
>
> So I'm OK with not having the _in for now. If needed, it's possible to
> construct the statistics as a bytea using a bit of C code. That's at least
> obviously unsafe, as anything written in C, touching the memory.

Since these data types are already special-purpose, I don't really see
why it would be desirable to entangle them with the existing code for
serializing and deserializing Nodes.  Whether or not it's absolutely
necessary for these types to have input functions, it seems at least
possible that it would be useful, and it becomes much less likely that
we can make that work if it's piggybacking on stringToNode().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```

On 02/01/2017 11:52 PM, Alvaro Herrera wrote:

Still looking at 0002.

pg_ndistinct_in disallows input, claiming that pg_node_tree does the
same thing. But pg_node_tree does it for security reasons: you could
crash the backend if you supplied a malicious value. I don't think

that applies to pg_ndistinct_in. Perhaps it will be useful to inject
fake stats at some point, so why not allow it? It shouldn't be
complicated (though it does require writing some additional code, so
perhaps that's one reason we don't want to allow input of these
values).

>

Yes, I haven't written the code, and I'm not sure it's a very practical
way to inject custom statistics. But if we decide to allow that in the
future, we can probably add the code.

There's a subtle difference between pg_node_tree and the data types for
statistics - pg_node_tree stores the value as a string (matching the
nodeToString output), so the _in function is fairly simple. Of course,
stringToNode() assumes safe input, which is why the input is disabled.

OTOH the statistics are stored in an optimized binary format, allowing
to use the value directly (without having to do expensive parsing etc).

I was thinking that the easiest way to add support for _in would be to
add a bunch of Nodes for the statistics, along with in/out functions,
but keeping the internal binary representation. But that'll be tricky to
do in a safe way - even if those nodes are coded in a very defensive
ways, I'd bet there'll be ways to inject unsafe nodes.

So I'm OK with not having the _in for now. If needed, it's possible to
construct the statistics as a bytea using a bit of C code. That's at
least obviously unsafe, as anything written in C, touching the memory.

The comment on top of pg_ndistinct_out is missing the "_out"; also it

OK, will fix.

In the same function, a trivial point you don't need to pstrdup() the
.data out of a stringinfo; it's already palloc'ed in the right context
-- just PG_RETURN_CSTRING(str.data) and forget about "ret".  Saves you
one line.

Will fix too.

Nearby, some auxiliary functions such as n_choose_k and
num_combinations are not documented. What it is that they do? I'd
move these at the end of the file, keeping the important entry points
at the top of the file.

I'd say n-choose-k is pretty widely known term from combinatorics. The
comment would essentially say just 'this is n-choose-k' which seems
rather pointless. So as much as I dislike the self-documenting code,
this actually seems like a good case of that.

I see this patch has a estimate_ndistinct() which claims to be a re-
implementation of code already in analyze.c, but it is actually a lot
simpler than what analyze.c does.  I've been wondering if it'd be a good
idea to use some of this code so that some routines are moved out of
analyze.c; good implementations of statistics-related functions would
live in src/backend/statistics/ where they can be used both by analyze.c
and your new mvstats stuff.  (More generally I am beginning to wonder if
the new directory should be just src/backend/statistics.)

I'll look into that. I have to check if I ignored some assumptions or
corner cases the analyze.c deals with.

common.h does not belong in src/backend/utils/mvstats; IMO it should be
called src/include/utils/mvstat.h.  Also, it must not include
postgres.h, and it probably doesn't need most of the #includes it has;
those are better put into whatever include it.  It definitely needs a
guarding #ifdef MVSTATS_H around its whole content too.  An include file
is not just a way to avoid #includes in other files; it is supposed to
be a minimally invasive way of exporting the structs and functions
implemented in some file into other files.  So it must be kept minimal.

Will do.

psql/tab-complete.c compares the wrong version number (9.6 instead of
10).

Is it important to have a cast from pg_ndistinct to bytea?  I think
it's odd that outputting it as bytea yields something completely
different than as text.  (The bytea is not human readable and cannot be
used for future input, so what is the point?)

Because it internally is a bytea, and it seems useful to have the
ability to inspect the bytea value directly (e.g. to see the length of
the bytea and not the string output).

In another subthread you seem to have surrendered to the opinion that
the new catalog should be called pg_statistics_ext, just in case in the
future we come up with additional things to put on it.  However, given
its schema, with a "starelid / stakeys", is it sensible to think that
we're going to get anything other than something that involves multiple
variables?  Maybe it should just be "pg_statistics_multivar" and if
something else comes along we create another catalog with an appropriate
schema.  Heck, how does this catalog serve the purpose of cross-table
statistics in the first place, given that it has room to record a single
```

### Re: [HACKERS] multivariate statistics (v19)

```Still looking at 0002.

pg_ndistinct_in disallows input, claiming that pg_node_tree does the
same thing.  But pg_node_tree does it for security reasons: you could
crash the backend if you supplied a malicious value.  I don't think that
applies to pg_ndistinct_in.  Perhaps it will be useful to inject fake
stats at some point, so why not allow it?  It shouldn't be complicated
(though it does require writing some additional code, so perhaps that's
one reason we don't want to allow input of these values).

The comment on top of pg_ndistinct_out is missing the "_out"; also it

In the same function, a trivial point you don't need to pstrdup() the
.data out of a stringinfo; it's already palloc'ed in the right context
-- just PG_RETURN_CSTRING(str.data) and forget about "ret".  Saves you
one line.

Nearby, some auxiliary functions such as n_choose_k and num_combinations
are not documented.  What it is that they do?  I'd move these at the end
of the file, keeping the important entry points at the top of the file.

I see this patch has a estimate_ndistinct() which claims to be a re-
implementation of code already in analyze.c, but it is actually a lot
simpler than what analyze.c does.  I've been wondering if it'd be a good
idea to use some of this code so that some routines are moved out of
analyze.c; good implementations of statistics-related functions would
live in src/backend/statistics/ where they can be used both by analyze.c
and your new mvstats stuff.  (More generally I am beginning to wonder if
the new directory should be just src/backend/statistics.)

common.h does not belong in src/backend/utils/mvstats; IMO it should be
called src/include/utils/mvstat.h.  Also, it must not include
postgres.h, and it probably doesn't need most of the #includes it has;
those are better put into whatever include it.  It definitely needs a
guarding #ifdef MVSTATS_H around its whole content too.  An include file
is not just a way to avoid #includes in other files; it is supposed to
be a minimally invasive way of exporting the structs and functions
implemented in some file into other files.  So it must be kept minimal.

psql/tab-complete.c compares the wrong version number (9.6 instead of
10).

Is it important to have a cast from pg_ndistinct to bytea?  I think
it's odd that outputting it as bytea yields something completely
different than as text.  (The bytea is not human readable and cannot be
used for future input, so what is the point?)

In another subthread you seem to have surrendered to the opinion that
the new catalog should be called pg_statistics_ext, just in case in the
future we come up with additional things to put on it.  However, given
its schema, with a "starelid / stakeys", is it sensible to think that
we're going to get anything other than something that involves multiple
variables?  Maybe it should just be "pg_statistics_multivar" and if
something else comes along we create another catalog with an appropriate
schema.  Heck, how does this catalog serve the purpose of cross-table
statistics in the first place, given that it has room to record a single
relid only?  Are you thinking that in the future you'd change starelid
into an oidvector column?

The comment in gram.y about the CREATE STATISTICS is at odds with what
is actually allowed by the grammar.

I think the name of a statistics is only useful to DROP/ALTER it, right?
I wonder why it's useful that statistics belongs in a schema.  Perhaps
it should be a global object?  I suppose the name collisions would
become bothersome if you have many mvstats.

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 01/31/2017 07:52 AM, Amit Langote wrote:

On 2017/01/31 6:57, Tomas Vondra wrote:

On 01/30/2017 09:37 PM, Alvaro Herrera wrote:

Looks good to me.  I don't think we need to keep the names very short --
I would propose "standistinct", "stahistogram", "stadependencies".

Yeah, I got annoyed by the short names too.

This however reminds me that perhaps pg_mv_statistic is not the best name.
I know others proposed pg_statistic_ext (and pg_stats_ext), and while I
wasn't a big fan initially, I think it's a better name. People generally
don't know what 'multivariate' means, while 'extended' is better known
(e.g. because Oracle uses it for similar stuff).

So I think I'll switch to that name too.

+1 to pg_statistics_ext. Maybe, even pg_statistics_extended, however
being that verbose may not be warranted.

Yeah, I think pg_statistic_extended / pg_stats_extended seems fine.

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On 2017/01/31 6:57, Tomas Vondra wrote:
> On 01/30/2017 09:37 PM, Alvaro Herrera wrote:
>> Looks good to me.  I don't think we need to keep the names very short --
>> I would propose "standistinct", "stahistogram", "stadependencies".
>>
>
> Yeah, I got annoyed by the short names too.
>
> This however reminds me that perhaps pg_mv_statistic is not the best name.
> I know others proposed pg_statistic_ext (and pg_stats_ext), and while I
> wasn't a big fan initially, I think it's a better name. People generally
> don't know what 'multivariate' means, while 'extended' is better known
> (e.g. because Oracle uses it for similar stuff).
>
> So I think I'll switch to that name too.

+1 to pg_statistics_ext.  Maybe, even pg_statistics_extended, however
being that verbose may not be warranted.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Tue, Jan 31, 2017 at 6:57 AM, Tomas Vondra
wrote:
> This however reminds me that perhaps pg_mv_statistic is not the best name. I
> know others proposed pg_statistic_ext (and pg_stats_ext), and while I wasn't
> a big fan initially, I think it's a better name. People generally don't know
> what 'multivariate' means, while 'extended' is better known (e.g. because
> Oracle uses it for similar stuff).
>
> So I think I'll switch to that name too.

I have moved this patch to the next CF, with Álvaro as reviewer.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 01/30/2017 09:37 PM, Alvaro Herrera wrote:

Tomas Vondra wrote:

The 'built' flags may be easily replaced with a check if the bytea-like
columns are NULL, and the 'enabled' columns may be replaced by the array of
char, just like you proposed.

That'd give us a single catalog looking like this:

pg_mv_statistics
starelid
staname
stanamespace
staowner  -- all the above as currently
staenabledarray of "char" {d,f,s}
stakeys
standist (ndistinct coefficients)
stamcv   (MCV list)
stahist  (histogram)

Which is probably a better / simpler structure than the current one.

Looks good to me.  I don't think we need to keep the names very short --
I would propose "standistinct", "stahistogram", "stadependencies".

Yeah, I got annoyed by the short names too.

This however reminds me that perhaps pg_mv_statistic is not the best
name. I know others proposed pg_statistic_ext (and pg_stats_ext), and
while I wasn't a big fan initially, I think it's a better name. People
generally don't know what 'multivariate' means, while 'extended' is
better known (e.g. because Oracle uses it for similar stuff).

So I think I'll switch to that name too.

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```Tomas Vondra wrote:

> The 'built' flags may be easily replaced with a check if the bytea-like
> columns are NULL, and the 'enabled' columns may be replaced by the array of
> char, just like you proposed.
>
> That'd give us a single catalog looking like this:
>
> pg_mv_statistics
>   starelid
>   staname
>   stanamespace
>   staowner  -- all the above as currently
>   staenabled  array of "char" {d,f,s}
>   stakeys
>   standist (ndistinct coefficients)
>   stamcv   (MCV list)
>   stahist  (histogram)
>
> Which is probably a better / simpler structure than the current one.

Looks good to me.  I don't think we need to keep the names very short --
I would propose "standistinct", "stahistogram", "stadependencies".

Thanks,

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 01/30/2017 05:12 PM, Alvaro Herrera wrote:
>

Hmm.  So we have a catalog pg_mv_statistics which stores two things:
1. the configuration regarding mvstats that have been requested by user
via CREATE/ALTER STATISTICS
2. the actual values captured from the above, via ANALYZE

I think this conflates two things that really are separate, given their
different timings and usage patterns.  This decision is causing the
catalog to have columns enabled/built flags for each set of stats
requested, which looks a bit odd.  In particular, the fact that you have
looks inconvenient.

Have you thought about having the "requested" bits be separate from the
actual computed values?  Something like

pg_mv_statistics
starelid
staname
stanamespace
staowner   -- all the above as currently
staenabledarray of "char" {d,f,s}
stakeys
// no CATALOG_VARLEN here

where each char in the staenabled array has a #define and indicates one
type, "ndistinct", "functional dep", "selectivity" etc.

The actual values computed by ANALYZE would live in a catalog like:

pg_mv_statistics_values
stvstaid  -- OID of the corresponding pg_mv_statistics row.  Needed?

Definitely needed. How else would you know which MCV list and histogram
belong together? This works just like in pg_statistic - when both MCV
and histograms are enabled for the statistic, we first build MCV list,
then histogram on remaining rows. So we need to pair them.

stvrelid  -- same as starelid
stvkeys   -- same as stakeys
#ifdef CATALOG_VARLEN
stvkind   'd' or 'f' or 's', etc
stvvalue  the bytea blob
#endif

I think that would be simpler, both conceptually and in terms of code.

I think the main issue here is that it throws away the special data
types (pg_histogram, pg_mcv, pg_ndistinct, pg_dependencies), which I
think is a neat idea and would like to keep it. This would throw that
away, making everything bytea again. I don't like that.

The other angle to consider is planner-side: how does the planner gets
to the values?  I think as far as the planner goes, the first catalog
doesn't matter at all, because a statistics type that has been enabled
but not computed is not interesting at all; planner only cares about the
values in the second catalog (this is why I added stvkeys).  Currently
you're just caching a single pg_mv_statistics row in get_relation_info
(and only if any of the "built" flags is set), which is simple.  With my
proposed change, you'd need to keep multiple pg_mv_statistics_values
rows.

But maybe you already tried something like what I propose and there's a
reason not to do it?

Honestly, I don't see how this improves the situation. We still need to
cache data for exactly one catalog, so how is that simpler?

The way I see it, it actually makes things more complicated, because now
we have two catalogs to manage instead of one (e.g. when doing DROP
STATISTICS, or after ALTER TABLE ... DROP COLUMN).

The 'built' flags may be easily replaced with a check if the bytea-like
columns are NULL, and the 'enabled' columns may be replaced by the array
of char, just like you proposed.

That'd give us a single catalog looking like this:

pg_mv_statistics
starelid
staname
stanamespace
staowner  -- all the above as currently
staenabledarray of "char" {d,f,s}
stakeys
standist (ndistinct coefficients)
stamcv   (MCV list)
stahist  (histogram)

Which is probably a better / simpler structure than the current one.

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 01/30/2017 05:55 PM, Alvaro Herrera wrote:

Minor nitpicks:

Let me suggest to use get_attnum() in CreateStatistics instead of
SearchSysCacheAttName for each column.  Also, we use type AttrNumber for
attribute numbers rather than int16.  Finally in the same function you
have an erroneous ERRCODE_UNDEFINED_COLUMN which should be
ERRCODE_DUPLICATE_COLUMN in the loop that searches for duplicates.

May I suggest that compare_int16 be named attnum_cmp (just to be
consistent with other qsort comparators) and look like
return *((const AttrNumber *) a) - *((const AttrNumber *) b);

Yes, I think this is pretty much what Kyotaro-san pointed out in his
review. I'll go through the patch and make sure the correct data types
are used.

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
Hello,

On 01/26/2017 10:03 AM, Ideriha, Takeshi wrote:

Though I pointed out these typoes and so on,
I believe these feedback are less priority compared to the source code itself.

So please work on my feedback if you have time.

I think getting the comments (and docs in general) right is just as
important as the code. So thank you for your review!

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 01/26/2017 10:43 AM, Dilip Kumar wrote:

histograms
--
+ if (matches[i] == MVSTATS_MATCH_FULL)
+ s += mvhist->buckets[i]->ntuples;
+ else if (matches[i] == MVSTATS_MATCH_PARTIAL)
+ s += 0.5 * mvhist->buckets[i]->ntuples;

Isn't it will be better that take some percentage of the bucket based
on the number of distinct element for partial matching buckets.

I don't think so, for the same reason why ineq_histogram_selectivity()
in selfuncs.c uses

binfrac = 0.5;

for partial bucket matches - it provides minimum average error. Even if
we knew the number of distinct items in the bucket, we have no idea what
the distribution within the bucket looks like. Maybe 99% of the bucket
are covered by a single distinct value, maybe all the items are squashed
on one side of the bucket, etc.

Moreover we don't really know the number of distinct values in the
bucket - we only know the number of distinct items in the sample, and
only while building the histogram. I don't think it makes much sense to
estimate the number of distinct items in a bucket, because the buckets
contain only very few rows so the estimates would be wildly inaccurate.

+static int
+update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
+  int2vector *stakeys,
+  MVSerializedHistogram mvhist,
+  int nmatches, char *matches,
+  bool is_or)
+{
+ int i;

For each clause we are processing all the buckets, can't we use some
data structure which can make multi-dimensions information searching
faster.

>

No, we're not processing all buckets for each clause. We're' only
processing buckets that were not "ruled out" by preceding clauses.
That's the whole point of the bitmap.

For example for condition (a=1) AND (b=2), the code will first evaluate
(a=1) on all buckets, and then (b=2) but only on buckets where (a=1) was
evaluated as true. Similarly for OR clauses.

>

Something like HTree, RTree, Maybe storing histogram in these formats
will be difficult?

Maybe, but I don't want to do that in the first version. I'm not opposed
to doing that in the future, if we find out the v1 histograms are not
efficient (I don't think we will, based on tests I did while working on
the patch). Support for other histogram implementations is pretty much
why there is 'type' field in the struct.

For now I think we should stick with the simple implementation.

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```Minor nitpicks:

Let me suggest to use get_attnum() in CreateStatistics instead of
SearchSysCacheAttName for each column.  Also, we use type AttrNumber for
attribute numbers rather than int16.  Finally in the same function you
have an erroneous ERRCODE_UNDEFINED_COLUMN which should be
ERRCODE_DUPLICATE_COLUMN in the loop that searches for duplicates.

May I suggest that compare_int16 be named attnum_cmp (just to be
consistent with other qsort comparators) and look like
return *((const AttrNumber *) a) - *((const AttrNumber *) b);

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```Tomas Vondra wrote:
> On 01/03/2017 05:22 PM, Tomas Vondra wrote:
> > On 01/03/2017 02:42 PM, Dilip Kumar wrote:
> ...
> > > I think it should be easily reproducible, in case it's not I can send
> > > call stack or core dump.
> > >
> >
> > Thanks for the report. It was trivial to reproduce and it turned out to
> > be a fairly simple bug. Will send a new version of the patch soon.
> >
>
> Attached is v22 of the patch series, rebased to current master and fixing
> the reported bug. I haven't made any other changes - the issues reported by
> Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
> other reviews.

Hmm.  So we have a catalog pg_mv_statistics which stores two things:
1. the configuration regarding mvstats that have been requested by user
via CREATE/ALTER STATISTICS
2. the actual values captured from the above, via ANALYZE

I think this conflates two things that really are separate, given their
different timings and usage patterns.  This decision is causing the
catalog to have columns enabled/built flags for each set of stats
requested, which looks a bit odd.  In particular, the fact that you have
looks inconvenient.

Have you thought about having the "requested" bits be separate from the
actual computed values?  Something like

pg_mv_statistics
starelid
staname
stanamespace
staowner   -- all the above as currently
staenabledarray of "char" {d,f,s}
stakeys
// no CATALOG_VARLEN here

where each char in the staenabled array has a #define and indicates one
type, "ndistinct", "functional dep", "selectivity" etc.

The actual values computed by ANALYZE would live in a catalog like:

pg_mv_statistics_values
stvstaid  -- OID of the corresponding pg_mv_statistics row.  Needed?
stvrelid  -- same as starelid
stvkeys   -- same as stakeys
#ifdef CATALOG_VARLEN
stvkind   'd' or 'f' or 's', etc
stvvalue  the bytea blob
#endif

I think that would be simpler, both conceptually and in terms of code.

The other angle to consider is planner-side: how does the planner gets
to the values?  I think as far as the planner goes, the first catalog
doesn't matter at all, because a statistics type that has been enabled
but not computed is not interesting at all; planner only cares about the
values in the second catalog (this is why I added stvkeys).  Currently
you're just caching a single pg_mv_statistics row in get_relation_info
(and only if any of the "built" flags is set), which is simple.  With my
proposed change, you'd need to keep multiple pg_mv_statistics_values
rows.

But maybe you already tried something like what I propose and there's a
reason not to do it?

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```Hello, I'll return on this since this should welcome more eyeballs.

At Thu, 26 Jan 2017 09:03:10 +, "Ideriha, Takeshi"
wrote in
<4E72940DA2BF16479384A86D54D0988A565822A9@G01JPEXMBKW04>
> Hi
>
> When you have time, could you rebase the pathes?
> Some patches cannot be applied to the current HEAD.

For those who are willing to look this,
352a24a1f9d6f7d4abb1175bfd22acc358f43140 breaks this. So just
before it can accept this patches cleanly.

> 0001 patch can be applied but the following 0002 patch cannot be.
>
> code.)
>
> Though these are minor things, I've found some typos or mistakes in the
>
> >+   statistics on the table. The statistics will be created in the in the
> >+   current database. The statistics will be owned by the user issuing
>
> Regarding line 629 at
> 0002-PATCH-shared-infrastructure-and-ndistinct-coeffi-v22.patch,
> there is a double "in the".
>
> >+   knowledge of a value in the first column is sufficient for detemining the
> >+   value in the other column. Then functional dependencies are built on
> >those
>
> Regarding line 701 at 0002-PATCH,
> "determining" is mistakenly spelled "detemining".
>
>
> >@@ -0,0 +1,98 @@
> >+Multivariate statististics
> >+==
>
> Regarding line 2415 at 0002-PATCH, "statististics" should be statistics
>
>
> >+
> >+  CREATE STATISTICS
> >+  define a new statistics
> >+
>
> >+
> >+  DROP STATISTICS
> >+  remove a statistics
> >+
>
> Regarding line 612 and 771 at 0002-PATCH,
> I assume saying "multiple statistics" explicitly is easier to understand to
> users
> since these commands don't for the statistics we already have in the
> pg_statistics in my understanding.
>
> >+   [1] http://en.wikipedia.org/wiki/Database_normalization
>
> Regarding line 386 at 0003-PATCH, is it better to change this link to this
> one:
> https://en.wikipedia.org/wiki/Functional_dependency ?
>
> Though I pointed out these typoes and so on,
> I believe these feedback are less priority compared to the source code itself.
>
> So please work on my feedback if you have time.

> dependencies, and for each one count the number of rows rows consistent it.
"of rows rows consistent it" => "or rows consistent with it"?

> are in fact consistent with the functinal dependency, i.e. that given the a

"that given the a" => "that given a" ?

dependencies.c:

dependency_dgree():

- The k is assumed larger than 1. I think assertion is required.

- "/* end of the preceding group */" seems to be better if it
is just after the "if (multi_sort.." currently just after it.

- The following comment seems mis-edited.
> * If there is a single are no contradicting rows, count the group
> * as supporting, otherwise contradicting.

maybe this would be like the following? The varialbe counting
the first "contradiction" is named "n_violations". This seems
somewhat confusing.

> * If there are no violating rows up to here, count the group
> * as supporting, otherwise contradicting.

- "/* first columns match, but the last one does not"
else if (multi_sort_compare_dims((k - 1), (k - 1), ...

The above comparison should use multi_sort_compare_dim, not
dims

- This function counts "n_contradicting_rows" but it is not
referenced. Anyway n_contradicting_rows = numrows -
n_supporing_rows so it and n_contradicting seem
unncecessary.

build_mv_dependencies():

- In the commnet,
"* covering jut 2 columns, to the largest ones, covering all columns"
"* included int the statistics. We start from the smallest ones because we"

l1: "jut" => "just", l2: "int" => "in"

mvstats.h:

- struct MVDependencyData/ MVDependenciesData

The varialbe length member at the last of the structs should
be defined using FLEXIBLE_ARRAY_MEMBER, from the convention.

- I'm not sure how much it impacts performance, but some
struct members seems to have a bit too wide types. For
example, MVDepedenciesData.type is of int32 but it can have
only '1' for now and it won't be two-digits. Also ndeps
cannot be so large.

common.c:

multi_sort_compare_dims needs comment.

general:
This patch uses int16 as the type of attrubute number but it
might be better to use AttrNumber for the purpose.
(Specifically it seems defined as the type for an attribute
index but also used as the varialbe for number of attributes)

Sorry for the random comment in advance. I'll learn this further.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Thu, Jan 5, 2017 at 3:27 AM, Tomas Vondra
wrote:
> Thanks. Those plans match my experiments with the TPC-H data set, although
> I've been playing with the smallest scale (1GB).
>
> It's not very difficult to make the estimation error arbitrary large, e.g.
> by using perfectly correlated (identical) columns.

I have done an initial review for ndistint and histogram patches,

ndistinct
-
1. Duplicate statistics:
postgres=# create statistics s with (ndistinct) on (a,c) from t;
2017-01-07 16:21:54.575 IST [63817] ERROR:  duplicate key value
violates unique constraint "pg_mv_statistic_name_index"
2017-01-07 16:21:54.575 IST [63817] DETAIL:  Key (staname,
2017-01-07 16:21:54.575 IST [63817] STATEMENT:  create statistics s
with (ndistinct) on (a,c) from t;
ERROR:  duplicate key value violates unique constraint
"pg_mv_statistic_name_index"
DETAIL:  Key (staname, stanamespace)=(s, 2200) already exists.

For duplicate statistics, I think we can check the existence of the
statistics and give more meaningful error code something statistics

2. Typo
+ /*
+ * Sort the attnums, which makes detecting duplicies somewhat
+ * easier, and it does not hurt (it does not affect the efficiency,
+ * onlike for indexes, for example).
+ */
/onlike/unlike

3. Typo
/*
* Find attnims of MV stats using the mvoid.
*/
int2vector *
find_mv_attnums(Oid mvoid, Oid *relid)

/attnims/attnums

histograms
--
+ if (matches[i] == MVSTATS_MATCH_FULL)
+ s += mvhist->buckets[i]->ntuples;
+ else if (matches[i] == MVSTATS_MATCH_PARTIAL)
+ s += 0.5 * mvhist->buckets[i]->ntuples;

Isn't it will be better that take some percentage of the bucket based
on the number of distinct element for partial matching buckets.

+static int
+update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
+  int2vector *stakeys,
+  MVSerializedHistogram mvhist,
+  int nmatches, char *matches,
+  bool is_or)
+{
+ int i;

For each clause we are processing all the buckets, can't we use some
data structure which can make multi-dimensions information searching
faster.
Something like HTree, RTree, Maybe storing histogram in these formats
will be difficult?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```Hi

When you have time, could you rebase the pathes?
Some patches cannot be applied to the current HEAD.
0001 patch can be applied but the following 0002 patch cannot be.

code.)

Though these are minor things, I've found some typos or mistakes in the

>+   statistics on the table. The statistics will be created in the in the
>+   current database. The statistics will be owned by the user issuing

Regarding line 629 at
0002-PATCH-shared-infrastructure-and-ndistinct-coeffi-v22.patch,
there is a double "in the".

>+   knowledge of a value in the first column is sufficient for detemining the
>+   value in the other column. Then functional dependencies are built on those

Regarding line 701 at 0002-PATCH,
"determining" is mistakenly spelled "detemining".

>@@ -0,0 +1,98 @@
>+Multivariate statististics
>+==

Regarding line 2415 at 0002-PATCH, "statististics" should be statistics

>+
>+  CREATE STATISTICS
>+  define a new statistics
>+

>+
>+  DROP STATISTICS
>+  remove a statistics
>+

Regarding line 612 and 771 at 0002-PATCH,
I assume saying "multiple statistics" explicitly is easier to understand to
users
since these commands don't for the statistics we already have in the
pg_statistics in my understanding.

>+   [1] http://en.wikipedia.org/wiki/Database_normalization

Regarding line 386 at 0003-PATCH, is it better to change this link to this one:
https://en.wikipedia.org/wiki/Functional_dependency ?

Though I pointed out these typoes and so on,
I believe these feedback are less priority compared to the source code itself.

So please work on my feedback if you have time.

regards,
Ideriha Takeshi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Wed, Jan 25, 2017 at 9:56 PM, Alvaro Herrera
wrote:
> Michael Paquier wrote:
>> And nothing has happened since. Are there people willing to review
>> this patch and help it proceed?
>
> I am going to grab this patch as committer.

Thanks, that's good to know.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```Michael Paquier wrote:
> On Wed, Jan 4, 2017 at 11:35 AM, Tomas Vondra
>  wrote:

> > Attached is v22 of the patch series, rebased to current master and fixing
> > the reported bug. I haven't made any other changes - the issues reported by
> > Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
> > other reviews.
>
> And nothing has happened since. Are there people willing to review
> this patch and help it proceed?

I am going to grab this patch as committer.

> As this patch is quite large, I am not sure if it is fit to join the
> last CF. Thoughts?

All patches, regardless of size, are welcome to join any commitfest.
The last commitfest is not different in that regard.  The rule I
remember is that patches may not arrive *for the first time* in the last
commitfest.  This patch has already seen a lot of work in previous
commitfests, so it's fine.

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Wed, Jan 4, 2017 at 11:35 AM, Tomas Vondra
wrote:
> On 01/03/2017 05:22 PM, Tomas Vondra wrote:
>>
>> On 01/03/2017 02:42 PM, Dilip Kumar wrote:
>
> ...
>>>
>>> I think it should be easily reproducible, in case it's not I can send
>>> call stack or core dump.
>>>
>>
>> Thanks for the report. It was trivial to reproduce and it turned out to
>> be a fairly simple bug. Will send a new version of the patch soon.
>>
>
> Attached is v22 of the patch series, rebased to current master and fixing
> the reported bug. I haven't made any other changes - the issues reported by
> Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
> other reviews.

And nothing has happened since. Are there people willing to review
this patch and help it proceed? As this patch is quite large, I am not
sure if it is fit to join the last CF. Thoughts?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 01/04/2017 03:21 PM, Dilip Kumar wrote:

On Wed, Jan 4, 2017 at 8:05 AM, Tomas Vondra
wrote:

Attached is v22 of the patch series, rebased to current master and fixing
the reported bug. I haven't made any other changes - the issues reported by
Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
other reviews.

v22 fixes the problem, I reported.  In my test, I observed that group
by estimation is much better with ndistinct stat.

Here is one example:

postgres=# explain analyze select p_brand, p_type, p_size from part
group by p_brand, p_type, p_size;
QUERY PLAN
---
HashAggregate  (cost=37992.00..38992.00 rows=10 width=36) (actual
time=953.359..1011.302 rows=186607 loops=1)
Group Key: p_brand, p_type, p_size
->  Seq Scan on part  (cost=0.00..30492.00 rows=100 width=36)
(actual time=0.013..163.672 rows=100 loops=1)
Planning time: 0.194 ms
Execution time: 1020.776 ms
(5 rows)

postgres=# CREATE STATISTICS s2  WITH (ndistinct) on (p_brand, p_type,
p_size) from part;
CREATE STATISTICS
postgres=# analyze part;
ANALYZE
postgres=# explain analyze select p_brand, p_type, p_size from part
group by p_brand, p_type, p_size;
QUERY PLAN
---
HashAggregate  (cost=37992.00..39622.46 rows=163046 width=36) (actual
time=935.162..992.944 rows=186607 loops=1)
Group Key: p_brand, p_type, p_size
->  Seq Scan on part  (cost=0.00..30492.00 rows=100 width=36)
(actual time=0.013..156.746 rows=100 loops=1)
Planning time: 0.308 ms
Execution time: 1001.889 ms

In above example,
Without MVStat-> estimated: 10 Actual: 186607
With MVStat-> estimated: 163046 Actual: 186607

Thanks. Those plans match my experiments with the TPC-H data set,
although I've been playing with the smallest scale (1GB).

It's not very difficult to make the estimation error arbitrary large,
e.g. by using perfectly correlated (identical) columns.

regard

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Wed, Jan 4, 2017 at 8:05 AM, Tomas Vondra
wrote:
> Attached is v22 of the patch series, rebased to current master and fixing
> the reported bug. I haven't made any other changes - the issues reported by
> Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
> other reviews.

v22 fixes the problem, I reported.  In my test, I observed that group
by estimation is much better with ndistinct stat.

Here is one example:

postgres=# explain analyze select p_brand, p_type, p_size from part
group by p_brand, p_type, p_size;
QUERY PLAN
---
HashAggregate  (cost=37992.00..38992.00 rows=10 width=36) (actual
time=953.359..1011.302 rows=186607 loops=1)
Group Key: p_brand, p_type, p_size
->  Seq Scan on part  (cost=0.00..30492.00 rows=100 width=36)
(actual time=0.013..163.672 rows=100 loops=1)
Planning time: 0.194 ms
Execution time: 1020.776 ms
(5 rows)

postgres=# CREATE STATISTICS s2  WITH (ndistinct) on (p_brand, p_type,
p_size) from part;
CREATE STATISTICS
postgres=# analyze part;
ANALYZE
postgres=# explain analyze select p_brand, p_type, p_size from part
group by p_brand, p_type, p_size;
QUERY PLAN
---
HashAggregate  (cost=37992.00..39622.46 rows=163046 width=36) (actual
time=935.162..992.944 rows=186607 loops=1)
Group Key: p_brand, p_type, p_size
->  Seq Scan on part  (cost=0.00..30492.00 rows=100 width=36)
(actual time=0.013..156.746 rows=100 loops=1)
Planning time: 0.308 ms
Execution time: 1001.889 ms

In above example,
Without MVStat-> estimated: 10 Actual: 186607
With MVStat-> estimated: 163046 Actual: 186607

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 12/30/2016 02:12 PM, Petr Jelinek wrote:

On 12/12/16 22:50, Tomas Vondra wrote:

On 12/12/2016 12:26 PM, Amit Langote wrote:

Hi Tomas,

On 2016/10/30 4:23, Tomas Vondra wrote:

Hi,

Attached is v20 of the multivariate statistics patch series, doing
mostly
the changes outlined in the preceding e-mail from October 11.

The patch series currently has these parts:

* 0001 : (FIX) teach pull_varno about RestrictInfo
* 0002 : (PATCH) shared infrastructure and ndistinct coefficients

Hi,

I went over these two (IMHO those could easily be considered as minimal
committable set even if the user visible functionality they provide is
rather limited).

Yes, although I still have my doubts 0001 is the right way to make
pull_varnos work. It's probably related to the bigger design question,
because moving the statistics selection to an earlier phase could make
it unnecessary I guess.

dropping statistics
---

The statistics may be dropped automatically using DROP STATISTICS.

After ALTER TABLE ... DROP COLUMN, statistics referencing are:

(a) dropped, if the statistics would reference only one column

(b) retained, but modified on the next ANALYZE

This should be documented in user visible form if you plan to keep it
(it does make sense to me).

Yes, I plan to keep it. I agree it should be documented, probably on the
ALTER TABLE page (and linked from CREATE/DROP statistics pages).

+   correlation between columns is the purpose of multivariate statistics,
+   and the rest of this section thoroughly explains how the planner
+   leverages them to improve estimates.
+
+
+
+   READMEs for each type of statistics, mentioned in the following
+   sections.
+
+
+

I don't think this qualifies as "thoroughly explains" ;)

OK, I'll drop the "thoroughly" ;-)

+
+Oid
+get_statistics_oid(List *names, bool missing_ok)

No comment?

+   case OBJECT_STATISTICS:
+   msg = gettext_noop("statistics \"%s\" does not exist,
skipping");
+   name = NameListToString(objname);
+   break;

This sounds somewhat weird (plural vs singular).

Ah, right - it should be either "statistic ... does not" or "statistics
... do not". I think "statistics" is the right choice here, because (a)
we have CREATE STATISTICS and (b) it may be a combination of statistics,
e.g. histogram + MCV.

+ * XXX Maybe this should check for duplicate stats. Although it's not clear
+ * what "duplicate" would mean here (wheter to compare only keys or also
+ * options). Moreover, we don't do such checks for indexes, although those
+ * store tuples and recreating a new index may be a way to fix bloat (which
+ * is a problem statistics don't have).
+ */
+CreateStatistics(CreateStatsStmt *stmt)

I don't think we should check duplicates TBH so I would remove the XXX
(also "wheter" is typo but if you remove that paragraph it does not matter).

Yes, I came to the same conclusion - we can only really check for exact
matches (same set of columns, same choice of statistic types), but
that's fairly useless. I'll remove the XXX.

+   if (true)
+   {

Huh?

Yeah, that's a bit weird pattern. It's a remainder of copy-pasting the
preceding block, which looks like this

if (hasindex)
{
...
}

But we've decided to not add similar flag for the statistics. I'll move
the block to a separate function (instead of merging it directly into
the function, which is already a bit largeish).

+
+List *
+RelationGetMVStatList(Relation relation)
+{

...

+
+void
+update_mv_stats(Oid mvoid, MVNDistinct ndistinct,
+   int2vector *attrs, VacAttrStats **stats)

...

+static double
+ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
+  int2vector *attrs, VacAttrStats **stats,
+  int k, int *combination)
+{

Again, these deserve comment.

I'll try to look at other patches in the series as time permits.

thanks

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 01/03/2017 02:42 PM, Dilip Kumar wrote:

On Tue, Dec 13, 2016 at 3:20 AM, Tomas Vondra
wrote:

attached is v21 of the patch series, rebased to current master (resolving
the duplicate OID and a few trivial merge conflicts), and also fixing some
of the issues you reported.

I wanted to test the grouping estimation behaviour with TPCH, While
testing I found some crash so I thought of reporting it.

My setup detail:
TPCH scale factor : 5
Applied all the patch for 21 series, and ran below queries.

postgres=# analyze part;
ANALYZE
postgres=# CREATE STATISTICS s2  WITH (ndistinct) on (p_brand, p_type,
p_size) from part;
CREATE STATISTICS
postgres=# analyze part;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I think it should be easily reproducible, in case it's not I can send
call stack or core dump.

Thanks for the report. It was trivial to reproduce and it turned out to
be a fairly simple bug. Will send a new version of the patch soon.

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Tue, Dec 13, 2016 at 3:20 AM, Tomas Vondra
wrote:
> attached is v21 of the patch series, rebased to current master (resolving
> the duplicate OID and a few trivial merge conflicts), and also fixing some
> of the issues you reported.

I wanted to test the grouping estimation behaviour with TPCH, While
testing I found some crash so I thought of reporting it.

My setup detail:
TPCH scale factor : 5
Applied all the patch for 21 series, and ran below queries.

postgres=# analyze part;
ANALYZE
postgres=# CREATE STATISTICS s2  WITH (ndistinct) on (p_brand, p_type,
p_size) from part;
CREATE STATISTICS
postgres=# analyze part;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I think it should be easily reproducible, in case it's not I can send
call stack or core dump.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On 12/12/16 22:50, Tomas Vondra wrote:
> On 12/12/2016 12:26 PM, Amit Langote wrote:
>>
>> Hi Tomas,
>>
>> On 2016/10/30 4:23, Tomas Vondra wrote:
>>> Hi,
>>>
>>> Attached is v20 of the multivariate statistics patch series, doing
>>> mostly
>>> the changes outlined in the preceding e-mail from October 11.
>>>
>>> The patch series currently has these parts:
>>>
>>> * 0001 : (FIX) teach pull_varno about RestrictInfo
>>> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients

Hi,

I went over these two (IMHO those could easily be considered as minimal
committable set even if the user visible functionality they provide is
rather limited).

> dropping statistics
> ---
>
> The statistics may be dropped automatically using DROP STATISTICS.
>
> After ALTER TABLE ... DROP COLUMN, statistics referencing are:
>
>   (a) dropped, if the statistics would reference only one column
>
>   (b) retained, but modified on the next ANALYZE

This should be documented in user visible form if you plan to keep it
(it does make sense to me).

> +   correlation between columns is the purpose of multivariate statistics,
> +   and the rest of this section thoroughly explains how the planner
> +   leverages them to improve estimates.
> +
> +
> +
> +   READMEs for each type of statistics, mentioned in the
> following
> +   sections.
> +
> +
> +

I don't think this qualifies as "thoroughly explains" ;)

> +
> +Oid
> +get_statistics_oid(List *names, bool missing_ok)

No comment?

> + case OBJECT_STATISTICS:
> + msg = gettext_noop("statistics \"%s\" does not exist,
> skipping");
> + name = NameListToString(objname);
> + break;

This sounds somewhat weird (plural vs singular).

> + * XXX Maybe this should check for duplicate stats. Although it's not clear
> + * what "duplicate" would mean here (wheter to compare only keys or also
> + * options). Moreover, we don't do such checks for indexes, although those
> + * store tuples and recreating a new index may be a way to fix bloat (which
> + * is a problem statistics don't have).
> + */
> +CreateStatistics(CreateStatsStmt *stmt)

I don't think we should check duplicates TBH so I would remove the XXX
(also "wheter" is typo but if you remove that paragraph it does not matter).

> + if (true)
> + {

Huh?

> +
> +List *
> +RelationGetMVStatList(Relation relation)
> +{
...
> +
> +void
> +update_mv_stats(Oid mvoid, MVNDistinct ndistinct,
> + int2vector *attrs, VacAttrStats **stats)
...
> +static double
> +ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
> +int2vector *attrs, VacAttrStats **stats,
> +int k, int *combination)
> +{

Again, these deserve comment.

I'll try to look at other patches in the series as time permits.

--
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On 12/12/16 22:50, Tomas Vondra wrote:
>> +
>> +  FROM pg_mv_statistic WHERE staname = 's1';
>> +
>> + pg_mv_stats_dependencies_show
>> +---
>> + (1) => 2, (2) => 1
>> +(1 row)
>> +
>>
>> Couldn't this somehow show actual column names, instead of attribute
>> numbers?
>>
>
> Yeah, I was thinking about that too. The trouble is that's table-level
> metadata, so we don't have that kind of info serialized within the data
> type (e.g. because it would not handle column renames etc.).
>
> It might be possible to explicitly pass the table OID as a parameter of
> the function, but it seemed a bit ugly to me.

I think it makes sense to have such function, this is not out function
so I think it's ok for it to have the oid as input, especially since in
the use-case shown above you can use starelid easily.

>
> FWIW, as I wrote in this thread, the place where this patch series needs
> feedback most desperately is integration into the optimizer. Currently
> all the magic happens in clausesel.c and does not leave it.I think it
> would be good to move some of that (particularly the choice of
> statistics to apply) to an earlier stage, and store the information
> within the plan tree itself, so that it's available outside clausesel.c
> (e.g. for EXPLAIN - showing which stats were picked seems useful).
>
> I was thinking it might work similarly to the foreign key estimation
> patch (100340e2). It might even be more efficient, as the current code
> may end repeating the selection of statistics multiple times. But
> enriching the plan tree turned out to be way more invasive than I'm
> comfortable with (but maybe that'd be OK).
>

In theory it seems like possibly reasonable approach to me, mainly
because mv statistics are user defined objects. I guess we'd have to see
at least some PoC to see how invasive it is. But I ultimately think that
feedback from a committer who is more familiar with planner is needed here.

--
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
Hi Tomas,

On 2016/10/30 4:23, Tomas Vondra wrote:
> Hi,
>
> Attached is v20 of the multivariate statistics patch series, doing mostly
> the changes outlined in the preceding e-mail from October 11.
>
> The patch series currently has these parts:
>
> * 0001 : (FIX) teach pull_varno about RestrictInfo
> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
> * 0003 : (PATCH) functional dependencies (only the ANALYZE part)
> * 0004 : (PATCH) selectivity estimation using functional dependencies
> * 0005 : (PATCH) multivariate MCV lists
> * 0006 : (PATCH) multivariate histograms
> * 0007 : (WIP) selectivity estimation using ndistinct coefficients
> * 0008 : (WIP) use multiple statistics for estimation
> * 0009 : (WIP) psql tab completion basics

Unfortunately, this failed to compile because of the duplicate_oids error.
Partitioning patch consumed same OIDs as used in this patch.

I will try to read the patches in some more detail, but in the meantime,
here are some comments/nitpicks on the documentation:

+
+   The examples presented in  used
+   statistics about individual columns to compute selectivity estimates.
+   When estimating conditions on multiple columns, the planner assumes
+   independence and multiplies the selectivities. When the columns are
+   correlated, the independence assumption is violated, and the estimates
+   may be seriously off, resulting in poor plan choices.
+

The term independence is used in isolation - independence of what?
Independence of the distributions of values in separate columns?  Also,
the phrase "seriously off" could perhaps be replaced by more rigorous
terminology; it might be unclear to some readers.  Perhaps: wildly
inaccurate, :)

+
+EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1;
+   QUERY PLAN
+-
+ Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual
time=0.031..2.870 rows=100 loops=1)
+   Filter: (a = 1)
+   Rows Removed by Filter: 9900
+ Planning time: 0.092 ms
+ Execution time: 3.103 ms

Is there a reason why examples in "67.2. Multivariate Statistics" (like
the one above) use EXPLAIN ANALYZE, whereas those in "67.1. Row Estimation
Examples" (also, other relevant chapters) uses just EXPLAIN.

+   the final 0.01% estimate. The plan however shows that this results in
+   a significant under-estimate, as the actual number of rows matching the

s/under-estimate/underestimate/g

+
+   README for each type of statistics, mentioned in the following
+   sections.
+

Referring to source tree READMEs seems novel around this portion of the
documentation, but I think not too far away, there are some references.
This is under the VII. Internals chapter anyway, so that might be OK.

+used in definitions of database normal forms. When simplified, saying
that
+b is functionally dependent on a means that

Maybe, s/When simplified/In simple terms/g

+In normalized databases, only functional dependencies on primary keys
+and super keys are allowed. In practice however many data sets are not
+fully normalized, for example thanks to intentional denormalization for
+performance reasons. The table t is an example of a data
+with functional dependencies. As a=b for all rows in the
+table, a is functionally dependent on b and
+b is functionally dependent on a.

"super keys" sounds like a new term.

s/for example thanks to/for example, thanks to/g  (or due to instead of
thanks to)

How about: s/an example of a data with/an example of a schema with/g

Perhaps, s/a=b/a = b/g  (additional white space)

+Similarly to per-column statistics, multivariate statistics are stored in

I notice that "similar to" is used more often than "similarly to".  But
that might be OK.

+ This shows that the statistics is defined on table t,

Perhaps: the statistics is -> the statistics are or the statistic is

+ lists attnums of the columns (references
+ pg_attribute).

While this text may be OK on the catalog description page, it might be
better to expand attnums here as "attribute numbers" dropping the
parenthesized phrase altogether.

+
+  FROM pg_mv_statistic WHERE staname = 's1';
+
+ pg_mv_stats_dependencies_show
+---
+ (1) => 2, (2) => 1
+(1 row)
+

Couldn't this somehow show actual column names, instead of attribute numbers?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
Hi everyone,

thanks for the reviews. Let me sum the feedback so far, and outline my
plans for the next patch version that I'd like to submit for CF 2016-11.

1) syntax changes

I agree with the changes proposed by Dean, although only a subset of the
syntax is going to be supported until we add support for either join or
partial statistics. So something like this:

CREATE STATISTICS name
[ WITH (options) ]
ON (column1, column2 [, ...])
FROM table

That should be a difficult change.

2) catalog names

I'm not sure what are the best names, so I'm fine with using whatever is
the consensus.

That being said, I'm not sure I like extending the catalog to also
support non-multivariate statistics (like for example statistics on
expressions). While that would be a clearly useful feature, it seems
like a slightly different use case and perhaps a separate catalog would
be better. So maybe pg_statistic_ext is not the best name.

3) special data type(s) to store statistics

I agree using an opaque bytea value is not very nice. I see Heikki
proposed using something like pg_node_tree, and maybe storing all the
statistics in a single value.

I assume the pg_node_tree was meant only as an inspiration how to build
pseudo-type on top of a varlena value. I agree that's a good idea, and I
plan to do something like that - say adding pg_mcv, pg_histogram,
pg_ndistinct and pg_dependencies data types.

Heikki also mentioned that maybe JSONB would be a good way to store the
statistics. I don't think so - firstly, it only supports a subset of
data types, so we'd be unable to store statistics for some data types
(or we'd have to store them as text, which sucks). Also, there's a fair
amount of smartness in how the statistics are stored (e.g. how the
histogram bucket boundaries are deduplicated, or how the estimation uses
the serialized representation directly). We'd lose all of that when
using JSONB.

Similarly for storing all the statistics in a single value - I see no
reason why keeping the statistics in separate columns would be a bad
idea (after all, that's kinda the point of relational databases). Also,
there are perfectly valid cases when the caller only needs a particular
type of statistic - e.g. when estimating GROUP BY we'll only need the
ndistinct coefficients. Why should we force the caller to fetch and
detoast everything, and throw away probably 99% of that?

So my plan here is to define pseudo types similar to how pg_node_tree is
defined. That does not seem like a tremendous amount of work.

4) functional dependencies

Several people mentioned they don't like how functional dependencies are
detected at ANALYZE time, particularly that there's a sudden jump
between 0 and 1. Instead, a continuous "dependency degree" between 0 and
1 was proposed.

I'm fine with that, although that makes "clause reduction" (deciding
that we don't need to estimate one of the clauses at all, as it's
implied by some other clause) impossible. But that's fine, the
functional dependencies will still be much less expensive than the other
statistics.

I'm wondering how will this interact with transitivity, though. IIRC the
current implementation is able to detect transitive dependencies and use
that to reduce storage space etc.

In any case, this significantly complicates the functional dependencies,
which were meant as a trivial type of statistics, mostly to establish
the shared infrastructure. Which brings me to ndistinct.

5) ndistinct

So far, the ndistinct coefficients were lumped at the very end of the
patch, and the statistic was only built but not used for any sort of
estimation. I agree with Dean that perhaps it'd be better to move this
to the very beginning, and use it as the simplest statistic to build the
infrastructure instead of functional dependencies (which only gets truer
due to the changes in functional dependencies, discussed in the
preceding section).

I think it's probably a good idea and I plan to do that, so the patch
series will probably look like this:

* 001 - CREATE STATISTICS infrastucture with ndistinct coefficients
* 002 - use ndistinct coefficients to improve GROUP BY estimates
* 003 - use ndistinct coefficients in clausesel.c (not sure)
* 004 - add functional dependencies (build + clausesel.c)
* 005 - add multivariate MCV (build + clausesel.c)
* 006 - add multivariate histograms (build + clausesel.c)

I'm not sure about using the ndistinct coefficients in clausesel.c to
estimate regular conditions - it's the place for which ndistinct
coefficients were originally proposed by Kyotaro-san, but I seem to
remember it was non-trivial to choose the best statistics when there
were other types of stats available. But I'll look into that.

6) combining statistics

I've decided not to re-submit this part of the patch until the basic
functionality gets in. I do think it's a very useful feature (despite
having my doubts about the ```

### Re: [HACKERS] multivariate statistics (v19)

```On 4 October 2016 at 09:15, Heikki Linnakangas  wrote:
> However, for tables and views, each object you store in those views is a
> "table" or "view", but with this thing, the object you store is
> "statistics". Would you have a catalog table called "pg_scissor"?
>

No, probably not (unless it was storing individual scissor blades).

However, in this case, we have related pre-existing catalog tables, so...

> We call the current system table "pg_statistic", though. I agree we should
> call it pg_mv_statistic, in singular, to follow the example of pg_statistic.
>
> Of course, the user-friendly system view on top of that is called
> "pg_stats", just to confuse things more :-).
>

I agree. Given where we are, with a pg_statistic table and a pg_stats
view, I think the least worst solution is to have a pg_statistic_ext
table, and then maybe a pg_stats_ext view.

>> It doesn't seem like the end of the world that it doesn't
>> match the user-facing syntax. A bigger concern is the use of "mv" in
>> the name, because as has already been pointed out, this table may also
>> in the future be used to store univariate expression and partial
>> statistics, so I think we should drop the "mv" and go with something
>> like pg_statistic_ext, or some other more general name.
>
>
> Also, "mv" makes me think of materialized views, which is completely
> unrelated to this.
>

Yeah, I hadn't thought of that.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 04/10/16 20:37, Dean Rasheed wrote:

On 4 October 2016 at 04:25, Michael Paquier  wrote:

OK. A second thing was related to the use of schemas in the new system
catalogs. As mentioned in [1], those could be removed.
[1]:
https://www.postgresql.org/message-id/cab7npqtu40q5_nsghvomjfbyh1hdtqmbfdj+kwfjspam35b...@mail.gmail.com.

That doesn't work, because if the intention is to be able to one day
support statistics across multiple tables, you can't assume that the
statistics are in the same schema as the table.

In fact, if multi-table statistics are to be allowed in the future, I
think you want to move away from thinking of statistics as depending
on and referring to a single table, and handle them more like views --
i.e, store a pg_node_tree representing the from_clause and add
multiple dependencies at statistics creation time. That was what I was
getting at upthread when I suggested the alternate syntax, and also

Of course, if we don't think that we will ever support multi-table
statistics, that all goes away, and you may as well make the
statistics name local to the table, but I think that's a bit limiting.
One way or the other, I think this is a question that needs to be
answered now. My vote is to leave expansion room to support
multi-table statistics in the future.

Regards,
Dean

I can see multi-table statistics being useful if one is trying to
optimise indexes for multiple joins.

Am assuming that the statistics can be accessed by the user as well as
the planner? (I've only lightly followed this thread, so I might have
missed, significant relevant details!)

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 10/04/2016 10:49 AM, Dean Rasheed wrote:

On 30 September 2016 at 12:10, Heikki Linnakangas  wrote:

I fear that using "statistics" as the name of the new object might get a bit
awkward. "statistics" is a plural, but we use it as the name of a single
object, like "pants" or "scissors". Not sure I have any better ideas though.
"estimator"? "statistics collection"? Or perhaps it should be singular,
"statistic". I note that you actually called the system table
"pg_mv_statistic", in singular.

I think it's OK. The functional dependency is a single statistic, but
MCV lists and histograms are multiple statistics (multiple facts about
the data sampled), so in general when you create one of these new
objects, you are creating multiple statistics about the data.

Ok. I don't really have any better ideas, was just hoping that someone
else would.

Also I find "CREATE STATISTIC" just sounds a bit clumsy compared to
"CREATE STATISTICS".

Agreed.

The convention for naming system catalogs seems to be to use the
singular for tables and plural for views, so I guess we should stick
with that.

However, for tables and views, each object you store in those views is a
"table" or "view", but with this thing, the object you store is
"statistics". Would you have a catalog table called "pg_scissor"?

We call the current system table "pg_statistic", though. I agree we
should call it pg_mv_statistic, in singular, to follow the example of
pg_statistic.

Of course, the user-friendly system view on top of that is called
"pg_stats", just to confuse things more :-).

It doesn't seem like the end of the world that it doesn't
match the user-facing syntax. A bigger concern is the use of "mv" in
the name, because as has already been pointed out, this table may also
in the future be used to store univariate expression and partial
statistics, so I think we should drop the "mv" and go with something
like pg_statistic_ext, or some other more general name.

Also, "mv" makes me think of materialized views, which is completely
unrelated to this.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On 30 September 2016 at 12:10, Heikki Linnakangas  wrote:
> I fear that using "statistics" as the name of the new object might get a bit
> awkward. "statistics" is a plural, but we use it as the name of a single
> object, like "pants" or "scissors". Not sure I have any better ideas though.
> "estimator"? "statistics collection"? Or perhaps it should be singular,
> "statistic". I note that you actually called the system table
> "pg_mv_statistic", in singular.
>

I think it's OK. The functional dependency is a single statistic, but
MCV lists and histograms are multiple statistics (multiple facts about
the data sampled), so in general when you create one of these new
objects, you are creating multiple statistics about the data. Also I
find "CREATE STATISTIC" just sounds a bit clumsy compared to "CREATE
STATISTICS".

The convention for naming system catalogs seems to be to use the
singular for tables and plural for views, so I guess we should stick
with that. It doesn't seem like the end of the world that it doesn't
match the user-facing syntax. A bigger concern is the use of "mv" in
the name, because as has already been pointed out, this table may also
in the future be used to store univariate expression and partial
statistics, so I think we should drop the "mv" and go with something
like pg_statistic_ext, or some other more general name.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On 4 October 2016 at 04:25, Michael Paquier  wrote:
> OK. A second thing was related to the use of schemas in the new system
> catalogs. As mentioned in [1], those could be removed.
> [1]:
> https://www.postgresql.org/message-id/cab7npqtu40q5_nsghvomjfbyh1hdtqmbfdj+kwfjspam35b...@mail.gmail.com.
>

That doesn't work, because if the intention is to be able to one day
support statistics across multiple tables, you can't assume that the
statistics are in the same schema as the table.

In fact, if multi-table statistics are to be allowed in the future, I
think you want to move away from thinking of statistics as depending
on and referring to a single table, and handle them more like views --
i.e, store a pg_node_tree representing the from_clause and add
multiple dependencies at statistics creation time. That was what I was
getting at upthread when I suggested the alternate syntax, and also

Of course, if we don't think that we will ever support multi-table
statistics, that all goes away, and you may as well make the
statistics name local to the table, but I think that's a bit limiting.
One way or the other, I think this is a question that needs to be
answered now. My vote is to leave expansion room to support
multi-table statistics in the future.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Mon, Oct 3, 2016 at 8:25 PM, Heikki Linnakangas  wrote:
> Yeah. The idea was to use something like pg_node_tree to store all the
> different kinds of statistics, the histogram, the MCV, and the functional
> dependencies, in one datum. Or JSON, maybe. It sounds better than an opaque
> bytea blob, although I'd prefer something more relational. For the
> functional dependencies, I think we could get away with a simple float
> array, so let's do that in the first cut, and revisit this for the MCV and
> histogram later.

OK. A second thing was related to the use of schemas in the new system
catalogs. As mentioned in [1], those could be removed.
[1]:
https://www.postgresql.org/message-id/cab7npqtu40q5_nsghvomjfbyh1hdtqmbfdj+kwfjspam35b...@mail.gmail.com.

> Separate columns for the functional dependencies, the MCVs,
> and the histogram, probably makes sense anyway.

Probably..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 10/03/2016 04:46 AM, Michael Paquier wrote:

On Fri, Sep 30, 2016 at 8:10 PM, Heikki Linnakangas  wrote:

This patch set is in pretty good shape, the only problem is that it's so big
that no-one seems to have the time or courage to do the final touches and
commit it.

Did you see my suggestions about simplifying its SQL structure? You
could shave some code without impacting the base set of features.

Yeah. The idea was to use something like pg_node_tree to store all the
different kinds of statistics, the histogram, the MCV, and the
functional dependencies, in one datum. Or JSON, maybe. It sounds better
than an opaque bytea blob, although I'd prefer something more
relational. For the functional dependencies, I think we could get away
with a simple float array, so let's do that in the first cut, and
revisit this for the MCV and histogram later. Separate columns for the
functional dependencies, the MCVs, and the histogram, probably makes
sense anyway.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Fri, Sep 30, 2016 at 8:10 PM, Heikki Linnakangas  wrote:
> This patch set is in pretty good shape, the only problem is that it's so big
> that no-one seems to have the time or courage to do the final touches and
> commit it.

Did you see my suggestions about simplifying its SQL structure? You
could shave some code without impacting the base set of features.

> I fear that using "statistics" as the name of the new object might get a bit
> awkward. "statistics" is a plural, but we use it as the name of a single
> object, like "pants" or "scissors". Not sure I have any better ideas though.
> "estimator"? "statistics collection"? Or perhaps it should be singular,
> "statistic". I note that you actually called the system table
> "pg_mv_statistic", in singular.
>
> I'm not a big fan of storing the stats as just a bytea blob, and having to
> use special functions to interpret it. By looking at the patch, it's not
> clear to me what we actually store for functional dependencies. A list of
> attribute numbers? Could we store them simply as an int[]? (I'm not a big
> fan of the hack in pg_statistic, that allows storing arrays of any data type
> in the same column, though. But for functional dependencies, I don't think
> we need that.)

I am marking this patch as returned with feedback for now.

> Overall, this is going to be a great feature!

+1.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```This patch set is in pretty good shape, the only problem is that it's so
big that no-one seems to have the time or courage to do the final
touches and commit it. If we just focus on the functional dependencies
part for now, I think we might get somewhere. I peeked at the MCV and
histogram patches too, and I think they make total sense as well, and
are a natural extension of the functional dependencies patch. So if we
just focus on that for now, I don't think we will paint ourselves in the
corner.

(more below)

On 09/14/2016 01:01 AM, Tomas Vondra wrote:

On 09/12/2016 04:08 PM, Dean Rasheed wrote:

Regarding the statistics themselves, I read the description of soft
functional dependencies, and I'm somewhat skeptical about that
algorithm. I don't like the arbitrary thresholds or the sudden jump
from independence to dependence and clause reduction. As others have
said, I think this should account for a continuous spectrum of
dependence from fully independent to fully dependent, and combine
clause selectivities in a way based on the degree of dependence. For
example, if you computed an estimate for the fraction 'f' of the
table's rows for which a -> b, then it might be reasonable to combine
the selectivities using

P(a,b) = P(a) * (f + (1-f) * P(b))

Yeah, I agree that the thresholds resulting in sudden changes between
"dependent" and "not dependent" are annoying. The question is whether it
makes sense to fix that, though - the functional dependencies were meant
as the simplest form of statistics, allowing us to get the rest of the
infrastructure in.

I'm OK with replacing the true/false dependencies with a degree of
dependency between 0 and 1, but I'm a bit afraid it'll result in
complaints that the first patch got too large / complicated.

+1 for using a floating degree between 0 and 1, rather than a boolean.

It also contradicts the idea of using functional dependencies as a
low-overhead type of statistics, filtering the list of clauses that need
to be estimated using more expensive types of statistics (MCV lists,
histograms, ...). Switching to a degree of dependency would prevent
removal of "unnecessary" clauses.

That sounds OK to me, although I'm not deeply familiar with this patch yet.

Of course, having just a single number that tells you the columns are
correlated, tells you nothing about whether the clauses on those
columns are consistent with that correlation. For example, in the
following table

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);

'b' is functionally dependent on 'a' (and vice versa), but if you
query the rows with a<50 and with b<50, those clauses behave
essentially independently, because they're not consistent with the
functional dependence between 'a' and 'b', so the best way to combine
their selectivities is just to multiply them, as we currently do.

So whilst it may be interesting to determine that 'b' is functionally
dependent on 'a', it's not obvious whether that fact by itself should
be used in the selectivity estimates. Perhaps it should, on the
grounds that it's best to attempt to use all the available
information, but only if there are no more detailed statistics
available. In any case, knowing that there is a correlation can be
used as an indicator that it may be worthwhile to build more detailed
multivariate statistics like a MCV list or a histogram on those
columns.

Right. IIRC this is actually described in the README as "incompatible
conditions". While implementing it, I concluded that this is OK and it's
up to the developer to decide whether the queries are compatible with
the "assumption of compatibility". But maybe this is reasoning is bogus
and makes (the current implementation of) functional dependencies
unusable in practice.

I think that's OK. It seems like a good assumption that the conditions
are "compatible" with the functional dependency. For two reasons:

1) A query with compatible clauses is much more likely to occur in real
life. Why would you run a query with an incompatible ZIP and city clauses?

2) If the conditions were in fact incompatible, the query is likely to
return 0 rows, and will bail out very quickly, even if the estimates are
way off and you choose a non-optimal plan. There are exceptions, of
course: an index scan might be able to conclude that there are no rows
much quicker than a seqscan, but as a general rule of thumb, a query
that returns 0 rows isn't very sensitive to the chosen plan.

And of course, as long as we're not collecting these statistics
automatically, if it doesn't work for your application, just don't
collect them.

I fear that using "statistics" as the name of the new object might get a
bit awkward. "statistics" is a plural, but we use it as the name of a
single object, like "pants" or "scissors". Not sure I have any better
ideas though. "estimator"? "statistics collection"? Or perhaps it should
be singular, "statistic". I ```

### Re: [HACKERS] multivariate statistics (v19)

```
Hi,

Thanks for looking into this!

On 09/12/2016 04:08 PM, Dean Rasheed wrote:

On 3 August 2016 at 02:58, Tomas Vondra  wrote:

Attached is v19 of the "multivariate stats" patch series

Hi,

I started looking at this - just at a very high level - I've not read
much of the detail yet, but here are some initial review comments.

I think the overall infrastructure approach for CREATE STATISTICS
makes sense, and I agree with other suggestions upthread that it would
be useful to be able to build statistics on arbitrary expressions,
although that doesn't need to be part of this patch, it's useful to
keep that in mind as a possible future extension of this initial
design.

I can imagine it being useful to be able to create user-defined
statistics on an arbitrary list of expressions, and I think that would
include univariate as well as multivariate statistics. Perhaps that's
something to take into account in the naming of things, e.g., as David
Rowley suggested, something like pg_statistic_ext, rather than
pg_mv_statistic.

I also like the idea that this might one day be extended to support
statistics across multiple tables, although I think that might be
challenging to achieve -- you'd need a method of taking a random
sample of rows from a join between 2 or more tables. However, if the
intention is to be able to support that one day, I think that needs to
be accounted for in the syntax now -- specifically, I think it will be
too limiting to only support things extending the current syntax of
the form table1(col1, col2, ...), table2(col1, col2, ...), because
that precludes building statistics on an expression referring to
columns from more than one table. So I think we should plan further
ahead and use a syntax giving greater flexibility in the future, for
example something structured more like a query (like CREATE VIEW):

CREATE STATISTICS name
[ WITH (options) ]
ON expression [, ...]
FROM table [, ...]
WHERE condition

where the first version of the patch would only support expressions
that are simple column references, and would require at least 2 such
columns from a single table with no WHERE clause, i.e.:

CREATE STATISTICS name
[ WITH (options) ]
ON column1, column2 [, ...]
FROM table

For multi-table statistics, a WHERE clause would typically be needed
to specify how the tables are expected to be joined, but potentially
such a clause might also be useful in single-table statistics, to
build partial statistics on a commonly queried subset of the table,
just like a partial index.

Hmm, the "partial statistics" idea seems interesting, It would allow us
to provide additional / more detailed statistics only for a subset of a
table.

I'm however not sure about the join case - how would the syntax work
with outer joins? But as you said, we only need

CREATE STATISTICS name
[ WITH (options) ]
ON (column1, column2 [, ...])
FROM table
WHERE condition

until we add support for join statistics.

Regarding the statistics themselves, I read the description of soft
functional dependencies, and I'm somewhat skeptical about that
algorithm. I don't like the arbitrary thresholds or the sudden jump
from independence to dependence and clause reduction. As others have
said, I think this should account for a continuous spectrum of
dependence from fully independent to fully dependent, and combine
clause selectivities in a way based on the degree of dependence. For
example, if you computed an estimate for the fraction 'f' of the
table's rows for which a -> b, then it might be reasonable to combine
the selectivities using

P(a,b) = P(a) * (f + (1-f) * P(b))

Yeah, I agree that the thresholds resulting in sudden changes between
"dependent" and "not dependent" are annoying. The question is whether it
makes sense to fix that, though - the functional dependencies were meant
as the simplest form of statistics, allowing us to get the rest of the
infrastructure in.

I'm OK with replacing the true/false dependencies with a degree of
dependency between 0 and 1, but I'm a bit afraid it'll result in
complaints that the first patch got too large / complicated.

It also contradicts the idea of using functional dependencies as a
low-overhead type of statistics, filtering the list of clauses that need
to be estimated using more expensive types of statistics (MCV lists,
histograms, ...). Switching to a degree of dependency would prevent
removal of "unnecessary" clauses.

Of course, having just a single number that tells you the columns are
correlated, tells you nothing about whether the clauses on those
columns are consistent with that correlation. For example, in the
following table

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);

'b' is functionally dependent on 'a' (and vice versa), but if you
query the rows with a<50 and with b<50, those clauses behave
essentially independently, because they're not ```

### Re: [HACKERS] multivariate statistics (v19)

```On 3 August 2016 at 02:58, Tomas Vondra  wrote:
> Attached is v19 of the "multivariate stats" patch series

Hi,

I started looking at this - just at a very high level - I've not read
much of the detail yet, but here are some initial review comments.

I think the overall infrastructure approach for CREATE STATISTICS
makes sense, and I agree with other suggestions upthread that it would
be useful to be able to build statistics on arbitrary expressions,
although that doesn't need to be part of this patch, it's useful to
keep that in mind as a possible future extension of this initial
design.

I can imagine it being useful to be able to create user-defined
statistics on an arbitrary list of expressions, and I think that would
include univariate as well as multivariate statistics. Perhaps that's
something to take into account in the naming of things, e.g., as David
Rowley suggested, something like pg_statistic_ext, rather than
pg_mv_statistic.

I also like the idea that this might one day be extended to support
statistics across multiple tables, although I think that might be
challenging to achieve -- you'd need a method of taking a random
sample of rows from a join between 2 or more tables. However, if the
intention is to be able to support that one day, I think that needs to
be accounted for in the syntax now -- specifically, I think it will be
too limiting to only support things extending the current syntax of
the form table1(col1, col2, ...), table2(col1, col2, ...), because
that precludes building statistics on an expression referring to
columns from more than one table. So I think we should plan further
ahead and use a syntax giving greater flexibility in the future, for
example something structured more like a query (like CREATE VIEW):

CREATE STATISTICS name
[ WITH (options) ]
ON expression [, ...]
FROM table [, ...]
WHERE condition

where the first version of the patch would only support expressions
that are simple column references, and would require at least 2 such
columns from a single table with no WHERE clause, i.e.:

CREATE STATISTICS name
[ WITH (options) ]
ON column1, column2 [, ...]
FROM table

For multi-table statistics, a WHERE clause would typically be needed
to specify how the tables are expected to be joined, but potentially
such a clause might also be useful in single-table statistics, to
build partial statistics on a commonly queried subset of the table,
just like a partial index.

Of course, I'm not suggesting that the current patch do any of that --
it's big enough as it is. I'm just throwing out possible future
directions this might go in, so that we don't get painted into a
corner when designing the syntax for the current patch.

Regarding the statistics themselves, I read the description of soft
functional dependencies, and I'm somewhat skeptical about that
algorithm. I don't like the arbitrary thresholds or the sudden jump
from independence to dependence and clause reduction. As others have
said, I think this should account for a continuous spectrum of
dependence from fully independent to fully dependent, and combine
clause selectivities in a way based on the degree of dependence. For
example, if you computed an estimate for the fraction 'f' of the
table's rows for which a -> b, then it might be reasonable to combine
the selectivities using

P(a,b) = P(a) * (f + (1-f) * P(b))

Of course, having just a single number that tells you the columns are
correlated, tells you nothing about whether the clauses on those
columns are consistent with that correlation. For example, in the
following table

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);

'b' is functionally dependent on 'a' (and vice versa), but if you
query the rows with a<50 and with b<50, those clauses behave
essentially independently, because they're not consistent with the
functional dependence between 'a' and 'b', so the best way to combine
their selectivities is just to multiply them, as we currently do.

So whilst it may be interesting to determine that 'b' is functionally
dependent on 'a', it's not obvious whether that fact by itself should
be used in the selectivity estimates. Perhaps it should, on the
grounds that it's best to attempt to use all the available
information, but only if there are no more detailed statistics
available. In any case, knowing that there is a correlation can be
used as an indicator that it may be worthwhile to build more detailed
multivariate statistics like a MCV list or a histogram on those
columns.

Looking at the ndistinct coefficient 'q', I think it would be better
if the recorded statistic were just the estimate for
ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a
more fundamental statistic, and it's easier to document and easier to
interpret. Also, I don't believe that the coefficient 'q' is the right
number to use for clause estimation:

Looking at ```

### Re: [HACKERS] multivariate statistics (v19)

```On Wed, Aug 24, 2016 at 2:03 AM, Robert Haas  wrote:
> ISTR that you were going to try to look at this patch set.  It seems
> from the discussion that it's not really ready for serious
> consideration for commit yet, but also that some high-level design
> comments from you at this stage could go a long way toward making sure
> that the final form of the patch is something that will be acceptable.
>
> I'd really like to see us get some kind of capability along these
> lines, but I'm sure it will go a lot better if you or Dean handle it
> than if I try to do it ... not to mention that there are only so many
> hours in the day.

Agreed. What I have been able to look until now was the high-level
structure of the patch, and I think that we should really shave 0002
and simplify it to get a core infrastructure in place, but the core
patch is at another level, and it would be good to get some feedback
regarding the structure of the patch and if it is moving in the good
direction is good or not.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Tue, Aug 2, 2016 at 9:58 PM, Tomas Vondra
wrote:
> Attached is v19 of the "multivariate stats" patch series - essentially v18
> rebased on top of current master.

Tom:

ISTR that you were going to try to look at this patch set.  It seems
from the discussion that it's not really ready for serious
consideration for commit yet, but also that some high-level design
comments from you at this stage could go a long way toward making sure
that the final form of the patch is something that will be acceptable.

I'd really like to see us get some kind of capability along these
lines, but I'm sure it will go a lot better if you or Dean handle it
than if I try to do it ... not to mention that there are only so many
hours in the day.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 08/10/2016 06:41 AM, Michael Paquier wrote:

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
wrote:

1) enriching the query tree with multivariate statistics info

Right now all the stuff related to multivariate statistics estimation
happens in clausesel.c - matching condition to statistics, selection of
statistics to use (if there are multiple usable stats), etc. So pretty much
all this info is internal to clausesel.c and does not get outside.

This does not seem bad to me as first sight but...

I'm starting to think that some of the steps (matching quals to stats,
selection of stats) should happen in a "preprocess" step before the actual
estimation, storing the information (which stats to use, etc.) in a new type
of node in the query tree - something like RestrictInfo.

I believe this needs to happen sometime after deconstruct_jointree() as that
builds RestrictInfos nodes, and looking at planmain.c, right after
extract_restriction_or_clauses seems about right. Haven't tried, though.

This would move all the "statistics selection" logic from clausesel.c,
separating it from the "actual estimation" and simplifying the code.

But more importantly, I think we'll need to show some of the data in EXPLAIN
output. With per-column statistics it's fairly straightforward to determine
which statistics are used and how. But with multivariate stats things are
often more complicated - there may be multiple candidate statistics (e.g.
histograms covering different subsets of the conditions), it's possible to
apply them in different orders, etc.

But EXPLAIN can't show the info if it's ephemeral and available only within
clausesel.c (and thrown away after the estimation).

This gives a good reason to not do that in clauserel.c, it would be
really cool to be able to get some information regarding the stats
used with a simple EXPLAIN.

in practice. It essentially means doing something like

rel->baserestrictinfo = enrichWithStatistics(rel->baserestrictinfo);

for each table (and in the future maybe also for joins etc.) But as the
name suggests the list should only include RestrictInfo nodes, which

For example with conditions

WHERE (a=1) AND (b=2) AND (c=3)

the list will contain 3 RestrictInfos. But if there's a statistics on
(a,b,c), we need to note that somehow - my plan was to inject a node
storing this information, something like (a bit simplified):

StatisticsInfo {
Oid statisticsoid; /* OID of the statistics */
List *mvconditions; /* estimate using the statistics */
List *otherconditions; /* estimate the old way */
}

But that'd clearly violate the assumption that baserestrictinfo only
contains RestrictInfo. I don't think it's feasible (or desirable) to
rework all the places to expect both RestrictInfo and the new node.

I can think of two alternatives:

1) keep the transformed list as separate list, next to baserestrictinfo

This obviously fixes the issue, as each caller can decide which node it
wants. But it also means we need to maintain two lists instead of one,
and keep them synchronized.

2) embed the information into the existing tree

It might be possible to store the information in existing nodes, i.e.
each node would track whether it's estimated the "old way" or using
multivariate statistics (and which one). But it would require changing
many of the existing nodes (at least those compatible with multivariate
statistics: currently OpExpr, NullTest, ...).

And it also seems fairly difficult to reconstruct the information during
the estimation, as it'd be necessary to look for other nodes to be
estimated by the same statistics. Which seems to defeat the idea of
preprocessing to some degree.

So I'm not sure what's the best solution. I'm leaning to (1), i.e.
keeping a separate list, but I'd welcome other ideas.

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Thu, Aug 11, 2016 at 3:34 AM, Tomas Vondra
wrote:
> On 08/10/2016 02:23 PM, Michael Paquier wrote:
>>
>> On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
>>  wrote:
>>> The idea is that the syntax should work even for statistics built on
>>> multiple tables, e.g. to provide better statistics for joins. That's why
>>> the
>>> schema may be specified (as each table might be in different schema), and
>>> so
>>> on.
>>
>>
>> So you mean that the same statistics could be shared between tables?
>> But as this is visibly not a concept introduced yet in this set of
>> patches, why not just cut it off for now to simplify the whole? If
>> there is no schema-related field in pg_mv_statistics we could still
>> add it later if it proves to be useful.
>>
>
> Yes, I think creating statistics on multiple tables is one of the possible
> future directions. One of the previous patch versions introduced ALTER TABLE
> ... ADD STATISTICS syntax, but that ran into issues in gram.y, and given the
> multi-table possibilities the CREATE STATISTICS seems like a much better
> idea anyway.
>
> But I guess you're right we may make this a bit more strict now, and relax
> it in the future if needed. For example as we only support single-table
> statistics at this point, we may remove the schema and always create the
> statistics in the schema of the table.

This would simplify the code the code a bit so I'd suggest removing
that from the first shot. If there is demand for it, keeping the
infrastructure open for this extension is what we had better do.

> But I don't think we should make the statistics names unique only within a
> table (instead of within the schema).

They could be made unique using (name, table_oid, column_list).

There is a lot of mumbo-jumbo regarding the way dependencies are
stored with mainly serialize_mv_dependencies and
deserialize_mv_dependencies that operates them from bytea/dep trees.
That's not cool and not portable because pg_mv_statistic represents
that as pure bytea. I would suggest creating a generic data type that
does those operations, named like pg_dependency_tree and then use that
in those new catalogs. pg_node_tree is a precedent of such a thing.
New features could as well make use of this new data type of we are
able to design that in a way generic enough, so that would be a base
patch that the current 0002 applies on top of.
>>>
>>>
>>>
>>> Interesting idea, haven't thought about that. So are you suggesting to
>>> data type for each statistics type (dependencies, MCV, histogram, ...)?
>>
>>
>> Yes that would be something like that, it would be actually perhaps
>> better to have one single data type, and be able to switch between
>> each model easily instead of putting byteas in the catalog.
>
> Hmmm, not sure about that. For example what about combinations of statistics
> - e.g. when we have MCV list on the most common values and a histogram on
> the rest? Should we store both as a single value, or would that be in two
> separate values, or what?

The same statistics can combine two different things, using different
columns may depend on how readable things get...
Btw, for the format we could get inspired from pg_node_tree, with pg_stat_tree:
{HISTOGRAM :arg {BUCKET :index 0 :minvals ... }}
{DEPENDENCY :arg {:elt "a => c" ...} ... }
{MVC :arg {:index 0 :values {0,0} ... } ... }
Please consider that as a tentative idea to make things more friendly.
Others may have a different opinion on the matter.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 08/10/2016 02:23 PM, Michael Paquier wrote:

On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
wrote:

On 08/10/2016 06:41 AM, Michael Paquier wrote:

Patch 0001: there have been comments about that before, and you have
put the checks on RestrictInfo in a couple of variables of
pull_varnos_walker, so nothing to say from here.

I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's
a proposed solution?

I think that's reasonable.

Well, to me the 0001 feels more like a temporary workaround rather than
a proper solution. I just don't know how to deal with it so I've kept it
for now. Pretty sure there will be complaints that adding RestrictInfo
to the expression walkers is not a nice idea.

>> ...

The idea is that the syntax should work even for statistics built on
multiple tables, e.g. to provide better statistics for joins. That's why the
schema may be specified (as each table might be in different schema), and so
on.

So you mean that the same statistics could be shared between tables?
But as this is visibly not a concept introduced yet in this set of
patches, why not just cut it off for now to simplify the whole? If
there is no schema-related field in pg_mv_statistics we could still
add it later if it proves to be useful.

Yes, I think creating statistics on multiple tables is one of the
possible future directions. One of the previous patch versions
introduced ALTER TABLE ... ADD STATISTICS syntax, but that ran into
issues in gram.y, and given the multi-table possibilities the CREATE
STATISTICS seems like a much better idea anyway.

But I guess you're right we may make this a bit more strict now, and
relax it in the future if needed. For example as we only support
single-table statistics at this point, we may remove the schema and
always create the statistics in the schema of the table.

But I don't think we should make the statistics names unique only within
a table (instead of within the schema).

The difference between those two cases is that if we allow multi-table
statistics in the future, we can simply allow specifying the schema and
everything will work just fine. But it'd break the second case, as it
might result in conflicts in existing schemas.

I do realize this might be seen as a case of "future proofing" based on
dubious predictions of how something might work, but OTOH this (schema
inherited from table, unique within a schema) is pretty consistent with
how this work for indexes.

+
/*
Spurious noise in the patch.

+   /* check that at least some statistics were requested */
+   if (!build_dependencies)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("no statistics type (dependencies) was
requested")));
So, WITH (dependencies) is mandatory in any case. Why not just
dropping it from the first cut then?

Because the follow-up patches extend this to require at least one statistics
type. So in 0004 it becomes

if (!(build_dependencies || build_mcv))

and in 0005 it's

if (!(build_dependencies || build_mcv || build_histogram))

We might drop it from 0002 (and assume build_dependencies=true), and then
add the check in 0004. But it seems a bit pointless.

This is a complicated set of patches. I'd think that we should try to
simplify things as much as possible first, and the WITH clause is not
mandatory to have as of 0002.

OK, I can remove the WITH from the 0002 part. Not a big deal.

Statistics definition reorder the columns by itself depending on their
order. For example:
create table aa (a int, b int);
create statistics aas on aa(b, a) with (dependencies);
\d aa
"public.aas" (dependencies) ON (a, b)
As this defines a correlation between multiple columns, isn't it wrong
to assume that (b, a) and (a, b) are always the same correlation? I
don't recall such properties as being always commutative (old
is caused by the implementation limitations that only limit the
analysis between interactions of two columns. Still it seems incorrect
to reorder the user-visible portion.

I don't follow. If you talk about Pearson's correlation, that clearly does
not depend on the order of columns - it's perfectly independent of that. If
dependence between columns), that might depend - but I don't remember a
single piece of the patch where this might be a problem.

Yes, based on what is done in the patch that may not be a problem, but
I am wondering if this is not restricting things too much.

Let's keep the code as it is. If we run into this issue in the future,
we can easily relax this - there's nothing depending on the ordering of
attnums, IIRC.

Also, which README states that we can only analyze interactions between two
columns? That's pretty clearly not the case - the patch should handle
dependencies between more columns ```

### Re: [HACKERS] multivariate statistics (v19)

```
On 08/10/2016 02:24 PM, Michael Paquier wrote:

On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek  wrote:

On 10/08/16 13:33, Tomas Vondra wrote:

On 08/10/2016 06:41 AM, Michael Paquier wrote:

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra

2) combining multiple statistics

I think the ability to combine multivariate statistics (covering
different
subsets of conditions) is important and useful, but I'm starting to
think
that the current implementation may not be the correct one (which is
why I

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current
patch does

P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then
estimates
(c=3) using (b,c) with "b=2" as a condition. Now, this is very
efficient,
but it only works as long as the query contains conditions
"connecting" the
two statistics. So if we remove the "b=2" condition from the query, this
stops working.

This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?

I don't think so. Obviously, if you have statistics covering all the
conditions - great, we can't really do better than that.

But there's a crucial relation between the number of dimensions of the
statistics and accuracy of the statistics. Let's say you have statistics
on 8 columns, and you split each dimension twice to build a histogram -
that's 256 buckets right there, and we only get ~50% selectivity in each
dimension (the actual histogram building algorithm is more complex, but
you get the idea).

I think it makes sense to pursue this, but I also think we can easily live
with not having it in the first version that gets committed and doing it as
follow-up patch.

This patch is large and complicated enough. As this is not a mandatory
piece to get a basic support, I'd suggest just to drop that for later.

Which is why combining multiple statistics is in part 0006 and all the
previous parts simply choose the single "best" statistics ;-)

I'm perfectly fine with committing just the first few parts, and leaving
0006 for the next major version.

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 08/10/2016 03:29 PM, Ants Aasma wrote:

On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra
wrote:

2) combining multiple statistics

I think the ability to combine multivariate statistics (covering different
subsets of conditions) is important and useful, but I'm starting to think
that the current implementation may not be the correct one (which is why I

While researching this topic a few years ago I came across a paper on
this exact topic called "Consistently Estimating the Selectivity of
Conjuncts of Predicates" [1]. While effective it seems to be quite
heavy-weight, so would probably need support for tiered optimization.

[1]
https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf

I think I've read that paper some time ago, and IIRC it's solving the
same problem but in a very different way - instead of combining the
statistics directly, it relies on the "partial" selectivities and then
estimates the total selectivity using the maximum-entropy principle.

I think it's a nice idea and it probably works fine in many cases, but
it kinda throws away part of the information (that we could get by
matching the statistics against each other directly). But I'll keep that
paper in mind, and we can revisit this solution later.

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra
wrote:
> 2) combining multiple statistics
>
> I think the ability to combine multivariate statistics (covering different
> subsets of conditions) is important and useful, but I'm starting to think
> that the current implementation may not be the correct one (which is why I

While researching this topic a few years ago I came across a paper on
this exact topic called "Consistently Estimating the Selectivity of
Conjuncts of Predicates" [1]. While effective it seems to be quite
heavy-weight, so would probably need support for tiered optimization.

[1]
https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf

Regards,
Ants Aasma

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek  wrote:
> On 10/08/16 13:33, Tomas Vondra wrote:
>>
>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>>
>>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra

2) combining multiple statistics

I think the ability to combine multivariate statistics (covering
different
subsets of conditions) is important and useful, but I'm starting to
think
that the current implementation may not be the correct one (which is
why I

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current
patch does

P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then
estimates
(c=3) using (b,c) with "b=2" as a condition. Now, this is very
efficient,
but it only works as long as the query contains conditions
"connecting" the
two statistics. So if we remove the "b=2" condition from the query, this
stops working.
>>>
>>>
>>> This is trying to make the algorithm smarter than the user, which is
>>> something I'd think we could live without. In this case statistics on
>>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>>
>>
>> I don't think so. Obviously, if you have statistics covering all the
>> conditions - great, we can't really do better than that.
>>
>> But there's a crucial relation between the number of dimensions of the
>> statistics and accuracy of the statistics. Let's say you have statistics
>> on 8 columns, and you split each dimension twice to build a histogram -
>> that's 256 buckets right there, and we only get ~50% selectivity in each
>> dimension (the actual histogram building algorithm is more complex, but
>> you get the idea).
>
> I think it makes sense to pursue this, but I also think we can easily live
> with not having it in the first version that gets committed and doing it as
> follow-up patch.

This patch is large and complicated enough. As this is not a mandatory
piece to get a basic support, I'd suggest just to drop that for later.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
wrote:
> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>> Patch 0001: there have been comments about that before, and you have
>> put the checks on RestrictInfo in a couple of variables of
>> pull_varnos_walker, so nothing to say from here.
>>
>
> I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's
> a proposed solution?

I think that's reasonable.

>> Patch 0002:
>> +
>> +   CREATE STATISTICS will create a new multivariate
>> +   statistics on the table. The statistics will be created in the in the
>> +   current database. The statistics will be owned by the user issuing
>> +   the command.
>> +
>> s/in the/in the/.
>>
>> +
>> +   Create table t1 with two functionally dependent
>> columns, i.e.
>> +   knowledge of a value in the first column is sufficient for detemining
>> the
>> +   value in the other column. Then functional dependencies are built on
>> those
>> +   columns:
>> s/detemining/determining/
>>
>> +
>> +   If a schema name is given (for example, CREATE STATISTICS
>> +   myschema.mystat ...) then the statistics is created in the
>> specified
>> +   schema.  Otherwise it is created in the current schema.  The name of
>> +   the table must be distinct from the name of any other statistics in
>> the
>> +   same schema.
>> +
>> I would just assume that a statistics is located on the schema of the
>> relation it depends on. So the thing that may be better to do is just:
>> - Register the OID of the table a statistics depends on but not the
>> schema.
>> - Give up on those query extensions related to the schema.
>> - Allow the same statistics name to be used for multiple tables.
>> - Just fail if a statistics name is being reused on the table again.
>> It may be better to complain about that even if the column list is
>> different.
>> - Register the dependency between the statistics and the table.
>
> The idea is that the syntax should work even for statistics built on
> multiple tables, e.g. to provide better statistics for joins. That's why the
> schema may be specified (as each table might be in different schema), and so
> on.

So you mean that the same statistics could be shared between tables?
But as this is visibly not a concept introduced yet in this set of
patches, why not just cut it off for now to simplify the whole? If
there is no schema-related field in pg_mv_statistics we could still
add it later if it proves to be useful.

>> +
>>  /*
>> Spurious noise in the patch.
>>
>> +   /* check that at least some statistics were requested */
>> +   if (!build_dependencies)
>> +   ereport(ERROR,
>> +   (errcode(ERRCODE_SYNTAX_ERROR),
>> +errmsg("no statistics type (dependencies) was
>> requested")));
>> So, WITH (dependencies) is mandatory in any case. Why not just
>> dropping it from the first cut then?
>
>
> Because the follow-up patches extend this to require at least one statistics
> type. So in 0004 it becomes
>
> if (!(build_dependencies || build_mcv))
>
> and in 0005 it's
>
> if (!(build_dependencies || build_mcv || build_histogram))
>
> We might drop it from 0002 (and assume build_dependencies=true), and then
> add the check in 0004. But it seems a bit pointless.

This is a complicated set of patches. I'd think that we should try to
simplify things as much as possible first, and the WITH clause is not
mandatory to have as of 0002.

>> Statistics definition reorder the columns by itself depending on their
>> order. For example:
>> create table aa (a int, b int);
>> create statistics aas on aa(b, a) with (dependencies);
>> \d aa
>> "public.aas" (dependencies) ON (a, b)
>> As this defines a correlation between multiple columns, isn't it wrong
>> to assume that (b, a) and (a, b) are always the same correlation? I
>> don't recall such properties as being always commutative (old
>> memories, I suck at stats in general). [...reading README...] So this
>> is caused by the implementation limitations that only limit the
>> analysis between interactions of two columns. Still it seems incorrect
>> to reorder the user-visible portion.
>
> I don't follow. If you talk about Pearson's correlation, that clearly does
> not depend on the order of columns - it's perfectly independent of that. If
> you talk about about correlation in the wider sense (i.e. arbitrary
> dependence between columns), that might depend - but I don't remember a
> single piece of the patch where this might be a problem.

Yes, based on what is done in the patch that may not be a problem, but
I am wondering if this is not restricting things too much.

> Also, which README states that we can only analyze interactions between two
> columns? That's pretty clearly not the case - the patch should handle
> dependencies between more columns without any problems.

I have noticed that the patch evaluates all the set of permutations
possible using a column list, it seems to me though that ```

### Re: [HACKERS] multivariate statistics (v19)

```
On 10/08/16 13:33, Tomas Vondra wrote:

On 08/10/2016 06:41 AM, Michael Paquier wrote:

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra

2) combining multiple statistics

I think the ability to combine multivariate statistics (covering
different
subsets of conditions) is important and useful, but I'm starting to
think
that the current implementation may not be the correct one (which is
why I

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current
patch does

P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then
estimates
(c=3) using (b,c) with "b=2" as a condition. Now, this is very
efficient,
but it only works as long as the query contains conditions
"connecting" the
two statistics. So if we remove the "b=2" condition from the query, this
stops working.

This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?

I don't think so. Obviously, if you have statistics covering all the
conditions - great, we can't really do better than that.

But there's a crucial relation between the number of dimensions of the
statistics and accuracy of the statistics. Let's say you have statistics
on 8 columns, and you split each dimension twice to build a histogram -
that's 256 buckets right there, and we only get ~50% selectivity in each
dimension (the actual histogram building algorithm is more complex, but
you get the idea).

I think it makes sense to pursue this, but I also think we can easily
live with not having it in the first version that gets committed and
doing it as follow-up patch.

--
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 08/10/2016 06:41 AM, Michael Paquier wrote:

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
wrote:

...

But more importantly, I think we'll need to show some of the data in EXPLAIN
output. With per-column statistics it's fairly straightforward to determine
which statistics are used and how. But with multivariate stats things are
often more complicated - there may be multiple candidate statistics (e.g.
histograms covering different subsets of the conditions), it's possible to
apply them in different orders, etc.

But EXPLAIN can't show the info if it's ephemeral and available only within
clausesel.c (and thrown away after the estimation).

This gives a good reason to not do that in clauserel.c, it would be
really cool to be able to get some information regarding the stats
used with a simple EXPLAIN.

I think there are two separate questions:

(a) Whether the query plan is "enriched" with information about
statistics, or whether this information is ephemeral and available only
in clausesel.c.

(b) Where exactly this enrichment happens.

Theoretically we might enrich the query plan (add nodes with info about
the statistics), so that EXPLAIN gets the info, and it might still
happen in clausesel.c.

2) combining multiple statistics

I think the ability to combine multivariate statistics (covering different
subsets of conditions) is important and useful, but I'm starting to think
that the current implementation may not be the correct one (which is why I

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current patch does

P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then estimates
(c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient,
but it only works as long as the query contains conditions "connecting" the
two statistics. So if we remove the "b=2" condition from the query, this
stops working.

This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?

I don't think so. Obviously, if you have statistics covering all the
conditions - great, we can't really do better than that.

But there's a crucial relation between the number of dimensions of the
statistics and accuracy of the statistics. Let's say you have statistics
on 8 columns, and you split each dimension twice to build a histogram -
that's 256 buckets right there, and we only get ~50% selectivity in each
dimension (the actual histogram building algorithm is more complex, but
you get the idea).

I see this as probably the most interesting part of the patch, and quite
useful. But we'll definitely get the single-statistics estimate first,

Patch 0001: there have been comments about that before, and you have
put the checks on RestrictInfo in a couple of variables of
pull_varnos_walker, so nothing to say from here.

I don't follow. Are you suggesting 0001 is a reasonable fix, or that
there's a proposed solution?

Patch 0002:
+
+   CREATE STATISTICS will create a new multivariate
+   statistics on the table. The statistics will be created in the in the
+   current database. The statistics will be owned by the user issuing
+   the command.
+
s/in the/in the/.

+
+   Create table t1 with two functionally dependent columns, i.e.
+   knowledge of a value in the first column is sufficient for detemining the
+   value in the other column. Then functional dependencies are built on those
+   columns:
s/detemining/determining/

+
+   If a schema name is given (for example, CREATE STATISTICS
+   myschema.mystat ...) then the statistics is created in the specified
+   schema.  Otherwise it is created in the current schema.  The name of
+   the table must be distinct from the name of any other statistics in the
+   same schema.
+
I would just assume that a statistics is located on the schema of the
relation it depends on. So the thing that may be better to do is just:
- Register the OID of the table a statistics depends on but not the schema.
- Give up on those query extensions related to the schema.
- Allow the same statistics name to be used for multiple tables.
- Just fail if a statistics name is being reused on the table again.
It may be better to complain about that even if the column list is
different.
- Register the dependency between the statistics and the table.

The idea is that the syntax should work even for statistics built on
multiple tables, e.g. to provide better statistics for joins. That's why
the schema may be specified (as each table ```

### Re: [HACKERS] multivariate statistics (v19)

```On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
wrote:
> 1) enriching the query tree with multivariate statistics info
>
> Right now all the stuff related to multivariate statistics estimation
> happens in clausesel.c - matching condition to statistics, selection of
> statistics to use (if there are multiple usable stats), etc. So pretty much
> all this info is internal to clausesel.c and does not get outside.

This does not seem bad to me as first sight but...

> I'm starting to think that some of the steps (matching quals to stats,
> selection of stats) should happen in a "preprocess" step before the actual
> estimation, storing the information (which stats to use, etc.) in a new type
> of node in the query tree - something like RestrictInfo.
>
> I believe this needs to happen sometime after deconstruct_jointree() as that
> builds RestrictInfos nodes, and looking at planmain.c, right after
> extract_restriction_or_clauses seems about right. Haven't tried, though.
>
> This would move all the "statistics selection" logic from clausesel.c,
> separating it from the "actual estimation" and simplifying the code.
>
> But more importantly, I think we'll need to show some of the data in EXPLAIN
> output. With per-column statistics it's fairly straightforward to determine
> which statistics are used and how. But with multivariate stats things are
> often more complicated - there may be multiple candidate statistics (e.g.
> histograms covering different subsets of the conditions), it's possible to
> apply them in different orders, etc.
>
> But EXPLAIN can't show the info if it's ephemeral and available only within
> clausesel.c (and thrown away after the estimation).

This gives a good reason to not do that in clauserel.c, it would be
really cool to be able to get some information regarding the stats
used with a simple EXPLAIN.

> 2) combining multiple statistics
>
> I think the ability to combine multivariate statistics (covering different
> subsets of conditions) is important and useful, but I'm starting to think
> that the current implementation may not be the correct one (which is why I
>
> Assume there's a table "t" with 3 columns (a, b, c), and that we're
> estimating query:
>
>SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>
> but that we only have two statistics (a,b) and (b,c). The current patch does
>
>P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>
> i.e. it estimates the first two conditions using (a,b), and then estimates
> (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient,
> but it only works as long as the query contains conditions "connecting" the
> two statistics. So if we remove the "b=2" condition from the query, this
> stops working.

This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?

Patch 0001: there have been comments about that before, and you have
put the checks on RestrictInfo in a couple of variables of
pull_varnos_walker, so nothing to say from here.

Patch 0002:
+
+   CREATE STATISTICS will create a new multivariate
+   statistics on the table. The statistics will be created in the in the
+   current database. The statistics will be owned by the user issuing
+   the command.
+
s/in the/in the/.

+
+   Create table t1 with two functionally dependent columns, i.e.
+   knowledge of a value in the first column is sufficient for detemining the
+   value in the other column. Then functional dependencies are built on those
+   columns:
s/detemining/determining/

+
+   If a schema name is given (for example, CREATE STATISTICS
+   myschema.mystat ...) then the statistics is created in the specified
+   schema.  Otherwise it is created in the current schema.  The name of
+   the table must be distinct from the name of any other statistics in the
+   same schema.
+
I would just assume that a statistics is located on the schema of the
relation it depends on. So the thing that may be better to do is just:
- Register the OID of the table a statistics depends on but not the schema.
- Give up on those query extensions related to the schema.
- Allow the same statistics name to be used for multiple tables.
- Just fail if a statistics name is being reused on the table again.
It may be better to complain about that even if the column list is
different.
- Register the dependency between the statistics and the table.

+ALTER STATISTICS name
OWNER TO { new_owner |
CURRENT_USER | SESSION_USER }
On the same line, is OWNER TO really necessary? I could have assumed
that if a user is able to query the set of columns related to a

=# create statistics aa_a_b3 on aam (a, b) with (dependencies);
ERROR:  ```

### Re: [HACKERS] multivariate statistics (v19)

```On Sat, Aug 6, 2016 at 2:38 AM, Tomas Vondra
wrote:
> On 08/05/2016 06:24 AM, Michael Paquier wrote:
>>
>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>  wrote:
>>>
>>> Attached is v19 of the "multivariate stats" patch series - essentially
>>> v18
>>> rebased on top of current master. Aside from a few bug fixes, the main
>>> improvement is addition of SGML docs demonstrating the statistics in a
>>> way
>>> similar to the current "Row Estimation Examples" (and the docs are
>>> actually
>>> in the same section). I've tried to keep the right amount of technical
>>> detail (and pointing to the right README for additional details), but
>>> this
>>> may need improvements. I have not written docs explaining how statistics
>>> may
>>
>>
>> What we have here is quite something:
>> \$ git diff master --stat | tail -n1
>>  77 files changed, 12809 insertions(+), 65 deletions(-)
>> I will try to get familiar on the topic and added myself as a reviewer
>> of this patch. Hopefully I'll get feedback soon.
>
>
> Yes, it's a large patch. Although 25% of the insertions are SGML docs,
> regression tests and READMEs, and large part of the remaining ~9k insertions
> are comments. But it may still be overwhelming, no doubt about that.
>
> FWIW, if someone is interested in the patch but is unsure where to start,
> I'm ready to help with that as much as possible. For example if you happen
> to go to PostgresOpen, feel free to drag me to a corner and ask me as many
> questions as you want ...

Sure. Only PGconf SV is on my track this year.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```
On 08/05/2016 06:24 AM, Michael Paquier wrote:

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
wrote:

Attached is v19 of the "multivariate stats" patch series - essentially v18
rebased on top of current master. Aside from a few bug fixes, the main
improvement is addition of SGML docs demonstrating the statistics in a way
similar to the current "Row Estimation Examples" (and the docs are actually
in the same section). I've tried to keep the right amount of technical
detail (and pointing to the right README for additional details), but this
may need improvements. I have not written docs explaining how statistics may

What we have here is quite something:
\$ git diff master --stat | tail -n1
77 files changed, 12809 insertions(+), 65 deletions(-)
I will try to get familiar on the topic and added myself as a reviewer
of this patch. Hopefully I'll get feedback soon.

Yes, it's a large patch. Although 25% of the insertions are SGML docs,
regression tests and READMEs, and large part of the remaining ~9k
insertions are comments. But it may still be overwhelming, no doubt

FWIW, if someone is interested in the patch but is unsure where to
start, I'm ready to help with that as much as possible. For example if
you happen to go to PostgresOpen, feel free to drag me to a corner and
ask me as many questions as you want ...

regards

--
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```

### Re: [HACKERS] multivariate statistics (v19)

```On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
wrote:
> Attached is v19 of the "multivariate stats" patch series - essentially v18
> rebased on top of current master. Aside from a few bug fixes, the main
> improvement is addition of SGML docs demonstrating the statistics in a way
> similar to the current "Row Estimation Examples" (and the docs are actually
> in the same section). I've tried to keep the right amount of technical
> detail (and pointing to the right README for additional details), but this
> may need improvements. I have not written docs explaining how statistics may

What we have here is quite something:
\$ git diff master --stat | tail -n1
77 files changed, 12809 insertions(+), 65 deletions(-)
I will try to get familiar on the topic and added myself as a reviewer
of this patch. Hopefully I'll get feedback soon.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

```