Re: PoC/WIP: Extended statistics on expressions

2021-09-19 Thread Tomas Vondra

On 9/3/21 5:56 AM, Justin Pryzby wrote:

On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:

However while polishing the second patch, I realized we're allowing
statistics on expressions referencing system attributes. So this fails;

CREATE STATISTICS s ON ctid, x FROM t;

but this passes:

CREATE STATISTICS s ON (ctid::text), x FROM t;

IMO we should reject such expressions, just like we reject direct references
to system attributes - patch attached.


Right, same as indexes.  +1



I've pushed this check, disallowing extended stats on expressions 
referencing system attributes. This means we'll reject both ctid and 
(ctid::text), just like for indexes.



Furthermore, I wonder if we should reject expressions without any Vars? This
works now:

CREATE STATISTICS s ON (11:text) FROM t;

but it seems rather silly / useless, so maybe we should reject it.


To my surprise, this is also allowed for indexes...

But (maybe this is what I was remembering) it's prohibited to have a constant
expression as a partition key.

Expressions without a var seem like a case where the user did something
deliberately silly, and dis-similar from the case of making a stats expression
on a simple column - that seemed like it could be a legitimate
mistake/confusion (it's not unreasonable to write an extra parenthesis, but
it's strange if that causes it to behave differently).

I think it's not worth too much effort to prohibit this: if they're determined,
they can still write an expresion with a var which is constant.  I'm not going
to say it's worth zero effort, though.



I've decided not to push this. The statistics objects on expressions not 
referencing any variables seem useless, but maybe not entirely - we 
allow volatile expressions, like


  CREATE STATISTICS s ON (random()) FROM t;

which I suppose might be useful. And we reject similar cases (except for 
the volatility, of course) for indexes too.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-09-02 Thread Justin Pryzby
On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:
> However while polishing the second patch, I realized we're allowing
> statistics on expressions referencing system attributes. So this fails;
> 
> CREATE STATISTICS s ON ctid, x FROM t;
> 
> but this passes:
> 
> CREATE STATISTICS s ON (ctid::text), x FROM t;
> 
> IMO we should reject such expressions, just like we reject direct references
> to system attributes - patch attached.

Right, same as indexes.  +1

> Furthermore, I wonder if we should reject expressions without any Vars? This
> works now:
> 
> CREATE STATISTICS s ON (11:text) FROM t;
> 
> but it seems rather silly / useless, so maybe we should reject it.

To my surprise, this is also allowed for indexes...

But (maybe this is what I was remembering) it's prohibited to have a constant
expression as a partition key.

Expressions without a var seem like a case where the user did something
deliberately silly, and dis-similar from the case of making a stats expression
on a simple column - that seemed like it could be a legitimate
mistake/confusion (it's not unreasonable to write an extra parenthesis, but
it's strange if that causes it to behave differently).

I think it's not worth too much effort to prohibit this: if they're determined,
they can still write an expresion with a var which is constant.  I'm not going
to say it's worth zero effort, though.

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-09-01 Thread Tomas Vondra




On 9/1/21 9:38 PM, Justin Pryzby wrote:

On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:

Patch 0001 fixes the "double parens" issue discussed elsewhere in this
thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
column reference.


0002 refuses to create expressional stats on a simple column reference like
(a), which I think is helps to avoid a user accidentally creating useless ext
stats objects (which are redundant with the table's column stats).

0002 does not attempt to refuse cases like (a+0), which I think is fine:
we don't try to reject useless cases if someone insists on it.
See 240971675, 701fd0bbc.

So I am +1 to apply both patches.

I added this as an Opened Item for increased visibility.


I've pushed both fixes, so the open item should be resolved.


Thank you - I marked it as such.

There are some typos in 537ca68db (refenrece)
I'll add them to my typos branch if you don't want to patch them right now or
wait to see if someone notices anything else.



Yeah, probably better to wait a bit. Any opinions on rejecting 
expressions referencing system attributes or no attributes at all?


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-09-01 Thread Justin Pryzby
On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote:
> > > Patch 0001 fixes the "double parens" issue discussed elsewhere in this
> > > thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
> > > column reference.
> > 
> > 0002 refuses to create expressional stats on a simple column reference like
> > (a), which I think is helps to avoid a user accidentally creating useless 
> > ext
> > stats objects (which are redundant with the table's column stats).
> > 
> > 0002 does not attempt to refuse cases like (a+0), which I think is fine:
> > we don't try to reject useless cases if someone insists on it.
> > See 240971675, 701fd0bbc.
> > 
> > So I am +1 to apply both patches.
> > 
> > I added this as an Opened Item for increased visibility.
> 
> I've pushed both fixes, so the open item should be resolved.

Thank you - I marked it as such.

There are some typos in 537ca68db (refenrece)
I'll add them to my typos branch if you don't want to patch them right now or
wait to see if someone notices anything else.

diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 59369f8736..17cbd97808 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -205,27 +205,27 @@ CreateStatistics(CreateStatsStmt *stmt)
numcols = list_length(stmt->exprs);
if (numcols > STATS_MAX_DIMENSIONS)
ereport(ERROR,
(errcode(ERRCODE_TOO_MANY_COLUMNS),
 errmsg("cannot have more than %d columns in 
statistics",
STATS_MAX_DIMENSIONS)));
 
/*
 * Convert the expression list to a simple array of attnums, but also 
keep
 * a list of more complex expressions.  While at it, enforce some
 * constraints - we don't allow extended statistics on system 
attributes,
-* and we require the data type to have less-than operator.
+* and we require the data type to have a less-than operator.
 *
-* There are many ways how to "mask" a simple attribute refenrece as an
+* There are many ways to "mask" a simple attribute reference as an
 * expression, for example "(a+0)" etc. We can't possibly detect all of
-* them, but we handle at least the simple case with attribute in 
parens.
+* them, but we handle at least the simple case with the attribute in 
parens.
 * There'll always be a way around this, if the user is determined (like
 * the "(a+0)" example), but this makes it somewhat consistent with how
 * indexes treat attributes/expressions.
 */
foreach(cell, stmt->exprs)
{
StatsElem  *selem = lfirst_node(StatsElem, cell);
 
if (selem->name)/* column reference */
{
char   *attname;




Re: PoC/WIP: Extended statistics on expressions

2021-09-01 Thread Tomas Vondra


On 8/24/21 3:13 PM, Justin Pryzby wrote:

On Mon, Aug 16, 2021 at 05:41:57PM +0200, Tomas Vondra wrote:

This still seems odd:

postgres=# CREATE STATISTICS asf ON i FROM t;
ERROR:  extended statistics require at least 2 columns
postgres=# CREATE STATISTICS asf ON (i) FROM t;
CREATE STATISTICS

It seems wrong that the command works with added parens, but builds expression
stats on a simple column (which is redundant with what analyze does without
extended stats).


Well, yeah. But I think this is a behavior that was discussed somewhere in
this thread, and the agreement was that it's not worth the complexity, as
this comment explains

   * XXX We do only the bare minimum to separate simple attribute and
   * complex expressions - for example "(a)" will be treated as a complex
   * expression. No matter how elaborate the check is, there'll always be
   * a way around it, if the user is determined (consider e.g. "(a+0)"),
   * so it's not worth protecting against it.

Patch 0001 fixes the "double parens" issue discussed elsewhere in this
thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
column reference.


0002 refuses to create expressional stats on a simple column reference like
(a), which I think is helps to avoid a user accidentally creating useless ext
stats objects (which are redundant with the table's column stats).

0002 does not attempt to refuse cases like (a+0), which I think is fine:
we don't try to reject useless cases if someone insists on it.
See 240971675, 701fd0bbc.

So I am +1 to apply both patches.

I added this as an Opened Item for increased visibility.



I've pushed both fixes, so the open item should be resolved.

However while polishing the second patch, I realized we're allowing 
statistics on expressions referencing system attributes. So this fails;


CREATE STATISTICS s ON ctid, x FROM t;

but this passes:

CREATE STATISTICS s ON (ctid::text), x FROM t;

IMO we should reject such expressions, just like we reject direct 
references to system attributes - patch attached.


Furthermore, I wonder if we should reject expressions without any Vars? 
This works now:


CREATE STATISTICS s ON (11:text) FROM t;

but it seems rather silly / useless, so maybe we should reject it.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From a3ade67b8b12cdbfa585bf351ca5569599977b41 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Wed, 1 Sep 2021 17:28:21 +0200
Subject: [PATCH] Reject extended statistics referencing system attributes

Extended statistics are not allowed to reference system attributes, but
we only enforced that for simple columns. For expressions, it was
possible to define the statistics on an expression defining the system
attribute, e.g.

CREATE STATISTICS s ON (ctid::text) FROM t;

which seems strange. This adds a check rejection expressions referencing
system attributes, just like we do for simple columns.

Backpatch to 14, where extended statistics on expressions were added.

Backpath-through: 14
---
 src/backend/commands/statscmds.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 59369f8736..bf01840d8a 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -288,9 +288,24 @@ CreateStatistics(CreateStatsStmt *stmt)
 			Node	   *expr = selem->expr;
 			Oid			atttype;
 			TypeCacheEntry *type;
+			Bitmapset  *attnums = NULL;
+			int			k;
 
 			Assert(expr != NULL);
 
+			/* Disallow expressions referencing system attributes. */
+			pull_varattnos(expr, 1, );
+
+			k = -1;
+			while ((k = bms_next_member(attnums, k)) >= 0)
+			{
+AttrNumber	attnum = k + FirstLowInvalidHeapAttributeNumber;
+if (attnum <= 0)
+	ereport(ERROR,
+		(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+		 errmsg("statistics creation on system columns is not supported")));
+			}
+
 			/*
 			 * Disallow data types without a less-than operator.
 			 *
-- 
2.31.1



Re: PoC/WIP: Extended statistics on expressions

2021-08-24 Thread Justin Pryzby
On Mon, Aug 16, 2021 at 05:41:57PM +0200, Tomas Vondra wrote:
> > This still seems odd:
> > 
> > postgres=# CREATE STATISTICS asf ON i FROM t;
> > ERROR:  extended statistics require at least 2 columns
> > postgres=# CREATE STATISTICS asf ON (i) FROM t;
> > CREATE STATISTICS
> > 
> > It seems wrong that the command works with added parens, but builds 
> > expression
> > stats on a simple column (which is redundant with what analyze does without
> > extended stats).
> 
> Well, yeah. But I think this is a behavior that was discussed somewhere in
> this thread, and the agreement was that it's not worth the complexity, as
> this comment explains
> 
>   * XXX We do only the bare minimum to separate simple attribute and
>   * complex expressions - for example "(a)" will be treated as a complex
>   * expression. No matter how elaborate the check is, there'll always be
>   * a way around it, if the user is determined (consider e.g. "(a+0)"),
>   * so it's not worth protecting against it.
> 
> Patch 0001 fixes the "double parens" issue discussed elsewhere in this
> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
> column reference.

0002 refuses to create expressional stats on a simple column reference like
(a), which I think is helps to avoid a user accidentally creating useless ext
stats objects (which are redundant with the table's column stats).

0002 does not attempt to refuse cases like (a+0), which I think is fine:
we don't try to reject useless cases if someone insists on it.
See 240971675, 701fd0bbc.

So I am +1 to apply both patches.

I added this as an Opened Item for increased visibility.

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-08-18 Thread Tomas Vondra




On 8/18/21 5:07 AM, Justin Pryzby wrote:

Patch 0001 fixes the "double parens" issue discussed elsewhere in this
thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
column reference.



From: Tomas Vondra 
Date: Mon, 16 Aug 2021 17:19:33 +0200
Subject: [PATCH 2/2] fix: identify single-attribute references



diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index a4ee54d516..be1f3a5175 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -2811,7 +2811,7 @@ my %tests = (
create_sql   => 'CREATE STATISTICS dump_test.test_ext_stats_expr
ON (2 * col1) FROM 
dump_test.test_fifth_table',
regexp => qr/^
-   \QCREATE STATISTICS dump_test.test_ext_stats_expr ON 
((2 * col1)) FROM dump_test.test_fifth_table;\E
+   \QCREATE STATISTICS dump_test.test_ext_stats_expr ON (2 
* col1) FROM dump_test.test_fifth_table;\E



This hunk should be in 0001, no ?


Yeah, I mixed that up a bit.


But I'm not sure 0002 is something we can do without catversion bump. What
if someone created such "bogus" statistics? It's mostly harmless, because
the statistics is useless anyway (AFAICS we'll just use the regular one we
have for the column), but if they do pg_dump, that'll fail because of this
new restriction.


I think it's okay if it pg_dump throws an error, since the fix is as easy as
dropping the stx object.  (It wouldn't be okay if it silently misbehaved.)

Andres concluded similarly with the reverted autovacuum patch:
https://www.postgresql.org/message-id/20210817105022.e2t4rozkhqy2m...@alap3.anarazel.de


> +RMT in case someone wants to argue otherwise.
>

I feel a bit uneasy about it, but if there's a precedent ...


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-08-17 Thread Justin Pryzby
> Patch 0001 fixes the "double parens" issue discussed elsewhere in this
> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
> column reference.

> From: Tomas Vondra 
> Date: Mon, 16 Aug 2021 17:19:33 +0200
> Subject: [PATCH 2/2] fix: identify single-attribute references

> diff --git a/src/bin/pg_dump/t/002_pg_dump.pl 
> b/src/bin/pg_dump/t/002_pg_dump.pl
> index a4ee54d516..be1f3a5175 100644
> --- a/src/bin/pg_dump/t/002_pg_dump.pl
> +++ b/src/bin/pg_dump/t/002_pg_dump.pl
> @@ -2811,7 +2811,7 @@ my %tests = (
>   create_sql   => 'CREATE STATISTICS dump_test.test_ext_stats_expr
>   ON (2 * col1) FROM 
> dump_test.test_fifth_table',
>   regexp => qr/^
> - \QCREATE STATISTICS dump_test.test_ext_stats_expr ON 
> ((2 * col1)) FROM dump_test.test_fifth_table;\E
> + \QCREATE STATISTICS dump_test.test_ext_stats_expr ON (2 
> * col1) FROM dump_test.test_fifth_table;\E


This hunk should be in 0001, no ?

> But I'm not sure 0002 is something we can do without catversion bump. What
> if someone created such "bogus" statistics? It's mostly harmless, because
> the statistics is useless anyway (AFAICS we'll just use the regular one we
> have for the column), but if they do pg_dump, that'll fail because of this
> new restriction.

I think it's okay if it pg_dump throws an error, since the fix is as easy as
dropping the stx object.  (It wouldn't be okay if it silently misbehaved.)

Andres concluded similarly with the reverted autovacuum patch:
https://www.postgresql.org/message-id/20210817105022.e2t4rozkhqy2m...@alap3.anarazel.de

+RMT in case someone wants to argue otherwise.

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-08-16 Thread Tomas Vondra



On 8/16/21 3:32 AM, Justin Pryzby wrote:

On Mon, Dec 07, 2020 at 03:15:17PM +0100, Tomas Vondra wrote:

Looking at the current behaviour, there are a couple of things that
seem a little odd, even though they are understandable. For example,
the fact that

   CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;

fails, but

   CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;

succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax

   CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;

tends to suggest that it's going to create statistics on the pair of
expressions, describing their correlation, when actually it builds 2
independent statistics. Also, this error text isn't entirely accurate:

   CREATE STATISTICS s ON col FROM tbl;
   ERROR:  extended statistics require at least 2 columns

because extended statistics don't always require 2 columns, they can
also just have an expression, or multiple expressions and 0 or 1
columns.

I think a lot of this stems from treating "expressions" in the same
way as the other (multi-column) stats kinds, and it might actually be
neater to have separate documented syntaxes for single- and
multi-column statistics:

   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
 ON (expression)
 FROM table_name

   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
 [ ( statistics_kind [, ... ] ) ]
 ON { column_name | (expression) } , { column_name | (expression) } [, ...]
 FROM table_name

The first syntax would create single-column stats, and wouldn't accept
a statistics_kind argument, because there is only one kind of
single-column statistic. Maybe that might change in the future, but if
so, it's likely that the kinds of single-column stats will be
different from the kinds of multi-column stats.

In the second syntax, the only accepted kinds would be the current
multi-column stats kinds (ndistinct, dependencies, and mcv), and it
would always build stats describing the correlations between the
columns listed. It would continue to build standard/expression stats
on any expressions in the list, but that's more of an implementation
detail.

It would no longer be possible to do "CREATE STATISTICS s
(expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
issue 2 separate "CREATE STATISTICS" commands, but that seems more
logical, because they're independent stats.

The parsing code might not change much, but some of the errors would
be different. For example, the errors "building only extended
expression statistics on simple columns not allowed" and "extended
expression statistics require at least one expression" would go away,
and the error "extended statistics require at least 2 columns" might
become more specific, depending on the stats kind.


This still seems odd:

postgres=# CREATE STATISTICS asf ON i FROM t;
ERROR:  extended statistics require at least 2 columns
postgres=# CREATE STATISTICS asf ON (i) FROM t;
CREATE STATISTICS

It seems wrong that the command works with added parens, but builds expression
stats on a simple column (which is redundant with what analyze does without
extended stats).



Well, yeah. But I think this is a behavior that was discussed somewhere 
in this thread, and the agreement was that it's not worth the 
complexity, as this comment explains


  * XXX We do only the bare minimum to separate simple attribute and
  * complex expressions - for example "(a)" will be treated as a complex
  * expression. No matter how elaborate the check is, there'll always be
  * a way around it, if the user is determined (consider e.g. "(a+0)"),
  * so it's not worth protecting against it.

Of course, maybe that wasn't the right decision - it's a bit weird that

  CREATE INDEX on t ((a), (b))

actually "extracts" the column references and stores that in indkeys, 
instead of treating that as expressions.


Patch 0001 fixes the "double parens" issue discussed elsewhere in this 
thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a 
simple column reference.


But I'm not sure 0002 is something we can do without catversion bump. 
What if someone created such "bogus" statistics? It's mostly harmless, 
because the statistics is useless anyway (AFAICS we'll just use the 
regular one we have for the column), but if they do pg_dump, that'll 
fail because of this new restriction.


OTOH we're still "only" in beta, and IIRC the rule is not to bump 
catversion after rc1.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From 4428ee5b46fe1ce45331c355e1646520ca769991 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 16 Aug 2021 16:40:43 +0200
Subject: [PATCH 1/2] fix: don't print double parens

---
 src/backend/utils/adt/ruleutils.c |   2 +-
 .../regress/expected/create_table_like.out|   6 +-
 src/test/regress/expected/stats_ext.out   | 110 +-
 3 files changed, 59 insertions(+), 59 

Re: PoC/WIP: Extended statistics on expressions

2021-08-16 Thread Tomas Vondra




On 8/16/21 3:31 AM, Justin Pryzby wrote:

On 1/22/21 5:01 AM, Justin Pryzby wrote:

In any case, why are there so many parentheses ?


On Fri, Jan 22, 2021 at 02:09:04PM +0100, Tomas Vondra wrote:

That's a bug in pg_get_statisticsobj_worker, probably. It shouldn't be
adding extra parentheses, on top of what deparse_expression_pretty does.
Will fix.


The extra parens are still here - is it intended ?



Ah, thanks for reminding me! I was looking at this, and the problem is 
that pg_get_statisticsobj_worker only does this:


prettyFlags = PRETTYFLAG_INDENT;

Changing that to

prettyFlags = PRETTYFLAG_INDENT | PRETTYFLAG_PAREN;

fixes this (not sure we need the INDENT flag - probably not).

I'm a bit confused, though. My assumption was "PRETTYFLAG_PAREN = true" 
would force the deparsing itself to add the parens, if needed, but in 
reality it works the other way around.


I guess it's more complicated due to deparsing multi-level expressions, 
but unfortunately, there's no comment explaining what it does.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-08-15 Thread Justin Pryzby
On Mon, Dec 07, 2020 at 03:15:17PM +0100, Tomas Vondra wrote:
> > Looking at the current behaviour, there are a couple of things that
> > seem a little odd, even though they are understandable. For example,
> > the fact that
> > 
> >   CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
> > 
> > fails, but
> > 
> >   CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
> > 
> > succeeds and creates both "expressions" and "mcv" statistics. Also, the 
> > syntax
> > 
> >   CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
> > 
> > tends to suggest that it's going to create statistics on the pair of
> > expressions, describing their correlation, when actually it builds 2
> > independent statistics. Also, this error text isn't entirely accurate:
> > 
> >   CREATE STATISTICS s ON col FROM tbl;
> >   ERROR:  extended statistics require at least 2 columns
> > 
> > because extended statistics don't always require 2 columns, they can
> > also just have an expression, or multiple expressions and 0 or 1
> > columns.
> > 
> > I think a lot of this stems from treating "expressions" in the same
> > way as the other (multi-column) stats kinds, and it might actually be
> > neater to have separate documented syntaxes for single- and
> > multi-column statistics:
> > 
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> > ON (expression)
> > FROM table_name
> > 
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> > [ ( statistics_kind [, ... ] ) ]
> > ON { column_name | (expression) } , { column_name | (expression) } [, 
> > ...]
> > FROM table_name
> > 
> > The first syntax would create single-column stats, and wouldn't accept
> > a statistics_kind argument, because there is only one kind of
> > single-column statistic. Maybe that might change in the future, but if
> > so, it's likely that the kinds of single-column stats will be
> > different from the kinds of multi-column stats.
> > 
> > In the second syntax, the only accepted kinds would be the current
> > multi-column stats kinds (ndistinct, dependencies, and mcv), and it
> > would always build stats describing the correlations between the
> > columns listed. It would continue to build standard/expression stats
> > on any expressions in the list, but that's more of an implementation
> > detail.
> > 
> > It would no longer be possible to do "CREATE STATISTICS s
> > (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
> > issue 2 separate "CREATE STATISTICS" commands, but that seems more
> > logical, because they're independent stats.
> > 
> > The parsing code might not change much, but some of the errors would
> > be different. For example, the errors "building only extended
> > expression statistics on simple columns not allowed" and "extended
> > expression statistics require at least one expression" would go away,
> > and the error "extended statistics require at least 2 columns" might
> > become more specific, depending on the stats kind.

This still seems odd:

postgres=# CREATE STATISTICS asf ON i FROM t;
ERROR:  extended statistics require at least 2 columns
postgres=# CREATE STATISTICS asf ON (i) FROM t;
CREATE STATISTICS

It seems wrong that the command works with added parens, but builds expression
stats on a simple column (which is redundant with what analyze does without
extended stats).

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-08-15 Thread Justin Pryzby
On 1/22/21 5:01 AM, Justin Pryzby wrote:
> > In any case, why are there so many parentheses ?

On Fri, Jan 22, 2021 at 02:09:04PM +0100, Tomas Vondra wrote:
> That's a bug in pg_get_statisticsobj_worker, probably. It shouldn't be
> adding extra parentheses, on top of what deparse_expression_pretty does.
> Will fix.

The extra parens are still here - is it intended ?

postgres=# CREATE STATISTICS s ON i, (1+i), (2+i) FROM t;
CREATE STATISTICS
postgres=# \d t
 Table "public.t"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 i  | integer |   |  | 
Statistics objects:
"public"."s" ON i, ((1 + i)), ((2 + i)) FROM t

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-06-11 Thread Tomas Vondra




On 6/11/21 6:55 AM, Noah Misch wrote:

On Sun, Jun 06, 2021 at 09:13:17PM +0200, Tomas Vondra wrote:


On 6/6/21 7:37 AM, Noah Misch wrote:

On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote:

OK, pushed after a bit more polishing and testing.


This added a "transformed" field to CreateStatsStmt, but it didn't mention
that field in src/backend/nodes.  Should those functions handle the field?



Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not
sure if it can result in error/failure or just inefficiency (due to
transforming the expressions repeatedly), but it should do whatever
CREATE INDEX is doing.

Thanks for noticing! Fixed by d57ecebd12.


Great.  For future reference, this didn't need a catversion bump.  readfuncs.c
changes need a catversion bump, since the catalogs might contain input for
each read function.  Other src/backend/nodes functions don't face that.  Also,
src/backend/nodes generally process fields in the order that they appear in
the struct.  The order you used in d57ecebd12 is nicer, being more like
IndexStmt, so I'm pushing an order change to the struct.



OK, makes sense. Thanks!

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-06-10 Thread Noah Misch
On Sun, Jun 06, 2021 at 09:13:17PM +0200, Tomas Vondra wrote:
> 
> On 6/6/21 7:37 AM, Noah Misch wrote:
> > On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote:
> >> OK, pushed after a bit more polishing and testing.
> > 
> > This added a "transformed" field to CreateStatsStmt, but it didn't mention
> > that field in src/backend/nodes.  Should those functions handle the field?
> > 
> 
> Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not
> sure if it can result in error/failure or just inefficiency (due to
> transforming the expressions repeatedly), but it should do whatever
> CREATE INDEX is doing.
> 
> Thanks for noticing! Fixed by d57ecebd12.

Great.  For future reference, this didn't need a catversion bump.  readfuncs.c
changes need a catversion bump, since the catalogs might contain input for
each read function.  Other src/backend/nodes functions don't face that.  Also,
src/backend/nodes generally process fields in the order that they appear in
the struct.  The order you used in d57ecebd12 is nicer, being more like
IndexStmt, so I'm pushing an order change to the struct.




Re: PoC/WIP: Extended statistics on expressions

2021-06-06 Thread Tom Lane
Tomas Vondra  writes:
> On 6/6/21 9:17 PM, Tom Lane wrote:
>> I'm curious about how come the buildfarm didn't notice this.  The
>> animals using COPY_PARSE_PLAN_TREES should have failed.  The fact
>> that they didn't implies that there's no test case that makes use
>> of a nonzero value for this field, which seems like a testing gap.

> AFAICS the reason is pretty simple - the COPY_PARSE_PLAN_TREES checks
> look like this:
> ...
> But if the field is missing from all the functions, equal() can't detect
> that copyObject() did not actually copy it.

Right, that code would only detect a missing copyfuncs.c line if
equalfuncs.c did have the line, which isn't all that likely.  However,
we then pass the copied node on to further processing, which in principle
should result in visible failures when copyfuncs.c is missing a line.

I think the reason it didn't is that the transformed field would always
be zero (false) in grammar output.  We could only detect a problem if
we copied already-transformed nodes and then used them further.  Even
then it *might* not fail, because the consequence would likely be an
extra round of parse analysis on the expressions, which is likely to
be a no-op.

Not sure if there's a good way to improve that.  I hope sometime soon
we'll be able to auto-generate these functions, and then the risk of
this sort of mistake will go away (he says optimistically).

regards, tom lane




Re: PoC/WIP: Extended statistics on expressions

2021-06-06 Thread Tomas Vondra



On 6/6/21 9:17 PM, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 6/6/21 7:37 AM, Noah Misch wrote:
>>> This added a "transformed" field to CreateStatsStmt, but it didn't mention
>>> that field in src/backend/nodes.  Should those functions handle the field?
> 
>> Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not
>> sure if it can result in error/failure or just inefficiency (due to
>> transforming the expressions repeatedly), but it should do whatever
>> CREATE INDEX is doing.
> 
> I'm curious about how come the buildfarm didn't notice this.  The
> animals using COPY_PARSE_PLAN_TREES should have failed.  The fact
> that they didn't implies that there's no test case that makes use
> of a nonzero value for this field, which seems like a testing gap.
> 

AFAICS the reason is pretty simple - the COPY_PARSE_PLAN_TREES checks
look like this:

List   *new_list = copyObject(raw_parsetree_list);

/* This checks both copyObject() and the equal() routines... */
if (!equal(new_list, raw_parsetree_list))
elog(WARNING, "copyObject() failed to produce an equal raw
   parse tree");
else
raw_parsetree_list = new_list;
}

But if the field is missing from all the functions, equal() can't detect
that copyObject() did not actually copy it. It'd detect a case when the
field was added just to one place, but not this. The CREATE INDEX (which
served as an example for CREATE STATISTICS) has exactly the same issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-06-06 Thread Tom Lane
Tomas Vondra  writes:
> On 6/6/21 7:37 AM, Noah Misch wrote:
>> This added a "transformed" field to CreateStatsStmt, but it didn't mention
>> that field in src/backend/nodes.  Should those functions handle the field?

> Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not
> sure if it can result in error/failure or just inefficiency (due to
> transforming the expressions repeatedly), but it should do whatever
> CREATE INDEX is doing.

I'm curious about how come the buildfarm didn't notice this.  The
animals using COPY_PARSE_PLAN_TREES should have failed.  The fact
that they didn't implies that there's no test case that makes use
of a nonzero value for this field, which seems like a testing gap.

regards, tom lane




Re: PoC/WIP: Extended statistics on expressions

2021-06-06 Thread Tomas Vondra


On 6/6/21 7:37 AM, Noah Misch wrote:
> On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote:
>> OK, pushed after a bit more polishing and testing.
> 
> This added a "transformed" field to CreateStatsStmt, but it didn't mention
> that field in src/backend/nodes.  Should those functions handle the field?
> 

Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not
sure if it can result in error/failure or just inefficiency (due to
transforming the expressions repeatedly), but it should do whatever
CREATE INDEX is doing.

Thanks for noticing! Fixed by d57ecebd12.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-06-05 Thread Noah Misch
On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote:
> OK, pushed after a bit more polishing and testing.

This added a "transformed" field to CreateStatsStmt, but it didn't mention
that field in src/backend/nodes.  Should those functions handle the field?




Re: PoC/WIP: Extended statistics on expressions (\d in old client)

2021-06-02 Thread Justin Pryzby
On Fri, Jan 22, 2021 at 02:09:04PM +0100, Tomas Vondra wrote:
> On 1/22/21 5:01 AM, Justin Pryzby wrote:
> > On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote:
> > > > > | Statistics objects:
> > > > > | "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> > > 
> > > Umm, for me that prints:
> > 
> > >  "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t
> > > 
> > > which I think is OK. But maybe there's something else to trigger the
> > > problem?
> > 
> > Oh.  It's because I was using /usr/bin/psql and not ./src/bin/psql.
> > I think it's considered ok if old client's \d commands don't work on new
> > server, but it's not clear to me if it's ok if they misbehave.  It's almost
> > better it made an ERROR.
> > 
> 
> Well, how would the server know to throw an error? We can't quite patch the
> old psql (if we could, we could just tweak the query).

To refresh: stats objects on a v14 server which include expressions are shown
by pre-v14 psql client with the expressions elided (because the attnums don't
correspond to anything in pg_attribute).

I'm mentioning it again since, even though I knew about this earlier in the
year, it caused some confusion for me again just now while testing our
application.  I had the v14 server installed but the psql symlink still pointed
to the v13 client.

There may not be anything we can do about it.
And it may not be a significant issue outside the beta period: more typically,
the client version would match the server version.

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-04-22 Thread Justin Pryzby
I suggest to add some kind of reference to stats expressions here.

--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml

  
   Updating Planner Statistics

   
statistics
of the planner
   

[...]

@@ -330,10 +330,12 @@
 
 
  Also, by default there is limited information available about
- the selectivity of functions.  However, if you create an expression
+ the selectivity of functions.  However, if you create a statistics
+ expression or an expression
  index that uses a function call, useful statistics will be
  gathered about the function, which can greatly improve query
  plans that use the expression index.


-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-03-26 Thread Tomas Vondra



On 3/27/21 1:17 AM, Tomas Vondra wrote:
> On 3/26/21 1:54 PM, Tomas Vondra wrote:
>>
>>
>> On 3/26/21 12:37 PM, Dean Rasheed wrote:
>>> On Thu, 25 Mar 2021 at 19:59, Tomas Vondra
>>>  wrote:

 Attached is an updated patch series, with all the changes discussed
 here. I've cleaned up the ndistinct stuff a bit more (essentially
 reverting back from GroupExprInfo to GroupVarInfo name), and got rid of
 the UpdateStatisticsForTypeChange.

>>>
>>> I've looked over all that and I think it's in pretty good shape. I
>>> particularly like how much simpler the ndistinct code has now become.
>>>
>>> Some (hopefully final) review comments:
>>>
>>> ...
>>>
>>
>> Thanks! I'll fix these, and then will consider getting it committed
>> sometime later today, once the buildfarm does some testing on the other
>> stuff I already committed.
>>
> 
> OK, pushed after a bit more polishing and testing. I've noticed one more
> missing piece in describe (expressions missing in \dX), so I fixed that.
> 
> May the buildfarm be merciful ...
> 

LOL! It failed on *my* buildfarm machine, because apparently some of the
expressions used in stats_ext depend on locale and the machine is using
cs_CZ.UTF-8. Will fix later ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-03-26 Thread Tomas Vondra
On 3/26/21 1:54 PM, Tomas Vondra wrote:
> 
> 
> On 3/26/21 12:37 PM, Dean Rasheed wrote:
>> On Thu, 25 Mar 2021 at 19:59, Tomas Vondra
>>  wrote:
>>>
>>> Attached is an updated patch series, with all the changes discussed
>>> here. I've cleaned up the ndistinct stuff a bit more (essentially
>>> reverting back from GroupExprInfo to GroupVarInfo name), and got rid of
>>> the UpdateStatisticsForTypeChange.
>>>
>>
>> I've looked over all that and I think it's in pretty good shape. I
>> particularly like how much simpler the ndistinct code has now become.
>>
>> Some (hopefully final) review comments:
>>
>> ...
>>
> 
> Thanks! I'll fix these, and then will consider getting it committed
> sometime later today, once the buildfarm does some testing on the other
> stuff I already committed.
> 

OK, pushed after a bit more polishing and testing. I've noticed one more
missing piece in describe (expressions missing in \dX), so I fixed that.

May the buildfarm be merciful ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-03-26 Thread Tomas Vondra



On 3/26/21 12:37 PM, Dean Rasheed wrote:
> On Thu, 25 Mar 2021 at 19:59, Tomas Vondra
>  wrote:
>>
>> Attached is an updated patch series, with all the changes discussed
>> here. I've cleaned up the ndistinct stuff a bit more (essentially
>> reverting back from GroupExprInfo to GroupVarInfo name), and got rid of
>> the UpdateStatisticsForTypeChange.
>>
> 
> I've looked over all that and I think it's in pretty good shape. I
> particularly like how much simpler the ndistinct code has now become.
> 
> Some (hopefully final) review comments:
> 
> 1). I don't think index.c is the right place for
> StatisticsGetRelation(). I appreciate that it is very similar to the
> adjacent IndexGetRelation() function, but index.c is really only for
> index-related code, so I think StatisticsGetRelation() should go in
> statscmds.c
> 

Ah, right, I forgot about this. I wonder if we should add
catalog/statistics.c, similar to catalog/index.c (instead of adding it
locally to statscmds.c).

> 2). Perhaps the error message at statscmds.c:293 should read
> 
>"expression cannot be used in multivariate statistics because its
> type %s has no default btree operator class"
> 
> (i.e., add the word "multivariate", since the same expression *can* be
> used in univariate statistics even though it has no less-than
> operator).
> 
> 3). The comment for ATExecAddStatistics() should probably mention that
> "ALTER TABLE ADD STATISTICS" isn't a command in the grammar, in a
> similar way to other similar functions, e.g.:
> 
> /*
>  * ALTER TABLE ADD STATISTICS
>  *
>  * This is no such command in the grammar, but we use this internally to add
>  * AT_ReAddStatistics subcommands to rebuild extended statistics after a table
>  * column type change.
>  */
> 
> 4). The comment at the start of ATPostAlterTypeParse() needs updating
> to mention CREATE STATISTICS statements.
> 
> 5). I think ATPostAlterTypeParse() should also attempt to preserve any
> COMMENTs attached to statistics objects, i.e., something like:
> 
> --- src/backend/commands/tablecmds.c.orig2021-03-26 10:39:38.328631864 
> +
> +++ src/backend/commands/tablecmds.c2021-03-26 10:47:21.042279580 +
> @@ -12619,6 +12619,9 @@
>  CreateStatsStmt  *stmt = (CreateStatsStmt *) stm;
>  AlterTableCmd *newcmd;
> 
> +/* keep the statistics object's comment */
> +stmt->stxcomment = GetComment(oldId, StatisticExtRelationId, 0);
> +
>  newcmd = makeNode(AlterTableCmd);
>  newcmd->subtype = AT_ReAddStatistics;
>  newcmd->def = (Node *) stmt;
> 
> 6). Comment typo at extended_stats.c:2532 - s/statitics/statistics/
> 
> 7). I don't think that the big XXX comment near the start of
> estimate_multivariate_ndistinct() is really relevant anymore, now that
> the code has been simplified and we no longer extract Vars from
> expressions, so perhaps it can just be deleted.
> 

Thanks! I'll fix these, and then will consider getting it committed
sometime later today, once the buildfarm does some testing on the other
stuff I already committed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-03-26 Thread Dean Rasheed
On Thu, 25 Mar 2021 at 19:59, Tomas Vondra
 wrote:
>
> Attached is an updated patch series, with all the changes discussed
> here. I've cleaned up the ndistinct stuff a bit more (essentially
> reverting back from GroupExprInfo to GroupVarInfo name), and got rid of
> the UpdateStatisticsForTypeChange.
>

I've looked over all that and I think it's in pretty good shape. I
particularly like how much simpler the ndistinct code has now become.

Some (hopefully final) review comments:

1). I don't think index.c is the right place for
StatisticsGetRelation(). I appreciate that it is very similar to the
adjacent IndexGetRelation() function, but index.c is really only for
index-related code, so I think StatisticsGetRelation() should go in
statscmds.c

2). Perhaps the error message at statscmds.c:293 should read

   "expression cannot be used in multivariate statistics because its
type %s has no default btree operator class"

(i.e., add the word "multivariate", since the same expression *can* be
used in univariate statistics even though it has no less-than
operator).

3). The comment for ATExecAddStatistics() should probably mention that
"ALTER TABLE ADD STATISTICS" isn't a command in the grammar, in a
similar way to other similar functions, e.g.:

/*
 * ALTER TABLE ADD STATISTICS
 *
 * This is no such command in the grammar, but we use this internally to add
 * AT_ReAddStatistics subcommands to rebuild extended statistics after a table
 * column type change.
 */

4). The comment at the start of ATPostAlterTypeParse() needs updating
to mention CREATE STATISTICS statements.

5). I think ATPostAlterTypeParse() should also attempt to preserve any
COMMENTs attached to statistics objects, i.e., something like:

--- src/backend/commands/tablecmds.c.orig2021-03-26 10:39:38.328631864 +
+++ src/backend/commands/tablecmds.c2021-03-26 10:47:21.042279580 +
@@ -12619,6 +12619,9 @@
 CreateStatsStmt  *stmt = (CreateStatsStmt *) stm;
 AlterTableCmd *newcmd;

+/* keep the statistics object's comment */
+stmt->stxcomment = GetComment(oldId, StatisticExtRelationId, 0);
+
 newcmd = makeNode(AlterTableCmd);
 newcmd->subtype = AT_ReAddStatistics;
 newcmd->def = (Node *) stmt;

6). Comment typo at extended_stats.c:2532 - s/statitics/statistics/

7). I don't think that the big XXX comment near the start of
estimate_multivariate_ndistinct() is really relevant anymore, now that
the code has been simplified and we no longer extract Vars from
expressions, so perhaps it can just be deleted.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-25 Thread Dean Rasheed
On Thu, 25 Mar 2021 at 00:05, Tomas Vondra
 wrote:
>
> here's an updated patch. 0001

The change to the way that CreateStatistics() records dependencies
isn't quite right -- recordDependencyOnSingleRelExpr() will not create
any dependencies if the expression uses only a whole-row Var. However,
pull_varattnos() will include whole-row Vars, and so nattnums_exprs
will be non-zero, and CreateStatistics() will not create a whole-table
dependency when it should.

I suppose that could be fixed up by inspecting the bitmapset returned
by pull_varattnos() in more detail, but I think it's probably safer to
revert to the previous code, which matched what index_create() did.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-25 Thread Dean Rasheed
On Thu, 25 Mar 2021 at 00:05, Tomas Vondra
 wrote:
>
> Actually, I think we need that block at all - there's no point in
> keeping the exact expression, because if there was a statistics matching
> it it'd be matched by the examine_variable. So if we get here, we have
> to just split it into the vars anyway. So the second block is entirely
> useless.

Good point.

> That however means we don't need the processing with GroupExprInfo and
> GroupVarInfo lists, i.e. we can revert back to the original simpler
> processing, with a bit of extra logic to match expressions, that's all.
>
> The patch 0003 does this (it's a bit crude, but hopefully enough to
> demonstrate).

Cool. I did wonder about that, but I didn't fully think it through.
I'll take a look.

> 0002 is an attempt to fix an issue I noticed today - we need to handle
> type changes.
>
> I think we have two options:
>
> a) Make UpdateStatisticsForTypeChange smarter to also transform and
> update the expression string, and reset pg_statistics[] data.
>
> b) Just recreate the statistics, just like we do for indexes. Currently
> this does not force analyze, so it just resets all the stats. Maybe it
> should do analyze, though.

I'd vote for (b) without an analyse, and I agree with getting rid of
UpdateStatisticsForTypeChange(). I've always been a bit skeptical
about trying to preserve extended statistics after a type change, when
we don't preserve regular per-column stats.

> BTW I wonder how useful the updated statistics actually is. Consider
> this example:
> ...
> the expression now looks like this:
>
> 
> "public"."s" (ndistinct) ON ((a + b)), b)::numeric)::double
> precision + c)) FROM t
> 
>
> But we're matching it to (((b)::double precision + c)), so that fails.
>
> This is not specific to extended statistics - indexes have exactly the
> same issue. Not sure how common this is in practice.

Hmm, that's unfortunate. Maybe it's not that common in practice
though. I'm not sure if there is any practical way to fix it, but if
there is, I guess we'd want to apply the same fix to both stats and
indexes, and that certainly seems out of scope for this patch.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Tomas Vondra
On 3/25/21 1:05 AM, Tomas Vondra wrote:
> ...
>
> 0002 is an attempt to fix an issue I noticed today - we need to handle
> type changes. Until now we did not have problems with that, because we
> only had attnums - so we just reset the statistics (with the exception
> of functional dependencies, on the assumption that those remain valid).
> 
> With expressions it's a bit more complicated, though.
> 
> 1) we need to transform the expressions so that the Vars contain the
> right type info etc. Otherwise an analyze with the old pg_node_tree crashes
> 
> 2) we need to reset the pg_statistic[] data too, which however makes
> keeping the functional dependencies a bit less useful, because those
> rely on the expression stats :-(
> 
> So I'm wondering what to do about this. I looked into how ALTER TABLE
> handles indexes, and 0003 is a PoC to do the same thing for statistics.
> Of couse, this is a bit unfortunate because it recreates the statistics
> (so we don't keep anything, not even functional dependencies).
> 
> I think we have two options:
> 
> a) Make UpdateStatisticsForTypeChange smarter to also transform and
> update the expression string, and reset pg_statistics[] data.
> 
> b) Just recreate the statistics, just like we do for indexes. Currently
> this does not force analyze, so it just resets all the stats. Maybe it
> should do analyze, though.
> 
> Any opinions? I need to think about this a bit more, but maybe (b) with
> the analyze is the right thing to do. Keeping just some of the stats
> always seemed a bit weird. (This is why the 0002 patch breaks one of the
> regression tests.)
> 

After thinking about this a bit more I think (b) is the right choice,
and the analyze is not necessary. The reason is fairly simple - we drop
the per-column statistics, because ATExecAlterColumnType does

RemoveStatistics(RelationGetRelid(rel), attnum);

so the user has to run analyze anyway, to get any reasonable estimates
(we keep the functional dependencies, but those still rely on per-column
statistics quite a bit). And we'll have to do the same thing with
per-expression stats too. It was a nice idea to keep at least the stats
that are not outright broken, but unfortunately it's not a very useful
optimization. It increases the instability of the system, because now we
have estimates with all statistics, no statistics, and something in
between after the partial reset. Not nice.

So my plan is to get rid of UpdateStatisticsForTypeChange, and just do
mostly what we do for indexes. It's not perfect (as demonstrated in last
message), but that'd apply even to option (a).

Any better ideas?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Justin Pryzby
On Thu, Mar 25, 2021 at 01:05:37AM +0100, Tomas Vondra wrote:
> here's an updated patch. 0001 should address most of the today's review
> items regarding comments etc.

This is still an issue:

postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true;
ERROR:  schema "i" does not exist

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Alvaro Herrera
On 2021-Mar-24, Justin Pryzby wrote:

> On Wed, Mar 24, 2021 at 05:15:46PM +, Dean Rasheed wrote:

> > Hmm, I think "univariate" and "multivariate" are pretty ubiquitous,
> > when used to describe statistics. You could use "single-column" and
> > "multi-column", but then "column" isn't really right anymore, since it
> > might be a column or an expression. I can't think of any other terms
> > that fit.

Agreed.  If we need to define the term, we can spend a sentence or two
in that.

> We already use "multivariate", just not in create-statistics.sgml

> So I think the answer is for create-statistics to expose that word in a
> user-facing way in its reference to multivariate-statistics-examples.

+1

-- 
Álvaro Herrera   Valdivia, Chile




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Justin Pryzby
On Wed, Mar 24, 2021 at 05:15:46PM +, Dean Rasheed wrote:
> On Wed, 24 Mar 2021 at 16:48, Tomas Vondra
>  wrote:
> >
> > As for the changes proposed in the create_statistics, do we really want
> > to use univariate / multivariate there? Yes, the terms are correct, but
> > I'm not sure how many people looking at CREATE STATISTICS will
> > understand them.
> >
> 
> Hmm, I think "univariate" and "multivariate" are pretty ubiquitous,
> when used to describe statistics. You could use "single-column" and
> "multi-column", but then "column" isn't really right anymore, since it
> might be a column or an expression. I can't think of any other terms
> that fit.

We already use "multivariate", just not in create-statistics.sgml

doc/src/sgml/perform.sgml:multivariate statistics, 
which can capture
doc/src/sgml/perform.sgml:it's impractical to compute multivariate 
statistics automatically.
doc/src/sgml/planstats.sgml: 
doc/src/sgml/planstats.sgml:   multivariate
doc/src/sgml/planstats.sgml:multivariate statistics on the two columns:
doc/src/sgml/planstats.sgml:  
doc/src/sgml/planstats.sgml:But without multivariate statistics, the 
estimate for the number of
doc/src/sgml/planstats.sgml:This section introduces multivariate variant of 
MCV
doc/src/sgml/ref/create_statistics.sgml:  and .
doc/src/sgml/release-13.sgml:2020-01-13 [eae056c19] Apply multiple multivariate 
MCV lists when possible

So I think the answer is for create-statistics to expose that word in a
user-facing way in its reference to multivariate-statistics-examples.

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Dean Rasheed
On Wed, 24 Mar 2021 at 16:48, Tomas Vondra
 wrote:
>
> As for the changes proposed in the create_statistics, do we really want
> to use univariate / multivariate there? Yes, the terms are correct, but
> I'm not sure how many people looking at CREATE STATISTICS will
> understand them.
>

Hmm, I think "univariate" and "multivariate" are pretty ubiquitous,
when used to describe statistics. You could use "single-column" and
"multi-column", but then "column" isn't really right anymore, since it
might be a column or an expression. I can't think of any other terms
that fit.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Tomas Vondra



On 3/24/21 5:28 PM, Dean Rasheed wrote:
> On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
>  wrote:
>>
>> AFAIK the primary issue here is that the two places disagree. While
>> estimate_num_groups does this
>>
>> varnos = pull_varnos(root, (Node *) varshere);
>> if (bms_membership(varnos) == BMS_SINGLETON)
>> { ... }
>>
>> the add_unique_group_expr does this
>>
>> varnos = pull_varnos(root, (Node *) groupexpr);
>>
>> That is, one looks at the group expression, while the other look at vars
>> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
>> this can differ, causing the crash.
>>
>> So we need to change one of those places - my fix tweaked the second
>> place to also look at the vars, but maybe we should change the other
>> place? Or maybe it's not the right fix for PlaceHolderVars ...
>>
> 
> I think that it doesn't make any difference which place is changed.
> 
> This is a case of an expression with no stats. With your change,
> you'll get a single GroupExprInfo containing a list of
> VariableStatData's for each of it's Var's, whereas if you changed it
> the other way, you'd get a separate GroupExprInfo for each Var. But I
> think they'd both end up being treated the same by
> estimate_multivariate_ndistinct(), since there wouldn't be any stats
> matching the expression, only the individual Var's. Maybe changing the
> first place would be the more bulletproof fix though.
> 

Yeah, I think that's true. I'll do a bit more research / experiments.

As for the changes proposed in the create_statistics, do we really want
to use univariate / multivariate there? Yes, the terms are correct, but
I'm not sure how many people looking at CREATE STATISTICS will
understand them.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Justin Pryzby
On Wed, Mar 24, 2021 at 01:24:46AM -0500, Justin Pryzby wrote:
> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
> > Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  

In HEAD:catalogs.sgml, pg_statistic_ext (the table) says "object":
|Name of the statistics object
|Owner of the statistics object
|An array of attribute numbers, indicating which table columns are covered by 
this statistics object;

But pg_stats_ext (the view) doesn't say "object", which sounds wrong:
|Name of extended statistics
|Owner of the extended statistics
|Names of the columns the extended statistics is defined on

Other pre-existing issues: should be singular "statistic":
doc/src/sgml/perform.sgml: Another type of statistics stored for each 
column are most-common value
doc/src/sgml/ref/psql-ref.sgml:The status of each kind of extended 
statistics is shown in a column

Pre-existing issues: doesn't say "object" but I think it should:
src/backend/commands/statscmds.c:
errmsg("statistics creation on system columns is not supported")));
src/backend/commands/statscmds.c:
errmsg("cannot have more than %d columns in statistics",
src/backend/commands/statscmds.c:* If we got here and the OID is not 
valid, it means the statistics does
src/backend/commands/statscmds.c: * Select a nonconflicting name for a new 
statistics.
src/backend/commands/statscmds.c: * Generate "name2" for a new statistics given 
the list of column names for it
src/backend/statistics/extended_stats.c:/* compute statistics 
target for this statistics */
src/backend/statistics/extended_stats.c: * attributes the statistics is defined 
on, and then the default statistics
src/backend/statistics/mcv.c: * The input is the OID of the statistics, and 
there are no rows returned if

should say "for a statistics object" or "for statistics objects"
src/backend/statistics/extended_stats.c: * target for a statistics objects 
(from the object target, attribute targets

Your patch adds these:

Should say "object":
+* Check if we actually have a matching statistics for the expression.  

   
+   /* evaluate expressions (if the statistics has any) */  

   
+* for the extended statistics. The second option seems more 
reasonable. 
  
+* the statistics had all options enabled on the original 
version.
 
+* But if the statistics is defined on just a single column, it 
has to  
   
+   /* has the statistics expressions? */   

   
+   /* expression - see if it's in the statistics */

   
+* column(s) the statistics depends on. 
 Also require all   
   
+* statistics is defined on more than one column/expression).   

   
+* statistics is useless, but harmless).

   
+* If there are no simply-referenced columns, give the statistics an 
auto
  


+* Then the first statistics matches no expressions and 
3 vars, 
  

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Tomas Vondra


On 3/24/21 7:24 AM, Justin Pryzby wrote:
> Most importantly, it looks like this forgets to update catalog documentation
> for stxexprs and stxkind='e'
> 

Good catch.

> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
>> Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  
> 

OK "statistics object" seems better and more consistent.

>> +   Name of schema containing table
> 
> I don't know about the nearby descriptions, but this one sounds too much like 
> a
> "schema-containing" table.  Say "Name of the schema which contains the table" 
> ?
> 

I think the current spelling is OK / consistent with the other catalogs.

>> +   Name of table
> 
> Say "name of table on which the extended statistics are defined"
> 

I've used "Name of table the statistics object is defined on".

>> +   Name of extended statistics
> 
> "Name of the extended statistic object"
> 
>> +   Owner of the extended statistics
> 
> ..object
> 

OK

>> +   Expression the extended statistics is defined on
> 
> I think it should say "the extended statistic", or "the extended statistics
> object".  Maybe "..on which the extended statistic is defined"
> 

OK

>> +   of random access to the disk.  (This expression is null if the 
>> expression
>> +   data type does not have a  operator.)
> 
> expression's data type
> 

OK

>> +   much-too-small row count estimate in the first two queries. Moreover, the
> 
> maybe say "dramatically underestimates the rowcount"
> 

I've changed this to "... results in a significant underestimate of row
count".

>> +   planner has no information about relationship between the expressions, 
>> so it
> 
> the relationship
> 

OK

>> +   assumes the two WHERE and GROUP BY
>> +   conditions are independent, and multiplies their selectivities together 
>> to
>> +   arrive at a much-too-high group count estimate in the aggregate query.
> 
> severe overestimate ?
> 

OK

>> +   This is further exacerbated by the lack of accurate statistics for the
>> +   expressions, forcing the planner to use default ndistinct estimate for 
>> the
> 
> use *a default
> 

OK

>> +   expression derived from ndistinct for the column. With such statistics, 
>> the
>> +   planner recognizes that the conditions are correlated and arrives at much
>> +   more accurate estimates.
> 
> are correlated comma
> 

OK

>> +if (type->lt_opr == InvalidOid)
> 
> These could be !OidIsValid
> 

Maybe, but it's like this already. I'll leave this alone and then
fix/backpatch separately.

>> + * expressions. It's either expensive or very easy to defeat for
>> + * determined used, and there's no risk if we allow such statistics (the
>> + * statistics is useless, but harmless).
> 
> I think it's meant to say "for a determined user" ?
> 

Right.

>> + * If there are no simply-referenced columns, give the statistics an 
>> auto
>> + * dependency on the whole table.  In most cases, this will be 
>> redundant,
>> + * but it might not be if the statistics expressions contain no Vars
>> + * (which might seem strange but possible).
>> + */
>> +if (!nattnums)
>> +{
>> +ObjectAddressSet(parentobject, RelationRelationId, relid);
>> +recordDependencyOn(, , DEPENDENCY_AUTO);
>> +}
> 
> Can this be unconditional ?
> 

What would be the benefit? This behavior copied from index_create, so
I'd prefer keeping it the same for consistency reason. Presumably it's
like that for some reason (a bit of cargo cult programming, I know).

>> + * Translate the array of indexs to regular attnums for the dependency 
>> (we
> 
> sp: indexes
> 

OK

>> + * Not found a matching expression, so 
>> we can simply skip
> 
> Found no matching expr
> 

OK

>> +/* if found a matching, */
> 
> matching ..
> 

Matching dependency.

>> +examine_attribute(Node *expr)
> 
> Maybe you should rename this to something distinct ?  So it's easy to add a
> breakpoint there, for example.
> 

What would be a better name? It's not difficult to add a breakpoint
using line number, for example.

>> +stats->anl_context = CurrentMemoryContext;  /* XXX should be using
>> +
>>  * something else? */
> 
>> +boolnulls[Natts_pg_statistic];
> ...
>> + * Construct a new pg_statistic tuple
>> + */
>> +for (i = 0; i < Natts_pg_statistic; ++i)
>> +{
>> +nulls[i] = false;
>> +}
> 
> Shouldn't you just write nulls[Natts_pg_statistic] = {false};
> or at least: memset(nulls, 0, sizeof(nulls));
> 

Maybe, but it's a copy of what 

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Dean Rasheed
On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
 wrote:
>
> AFAIK the primary issue here is that the two places disagree. While
> estimate_num_groups does this
>
> varnos = pull_varnos(root, (Node *) varshere);
> if (bms_membership(varnos) == BMS_SINGLETON)
> { ... }
>
> the add_unique_group_expr does this
>
> varnos = pull_varnos(root, (Node *) groupexpr);
>
> That is, one looks at the group expression, while the other look at vars
> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
> this can differ, causing the crash.
>
> So we need to change one of those places - my fix tweaked the second
> place to also look at the vars, but maybe we should change the other
> place? Or maybe it's not the right fix for PlaceHolderVars ...
>

I think that it doesn't make any difference which place is changed.

This is a case of an expression with no stats. With your change,
you'll get a single GroupExprInfo containing a list of
VariableStatData's for each of it's Var's, whereas if you changed it
the other way, you'd get a separate GroupExprInfo for each Var. But I
think they'd both end up being treated the same by
estimate_multivariate_ndistinct(), since there wouldn't be any stats
matching the expression, only the individual Var's. Maybe changing the
first place would be the more bulletproof fix though.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Tomas Vondra
On 3/24/21 2:36 PM, Dean Rasheed wrote:
> On Wed, 24 Mar 2021 at 10:22, Tomas Vondra
>  wrote:
>>
>> Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
>> seem to break the code's assumptions about varnos. This fixes it for me,
>> but I need to look at it more closely.
>>
> 
> I think that makes sense.
> 

AFAIK the primary issue here is that the two places disagree. While
estimate_num_groups does this

varnos = pull_varnos(root, (Node *) varshere);
if (bms_membership(varnos) == BMS_SINGLETON)
{ ... }

the add_unique_group_expr does this

varnos = pull_varnos(root, (Node *) groupexpr);

That is, one looks at the group expression, while the other look at vars
extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
this can differ, causing the crash.

So we need to change one of those places - my fix tweaked the second
place to also look at the vars, but maybe we should change the other
place? Or maybe it's not the right fix for PlaceHolderVars ...

> Reviewing the docs, I noticed a couple of omissions, and had a few
> other suggestions (attached).
> 

Thanks! I'll include that in the next version of the patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Dean Rasheed
On Wed, 24 Mar 2021 at 10:22, Tomas Vondra
 wrote:
>
> Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
> seem to break the code's assumptions about varnos. This fixes it for me,
> but I need to look at it more closely.
>

I think that makes sense.

Reviewing the docs, I noticed a couple of omissions, and had a few
other suggestions (attached).

Regards,
Dean
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
new file mode 100644
index dadca67..382cbd7
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7377,6 +7377,15 @@ SCRAM-SHA-256$iteration
m for most common values (MCV) list statistics
   
  
+
+ 
+  
+   stxexprs pg_node_tree
+  
+  
+   A list of any expressions covered by this statistics object.
+  
+ 
 

   
@@ -7474,6 +7483,16 @@ SCRAM-SHA-256$iteration
pg_mcv_list type
   
  
+
+ 
+  
+   stxdexpr pg_statistic[]
+  
+  
+   Per-expression statistics, serialized as an array of
+   pg_statistic type
+  
+ 
 

   
@@ -12843,7 +12862,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_p
 
   
The view pg_stats_ext provides access to
-   the information stored in the pg_statistic_ext
and pg_statistic_ext_data
catalogs.  This view allows access only to rows of
@@ -12930,7 +12950,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_p
(references pg_attribute.attname)
   
   
-   Names of the columns the extended statistics is defined on
+   Names of the columns included in the extended statistics
+  
+ 
+
+ 
+  
+   exprs text[]
+  
+  
+   Expressions included in the extended statistics
   
  
 
@@ -13033,7 +13062,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_p
 
   
The view pg_stats_ext_exprs provides access to
-   the information stored in the pg_statistic_ext
and pg_statistic_ext_data
catalogs.  This view allows access only to rows of
@@ -13119,7 +13149,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_p
expr text
   
   
-   Expression the extended statistics is defined on
+   Expression included in the extended statistics
   
  
 
diff --git a/doc/src/sgml/ref/create_statistics.sgml b/doc/src/sgml/ref/create_statistics.sgml
new file mode 100644
index 5f3aefd..f561599
--- a/doc/src/sgml/ref/create_statistics.sgml
+++ b/doc/src/sgml/ref/create_statistics.sgml
@@ -27,7 +27,7 @@ CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
 [ ( statistics_kind [, ... ] ) ]
-ON { column_name | ( expression ) } [, ...]
+ON { column_name | ( expression ) }, { column_name | ( expression ) } [, ...]
 FROM table_name
 
 
@@ -45,12 +45,15 @@ CREATE STATISTICS [ IF NOT EXISTS ] 
The CREATE STATISTICS command has two basic forms. The
-   simple variant allows building statistics for a single expression, does
-   not allow specifying any statistics kinds and provides benefits similar
-   to an expression index. The full variant allows defining statistics objects
-   on multiple columns and expressions, and selecting which statistics kinds will
-   be built. The per-expression statistics are built automatically when there
-   is at least one expression.
+   first form allows univariate statistics for a single expression to be
+   collected, providing benefits similar to an expression index without the
+   overhead of index maintenance.  This form does not allow the statistics
+   kind to be specified, since the various statistics kinds refer only to
+   multivariate statistics.  The second form of the command allows
+   multivariate statistics on multiple columns and/or expressions to be
+   collected, optionally specifying which statistics kinds to include.  This
+   form will also automatically cause univariate statistics to be collected on
+   any expressions included in the list.
   
 
   
@@ -93,16 +96,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] statistics_kind
 
  
-  A statistics kind to be computed in this statistics object.
+  A multivariate statistics kind to be computed in this statistics object.
   Currently supported kinds are
   ndistinct, which enables n-distinct statistics,
   dependencies, which enables functional
   dependency statistics, and mcv which enables
   most-common values lists.
   If this clause is omitted, all supported statistics kinds are
-  included in the statistics object. Expression statistics are built
-  automatically when the statistics definition includes complex
-  expressions and not just simple column references.
+  included in the statistics object. Univariate expression statistics are
+  built automatically if the statistics definition includes any complex
+  expressions rather than just simple column references.
   For more information, see 
   and .
  
@@ -114,8 +117,9 @@ CREATE STATISTICS [ IF NOT EXISTS ] 
 

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Justin Pryzby
On Wed, Mar 24, 2021 at 09:54:22AM +0100, Tomas Vondra wrote:
> Hi Justin,
> 
> Unfortunately the query is incomplete, so I can't quite determine what
> went wrong. Can you extract the full query causing the crash, either
> from the server log or from a core file?

Oh, shoot, I didn't realize it was truncated, and I already destroyed the core
and moved on to something else...

But this fails well enough, and may be much shorter than the original :)

select distinct
 pg_catalog.variance(
   cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over 
(partition by subq_0.c2 order by subq_0.c0) as c0,
   subq_0.c2 as c1, subq_0.c0 as c2, subq_0.c2 as c3, subq_0.c1 as c4, 
subq_0.c1 as c5, subq_0.c0 as c6
 from
   (select
 ref_1.foreign_server_catalog as c0,
 ref_1.authorization_identifier as c1,
 sample_2.tgname as c2,
 ref_1.foreign_server_catalog as c3
   from
 pg_catalog.pg_stat_database_conflicts as ref_0
 left join information_schema._pg_user_mappings as ref_1
 on (ref_0.datname < ref_0.datname)
   inner join pg_catalog.pg_amproc as sample_0 tablesample system 
(5)
   on (cast(null as uuid) < cast(null as uuid))
 left join pg_catalog.pg_aggregate as sample_1 tablesample system 
(2.9)
 on (sample_0.amprocnum = sample_1.aggnumdirectargs )
   inner join pg_catalog.pg_trigger as sample_2 tablesample system (1) 
on true )subq_0;

TRAP: FailedAssertion("bms_num_members(varnos) == 1", File: "selfuncs.c", Line: 
3332, PID: 16422)

Also ... with this patch CREATE STATISTIC is no longer rejecting multiple
tables, and instead does this:

postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true;
ERROR:  schema "i" does not exist

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Tomas Vondra
Hi Justin,

Unfortunately the query is incomplete, so I can't quite determine what
went wrong. Can you extract the full query causing the crash, either
from the server log or from a core file?

thanks

On 3/24/21 9:14 AM, Justin Pryzby wrote:
> I got this crash running sqlsmith:
> 
> #1  0x7f907574b801 in __GI_abort () at abort.c:79
> #2  0x5646b95a35f8 in ExceptionalCondition 
> (conditionName=conditionName@entry=0x5646b97411db "bms_num_members(varnos) == 
> 1", errorType=errorType@entry=0x5646b95fa00b "FailedAssertion", 
> fileName=fileName@entry=0x5646b9739dbe "selfuncs.c", 
> lineNumber=lineNumber@entry=3332) at assert.c:69
> #3  0x5646b955c9a1 in add_unique_group_expr (vars=0x5646bbd9e200, 
> expr=0x5646b9eb0c30, exprinfos=0x5646bbd9e100, root=0x5646ba9a0cb0) at 
> selfuncs.c:3332
> #4  add_unique_group_expr (root=0x5646ba9a0cb0, exprinfos=0x5646bbd9e100, 
> expr=0x5646b9eb0c30, vars=0x5646bbd9e200) at selfuncs.c:3307
> #5  0x5646b955d560 in estimate_num_groups () at selfuncs.c:3558
> #6  0x5646b93ad004 in create_distinct_paths (input_rel=, 
> root=0x5646ba9a0cb0) at planner.c:4808
> #7  grouping_planner () at planner.c:2238
> #8  0x5646b93ae0ef in subquery_planner (glob=glob@entry=0x5646ba9a0b98, 
> parse=parse@entry=0x5646ba905d80, parent_root=parent_root@entry=0x0, 
> hasRecursion=hasRecursion@entry=false, tuple_fraction=tuple_fraction@entry=0)
> at planner.c:1024
> #9  0x5646b93af543 in standard_planner (parse=0x5646ba905d80, 
> query_string=, cursorOptions=256, boundParams=) 
> at planner.c:404
> #10 0x5646b94873ac in pg_plan_query (querytree=0x5646ba905d80, 
> query_string=0x5646b9cd87e0 "select distinct \n  \n
> pg_catalog.variance(\n  
> cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over 
> (partition by subq_0.c2 order by subq_0.c0) as c0, \n  subq_0.c2 as c1, \n  
> sub"..., cursorOptions=256, boundParams=0x0) at postgres.c:821
> #11 0x5646b94874a1 in pg_plan_queries (querytrees=0x5646baaba250, 
> query_string=query_string@entry=0x5646b9cd87e0 "select distinct \n  \n
> pg_catalog.variance(\n  
> cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over 
> (partition by subq_0.c2 order by subq_0.c0) as c0, \n  subq_0.c2 as c1, \n  
> sub"..., cursorOptions=cursorOptions@entry=256, 
> boundParams=boundParams@entry=0x0) at postgres.c:912
> #12 0x5646b9487888 in exec_simple_query () at postgres.c:1104
> 
> 2021-03-24 03:06:12.489 CDT postmaster[11653] LOG:  server process (PID 
> 11696) was terminated by signal 6: Aborted
> 2021-03-24 03:06:12.489 CDT postmaster[11653] DETAIL:  Failed process was 
> running: select distinct 
>   
> pg_catalog.variance(
>   cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as 
> int8)) over (partition by subq_0.c2 order by subq_0.c0) as c0, 
>   subq_0.c2 as c1, 
>   subq_0.c0 as c2, 
>   subq_0.c2 as c3, 
>   subq_0.c1 as c4, 
>   subq_0.c1 as c5, 
>   subq_0.c0 as c6
> from 
>   (select  
> ref_1.foreign_server_catalog as c0, 
> ref_1.authorization_identifier as c1, 
> sample_2.tgname as c2, 
> ref_1.foreign_server_catalog as c3
>   from 
> pg_catalog.pg_stat_database_conflicts as ref_0
> left join information_schema._pg_user_mappings as 
> ref_1
> on (ref_0.datname < ref_0.datname)
>   inner join pg_catalog.pg_amproc as sample_0 tablesample 
> system (5) 
>   on (cast(null as uuid) < cast(null as uuid))
> left join pg_catalog.pg_aggregate as sample_1 tablesample 
> system (2.9) 
> on (sample_0.amprocnum = sample_1.aggnumdirectargs )
>   inner join pg_catalog.pg_trigger as sample_2 tablesampl
> 

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Justin Pryzby
I got this crash running sqlsmith:

#1  0x7f907574b801 in __GI_abort () at abort.c:79
#2  0x5646b95a35f8 in ExceptionalCondition 
(conditionName=conditionName@entry=0x5646b97411db "bms_num_members(varnos) == 
1", errorType=errorType@entry=0x5646b95fa00b "FailedAssertion", 
fileName=fileName@entry=0x5646b9739dbe "selfuncs.c", 
lineNumber=lineNumber@entry=3332) at assert.c:69
#3  0x5646b955c9a1 in add_unique_group_expr (vars=0x5646bbd9e200, 
expr=0x5646b9eb0c30, exprinfos=0x5646bbd9e100, root=0x5646ba9a0cb0) at 
selfuncs.c:3332
#4  add_unique_group_expr (root=0x5646ba9a0cb0, exprinfos=0x5646bbd9e100, 
expr=0x5646b9eb0c30, vars=0x5646bbd9e200) at selfuncs.c:3307
#5  0x5646b955d560 in estimate_num_groups () at selfuncs.c:3558
#6  0x5646b93ad004 in create_distinct_paths (input_rel=, 
root=0x5646ba9a0cb0) at planner.c:4808
#7  grouping_planner () at planner.c:2238
#8  0x5646b93ae0ef in subquery_planner (glob=glob@entry=0x5646ba9a0b98, 
parse=parse@entry=0x5646ba905d80, parent_root=parent_root@entry=0x0, 
hasRecursion=hasRecursion@entry=false, tuple_fraction=tuple_fraction@entry=0)
at planner.c:1024
#9  0x5646b93af543 in standard_planner (parse=0x5646ba905d80, 
query_string=, cursorOptions=256, boundParams=) 
at planner.c:404
#10 0x5646b94873ac in pg_plan_query (querytree=0x5646ba905d80, 
query_string=0x5646b9cd87e0 "select distinct \n  \n
pg_catalog.variance(\n  
cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over 
(partition by subq_0.c2 order by subq_0.c0) as c0, \n  subq_0.c2 as c1, \n  
sub"..., cursorOptions=256, boundParams=0x0) at postgres.c:821
#11 0x5646b94874a1 in pg_plan_queries (querytrees=0x5646baaba250, 
query_string=query_string@entry=0x5646b9cd87e0 "select distinct \n  \n
pg_catalog.variance(\n  
cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over 
(partition by subq_0.c2 order by subq_0.c0) as c0, \n  subq_0.c2 as c1, \n  
sub"..., cursorOptions=cursorOptions@entry=256, 
boundParams=boundParams@entry=0x0) at postgres.c:912
#12 0x5646b9487888 in exec_simple_query () at postgres.c:1104

2021-03-24 03:06:12.489 CDT postmaster[11653] LOG:  server process (PID 11696) 
was terminated by signal 6: Aborted
2021-03-24 03:06:12.489 CDT postmaster[11653] DETAIL:  Failed process was 
running: select distinct 
  
pg_catalog.variance(
  cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as 
int8)) over (partition by subq_0.c2 order by subq_0.c0) as c0, 
  subq_0.c2 as c1, 
  subq_0.c0 as c2, 
  subq_0.c2 as c3, 
  subq_0.c1 as c4, 
  subq_0.c1 as c5, 
  subq_0.c0 as c6
from 
  (select  
ref_1.foreign_server_catalog as c0, 
ref_1.authorization_identifier as c1, 
sample_2.tgname as c2, 
ref_1.foreign_server_catalog as c3
  from 
pg_catalog.pg_stat_database_conflicts as ref_0
left join information_schema._pg_user_mappings as ref_1
on (ref_0.datname < ref_0.datname)
  inner join pg_catalog.pg_amproc as sample_0 tablesample 
system (5) 
  on (cast(null as uuid) < cast(null as uuid))
left join pg_catalog.pg_aggregate as sample_1 tablesample 
system (2.9) 
on (sample_0.amprocnum = sample_1.aggnumdirectargs )
  inner join pg_catalog.pg_trigger as sample_2 tablesampl




Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Justin Pryzby
Most importantly, it looks like this forgets to update catalog documentation
for stxexprs and stxkind='e'

It seems like you're preferring to use pluralized "statistics" in a lot of
places that sound wrong to me.  For example:
> Currently the first statistics wins, which seems silly.
I can write more separately, but I think this is resolved and clarified if you
write "statistics object" and not just "statistics".  

> +   Name of schema containing table

I don't know about the nearby descriptions, but this one sounds too much like a
"schema-containing" table.  Say "Name of the schema which contains the table" ?

> +   Name of table

Say "name of table on which the extended statistics are defined"

> +   Name of extended statistics

"Name of the extended statistic object"

> +   Owner of the extended statistics

..object

> +   Expression the extended statistics is defined on

I think it should say "the extended statistic", or "the extended statistics
object".  Maybe "..on which the extended statistic is defined"

> +   of random access to the disk.  (This expression is null if the 
> expression
> +   data type does not have a  operator.)

expression's data type

> +   much-too-small row count estimate in the first two queries. Moreover, the

maybe say "dramatically underestimates the rowcount"

> +   planner has no information about relationship between the expressions, so 
> it

the relationship

> +   assumes the two WHERE and GROUP BY
> +   conditions are independent, and multiplies their selectivities together to
> +   arrive at a much-too-high group count estimate in the aggregate query.

severe overestimate ?

> +   This is further exacerbated by the lack of accurate statistics for the
> +   expressions, forcing the planner to use default ndistinct estimate for the

use *a default

> +   expression derived from ndistinct for the column. With such statistics, 
> the
> +   planner recognizes that the conditions are correlated and arrives at much
> +   more accurate estimates.

are correlated comma

> + if (type->lt_opr == InvalidOid)

These could be !OidIsValid

> +  * expressions. It's either expensive or very easy to defeat for
> +  * determined used, and there's no risk if we allow such statistics (the
> +  * statistics is useless, but harmless).

I think it's meant to say "for a determined user" ?

> +  * If there are no simply-referenced columns, give the statistics an 
> auto
> +  * dependency on the whole table.  In most cases, this will be 
> redundant,
> +  * but it might not be if the statistics expressions contain no Vars
> +  * (which might seem strange but possible).
> +  */
> + if (!nattnums)
> + {
> + ObjectAddressSet(parentobject, RelationRelationId, relid);
> + recordDependencyOn(, , DEPENDENCY_AUTO);
> + }

Can this be unconditional ?

> +  * Translate the array of indexs to regular attnums for the dependency 
> (we

sp: indexes

> +  * Not found a matching expression, so 
> we can simply skip

Found no matching expr

> + /* if found a matching, */

matching ..

> +examine_attribute(Node *expr)

Maybe you should rename this to something distinct ?  So it's easy to add a
breakpoint there, for example.

> + stats->anl_context = CurrentMemoryContext;  /* XXX should be using
> + 
>  * something else? */

> + boolnulls[Natts_pg_statistic];
...
> +  * Construct a new pg_statistic tuple
> +  */
> + for (i = 0; i < Natts_pg_statistic; ++i)
> + {
> + nulls[i] = false;
> + }

Shouldn't you just write nulls[Natts_pg_statistic] = {false};
or at least: memset(nulls, 0, sizeof(nulls));

> +  * We don't store collations used to build the 
> statistics, but
> +  * we can use the collation for the attribute 
> itself, as
> +  * stored in varcollid. We do reset the 
> statistics after a
> +  * type change (including collation change), so 
> this is OK. We
> +  * may need to relax this after allowing 
> extended statistics
> +  * on expressions.

This text should be updated or removed ?

> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
>   }
>  
>   /* print any extended statistics */
> - if (pset.sversion >= 10)
> + if (pset.sversion >= 14)
> + {
> + printfPQExpBuffer(,
> +   "SELECT oid, "
> +   
> "stxrelid::pg_catalog.regclass, "
> +

Re: PoC/WIP: Extended statistics on expressions

2021-03-18 Thread Dean Rasheed
On Wed, 17 Mar 2021 at 21:31, Tomas Vondra
 wrote:
>
> I agree applying at least the [(a+b),c] stats is probably the right
> approach, as it means we're considering at least the available
> information about dependence between the columns.
>
> I think to improve this, we'll need to teach the code to use overlapping
> statistics, a bit like conditional probability. In this case we might do
> something like this:
>
> ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c))

Yes, I was thinking the same thing. That would be equivalent to
applying a multiplicative "correction" factor of

  ndistinct(a,b,c,...) / ( ndistinct(a) * ndistinct(b) * ndistinct(c) * ... )

for each multivariate stat applicable to more than one
column/expression, regardless of whether those columns were already
covered by other multivariate stats. That might well simplify the
implementation, as well as probably produce better estimates.

> But that's clearly a matter for a future patch, and I'm sure there are
> cases where this will produce worse estimates.

Agreed.

> Anyway, I plan to go over the patches one more time, and start pushing
> them sometime early next week. I don't want to leave it until the very
> last moment in the CF.

+1. I think they're in good enough shape for that process to start.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Tomas Vondra



On 3/17/21 9:58 PM, Dean Rasheed wrote:
> On Wed, 17 Mar 2021 at 20:48, Dean Rasheed  wrote:
>>
>> For reference, here is the test case I was using (which isn't really very 
>> good for
>> catching dependence between columns):
>>
> 
> And here's a test case with much more dependence between the columns:
> 
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (a int, b int, c int, d int);
> INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,10) x;
> SELECT COUNT(DISTINCT a) FROM foo; -- 2
> SELECT COUNT(DISTINCT b) FROM foo; -- 5
> SELECT COUNT(DISTINCT c) FROM foo; -- 10
> SELECT COUNT(DISTINCT d) FROM foo; -- 15
> SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6
> SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20
> SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10
> SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30
> 
> -- First case: stats on [(a+b),c]
> CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE
> SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
>   -- Estimate = 150, Actual = 30
>   -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15,
>   -- which is much better than ndistinct((a+b)) * ndistinct(c) *
> ndistinct(d) = 6*10*15 = 900
>   -- Estimate with no stats = 1500
> 
> -- Second case: stats on (c+d) as well
> CREATE STATISTICS s2 ON (c+d) FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE
> SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
>   -- Estimate = 120, Actual = 30
>   -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20
> 
> Again, I'd say the current behaviour is pretty good.
> 

Thanks!

I agree applying at least the [(a+b),c] stats is probably the right
approach, as it means we're considering at least the available
information about dependence between the columns.

I think to improve this, we'll need to teach the code to use overlapping
statistics, a bit like conditional probability. In this case we might do
something like this:

ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c))

Which in this case would be either, for the "less correlated" case

228 * 24 / 12 = 446   (actual = 478, current estimate = 480)

or, for the "more correlated" case

10 * 20 / 10 = 20 (actual = 30, current estimate = 120)

But that's clearly a matter for a future patch, and I'm sure there are
cases where this will produce worse estimates.


Anyway, I plan to go over the patches one more time, and start pushing
them sometime early next week. I don't want to leave it until the very
last moment in the CF.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed
On Wed, 17 Mar 2021 at 20:48, Dean Rasheed  wrote:
>
> For reference, here is the test case I was using (which isn't really very 
> good for
> catching dependence between columns):
>

And here's a test case with much more dependence between the columns:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (a int, b int, c int, d int);
INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,10) x;
SELECT COUNT(DISTINCT a) FROM foo; -- 2
SELECT COUNT(DISTINCT b) FROM foo; -- 5
SELECT COUNT(DISTINCT c) FROM foo; -- 10
SELECT COUNT(DISTINCT d) FROM foo; -- 15
SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6
SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20
SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10
SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30

-- First case: stats on [(a+b),c]
CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 150, Actual = 30
  -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15,
  -- which is much better than ndistinct((a+b)) * ndistinct(c) *
ndistinct(d) = 6*10*15 = 900
  -- Estimate with no stats = 1500

-- Second case: stats on (c+d) as well
CREATE STATISTICS s2 ON (c+d) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 120, Actual = 30
  -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20

Again, I'd say the current behaviour is pretty good.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed
On Wed, 17 Mar 2021 at 19:07, Tomas Vondra
 wrote:
>
> On 3/17/21 7:54 PM, Dean Rasheed wrote:
> >
> > it might have been better to estimate the first case as
> >
> >  ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> >
> > and the second case as
> >
> >  ndistinct((a+b)) * ndistinct((c+d))
>
> OK. I might be confused, but isn't that what the algorithm currently
> does? Or am I just confused about what the first/second case refers to?
>

No, it currently estimates the first case as ndistinct((a+b),c) *
ndistinct(d). Having said that, maybe that's OK after all. It at least
makes an effort to account for any correlation between (a+b) and
(c+d), using the known correlation between (a+b) and c. For reference,
here is the test case I was using (which isn't really very good for
catching dependence between columns):

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (a int, b int, c int, d int);
INSERT INTO foo SELECT x%10, x%11, x%12, x%13 FROM generate_series(1,10) x;
SELECT COUNT(DISTINCT a) FROM foo; -- 10
SELECT COUNT(DISTINCT b) FROM foo; -- 11
SELECT COUNT(DISTINCT c) FROM foo; -- 12
SELECT COUNT(DISTINCT d) FROM foo; -- 13
SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 20
SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 24
SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 228
SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 478

-- First case: stats on [(a+b),c]
CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 2964, Actual = 478
  -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 228*13

-- Second case: stats on (c+d) as well
CREATE STATISTICS s2 ON (c+d) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 480, Actual = 478
  -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 20*24

I think that's probably pretty reasonable behaviour, given incomplete
stats (the estimate with no extended stats is capped at 1).

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Tomas Vondra



On 3/17/21 7:54 PM, Dean Rasheed wrote:
> On Wed, 17 Mar 2021 at 17:26, Tomas Vondra
>  wrote:
>>
>> My concern is that the current behavior (where we prefer expression
>> stats over multi-column stats to some extent) works fine as long as the
>> parts are independent, but once there's dependency it's probably more
>> likely to produce underestimates. I think underestimates for grouping
>> estimates were a risk in the past, so let's not make that worse.
>>
> 
> I'm not sure the current behaviour really is preferring expression
> stats over multi-column stats. In this example, where we're grouping
> by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of
> those multi-column stats actually match more than one
> column/expression. If anything, I'd go the other way and say that it
> was wrong to use the [(a+b),c] stats in the first case, where they
> were the only stats available, since those stats aren't really
> applicable to (c+d), which probably ought to be treated as
> independent. IOW, it might have been better to estimate the first case
> as
> 
>  ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> 
> and the second case as
> 
>  ndistinct((a+b)) * ndistinct((c+d))
> 

OK. I might be confused, but isn't that what the algorithm currently
does? Or am I just confused about what the first/second case refers to?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed
On Wed, 17 Mar 2021 at 17:26, Tomas Vondra
 wrote:
>
> My concern is that the current behavior (where we prefer expression
> stats over multi-column stats to some extent) works fine as long as the
> parts are independent, but once there's dependency it's probably more
> likely to produce underestimates. I think underestimates for grouping
> estimates were a risk in the past, so let's not make that worse.
>

I'm not sure the current behaviour really is preferring expression
stats over multi-column stats. In this example, where we're grouping
by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of
those multi-column stats actually match more than one
column/expression. If anything, I'd go the other way and say that it
was wrong to use the [(a+b),c] stats in the first case, where they
were the only stats available, since those stats aren't really
applicable to (c+d), which probably ought to be treated as
independent. IOW, it might have been better to estimate the first case
as

 ndistinct((a+b)) * ndistinct(c) * ndistinct(d)

and the second case as

 ndistinct((a+b)) * ndistinct((c+d))

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Tomas Vondra
Hi,

On 3/17/21 4:55 PM, Dean Rasheed wrote:
> On Sun, 7 Mar 2021 at 21:10, Tomas Vondra  
> wrote:
>>
>> 2) ndistinct
>>
>> There's one thing that's bugging me, in how we handle "partial" matches.
>> For each expression we track both the original expression and the Vars
>> we extract from it. If we can't find a statistics matching the whole
>> expression, we try to match those individual Vars, and we remove the
>> matching ones from the list. And in the end we multiply the estimates
>> for the remaining Vars.
>>
>> This works fine with one matching ndistinct statistics. Consider for example
>>
>>  GROUP BY (a+b), (c+d)
>>
>> with statistics on [(a+b),c] - that is, expression and one column.
> 
> I've just been going over this example, and I think it actually works
> slightly differently from how you described, though I haven't worked
> out the full general implications of that.
> 
>> We parse the expressions into two GroupExprInfo
>>
>>  {expr: (a+b), vars: [a, b]}
>>  {expr: (c+d), vars: [c, d]}
>>
> 
> Here, I think what you actually get, in the presence of stats on
> [(a+b),c] is actually the following two GroupExprInfos:
> 
>   {expr: (a+b), vars: []}
>   {expr: (c+d), vars: [c, d]}
> 

Yeah, right. To be precise, we get

{expr: (a+b), vars: [(a+b)]}

because in the first case we pass NIL, so add_unique_group_expr treats
the whole expression as a var (a bit strange, but OK).

> because of the code immediately after this comment in estimate_num_groups():
> 
> /*
>  * If examine_variable is able to deduce anything about the GROUP BY
>  * expression, treat it as a single variable even if it's really more
>  * complicated.
>  */
> 
> As it happens, that makes no difference in this case, where there is
> just this one stats object, but it does change things when there are
> two stats objects.
> 
>> and the statistics matches the first item exactly (the expression). The
>> second expression is not in the statistics, but we match "c". So we end
>> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo:
>>
>>  {expr: (c+d), vars: [d]}
> 
> Right.
> 
>> Without any other statistics we estimate that as ndistinct for "d", so
>> we end up with
>>
>>  ndistinct((a+b), c) * ndistinct(d)
>>
>> which mostly makes sense. It assumes ndistinct(c+d) is product of the
>> ndistinct estimates, but that's kinda what we've been always doing.
> 
> Yes, that appears to be what happens, and it's probably the best that
> can be done with the available stats.
> 
>> But now consider we have another statistics on just (c+d). In the second
>> loop we end up matching this expression exactly, so we end up with
>>
>>  ndistinct((a+b), c) * ndistinct((c+d))
> 
> In this case, with stats on (c+d) as well, the two GroupExprInfos
> built at the start change to:
> 
>   {expr: (a+b), vars: []}
>   {expr: (c+d), vars: []}
> 
> so it actually ends up not using any multivariate stats, but instead
> uses the two univariate expression stats, giving
> 
>  ndistinct((a+b)) * ndistinct((c+d))
> 
> which actually seems pretty good, and gave very good estimates in the
> simple test case I tried.
> 

Yeah, that works pretty well in this case.

I wonder if we'd be better off extracting the Vars and doing what I
mistakenly described as the current behavior. That's essentially mean
extracting the Vars even in the case where we now pass NIL.

My concern is that the current behavior (where we prefer expression
stats over multi-column stats to some extent) works fine as long as the
parts are independent, but once there's dependency it's probably more
likely to produce underestimates. I think underestimates for grouping
estimates were a risk in the past, so let's not make that worse.

>> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think
>> what we should do after the first loop is just discarding the whole
>> expression and "expand" into per-variable GroupExprInfo, so in the
>> second step we would not match the (c+d) statistics.
> 
> Not using the (c+d) stats would give either
> 
>  ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> 
> or
> 
>  ndistinct((a+b), c) * ndistinct(d)
> 
> depending on exactly how the algorithm was changed. In my test, these
> both gave worse estimates, but there are probably other datasets for
> which they might do better. It all depends on how much correlation
> there is between (a+b) and c.
> 
> I suspect that there is no optimal strategy for handling overlapping
> stats that works best for all datasets, but the current algorithm
> seems to do a pretty decent job.
> 
>> Of course, maybe there's a better way to pick the statistics, but I
>> think our conclusion so far was that people should just create
>> statistics covering all the columns in the query, to not have to match
>> multiple statistics like this.
> 
> Yes, I think that's always likely to work better, especially for
> ndistinct stats, where 

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed
On Sun, 7 Mar 2021 at 21:10, Tomas Vondra  wrote:
>
> 2) ndistinct
>
> There's one thing that's bugging me, in how we handle "partial" matches.
> For each expression we track both the original expression and the Vars
> we extract from it. If we can't find a statistics matching the whole
> expression, we try to match those individual Vars, and we remove the
> matching ones from the list. And in the end we multiply the estimates
> for the remaining Vars.
>
> This works fine with one matching ndistinct statistics. Consider for example
>
>  GROUP BY (a+b), (c+d)
>
> with statistics on [(a+b),c] - that is, expression and one column.

I've just been going over this example, and I think it actually works
slightly differently from how you described, though I haven't worked
out the full general implications of that.

> We parse the expressions into two GroupExprInfo
>
>  {expr: (a+b), vars: [a, b]}
>  {expr: (c+d), vars: [c, d]}
>

Here, I think what you actually get, in the presence of stats on
[(a+b),c] is actually the following two GroupExprInfos:

  {expr: (a+b), vars: []}
  {expr: (c+d), vars: [c, d]}

because of the code immediately after this comment in estimate_num_groups():

/*
 * If examine_variable is able to deduce anything about the GROUP BY
 * expression, treat it as a single variable even if it's really more
 * complicated.
 */

As it happens, that makes no difference in this case, where there is
just this one stats object, but it does change things when there are
two stats objects.

> and the statistics matches the first item exactly (the expression). The
> second expression is not in the statistics, but we match "c". So we end
> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo:
>
>  {expr: (c+d), vars: [d]}

Right.

> Without any other statistics we estimate that as ndistinct for "d", so
> we end up with
>
>  ndistinct((a+b), c) * ndistinct(d)
>
> which mostly makes sense. It assumes ndistinct(c+d) is product of the
> ndistinct estimates, but that's kinda what we've been always doing.

Yes, that appears to be what happens, and it's probably the best that
can be done with the available stats.

> But now consider we have another statistics on just (c+d). In the second
> loop we end up matching this expression exactly, so we end up with
>
>  ndistinct((a+b), c) * ndistinct((c+d))

In this case, with stats on (c+d) as well, the two GroupExprInfos
built at the start change to:

  {expr: (a+b), vars: []}
  {expr: (c+d), vars: []}

so it actually ends up not using any multivariate stats, but instead
uses the two univariate expression stats, giving

 ndistinct((a+b)) * ndistinct((c+d))

which actually seems pretty good, and gave very good estimates in the
simple test case I tried.

> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think
> what we should do after the first loop is just discarding the whole
> expression and "expand" into per-variable GroupExprInfo, so in the
> second step we would not match the (c+d) statistics.

Not using the (c+d) stats would give either

 ndistinct((a+b)) * ndistinct(c) * ndistinct(d)

or

 ndistinct((a+b), c) * ndistinct(d)

depending on exactly how the algorithm was changed. In my test, these
both gave worse estimates, but there are probably other datasets for
which they might do better. It all depends on how much correlation
there is between (a+b) and c.

I suspect that there is no optimal strategy for handling overlapping
stats that works best for all datasets, but the current algorithm
seems to do a pretty decent job.

> Of course, maybe there's a better way to pick the statistics, but I
> think our conclusion so far was that people should just create
> statistics covering all the columns in the query, to not have to match
> multiple statistics like this.

Yes, I think that's always likely to work better, especially for
ndistinct stats, where all possible permutations of subsets of the
columns are included, so a single ndistinct stat can work well for a
range of different queries.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-05 Thread Dean Rasheed
On Thu, 4 Mar 2021 at 22:16, Tomas Vondra  wrote:
>
> Attached is a slightly improved version of the patch series, addressing
> most of the issues raised in the previous message.

Cool. Sorry for the delay replying.

> 0003-Extended-statistics-on-expressions-20210304.patch
>
> Mostly unchanged, The one improvement is removing some duplicate code in
> in mvc.c.
>
> 0004-WIP-rework-tracking-of-expressions-20210304.patch
>
> This is mostly unchanged of the patch reworking how we assign artificial
> attnums to expressions (negative instead of (MaxHeapAttributeNumber+i)).

Looks good.

I see you undid the change to get_relation_statistics() in plancat.c,
which offset the attnums of plain attributes in the StatisticExtInfo
struct. I was going to suggest that as a simplification to the
previous 0004 patch. Related to that, is this comment in
dependencies_clauselist_selectivity():

/*
 * Count matching attributes - we have to undo two attnum offsets.
 * First, the dependency is offset using the number of expressions
 * for that statistics, and then (if it's a plain attribute) we
 * need to apply the same offset as above, by unique_exprs_cnt.
 */

which needs updating, since there is now just one attnum offset, not
two. Only the unique_exprs_cnt offset is relevant now.

Also, related to that change, I don't think that
stat_covers_attributes() is needed anymore. I think that the code that
calls it can now just be reverted back to using bms_is_subset(), since
that bitmapset holds plain attributes that aren't offset.

> 0005-WIP-unify-handling-of-attributes-and-expres-20210304.patch
>
> This reworks how we build statistics on attributes and expressions.
> Instead of treating attributes and expressions separately, this  allows
> handling them uniformly.
>
> Until now, the various "build" functions (for different statistics
> kinds) extracted attribute values from sampled tuples, but expressions
> were pre-calculated in a separate array. Firstly to save CPU time (not
> having to evaluate expensive expressions repeatedly) and to keep the
> different stats consistent (there might be volatile functions etc.).
>
> So the build functions had to look at the attnum, determine if it's
> attribute or expression, and in some cases it was tricky / easy to get
> wrong.
>
> This patch replaces this "split" view with a simple "consistent"
> representation merging values from attributes and expressions, and just
> passes that to the build functions. There's no need to check the attnum,
> and handle expressions in some special way, so the build functions are
> much simpler / easier to understand (at least I think so).
>
> The build data is represented by "StatsBuildData" struct - not sure if
> there's a better name.
>
> I'm mostly happy with how this turned out. I'm sure there's a bit more
> cleanup needed (e.g. the merging/remapping of dependencies needs some
> refactoring, I think) but overall this seems reasonable.

Agreed. That's a nice improvement.

I wonder if dependency_is_compatible_expression() can be merged with
dependency_is_compatible_clause() to reduce code duplication. It
probably also ought to be possible to support "Expr IN Array" there,
in a similar way to the other code in statext_is_compatible_clause().
Also, should this check rinfo->clause_relids against the passed-in
relid to rule out clauses referencing other relations, in the same way
that statext_is_compatible_clause() does?

> I did some performance testing, I don't think there's any measurable
> performance degradation. I'm actually wondering if we need to transform
> the AttrNumber arrays into bitmaps in various places - maybe we should
> just do a plain linear search. We don't really expect many elements, as
> each statistics has 8 attnums at most. So maybe building the bitmapsets
> is a net loss? The one exception might be functional dependencies, where
> we can "merge" multiple statistics together. But even then it'd require
> many statistics objects to make a difference.

Possibly. There's a danger in trying to change too much at once
though. As it stands, I think it's fairly close to being committable,
with just a little more tidying up.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-03-04 Thread Tomas Vondra
On 3/5/21 1:43 AM, Justin Pryzby wrote:
> On Mon, Jan 04, 2021 at 09:45:24AM -0600, Justin Pryzby wrote:
>> On Mon, Jan 04, 2021 at 03:34:08PM +, Dean Rasheed wrote:
>>> * I'm not sure I understand the need for 0001. Wasn't there an earlier
>>> version of this patch that just did it by re-populating the type
>>> array, but which still had it as an array rather than turning it into
>>> a list? Making it a list falsifies some of the comments and
>>> function/variable name choices in that file.
>>
>> This part is from me.
>>
>> I can review the names if it's desired , but it'd be fine to fall back to the
>> earlier patch.  I thought a pglist was cleaner, but it's not needed.
> 
> This updates the preliminary patches to address the issues Dean raised.
> 
> One advantage of using a pglist is that we can free it by calling
> list_free_deep(Typ), rather than looping to free each of its elements.
> But maybe for bootstrap.c it doesn't matter, and we can just write:
> | Typ = NULL; /* Leak the old Typ array */
> 

Thanks. I'll switch this in the next version of the patch series.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-03-04 Thread Justin Pryzby
On Mon, Jan 04, 2021 at 09:45:24AM -0600, Justin Pryzby wrote:
> On Mon, Jan 04, 2021 at 03:34:08PM +, Dean Rasheed wrote:
> > * I'm not sure I understand the need for 0001. Wasn't there an earlier
> > version of this patch that just did it by re-populating the type
> > array, but which still had it as an array rather than turning it into
> > a list? Making it a list falsifies some of the comments and
> > function/variable name choices in that file.
> 
> This part is from me.
> 
> I can review the names if it's desired , but it'd be fine to fall back to the
> earlier patch.  I thought a pglist was cleaner, but it's not needed.

This updates the preliminary patches to address the issues Dean raised.

One advantage of using a pglist is that we can free it by calling
list_free_deep(Typ), rather than looping to free each of its elements.
But maybe for bootstrap.c it doesn't matter, and we can just write:
| Typ = NULL; /* Leak the old Typ array */

-- 
Justin
>From 41ec12096cefc00484e7f2a0b3bfbc0f79cdd162 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 19 Nov 2020 20:48:48 -0600
Subject: [PATCH 1/2] bootstrap: convert Typ to a List*

---
 src/backend/bootstrap/bootstrap.c | 89 ++-
 1 file changed, 41 insertions(+), 48 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6f615e6622..1b940d9d27 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -58,7 +58,7 @@ static void BootstrapModeMain(void);
 static void bootstrap_signals(void);
 static void ShutdownAuxiliaryProcess(int code, Datum arg);
 static Form_pg_attribute AllocateAttribute(void);
-static void populate_typ_array(void);
+static void populate_typ(void);
 static Oid	gettype(char *type);
 static void cleanup(void);
 
@@ -159,7 +159,7 @@ struct typmap
 	FormData_pg_type am_typ;
 };
 
-static struct typmap **Typ = NULL;
+static List *Typ = NIL; /* List of struct typmap* */
 static struct typmap *Ap = NULL;
 
 static Datum values[MAXATTR];	/* current row's attribute values */
@@ -595,10 +595,10 @@ boot_openrel(char *relname)
 
 	/*
 	 * pg_type must be filled before any OPEN command is executed, hence we
-	 * can now populate the Typ array if we haven't yet.
+	 * can now populate Typ if we haven't yet.
 	 */
-	if (Typ == NULL)
-		populate_typ_array();
+	if (Typ == NIL)
+		populate_typ();
 
 	if (boot_reldesc != NULL)
 		closerel(NULL);
@@ -688,7 +688,7 @@ DefineAttr(char *name, char *type, int attnum, int nullness)
 
 	typeoid = gettype(type);
 
-	if (Typ != NULL)
+	if (Typ != NIL)
 	{
 		attrtypes[attnum]->atttypid = Ap->am_oid;
 		attrtypes[attnum]->attlen = Ap->am_typ.typlen;
@@ -866,47 +866,36 @@ cleanup(void)
 }
 
 /* 
- *		populate_typ_array
+ *		populate_typ
  *
- * Load the Typ array by reading pg_type.
+ * Load Typ by reading pg_type.
  * 
  */
 static void
-populate_typ_array(void)
+populate_typ(void)
 {
 	Relation	rel;
 	TableScanDesc scan;
 	HeapTuple	tup;
-	int			nalloc;
-	int			i;
-
-	Assert(Typ == NULL);
+	MemoryContext old;
 
-	nalloc = 512;
-	Typ = (struct typmap **)
-		MemoryContextAlloc(TopMemoryContext, nalloc * sizeof(struct typmap *));
+	Assert(Typ == NIL);
 
 	rel = table_open(TypeRelationId, NoLock);
 	scan = table_beginscan_catalog(rel, 0, NULL);
-	i = 0;
+	old = MemoryContextSwitchTo(TopMemoryContext);
 	while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
 	{
 		Form_pg_type typForm = (Form_pg_type) GETSTRUCT(tup);
+		struct typmap *newtyp;
 
-		/* make sure there will be room for a trailing NULL pointer */
-		if (i >= nalloc - 1)
-		{
-			nalloc *= 2;
-			Typ = (struct typmap **)
-repalloc(Typ, nalloc * sizeof(struct typmap *));
-		}
-		Typ[i] = (struct typmap *)
-			MemoryContextAlloc(TopMemoryContext, sizeof(struct typmap));
-		Typ[i]->am_oid = typForm->oid;
-		memcpy(&(Typ[i]->am_typ), typForm, sizeof(Typ[i]->am_typ));
-		i++;
+		newtyp = (struct typmap *) palloc(sizeof(struct typmap));
+		Typ = lappend(Typ, newtyp);
+
+		newtyp->am_oid = typForm->oid;
+		memcpy(>am_typ, typForm, sizeof(newtyp->am_typ));
 	}
-	Typ[i] = NULL;/* Fill trailing NULL pointer */
+	MemoryContextSwitchTo(old);
 	table_endscan(scan);
 	table_close(rel, NoLock);
 }
@@ -916,25 +905,26 @@ populate_typ_array(void)
  *
  * NB: this is really ugly; it will return an integer index into TypInfo[],
  * and not an OID at all, until the first reference to a type not known in
- * TypInfo[].  At that point it will read and cache pg_type in the Typ array,
+ * TypInfo[].  At that point it will read and cache pg_type in Typ,
  * and subsequently return a real OID (and set the global pointer Ap to
  * point at the found row in Typ).  So caller must check whether Typ is
- * still NULL to determine what the return value is!
+ * still NIL to determine what the return value is!
  * 
  */
 static Oid
 gettype(char *type)
 {
-	if (Typ != NULL)
+	if (Typ != NIL)
 	{
-		struct typmap 

Re: PoC/WIP: Extended statistics on expressions

2021-02-16 Thread Tomas Vondra



On 1/27/21 12:02 PM, Dean Rasheed wrote:
> On Fri, 22 Jan 2021 at 03:49, Tomas Vondra
>  wrote:
>>
>> Whooops. A fixed version attached.
>>
> 
> The change to pg_stats_ext_exprs isn't quite right, because now it
> cross joins expressions and their stats, which leads to too many rows,
> with the wrong stats being listed against expressions. For example:
> 
> CREATE TABLE foo (a int, b text);
> INSERT INTO foo SELECT 1, 'xxx' FROM generate_series(1,1000);
> CREATE STATISTICS foo_s ON (a*10), upper(b) FROM foo;
> ANALYSE foo;
> 
> SELECT tablename, statistics_name, expr, most_common_vals
>   FROM pg_stats_ext_exprs;
> 
>  tablename | statistics_name |   expr   | most_common_vals
> ---+-+--+--
>  foo   | foo_s   | (a * 10) | {10}
>  foo   | foo_s   | (a * 10) | {XXX}
>  foo   | foo_s   | upper(b) | {10}
>  foo   | foo_s   | upper(b) | {XXX}
> (4 rows)
> 
> 
> More protection is still required for tables with no analysable
> columns. For example:
> 
> CREATE TABLE foo();
> CREATE STATISTICS foo_s ON (1) FROM foo;
> INSERT INTO foo SELECT FROM generate_series(1,1000);
> ANALYSE foo;
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x0090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0,
> exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664
> 664stats[i]->tupDesc = vacatts[0]->tupDesc;
> 
> #0  0x0090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598,
> attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40)
> at extended_stats.c:664
> #1  0x0090da93 in BuildRelationExtStatistics (onerel=0x7f7766b37598,
> totalrows=1000, numrows=100, rows=0x216d040, natts=0,
> vacattrstats=0x216cb40) at extended_stats.c:161
> #2  0x0066ea97 in do_analyze_rel (onerel=0x7f7766b37598,
> params=0x7ffc06f7d450, va_cols=0x0,
> acquirefunc=0x66f71a , relpages=4, inh=false,
> in_outer_xact=false, elevel=13) at analyze.c:595
> 
> 
> Attached is an incremental update fixing those issues, together with a
> few more suggested improvements:
> 
> There was quite a bit of code duplication in extended_stats.c which I
> attempted to reduce by
> 
> 1). Deleting examine_opclause_expression() in favour of examine_clause_args().
> 2). Deleting examine_opclause_expression2() in favour of 
> examine_clause_args2().
> 3). Merging examine_clause_args() and examine_clause_args2(), renaming
> it examine_opclause_args() (which was actually the name it had in its
> original doc comment, despite the name in the code being different).
> 4). Merging statext_extract_expression() and
> statext_extract_expression_internal() into
> statext_is_compatible_clause() and
> statext_is_compatible_clause_internal() respectively.
> 
> That last change goes beyond just removing code duplication. It allows
> support for compound clauses that contain a mix of attribute and
> expression clauses, for example, this simple test case wasn't
> previously estimated well:
> 
> CREATE TABLE foo (a int, b int, c int);
> INSERT INTO foo SELECT x/100, x/100, x/100 FROM generate_series(1,1) g(x);
> CREATE STATISTICS foo_s on a,b,(c*c) FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE SELECT * FROM foo WHERE a=1 AND (b=1 OR c*c=1);
> 
> I didn't add any new regression tests, but perhaps it would be worth
> adding something to test a case like that.
> 
> I changed choose_best_statistics() in a couple of ways. Firstly, I
> think it wants to only count expressions from fully covered clauses,
> just as we only count attributes if the stat covers all the attributes
> from a clause, since otherwise the stat cannot estimate the clause, so
> it shouldn't count. Secondly, I think the number of expressions in the
> stat needs to be added to it's number of keys, so that the choice of
> narrowest stat with the same number of matches counts expressions in
> the same way as attributes.
> 
> I simplified the code in statext_mcv_clauselist_selectivity(), by
> attempting to handle expressions and attributes together in the same
> way, making it much closer to the original code. I don't think that
> the check for the existence of a stat covering all the expressions in
> a clause was necessary when pre-processing the list of clauses, since
> that's checked later on, so it's enough to just detect compatible
> clauses. Also, it now checks for stats that cover both the attributes
> and the expressions from each clause, rather than one or the other, to
> cope with examples like the one above. I also updated the check for
> simple_clauses -- what's wanted there is to identify clauses that only
> reference a single column or a single expression, so that the later
> code doesn't apply multi-column estimates to it.
> 
> I'm attaching it as a incremental patch (0004) on top of your patches,
> but if 0003 and 0004 are collapsed together, the total number of diffs
> is less than 0003 alone.
> 

Thanks. All 

Re: PoC/WIP: Extended statistics on expressions

2021-01-22 Thread Tomas Vondra




On 1/22/21 5:01 AM, Justin Pryzby wrote:

On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote:

| Statistics objects:
| "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t


Umm, for me that prints:



 "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t

which I think is OK. But maybe there's something else to trigger the
problem?


Oh.  It's because I was using /usr/bin/psql and not ./src/bin/psql.
I think it's considered ok if old client's \d commands don't work on new
server, but it's not clear to me if it's ok if they misbehave.  It's almost
better it made an ERROR.



Well, how would the server know to throw an error? We can't quite patch 
the old psql (if we could, we could just tweak the query).



In any case, why are there so many parentheses ?



That's a bug in pg_get_statisticsobj_worker, probably. It shouldn't be 
adding extra parentheses, on top of what deparse_expression_pretty does. 
Will fix.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-01-22 Thread Tomas Vondra




On 1/22/21 10:00 AM, Dean Rasheed wrote:

On Fri, 22 Jan 2021 at 04:46, Justin Pryzby  wrote:


I think you'll maybe have to do something better - this seems a bit too weird:

| postgres=# CREATE STATISTICS s2 ON (i+1) ,i FROM t;
| postgres=# \d t
| ...
| "public"."s2" (ndistinct, dependencies, mcv) ON i FROM t



I guess that's not surprising, given that old psql knows nothing about
expressions in stats.

In general, I think connecting old versions of psql to newer servers
is not supported. You're lucky if \d works at all. So it shouldn't be
this patch's responsibility to make that output nicer.



Yeah. It's not clear to me what exactly could we do with this, without 
"backpatching" the old psql or making the ruleutils.c consider version 
of the psql. Neither of these seems possible/acceptable.


I'm sure this is not the only place showing "incomplete" information in 
old psql on new server.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-01-22 Thread Dean Rasheed
On Fri, 22 Jan 2021 at 04:46, Justin Pryzby  wrote:
>
> I think you'll maybe have to do something better - this seems a bit too weird:
>
> | postgres=# CREATE STATISTICS s2 ON (i+1) ,i FROM t;
> | postgres=# \d t
> | ...
> | "public"."s2" (ndistinct, dependencies, mcv) ON i FROM t
>

I guess that's not surprising, given that old psql knows nothing about
expressions in stats.

In general, I think connecting old versions of psql to newer servers
is not supported. You're lucky if \d works at all. So it shouldn't be
this patch's responsibility to make that output nicer.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-01-21 Thread Justin Pryzby
On Thu, Jan 21, 2021 at 10:01:01PM -0600, Justin Pryzby wrote:
> On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote:
> > > > | Statistics objects:
> > > > | "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> > 
> > Umm, for me that prints:
> 
> > "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t
> > 
> > which I think is OK. But maybe there's something else to trigger the
> > problem?
> 
> Oh.  It's because I was using /usr/bin/psql and not ./src/bin/psql.
> I think it's considered ok if old client's \d commands don't work on new
> server, but it's not clear to me if it's ok if they misbehave.  It's almost
> better it made an ERROR.

I think you'll maybe have to do something better - this seems a bit too weird:

| postgres=# CREATE STATISTICS s2 ON (i+1) ,i FROM t;
| postgres=# \d t
| ...
| "public"."s2" (ndistinct, dependencies, mcv) ON i FROM t

It suggests including additional columns in stxkeys for each expression.
Maybe that also helps give direction to response to Dean's concern?

That doesn't make old psql do anything more desirable, though.
Unless you also added attributes, all you can do is make it say things like
"columns: ctid".

> In any case, why are there so many parentheses ?

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-01-21 Thread Justin Pryzby
On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote:
> > > | Statistics objects:
> > > | "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> 
> Umm, for me that prints:

> "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t
> 
> which I think is OK. But maybe there's something else to trigger the
> problem?

Oh.  It's because I was using /usr/bin/psql and not ./src/bin/psql.
I think it's considered ok if old client's \d commands don't work on new
server, but it's not clear to me if it's ok if they misbehave.  It's almost
better it made an ERROR.

In any case, why are there so many parentheses ?

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-01-21 Thread Justin Pryzby
This already needs to be rebased on 55dc86eca.

And needs to update rules.out.

And doesn't address this one:

On Sun, Jan 17, 2021 at 10:53:31PM -0600, Justin Pryzby wrote:
> | postgres=# CREATE TABLE t(i int);
> | postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
> | postgres=# \d t
> |  Table "public.t"
> |  Column |  Type   | Collation | Nullable | Default 
> | +-+---+--+-
> |  i  | integer |   |  | 
> | Indexes:
> | "t_i_idx" btree (i)
> | Statistics objects:
> | "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> 
> on ... what ?

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-01-21 Thread Dean Rasheed
On Tue, 19 Jan 2021 at 01:57, Tomas Vondra
 wrote:
>
> > A slightly bigger issue that I don't like is the way it assigns
> > attribute numbers for expressions starting from
> > MaxHeapAttributeNumber+1, so the first expression has an attnum of
> > 1601. That leads to pretty inefficient use of Bitmapsets, since most
> > tables only contain a handful of columns, and there's a large block of
> > unused space in the middle the Bitmapset.
> >
> > An alternative approach might be to use regular attnums for columns
> > and use negative indexes -1, -2, -3, ... for expressions in the stored
> > stats. Then when storing and retrieving attnums from Bitmapsets, it
> > could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
> > in the Bitmapsets, since there can't be more than that many
> > expressions (just like other code stores system attributes using
> > FirstLowInvalidHeapAttributeNumber).
>
> Well, I tried this but unfortunately it's not that simple. We still need
> to build the bitmaps, so I don't think add_expression_to_attributes can
> be just removed. I mean, we need to do the offsetting somewhere, even if
> we change how we do it.

Hmm, I was imagining that the offsetting would happen in each place
that adds or retrieves an attnum from a Bitmapset, much like a lot of
other code does for system attributes, and then you'd know you had an
expression if the resulting attnum was negative.

I was also thinking that it would be these negative attnums that would
be stored in the stats data, so instead of something like "1, 2 =>
1601", it would be "1, 2 => -1", so in some sense "-1" would be the
"real" attnum associated with the expression.

> But the main issue is that in some cases the number of expressions is
> not really limited by STATS_MAX_DIMENSIONS - for example when applying
> functional dependencies, we "merge" multiple statistics, so we may end
> up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.

Ah, I see. I hadn't really fully understood what that code was doing.

ISTM though that this is really an internal problem to the
dependencies code, in that these "merged" Bitmapsets containing attrs
from multiple different stats objects do not (and must not) ever go
outside that local code that uses them. So that code would be free to
use a different offset for its own purposes -- e..g., collect all the
distinct expressions across all the stats objects and then offset by
the number of distinct expressions.

> Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts
> the order of processing (so far we've assumed expressions are after
> regular attnums). So the changes are more extensive - I tried doing that
> anyway, and I'm still struggling with crashes and regression failures.
> Of course, that doesn't mean we shouldn't do it, but it's far from
> mechanical. (Some of that is probably a sign this code needs a bit more
> work to polish.)

Interesting. What code assumes expressions come after attributes?
Ideally, I think it would be cleaner if no code assumed any particular
order, but I can believe that it might be convenient in some
circumstances.

> But I wonder if it'd be easier to just calculate the actual max attnum
> and then use it instead of MaxHeapAttributeNumber ...

Hmm, I'm not sure how that would work. There still needs to be an
attnum that gets stored in the database, and it has to continue to
work if the user adds columns to the table. That's why I was
advocating storing negative values, though I haven't actually tried it
to see what might go wrong.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-01-18 Thread Tomas Vondra



On 1/18/21 4:48 PM, Dean Rasheed wrote:

Looking through extended_stats.c, I found a corner case that can lead
to a seg-fault:

CREATE TABLE foo();
CREATE STATISTICS s ON (1) FROM foo;
ANALYSE foo;

This crashes in lookup_var_attr_stats(), because it isn't expecting
nvacatts to be 0. I can't think of any case where building stats on a
table with no analysable columns is useful, so it should probably just
exit early in that case.


In BuildRelationExtStatistics(), it looks like min_attrs should be
declared assert-only.


In evaluate_expressions():

+   /* set the pointers */
+   result = (ExprInfo *) ptr;
+   ptr += sizeof(ExprInfo);

I think that should probably have a MAXALIGN().



Thanks, I'll fix all of that.



A slightly bigger issue that I don't like is the way it assigns
attribute numbers for expressions starting from
MaxHeapAttributeNumber+1, so the first expression has an attnum of
1601. That leads to pretty inefficient use of Bitmapsets, since most
tables only contain a handful of columns, and there's a large block of
unused space in the middle the Bitmapset.

An alternative approach might be to use regular attnums for columns
and use negative indexes -1, -2, -3, ... for expressions in the stored
stats. Then when storing and retrieving attnums from Bitmapsets, it
could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
in the Bitmapsets, since there can't be more than that many
expressions (just like other code stores system attributes using
FirstLowInvalidHeapAttributeNumber).

That would be a somewhat bigger change, but hopefully fairly
mechanical, and then some code like add_expressions_to_attributes()
would go away.



Well, I tried this but unfortunately it's not that simple. We still need 
to build the bitmaps, so I don't think add_expression_to_attributes can 
be just removed. I mean, we need to do the offsetting somewhere, even if 
we change how we do it.


But the main issue is that in some cases the number of expressions is 
not really limited by STATS_MAX_DIMENSIONS - for example when applying 
functional dependencies, we "merge" multiple statistics, so we may end 
up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.


Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts 
the order of processing (so far we've assumed expressions are after 
regular attnums). So the changes are more extensive - I tried doing that 
anyway, and I'm still struggling with crashes and regression failures. 
Of course, that doesn't mean we shouldn't do it, but it's far from 
mechanical. (Some of that is probably a sign this code needs a bit more 
work to polish.)


But I wonder if it'd be easier to just calculate the actual max attnum 
and then use it instead of MaxHeapAttributeNumber ...




Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
show expressions until the statistics have been built. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

  statistics_name | tablename | expr | n_distinct
-+---+--+
  s   | foo   |  |
(1 row)

but after populating and analysing the table, this becomes:

  statistics_name | tablename |  expr   | n_distinct
-+---+-+
  s   | foo   | (a + b) | 11
  s   | foo   | (a * b) | 11
(2 rows)

I think it should show the expressions even before the stats have been built.

Another issue is that it returns rows for non-expression stats as
well. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON a, b FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

  statistics_name | tablename | expr | n_distinct
-+---+--+
  s   | foo   |  |
(1 row)

and those values will never be populated, since they're not
expressions, so I would expect them to not be shown in the view.

So basically, instead of

+ LEFT JOIN LATERAL (
+ SELECT
+ *
+ FROM (
+ SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+ unnest(sd.stxdexpr)::pg_statistic AS a
+ ) x
+ ) stat ON sd.stxdexpr IS NOT NULL;

perhaps just

+ JOIN LATERAL (
+ SELECT
+ *
+ FROM (
+ SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+ unnest(sd.stxdexpr)::pg_statistic AS a
+ ) x
+ ) stat ON true;


Will fix.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-01-18 Thread Dean Rasheed
Looking through extended_stats.c, I found a corner case that can lead
to a seg-fault:

CREATE TABLE foo();
CREATE STATISTICS s ON (1) FROM foo;
ANALYSE foo;

This crashes in lookup_var_attr_stats(), because it isn't expecting
nvacatts to be 0. I can't think of any case where building stats on a
table with no analysable columns is useful, so it should probably just
exit early in that case.


In BuildRelationExtStatistics(), it looks like min_attrs should be
declared assert-only.


In evaluate_expressions():

+   /* set the pointers */
+   result = (ExprInfo *) ptr;
+   ptr += sizeof(ExprInfo);

I think that should probably have a MAXALIGN().


A slightly bigger issue that I don't like is the way it assigns
attribute numbers for expressions starting from
MaxHeapAttributeNumber+1, so the first expression has an attnum of
1601. That leads to pretty inefficient use of Bitmapsets, since most
tables only contain a handful of columns, and there's a large block of
unused space in the middle the Bitmapset.

An alternative approach might be to use regular attnums for columns
and use negative indexes -1, -2, -3, ... for expressions in the stored
stats. Then when storing and retrieving attnums from Bitmapsets, it
could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
in the Bitmapsets, since there can't be more than that many
expressions (just like other code stores system attributes using
FirstLowInvalidHeapAttributeNumber).

That would be a somewhat bigger change, but hopefully fairly
mechanical, and then some code like add_expressions_to_attributes()
would go away.


Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
show expressions until the statistics have been built. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

 statistics_name | tablename | expr | n_distinct
-+---+--+
 s   | foo   |  |
(1 row)

but after populating and analysing the table, this becomes:

 statistics_name | tablename |  expr   | n_distinct
-+---+-+
 s   | foo   | (a + b) | 11
 s   | foo   | (a * b) | 11
(2 rows)

I think it should show the expressions even before the stats have been built.

Another issue is that it returns rows for non-expression stats as
well. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON a, b FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

 statistics_name | tablename | expr | n_distinct
-+---+--+
 s   | foo   |  |
(1 row)

and those values will never be populated, since they're not
expressions, so I would expect them to not be shown in the view.

So basically, instead of

+ LEFT JOIN LATERAL (
+ SELECT
+ *
+ FROM (
+ SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+ unnest(sd.stxdexpr)::pg_statistic AS a
+ ) x
+ ) stat ON sd.stxdexpr IS NOT NULL;

perhaps just

+ JOIN LATERAL (
+ SELECT
+ *
+ FROM (
+ SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+ unnest(sd.stxdexpr)::pg_statistic AS a
+ ) x
+ ) stat ON true;

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-01-17 Thread Justin Pryzby
On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:
> > CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> > CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> > ANALYZE t;
> > SELECT i+1 FROM t GROUP BY 1;
> > ERROR:  corrupt MVNDistinct entry
> 
> Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting in
> mismatching the ndistinct coefficient items. The attached patch fixes that,
> but I've realized the way we pick the "best" statistics may need some
> improvements (I added an XXX comment about that).

That maybe indicates a deficiency in testing and code coverage.

| postgres=# CREATE TABLE t(i int);
| postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
| postgres=# \d t
|  Table "public.t"
|  Column |  Type   | Collation | Nullable | Default 
| +-+---+--+-
|  i  | integer |   |  | 
| Indexes:
| "t_i_idx" btree (i)
| Statistics objects:
| "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t

on ... what ?

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-01-17 Thread Tomas Vondra




On 1/17/21 6:29 AM, Justin Pryzby wrote:

On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:

diff --git a/doc/src/sgml/ref/create_statistics.sgml 
b/doc/src/sgml/ref/create_statistics.sgml
index 4363be50c3..5b8eb8d248 100644
--- a/doc/src/sgml/ref/create_statistics.sgml
+++ b/doc/src/sgml/ref/create_statistics.sgml
@@ -21,9 +21,13 @@ PostgreSQL documentation
  
   

  
+CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
+ON ( expression )
+FROM table_name



  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
  [ ( statistics_kind [, ... ] 
) ]
-ON column_name, column_name [, ...]
+ON { column_name | ( expression ) } [, ...]
  FROM table_name
  


I think this part is wrong, since it's not possible to specify a single column
in form#2.

If I'm right, the current way is:

  - form#1 allows expression statistics of a single expression, and doesn't
allow specifying "kinds";

  - form#2 allows specifying "kinds", but always computes expression statistics,
and requires multiple columns/expressions.

So it'd need to be column_name|expression, column_name|expression [,...]



Strictly speaking you're probably correct - there should be at least two 
elements. But I'm somewhat hesitant about making this more complex, 
because it'll be harder to understand.




@@ -39,6 +43,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] statistics_na
 database and will be owned by the user issuing the command.

  
+  

+   The CREATE STATISTICS command has two basic forms. The
+   simple variant allows building statistics for a single expression, does
+   not allow specifying any statistics kinds and provides benefits similar
+   to an expression index. The full variant allows defining statistics objects
+   on multiple columns and expressions, and selecting which statistics kinds 
will
+   be built. The per-expression statistics are built automatically when there
+   is at least one expression.
+  



+   
+expression
+
+ 
+  The expression to be covered by the computed statistics. In this case
+  only a single expression is required, in which case only the expression
+  statistics kind is allowed. The order of expressions is insignificant.


I think this part is wrong now ?
I guess there's no user-facing expression "kind" left in the CREATE command.
I guess "in which case" means "if only one expr is specified".
"expression" could be either form#1 or form#2.

Maybe it should just say:

+  An expression to be covered by the computed statistics.


Maybe somewhere else, say:

In the second form of the command, the order of expressions is insignificant.




Yeah, this is a leftover from when there was "expressions" kind. I'll 
reword this a bit.



thanks

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-01-17 Thread Tomas Vondra




On 1/17/21 3:55 AM, Zhihong Yu wrote:

Hi,
+    * Check that only the base rel is mentioned.  (This should be dead code
+    * now that add_missing_from is history.)
+    */
+   if (list_length(pstate->p_rtable) != 1)

If it is dead code, it can be removed, right ?



Maybe. The question is whether it really is dead code. It's copied from 
transformIndexStmt so I kept it for now.



For statext_mcv_clauselist_selectivity:

+                   // bms_free(list_attnums[listidx]);
 > The commented line can be removed.



Actually, this should probably do list_free on the list_exprs.


+bool
+examine_clause_args2(List *args, Node **exprp, Const **cstp, bool 
*expronleftp)


Better add some comment for examine_clause_args2 since there 
is examine_clause_args() already.




Yeah, this probably needs a better name. I have a feeling this may need 
a refactoring to reuse more of the existing code for the expressions.



+   if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)

When would stats->minrows have negative value ?



The typanalyze function (e.g. std_typanalyze) can set it to negative 
value. This is the same check as in examine_attribute, and we need the 
same protections I think.



For serialize_expr_stats():

+   sd = table_open(StatisticRelationId, RowExclusiveLock);
+
+   /* lookup OID of composite type for pg_statistic */
+   typOid = get_rel_type_id(StatisticRelationId);
+   if (!OidIsValid(typOid))
+       ereport(ERROR,

Looks like the table_open() call can be made after the typOid check.



Probably, but this failure is unlikely (should never happen) so I don't 
think this makes any real difference.



+       Datum       values[Natts_pg_statistic];
+       bool        nulls[Natts_pg_statistic];
+       HeapTuple   stup;
+
+       if (!stats->stats_valid)

It seems the local arrays can be declared after the validity check.



Nope, that would be against C99.


+           if (enabled[i] == STATS_EXT_NDISTINCT)
+               ndistinct_enabled = true;
+           if (enabled[i] == STATS_EXT_DEPENDENCIES)
+               dependencies_enabled = true;
+           if (enabled[i] == STATS_EXT_MCV)

the second and third if should be preceded with 'else'



Yeah, although this just moves existing code. But you're right it could 
use else.



+ReleaseDummy(HeapTuple tuple)
+{
+   pfree(tuple);

Since ReleaseDummy() is just a single pfree call, maybe you don't need 
this method - call pfree in its place.




No, that's not possible because the freefunc callback expects signature

void (*)(HeapTupleData *)

and pfree() does not match that.


thanks

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-01-16 Thread Justin Pryzby
On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote:
> diff --git a/doc/src/sgml/ref/create_statistics.sgml 
> b/doc/src/sgml/ref/create_statistics.sgml
> index 4363be50c3..5b8eb8d248 100644
> --- a/doc/src/sgml/ref/create_statistics.sgml
> +++ b/doc/src/sgml/ref/create_statistics.sgml
> @@ -21,9 +21,13 @@ PostgreSQL documentation
>  
>   
>  
> +CREATE STATISTICS [ IF NOT EXISTS ]  class="parameter">statistics_name
> +ON ( expression )
> +FROM table_name

>  CREATE STATISTICS [ IF NOT EXISTS ]  class="parameter">statistics_name
>  [ ( statistics_kind [, ... 
> ] ) ]
> -ON column_name, 
> column_name [, ...]
> +ON { column_name | ( 
> expression ) } [, ...]
>  FROM table_name
>  

I think this part is wrong, since it's not possible to specify a single column
in form#2.

If I'm right, the current way is:

 - form#1 allows expression statistics of a single expression, and doesn't
   allow specifying "kinds";

 - form#2 allows specifying "kinds", but always computes expression statistics,
   and requires multiple columns/expressions.

So it'd need to be column_name|expression, column_name|expression [,...]

> @@ -39,6 +43,16 @@ CREATE STATISTICS [ IF NOT EXISTS ]  class="parameter">statistics_na
> database and will be owned by the user issuing the command.
>
>  
> +  
> +   The CREATE STATISTICS command has two basic forms. The
> +   simple variant allows building statistics for a single expression, does
> +   not allow specifying any statistics kinds and provides benefits similar
> +   to an expression index. The full variant allows defining statistics 
> objects
> +   on multiple columns and expressions, and selecting which statistics kinds 
> will
> +   be built. The per-expression statistics are built automatically when there
> +   is at least one expression.
> +  

> +   
> +expression
> +
> + 
> +  The expression to be covered by the computed statistics. In this case
> +  only a single expression is required, in which case only the expression
> +  statistics kind is allowed. The order of expressions is insignificant.

I think this part is wrong now ?
I guess there's no user-facing expression "kind" left in the CREATE command.
I guess "in which case" means "if only one expr is specified".
"expression" could be either form#1 or form#2.

Maybe it should just say:
> +  An expression to be covered by the computed statistics.

Maybe somewhere else, say:
> In the second form of the command, the order of expressions is insignificant.

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-01-16 Thread Zhihong Yu
Hi,
+* Check that only the base rel is mentioned.  (This should be dead code
+* now that add_missing_from is history.)
+*/
+   if (list_length(pstate->p_rtable) != 1)

If it is dead code, it can be removed, right ?

For statext_mcv_clauselist_selectivity:

+   // bms_free(list_attnums[listidx]);

The commented line can be removed.

+bool
+examine_clause_args2(List *args, Node **exprp, Const **cstp, bool
*expronleftp)

Better add some comment for examine_clause_args2 since there
is examine_clause_args() already.

+   if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)

When would stats->minrows have negative value ?

For serialize_expr_stats():

+   sd = table_open(StatisticRelationId, RowExclusiveLock);
+
+   /* lookup OID of composite type for pg_statistic */
+   typOid = get_rel_type_id(StatisticRelationId);
+   if (!OidIsValid(typOid))
+   ereport(ERROR,

Looks like the table_open() call can be made after the typOid check.

+   Datum   values[Natts_pg_statistic];
+   boolnulls[Natts_pg_statistic];
+   HeapTuple   stup;
+
+   if (!stats->stats_valid)

It seems the local arrays can be declared after the validity check.

+   if (enabled[i] == STATS_EXT_NDISTINCT)
+   ndistinct_enabled = true;
+   if (enabled[i] == STATS_EXT_DEPENDENCIES)
+   dependencies_enabled = true;
+   if (enabled[i] == STATS_EXT_MCV)

the second and third if should be preceded with 'else'

+ReleaseDummy(HeapTuple tuple)
+{
+   pfree(tuple);

Since ReleaseDummy() is just a single pfree call, maybe you don't need this
method - call pfree in its place.

Cheers

On Sat, Jan 16, 2021 at 4:24 PM Tomas Vondra 
wrote:

> On 1/17/21 12:22 AM, Justin Pryzby wrote:
> > On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote:
> >> +  
> >> +   expr text
> >> +  
> >> +  
> >> +   Expression the extended statistics is defined on
> >> +  
> >
> > Expression the extended statistics ARE defined on
> > Or maybe say "on which the extended statistics are defined"
> >
>
> I'm pretty sure "is" is correct because "expression" is singular.
>
> >> +  
> >> +   The CREATE STATISTICS command has two basic
> forms. The
> >> +   simple variant allows to build statistics for a single expression,
> does
> >
> > .. ALLOWS BUILDING statistics for a single expression, AND does (or BUT
> does)
> >
> >> +   Expression statistics are per-expression and are similar to
> creating an
> >> +   index on the expression, except that they avoid the overhead of the
> index.
> >
> > Maybe say "overhead of index maintenance"
> >
>
> Yeah, that sounds better.
>
> >> +   All functions and operators used in a statistics definition must be
> >> +   immutable, that is, their results must depend only on
> >> +   their arguments and never on any outside influence (such as
> >> +   the contents of another table or the current time).  This
> restriction
> >
> > say "outside factor" or "external factor"
> >
>
> In fact, we've removed the immutability restriction, so this paragraph
> should have been removed.
>
> >> +   results of those expression, and uses default estimates as
> illustrated
> >> +   by the first query.  The planner also does not realize the value of
> the
> >
> > realize THAT
> >
>
> OK, changed.
>
> >> +   second column fully defines the value of the other column, because
> date
> >> +   truncated to day still identifies the month. Then expression and
> >> +   ndistinct statistics are built on those two columns:
> >
> > I got an error doing this:
> >
> > CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> > CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> > ANALYZE t;
> > SELECT i+1 FROM t GROUP BY 1;
> > ERROR:  corrupt MVNDistinct entry
> >
>
> Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting
> in mismatching the ndistinct coefficient items. The attached patch fixes
> that, but I've realized the way we pick the "best" statistics may need
> some improvements (I added an XXX comment about that).
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: PoC/WIP: Extended statistics on expressions

2021-01-16 Thread Justin Pryzby
On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote:
> +  
> +   expr text
> +  
> +  
> +   Expression the extended statistics is defined on
> +  

Expression the extended statistics ARE defined on
Or maybe say "on which the extended statistics are defined"

> +  
> +   The CREATE STATISTICS command has two basic forms. The
> +   simple variant allows to build statistics for a single expression, does

.. ALLOWS BUILDING statistics for a single expression, AND does (or BUT does)

> +   Expression statistics are per-expression and are similar to creating an
> +   index on the expression, except that they avoid the overhead of the index.

Maybe say "overhead of index maintenance"

> +   All functions and operators used in a statistics definition must be
> +   immutable, that is, their results must depend only on
> +   their arguments and never on any outside influence (such as
> +   the contents of another table or the current time).  This restriction

say "outside factor" or "external factor"

> +   results of those expression, and uses default estimates as illustrated
> +   by the first query.  The planner also does not realize the value of the

realize THAT

> +   second column fully defines the value of the other column, because date
> +   truncated to day still identifies the month. Then expression and
> +   ndistinct statistics are built on those two columns:

I got an error doing this:

CREATE TABLE t AS SELECT generate_series(1,9) AS i;
CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
ANALYZE t;
SELECT i+1 FROM t GROUP BY 1;
ERROR:  corrupt MVNDistinct entry

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-01-08 Thread Tomas Vondra
On 1/8/21 3:35 AM, Justin Pryzby wrote:
> On Fri, Jan 08, 2021 at 01:57:29AM +0100, Tomas Vondra wrote:
>> Attached is a patch fixing most of the issues. There are a couple
>> exceptions:
> 
> In the docs:
> 
> ...
>

Thanks! Checking the docs and comments is a tedious work, I appreciate
you going through all that. I'll fix that in the next version.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-01-07 Thread Justin Pryzby
On Fri, Jan 08, 2021 at 01:57:29AM +0100, Tomas Vondra wrote:
> Attached is a patch fixing most of the issues. There are a couple
> exceptions:

In the docs:

+at the cost that its schema must be extended whenever the structure 

 
+   of statistics pg_statistic 
changes.
 

should say "of statistics *IN* pg_statistics changes" ?

+   to an expression index. The full variant allows defining statistics objects 

 
+   on multiple columns and expressions, and pick which statistics kinds will   

 
+   be built. The per-expression statistics are built automatically when there  

 

"and pick" is wrong - maybe say "and selecting which.."

+   and run a query using an expression on that column.  Without the

 

remove "the" ?

+   extended statistics, the planner has no information about data  

 
+   distribution for reasults of those expression, and uses default 

 

*results

+   estimates as illustrated by the first query.  The planner also does 

 
+   not realize the value of the second column fully defines the value  

 
+   of the other column, because date truncated to day still identifies 

 
+   the month). Then expression and ndistinct statistics are built on   

 

The ")" is unbalanced

+   /* all parts of thi expression are covered by 
this statistics */  
   

this

+ * GrouExprInfos, but only if it's not known equal to any of the existing  

 

Group

+* we don't allow specifying any statistis kinds.  The simple variant   

 

statistics

+* If no statistic type was specified, build them all (but request  

 

Say "kind" not "type" ?

+ * expression is a simple Var. OTOH we check that there's at least one 

 
+ * statistics matching the expression. 

 

one statistic (singular) ?

+* the future, we might consider

 
+*/ 

 

consider ???

+-- (not it fails, when there are no simple column references)  

 

note?

There's some remaining copy/paste stuff from index expressions:

errmsg("statistics expressions and predicates can refer only to the table being 
indexed")));
left behind by evaluating the 

Re: PoC/WIP: Extended statistics on expressions

2021-01-07 Thread Dean Rasheed
Starting to look at the planner code, I found an oversight in the way
expression stats are read at the start of planning -- it is necessary
to call ChangeVarNodes() on any expressions if the relid isn't 1,
otherwise the stats expressions may contain Var nodes referring to the
wrong relation. Possibly the easiest place to do that would be in
get_relation_statistics(), if rel->relid != 1.

Here's a simple test case:

CREATE TABLE foo AS SELECT x FROM generate_series(1,10) g(x);
CREATE STATISTICS foo_s ON (x%10) FROM foo;
ANALYSE foo;

EXPLAIN SELECT * FROM foo WHERE x%10 = 0;
EXPLAIN SELECT * FROM (SELECT 1) t, foo WHERE x%10 = 0;

(in the second query, the stats don't get applied).

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-01-06 Thread Dean Rasheed
Looking over the statscmds.c changes, there are a few XXX's and
FIXME's that need resolving, and I had a couple of other minor
comments:

+   /*
+* An expression using mutable functions is probably wrong,
+* since if you aren't going to get the same result for the
+* same data every time, it's not clear what the index entries
+* mean at all.
+*/
+   if (CheckMutability((Expr *) expr))
+   ereport(ERROR,

That comment is presumably copied from the index code, so needs updating.


+   /*
+* Disallow data types without a less-than operator
+*
+* XXX Maybe allow this, but only for EXPRESSIONS stats and
+* prevent building e.g. MCV etc.
+*/
+   atttype = exprType(expr);
+   type = lookup_type_cache(atttype, TYPECACHE_LT_OPR);
+   if (type->lt_opr == InvalidOid)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("expression cannot be used in
statistics because its type %s has no default btree operator class",
+   format_type_be(atttype;

As the comment suggests, it's probably worth skipping this check if
numcols is 1 so that single-column stats can be built for more types
of expressions. (I'm assuming that it's basically no more effort to
make that work, so I think it falls into the might-as-well-do-it
category.)


+   /*
+* Parse the statistics kinds.  Firstly, check that this is not the
+* variant building statistics for a single expression, in which case
+* we don't allow specifying any statistis kinds.  The simple variant
+* only has one expression, and does not allow statistics kinds.
+*/
+   if ((list_length(stmt->exprs) == 1) && (list_length(stxexprs) == 1))
+   {

Typo: "statistis"
Nit-picking, this test could just be:

+   if ((numcols == 1) && (list_length(stxexprs) == 1))

which IMO is a little more readable, and matches a similar test a
little further down.


+   /*
+* If there are no simply-referenced columns, give the statistics an
+* auto dependency on the whole table.  In most cases, this will
+* be redundant, but it might not be if the statistics expressions
+* contain no Vars (which might seem strange but possible).
+*
+* XXX This is copied from index_create, not sure if it's applicable
+* to extended statistics too.
+*/

Seems right to me.


+   /*
+* FIXME use 'expr' for expressions, which have empty column names.
+* For indexes this is handled in ChooseIndexColumnNames, but we
+* have no such function for stats.
+*/
+   if (!name)
+   name = "expr";

In theory, this function could be made to duplicate the logic used for
indexes, creating names like "expr1", "expr2", etc. To be honest
though, I don't think it's worth the effort. The code for indexes
isn't really bulletproof anyway -- for example there might be a column
called "expr" that is or isn't included in the index, which would make
the generated name ambiguous. And in any case, a name like
"tbl_cola_expr_colb_expr1_colc_stat" isn't really any more useful than
"tbl_cola_expr_colb_expr_colc_stat". So I'd be tempted to leave that
code as it is.


+
+/*
+ * CheckMutability
+ * Test whether given expression is mutable
+ *
+ * FIXME copied from indexcmds.c, maybe use some shared function?
+ */
+static bool
+CheckMutability(Expr *expr)
+{

As the comment says, it's quite messy duplicating this code, but I'm
wondering whether it would be OK to just skip this check entirely. I
think someone else suggested that elsewhere, and I think it might not
be a bad idea.

For indexes, it could easily lead to wrong query results, but for
stats the most likely problem is that the stats would get out of date
(which they tend to do all by themselves anyway) and need rebuilding.

If you ignore intentionally crazy examples (which are still possible
even with this check), then there are probably many legitimate cases
where someone might want to use non-immutable functions in stats, and
this check just forces them to create an immutable wrapper function.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-01-05 Thread Tomas Vondra




On 1/5/21 3:10 PM, Dean Rasheed wrote:

On Tue, 5 Jan 2021 at 00:45, Tomas Vondra  wrote:


On 1/4/21 4:34 PM, Dean Rasheed wrote:


* In src/bin/psql/describe.c, I think the \d output should also
exclude the "expressions" stats kind and just list the other kinds (or
have no kinds list at all, if there are no other kinds), to make it
consistent with the CREATE STATISTICS syntax.


Not sure I understand. Why would this make it consistent with CREATE
STATISTICS? Can you elaborate?



This isn't absolutely essential, but I think it would be neater. For
example, if I have a table with stats like this:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo;

then the \d output is as follows:

\d foo
 Table "public.foo"
  Column |  Type   | Collation | Nullable | Default
+-+---+--+-
  a  | integer |   |  |
  b  | integer |   |  |
Statistics objects:
 "public"."foo_s_ab" (mcv) ON a, b FROM foo

and the stats line matches the DDL used to create the stats. It could,
for example, be copy-pasted and tweaked to create similar stats on
another table, but even if that's not very likely, it's neat that it
reflects how the stats were created.

OTOH, if there are expressions in the list, it produces something like this:

 Table "public.foo"
  Column |  Type   | Collation | Nullable | Default
+-+---+--+-
  a  | integer |   |  |
  b  | integer |   |  |
Statistics objects:
 "public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo

which no longer matches the DDL used, and isn't part of an accepted
syntax, so seems a bit inconsistent.

In general, if we're making the "expressions" kind an internal
implementation detail that just gets built automatically when needed,
then I think we should hide it from this sort of output, so the list
of kinds matches the list that the user used when the stats were
created.



Hmm, I see. You're probably right it's not necessary to show this, given 
the modified handling of expression stats (which makes them an internal 
detail, not exposed to users). I'll tweak this.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-01-05 Thread Dean Rasheed
On Tue, 5 Jan 2021 at 00:45, Tomas Vondra  wrote:
>
> On 1/4/21 4:34 PM, Dean Rasheed wrote:
> >
> > * In src/bin/psql/describe.c, I think the \d output should also
> > exclude the "expressions" stats kind and just list the other kinds (or
> > have no kinds list at all, if there are no other kinds), to make it
> > consistent with the CREATE STATISTICS syntax.
>
> Not sure I understand. Why would this make it consistent with CREATE
> STATISTICS? Can you elaborate?
>

This isn't absolutely essential, but I think it would be neater. For
example, if I have a table with stats like this:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo;

then the \d output is as follows:

\d foo
Table "public.foo"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Statistics objects:
"public"."foo_s_ab" (mcv) ON a, b FROM foo

and the stats line matches the DDL used to create the stats. It could,
for example, be copy-pasted and tweaked to create similar stats on
another table, but even if that's not very likely, it's neat that it
reflects how the stats were created.

OTOH, if there are expressions in the list, it produces something like this:

Table "public.foo"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Statistics objects:
"public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo

which no longer matches the DDL used, and isn't part of an accepted
syntax, so seems a bit inconsistent.

In general, if we're making the "expressions" kind an internal
implementation detail that just gets built automatically when needed,
then I think we should hide it from this sort of output, so the list
of kinds matches the list that the user used when the stats were
created.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2021-01-04 Thread Tomas Vondra

On 1/4/21 4:34 PM, Dean Rasheed wrote:


... 


Some other comments:

* I'm not sure I understand the need for 0001. Wasn't there an earlier
version of this patch that just did it by re-populating the type
array, but which still had it as an array rather than turning it into
a list? Making it a list falsifies some of the comments and
function/variable name choices in that file.



That's a bit done to Justin - I think it's fine to use the older version 
repopulating the type array, but that question is somewhat unrelated to 
this patch.



* There's a comment typo in catalog/Makefile -- "are are reputedly
other...", should be "there are reputedly other...".

* Looking at the pg_stats_ext view, I think perhaps expressions stats
should be omitted entirely from that view, since it doesn't show any
useful information about them. So it could remove "e" from the "kinds"
array, and exclude rows whose only kind is "e", since such rows have
no interesting data in them. Essentially, the new view
pg_stats_ext_exprs makes having any expression stats in pg_stats_ext
redundant. Hiding this data in pg_stats_ext would also be consistent
with making the "expressions" stats kind hidden from the user.



Hmmm, not sure. I'm not sure removing 'e' from the array is a good idea. 
On the one hand it's internal detail, on the other hand most of that 
view is internal detail too. Excluding rows with only 'e' seems 
reasonable, though. I need to think about this.



* In gram.y, it wasn't quite obvious why you converted the column list
for CREATE STATISTICS from an expr_list to a stats_params list. I
figured it out, and it makes sense, but I think it could use a
comment, perhaps something along the lines of the one for index_elem,
e.g.:

/*
  * Statistics attributes can be either simple column references, or arbitrary
  * expressions in parens.  For compatibility with index attributes permitted
  * in CREATE INDEX, we allow an expression that's just a function call to be
  * written without parens.
  */



OH, right. I'd have trouble figuring this myself, and I wrote that code 
myself only one or two months ago.



* In parse_func.c and parse_agg.c, there are a few new error strings
that use the abbreviation "stats expressions", whereas most other
errors refer to "statistics expressions". For consistency, I think
they should all be the latter.



OK, will fix.


* In generateClonedExtStatsStmt(), I think the "expressions" stats
kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE
...) fails if the source table has expression stats.



Yeah, will fix. I guess this also means we're missing some tests.


* CreateStatistics() uses ShareUpdateExclusiveLock, but in
tcop/utility.c the relation is opened with a ShareLock. ISTM that the
latter lock mode should be made to match CreateStatistics().



Not sure, will check.


* Why does the new code in tcop/utility.c not use
RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation?
That seems preferable to doing the ACL check in CreateStatistics().
For one thing, as it stands, it allows the lock to be taken even if
the user doesn't own the table. Is it intentional that the current
code allows extended stats to be created on system catalogs? That
would be one thing that using RangeVarCallbackOwnsRelation would
change, but I can't see a reason to allow it.



I think I copied the code from somewhere - probably expression indexes, 
or something like that. Not a proof that it's the right/better way to do 
this, though.



* In src/bin/psql/describe.c, I think the \d output should also
exclude the "expressions" stats kind and just list the other kinds (or
have no kinds list at all, if there are no other kinds), to make it
consistent with the CREATE STATISTICS syntax.



Not sure I understand. Why would this make it consistent with CREATE 
STATISTICS? Can you elaborate?



* The pg_dump output for a stats object whose only kind is
"expressions" is broken -- it includes a spurious "()" for the kinds
list.



Will fix. Again, this suggests there are TAP tests missing.


That's it for now. I'll look at the optimiser changes next, and try to
post more comments later this week.



Thanks!


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2021-01-04 Thread Justin Pryzby
On Mon, Jan 04, 2021 at 03:34:08PM +, Dean Rasheed wrote:
> * I'm not sure I understand the need for 0001. Wasn't there an earlier
> version of this patch that just did it by re-populating the type
> array, but which still had it as an array rather than turning it into
> a list? Making it a list falsifies some of the comments and
> function/variable name choices in that file.

This part is from me.

I can review the names if it's desired , but it'd be fine to fall back to the
earlier patch.  I thought a pglist was cleaner, but it's not needed.

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2021-01-04 Thread Dean Rasheed
On Fri, 11 Dec 2020 at 20:17, Tomas Vondra
 wrote:
>
> OK. Attached is an updated version, reworking it this way.

Cool. I think this is an exciting development, so I hope it makes it
into the next release.

I have started looking at it. So far I have only looked at the
catalog, parser and client changes, but I thought it's worth posting
my comments so far.

> I tried tweaking the grammar to differentiate these two syntax variants,
> but that led to shift/reduce conflicts with the existing ones. I tried
> fixing that, but I ended up doing that in CreateStatistics().

Yeah, that makes sense. I wasn't expecting the grammar to change.

> The other thing is that we probably can't tie this to just MCV, because
> functional dependencies need the per-expression stats too. So I simply
> build expression stats whenever there's at least one expression.

Makes sense.

> I also decided to keep the "expressions" statistics kind - it's not
> allowed to specify it in CREATE STATISTICS, but it's useful internally
> as it allows deciding whether to build the stats in a single place.
> Otherwise we'd need to do that every time we build the statistics, etc.

Yes, I thought that would be the easiest way to do it. Essentially the
"expressions" stats kind is an internal implementation detail, hidden
from the user, because it's built automatically when required, so you
don't need to (and can't) explicitly ask for it. This new behaviour
seems much more logical to me.

> I added a brief explanation to the sgml docs, not sure if that's good
> enough - maybe it needs more details.

Yes, I think that could use a little tidying up, but I haven't looked
too closely yet.


Some other comments:

* I'm not sure I understand the need for 0001. Wasn't there an earlier
version of this patch that just did it by re-populating the type
array, but which still had it as an array rather than turning it into
a list? Making it a list falsifies some of the comments and
function/variable name choices in that file.

* There's a comment typo in catalog/Makefile -- "are are reputedly
other...", should be "there are reputedly other...".

* Looking at the pg_stats_ext view, I think perhaps expressions stats
should be omitted entirely from that view, since it doesn't show any
useful information about them. So it could remove "e" from the "kinds"
array, and exclude rows whose only kind is "e", since such rows have
no interesting data in them. Essentially, the new view
pg_stats_ext_exprs makes having any expression stats in pg_stats_ext
redundant. Hiding this data in pg_stats_ext would also be consistent
with making the "expressions" stats kind hidden from the user.

* In gram.y, it wasn't quite obvious why you converted the column list
for CREATE STATISTICS from an expr_list to a stats_params list. I
figured it out, and it makes sense, but I think it could use a
comment, perhaps something along the lines of the one for index_elem,
e.g.:

/*
 * Statistics attributes can be either simple column references, or arbitrary
 * expressions in parens.  For compatibility with index attributes permitted
 * in CREATE INDEX, we allow an expression that's just a function call to be
 * written without parens.
 */

* In parse_func.c and parse_agg.c, there are a few new error strings
that use the abbreviation "stats expressions", whereas most other
errors refer to "statistics expressions". For consistency, I think
they should all be the latter.

* In generateClonedExtStatsStmt(), I think the "expressions" stats
kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE
...) fails if the source table has expression stats.

* CreateStatistics() uses ShareUpdateExclusiveLock, but in
tcop/utility.c the relation is opened with a ShareLock. ISTM that the
latter lock mode should be made to match CreateStatistics().

* Why does the new code in tcop/utility.c not use
RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation?
That seems preferable to doing the ACL check in CreateStatistics().
For one thing, as it stands, it allows the lock to be taken even if
the user doesn't own the table. Is it intentional that the current
code allows extended stats to be created on system catalogs? That
would be one thing that using RangeVarCallbackOwnsRelation would
change, but I can't see a reason to allow it.

* In src/bin/psql/describe.c, I think the \d output should also
exclude the "expressions" stats kind and just list the other kinds (or
have no kinds list at all, if there are no other kinds), to make it
consistent with the CREATE STATISTICS syntax.

* The pg_dump output for a stats object whose only kind is
"expressions" is broken -- it includes a spurious "()" for the kinds
list.

That's it for now. I'll look at the optimiser changes next, and try to
post more comments later this week.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2020-12-11 Thread Dean Rasheed
On Tue, 8 Dec 2020 at 12:44, Tomas Vondra  wrote:
>
> Possibly. But I don't think it's worth the extra complexity. I don't
> expect people to have a lot of overlapping stats, so the amount of
> wasted space and CPU time is expected to be fairly limited.
>
> So I don't think it's worth spending too much time on this now. Let's
> just do what you proposed, and revisit this later if needed.
>

Yes, I think that's a reasonable approach to take. As long as the
documentation makes it clear that building MCV stats also causes
standard expression stats to be built on any expressions included in
the list, then the user will know and can avoid duplication most of
the time. I don't think there's any need for code to try to prevent
that -- just as we don't bother with code to prevent a user building
multiple indexes on the same column.

The only case where duplication won't be avoidable is where there are
multiple MCV stats sharing the same expression, but that's probably
quite unlikely in practice, and it seems acceptable to leave improving
that as a possible future optimisation.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2020-12-08 Thread Tomas Vondra



On 12/7/20 5:02 PM, Dean Rasheed wrote:
> On Mon, 7 Dec 2020 at 14:15, Tomas Vondra  
> wrote:
>>
>> On 12/7/20 10:56 AM, Dean Rasheed wrote:
>>> it might actually be
>>> neater to have separate documented syntaxes for single- and
>>> multi-column statistics:
>>>
>>>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>> ON (expression)
>>> FROM table_name
>>>
>>>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>> [ ( statistics_kind [, ... ] ) ]
>>> ON { column_name | (expression) } , { column_name | (expression) } [, 
>>> ...]
>>> FROM table_name
>>
>> I think it makes sense in general. I see two issues with this approach,
>> though:
>>
>> * By adding expression/standard stats for individual statistics, it
>> makes the list of statistics longer - I wonder if this might have
>> measurable impact on lookups in this list.
>>
>> * I'm not sure it's a good idea that the second syntax would always
>> build the per-expression stats. Firstly, it seems a bit strange that it
>> behaves differently than the other kinds. Secondly, I wonder if there
>> are cases where it'd be desirable to explicitly disable building these
>> per-expression stats. For example, what if we have multiple extended
>> statistics objects, overlapping on a couple expressions. It seems
>> pointless to build the stats for all of them.
>>
> 
> Hmm, I'm not sure it would really be a good idea to build MCV stats on
> expressions without also building the standard stats for those
> expressions, otherwise the assumptions that
> mcv_combine_selectivities() makes about simple_sel and mcv_basesel
> wouldn't really hold. But then, if multiple MCV stats shared the same
> expression, it would be quite wasteful to build standard stats on the
> expression more than once.
> 

Yeah. You're right it'd be problematic to build MCV on expressions
without having the per-expression stats. In fact, that's exactly the
problem what forced me to add the per-expression stats to this patch.
Originally I planned to address it in a later patch, but I had to move
it forward.

So I think you're right we need to ensure we have standard stats for
each expression at least once, to make this work well.

> It feels like it should build a single extended stats object for each
> unique expression, with appropriate dependencies for any MCV stats
> that used those expressions, but I'm not sure how complex that would
> be. Dropping the last MCV stat object using a standard expression stat
> object might reasonably drop the expression stats ... except if they
> were explicitly created by the user, independently of any MCV stats.
> That could get quite messy.
> 

Possibly. But I don't think it's worth the extra complexity. I don't
expect people to have a lot of overlapping stats, so the amount of
wasted space and CPU time is expected to be fairly limited.

So I don't think it's worth spending too much time on this now. Let's
just do what you proposed, and revisit this later if needed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2020-12-07 Thread Dean Rasheed
On Mon, 7 Dec 2020 at 14:15, Tomas Vondra  wrote:
>
> On 12/7/20 10:56 AM, Dean Rasheed wrote:
> > it might actually be
> > neater to have separate documented syntaxes for single- and
> > multi-column statistics:
> >
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> > ON (expression)
> > FROM table_name
> >
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> > [ ( statistics_kind [, ... ] ) ]
> > ON { column_name | (expression) } , { column_name | (expression) } [, 
> > ...]
> > FROM table_name
>
> I think it makes sense in general. I see two issues with this approach,
> though:
>
> * By adding expression/standard stats for individual statistics, it
> makes the list of statistics longer - I wonder if this might have
> measurable impact on lookups in this list.
>
> * I'm not sure it's a good idea that the second syntax would always
> build the per-expression stats. Firstly, it seems a bit strange that it
> behaves differently than the other kinds. Secondly, I wonder if there
> are cases where it'd be desirable to explicitly disable building these
> per-expression stats. For example, what if we have multiple extended
> statistics objects, overlapping on a couple expressions. It seems
> pointless to build the stats for all of them.
>

Hmm, I'm not sure it would really be a good idea to build MCV stats on
expressions without also building the standard stats for those
expressions, otherwise the assumptions that
mcv_combine_selectivities() makes about simple_sel and mcv_basesel
wouldn't really hold. But then, if multiple MCV stats shared the same
expression, it would be quite wasteful to build standard stats on the
expression more than once.

It feels like it should build a single extended stats object for each
unique expression, with appropriate dependencies for any MCV stats
that used those expressions, but I'm not sure how complex that would
be. Dropping the last MCV stat object using a standard expression stat
object might reasonably drop the expression stats ... except if they
were explicitly created by the user, independently of any MCV stats.
That could get quite messy.

Regards,
Dean




Re: PoC/WIP: Extended statistics on expressions

2020-12-07 Thread Tomas Vondra



On 12/7/20 10:56 AM, Dean Rasheed wrote:
> On Thu, 3 Dec 2020 at 15:23, Tomas Vondra  
> wrote:
>>
>> Attached is a patch series rebased on top of 25a9e54d2d.
> 
> After reading this thread and [1], I think I prefer the name
> "standard" rather than "expressions", because it is meant to describe
> the kind of statistics being built rather than what they apply to, but
> maybe that name doesn't actually need to be exposed to the end user:
> 
> Looking at the current behaviour, there are a couple of things that
> seem a little odd, even though they are understandable. For example,
> the fact that
> 
>   CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
> 
> fails, but
> 
>   CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
> 
> succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax
> 
>   CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
> 
> tends to suggest that it's going to create statistics on the pair of
> expressions, describing their correlation, when actually it builds 2
> independent statistics. Also, this error text isn't entirely accurate:
> 
>   CREATE STATISTICS s ON col FROM tbl;
>   ERROR:  extended statistics require at least 2 columns
> 
> because extended statistics don't always require 2 columns, they can
> also just have an expression, or multiple expressions and 0 or 1
> columns.
> 
> I think a lot of this stems from treating "expressions" in the same
> way as the other (multi-column) stats kinds, and it might actually be
> neater to have separate documented syntaxes for single- and
> multi-column statistics:
> 
>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> ON (expression)
> FROM table_name
> 
>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> [ ( statistics_kind [, ... ] ) ]
> ON { column_name | (expression) } , { column_name | (expression) } [, ...]
> FROM table_name
> 
> The first syntax would create single-column stats, and wouldn't accept
> a statistics_kind argument, because there is only one kind of
> single-column statistic. Maybe that might change in the future, but if
> so, it's likely that the kinds of single-column stats will be
> different from the kinds of multi-column stats.
> 
> In the second syntax, the only accepted kinds would be the current
> multi-column stats kinds (ndistinct, dependencies, and mcv), and it
> would always build stats describing the correlations between the
> columns listed. It would continue to build standard/expression stats
> on any expressions in the list, but that's more of an implementation
> detail.
> 
> It would no longer be possible to do "CREATE STATISTICS s
> (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
> issue 2 separate "CREATE STATISTICS" commands, but that seems more
> logical, because they're independent stats.
> 
> The parsing code might not change much, but some of the errors would
> be different. For example, the errors "building only extended
> expression statistics on simple columns not allowed" and "extended
> expression statistics require at least one expression" would go away,
> and the error "extended statistics require at least 2 columns" might
> become more specific, depending on the stats kind.
> 

I think it makes sense in general. I see two issues with this approach,
though:

* By adding expression/standard stats for individual statistics, it
makes the list of statistics longer - I wonder if this might have
measurable impact on lookups in this list.

* I'm not sure it's a good idea that the second syntax would always
build the per-expression stats. Firstly, it seems a bit strange that it
behaves differently than the other kinds. Secondly, I wonder if there
are cases where it'd be desirable to explicitly disable building these
per-expression stats. For example, what if we have multiple extended
statistics objects, overlapping on a couple expressions. It seems
pointless to build the stats for all of them.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2020-12-07 Thread Dean Rasheed
On Thu, 3 Dec 2020 at 15:23, Tomas Vondra  wrote:
>
> Attached is a patch series rebased on top of 25a9e54d2d.

After reading this thread and [1], I think I prefer the name
"standard" rather than "expressions", because it is meant to describe
the kind of statistics being built rather than what they apply to, but
maybe that name doesn't actually need to be exposed to the end user:

Looking at the current behaviour, there are a couple of things that
seem a little odd, even though they are understandable. For example,
the fact that

  CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;

fails, but

  CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;

succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax

  CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;

tends to suggest that it's going to create statistics on the pair of
expressions, describing their correlation, when actually it builds 2
independent statistics. Also, this error text isn't entirely accurate:

  CREATE STATISTICS s ON col FROM tbl;
  ERROR:  extended statistics require at least 2 columns

because extended statistics don't always require 2 columns, they can
also just have an expression, or multiple expressions and 0 or 1
columns.

I think a lot of this stems from treating "expressions" in the same
way as the other (multi-column) stats kinds, and it might actually be
neater to have separate documented syntaxes for single- and
multi-column statistics:

  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
ON (expression)
FROM table_name

  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
[ ( statistics_kind [, ... ] ) ]
ON { column_name | (expression) } , { column_name | (expression) } [, ...]
FROM table_name

The first syntax would create single-column stats, and wouldn't accept
a statistics_kind argument, because there is only one kind of
single-column statistic. Maybe that might change in the future, but if
so, it's likely that the kinds of single-column stats will be
different from the kinds of multi-column stats.

In the second syntax, the only accepted kinds would be the current
multi-column stats kinds (ndistinct, dependencies, and mcv), and it
would always build stats describing the correlations between the
columns listed. It would continue to build standard/expression stats
on any expressions in the list, but that's more of an implementation
detail.

It would no longer be possible to do "CREATE STATISTICS s
(expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
issue 2 separate "CREATE STATISTICS" commands, but that seems more
logical, because they're independent stats.

The parsing code might not change much, but some of the errors would
be different. For example, the errors "building only extended
expression statistics on simple columns not allowed" and "extended
expression statistics require at least one expression" would go away,
and the error "extended statistics require at least 2 columns" might
become more specific, depending on the stats kind.

Regards,
Dean

[1] 
https://www.postgresql.org/message-id/flat/1009.1579038764%40sss.pgh.pa.us#8624792a20ae595683b574f5933dae53




Re: PoC/WIP: Extended statistics on expressions

2020-11-24 Thread Tomas Vondra



On 11/24/20 5:23 PM, Justin Pryzby wrote:
> On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
>> 0004 - Seems fine. IMHO not really "silly errors" but OK.
> 
> This is one of the same issues you pointed out - shadowing a variable.
> Could be backpatched.
> 
> On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
>>> +errmsg("statistics expressions and 
>>> predicates can refer only to the table being indexed")));
>>> +* partial-index predicates.  Create it in the per-index context to 
>>> be
>>>
>>> I think these are copied and shouldn't mention "indexes" or "predicates".  
>>> Or
>>> should statistics support predicates, too ?
>>>
>>
>> Right. Stupid copy-pasto.
> 
> Right, but then I was wondering if CREATE STATS should actually support
> predicates, since one use case is to do what indexes do without their 
> overhead.
> I haven't thought about it enough yet.
> 

Well, it's not supported now, so the message is bogus. I'm not against
supporting "partial statistics" with predicates in the future, but it's
going to be non-trivial project on it's own. It's not something I can
bolt onto the current patch easily.

>> 0006 - Not sure. I think CreateStatistics can be fixed with less code,
>> keeping it more like PG13 (good for backpatching). Not sure why rename
>> extended statistics to multi-variate statistics - we use "extended"
>> everywhere.
> 
> -   if (build_expressions && (list_length(stxexprs) == 0))
> +   if (!build_expressions_only && (list_length(stmt->exprs) < 2))
> ereport(ERROR,  
> (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> -errmsg("extended expression statistics 
> require at least one expression")));
> +errmsg("multi-variate statistics require at 
> least two columns")));
> 
> I think all of "CREATE STATISTICS" has been known as "extended stats", so I
> think it may be confusing to say that it requires two columns for the general
> facility.
> 
>> Not sure what's the point of serialize_expr_stats changes,
>> that's code is mostly copy-paste from update_attstats.
> 
> Right.  I think "i" is poor variable name when it isn't a loop variable and 
> not
> of limited scope.
> 

OK, I understand. I'll consider tweaking that.

>> 0007 - I suspect this makes the pg_stats_ext too complex to work with,
>> IMHO we should move this to a separate view.
> 
> Right - then unnest() the whole thing and return one row per expression rather
> than array, as you've done.  Maybe the docs should say that this returns one
> row per expression.
> 
> Looking quickly at your new patch: I guess you know there's a bunch of
> lingering references to "indexes" and "predicates":
> 
> I don't know if you want to go to the effort to prohibit this.
> postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t;
> CREATE STATISTICS
> 

Hmm, we're already rejecting system attributes, I suppose we should do
the same thing for expressions on system attributes.

> I think a lot of people will find this confusing:
> 
> postgres=# CREATE STATISTICS asf ON i FROM t;
> ERROR:  extended statistics require at least 2 columns
> postgres=# CREATE STATISTICS asf ON (i) FROM t;
> CREATE STATISTICS
> 
> postgres=# CREATE STATISTICS asf (expressions) ON i FROM t;
> ERROR:  extended expression statistics require at least one expression
> postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t;
> CREATE STATISTICS
> 
> I haven't looked, but is it possible to make it work without parens ?
> 

Hmm, you're right that may be surprising. I suppose we could walk the
expressions while creating the statistics, and replace such trivial
expressions with the nested variable, but I haven't tried. I wonder what
the CREATE INDEX behavior would be in these cases.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2020-11-24 Thread Justin Pryzby
On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
> 0004 - Seems fine. IMHO not really "silly errors" but OK.

This is one of the same issues you pointed out - shadowing a variable.
Could be backpatched.

On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
> > +errmsg("statistics expressions and 
> > predicates can refer only to the table being indexed")));
> > +* partial-index predicates.  Create it in the per-index context to 
> > be
> > 
> > I think these are copied and shouldn't mention "indexes" or "predicates".  
> > Or
> > should statistics support predicates, too ?
> > 
> 
> Right. Stupid copy-pasto.

Right, but then I was wondering if CREATE STATS should actually support
predicates, since one use case is to do what indexes do without their overhead.
I haven't thought about it enough yet.

> 0006 - Not sure. I think CreateStatistics can be fixed with less code,
> keeping it more like PG13 (good for backpatching). Not sure why rename
> extended statistics to multi-variate statistics - we use "extended"
> everywhere.

-   if (build_expressions && (list_length(stxexprs) == 0))
+   if (!build_expressions_only && (list_length(stmt->exprs) < 2))
ereport(ERROR,  
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-errmsg("extended expression statistics require 
at least one expression")));
+errmsg("multi-variate statistics require at 
least two columns")));

I think all of "CREATE STATISTICS" has been known as "extended stats", so I
think it may be confusing to say that it requires two columns for the general
facility.

> Not sure what's the point of serialize_expr_stats changes,
> that's code is mostly copy-paste from update_attstats.

Right.  I think "i" is poor variable name when it isn't a loop variable and not
of limited scope.

> 0007 - I suspect this makes the pg_stats_ext too complex to work with,
> IMHO we should move this to a separate view.

Right - then unnest() the whole thing and return one row per expression rather
than array, as you've done.  Maybe the docs should say that this returns one
row per expression.

Looking quickly at your new patch: I guess you know there's a bunch of
lingering references to "indexes" and "predicates":

I don't know if you want to go to the effort to prohibit this.
postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t;
CREATE STATISTICS

I think a lot of people will find this confusing:

postgres=# CREATE STATISTICS asf ON i FROM t;
ERROR:  extended statistics require at least 2 columns
postgres=# CREATE STATISTICS asf ON (i) FROM t;
CREATE STATISTICS

postgres=# CREATE STATISTICS asf (expressions) ON i FROM t;
ERROR:  extended expression statistics require at least one expression
postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t;
CREATE STATISTICS

I haven't looked, but is it possible to make it work without parens ?

-- 
Justin




Re: PoC/WIP: Extended statistics on expressions

2020-11-22 Thread Tomas Vondra



On 11/23/20 3:26 AM, Justin Pryzby wrote:
> On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:
>> attached is a significantly improved version of the patch, allowing
>> defining extended statistics on expressions. This fixes most of the
>> problems in the previous WIP version and AFAICS it does pass all
>> regression tests (including under valgrind). There's a bunch of FIXMEs
>> and a couple loose ends, but overall I think it's ready for reviews.
> 
> I was looking at the previous patch, so now read this one instead, and attach
> some proposed fixes.
> 
> +  * This matters especially for * expensive expressions, of course.
> 

The point this was trying to make is that we evaluate the expressions
only once, and use the results to build all extended statistics. Instead
of leaving it up to every "build" to re-evaluate it.

> +   The expression can refer only to columns of the underlying table, but
> +   it can use all columns, not just the ones the statistics is defined
> +   on.
> 
> I don't know what these are trying to say?
> 

D'oh. That's bogus para, copied from the CREATE INDEX docs (where it
talked about the index predicate, which is irrelevant here).

> +errmsg("statistics expressions and 
> predicates can refer only to the table being indexed")));
> +* partial-index predicates.  Create it in the per-index context to be
> 
> I think these are copied and shouldn't mention "indexes" or "predicates".  Or
> should statistics support predicates, too ?
> 

Right. Stupid copy-pasto.

> Idea: if a user specifies no stakinds, and there's no expression specified,
> then you automatically build everything except for expressional stats.  But if
> they specify only one statistics "column", it gives an error.  If that's a
> non-simple column reference, should that instead build *only* expressional
> stats (possibly with a NOTICE, since the user might be intending to make MV
> stats).
> 

Right, that was the intention - but I messed up and it works only if you
specify the "expressions" kind explicitly (and I also added the ERROR
message to expected output by mistake). I agree we should handle this
automatically, so that

   CREATE STATISTICS s ON (a+b) FROM t

works and only creates statistics for the expression.


> I think pg_stats_ext should allow inspecting the pg_statistic data in
> pg_statistic_ext_data.stxdexprs.  I guess array_agg() should be ordered by
> something, so maybe it should use ORDINALITY (?)
> 

I agree we should expose the expression statistics, but I'm not
convinced we should do that in the pg_stats_ext view itself. The problem
is that it's a table bested in a table, essentially, with non-trivial
structure, so I was thinking about adding a separate view exposing just
this one part. Something like pg_stats_ext_expressions, with about the
same structure as pg_stats, or something.

> I hacked more on bootstrap.c so included that here.

Thanks. As for the 0004-0007 patches:

0004 - Seems fine. IMHO not really "silly errors" but OK.

0005 - Mostly OK. The docs wording mostly comes from CREATE INDEX docs,
though. The paragraph about "t1" is old, so if we want to reword it then
maybe we should backpatch too.

0006 - Not sure. I think CreateStatistics can be fixed with less code,
keeping it more like PG13 (good for backpatching). Not sure why rename
extended statistics to multi-variate statistics - we use "extended"
everywhere. Not sure what's the point of serialize_expr_stats changes,
that's code is mostly copy-paste from update_attstats.

0007 - I suspect this makes the pg_stats_ext too complex to work with,
IMHO we should move this to a separate view.


Thanks for the review! I'll try to look more closely at those patches
sometime next week, and merge most of it.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC/WIP: Extended statistics on expressions

2020-11-16 Thread Tomas Vondra
On 11/16/20 2:49 PM, Tomas Vondra wrote:
> Hi,
>
> ...
>
> 4) apply the statistics
> 
>This is the hard part, really, and the exact state of the support
>depends on type of statistics.
> 
>For ndistinct coefficients, it generally works. I'm sure there may be
>bugs in estimate_num_groups, etc. but in principle it works.
> 
>For MCV lists, it generally works too - you can define statistics on
>the expressions and the estimates should improve. The main downside
>here is that it requires at least two expressions, otherwise we can't
>build/apply the extended statistics. So for example
> 
>   SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0
> 
>may be estimated "correctly", once you drop any of the conditions it
>gets much worse as we don't have stats for individual expressions.
>That's rather annoying - it does not break the extended MCV, but the
>behavior will certainly cause confusion.
> 
>For functional dependencies, the estimation does not work yet. Also,
>the missing per-column statistics have bigger impact than on MCV,
>because while MCV can work fine without it, the dependencies heavily
>rely on the per-column estimates. We only apply "corrections" based
>on the dependency degree, so we still need (good) per-column
>estimates, which does not quite work with the expressions.
> 
> 
>Of course, the lack of per-expression statistics may be somewhat
>fixed by adding indexes on expressions, but that's kinda expensive.
> 

FWIW after re-reading [1], I think the plan to build pg_statistic rows
for expressions and stash them in pg_statistic_ext_data is the way to
go. I was thinking that maybe we'll need some new statistics type to
request this, e.g.

   CREATE STATISTICS s (expressions) ON ...

but on second thought I think we should just build this whenever there
are expressions in the definition. It'll require some changes (e.g. we
require at least two items in the list, but we'll want to allow building
stats on a single expression too, I guess), but that's doable.

Of course, we don't have any catalogs with composite types yet, so it's
not 100% sure this will work, but it's worth a try.

regards


[1]
https://www.postgresql.org/message-id/flat/6331.1579041473%40sss.pgh.pa.us#5ec6af7583e84cef2ca6a9e8a713511e

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company