subject:"Re\: Statistics Import and Export"

Re: Statistics Import and Export

2024-05-16 Thread Jeff Davis

On Thu, 2024-05-16 at 05:25 -0400, Corey Huinker wrote:
> 
> Per previous comments, it was suggested by others that:
> 
> - having them in SECTION_NONE was a grave mistake
> - Everything that could belong in SECTION_DATA should, and the rest
> should be in SECTION_POST_DATA

I don't understand the gravity of the choice here: what am I missing?

To be clear: I'm not arguing against it, but I'd like to understand it
better. Perhaps it has to do with the relationship between the sections
and the dependencies?

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-05-16 Thread Corey Huinker

>
> Can you explain what you did with the
> SECTION_NONE/SECTION_DATA/SECTION_POST_DATA over v19-v21 and why?
>

Initially, I got things to work by having statistics import behave like
COMMENTs, which meant that they were run immediately after the
table/matview/index/constraint that created the pg_class/pg_attribute
entries, but they could be suppressed with a --noX flag

Per previous comments, it was suggested by others that:

- having them in SECTION_NONE was a grave mistake
- Everything that could belong in SECTION_DATA should, and the rest should
be in SECTION_POST_DATA
- This would almost certainly require the statistics import commands to be
TOC objects (one object per pg_class entry, not one object per function
call)

Turning them into TOC objects was a multi-phase process.

1. the TOC entries are generated with dependencies (the parent pg_class
object as well as the potential unique/pk constraint in the case of
indexes), but no statements are generated (in case the stats are filtered
out or the parent object is filtered out). This TOC entry must have
everything we'll need to later generate the function calls. So far, that
information is the parent name, parent schema, and relkind of the parent
object.

2. The TOC entries get sorted by dependencies, and additional dependencies
are added which enforce the PRE/DATA/POST boundaries. This is where knowing
the parent object's relkind is required, as that determines the DATA/POST
section.

3. Now the TOC entry is able to stand on its own, and generate the
statements if they survive the dump/restore filters. Most of the later
versions of the patch were efforts to get the objects to fall into the
right PRE/DATA/POST sections, and the central bug was that the dependencies
passed into ARCHIVE_OPTS were incorrect, as the dependent object passed in
was now the new TOC object, not the parent TOC object. Once that was
resolved, things fell into place.

Re: Statistics Import and Export

2024-05-15 Thread Jeff Davis

On Mon, 2024-05-06 at 23:43 -0400, Corey Huinker wrote:
> 
> v21 attached.
> 
> 0003 is the statistics changes to pg_dump, adding the options -X / --
> statistics-only, and the derivative boolean statisticsOnly. The -P
> option is already used by pg_restore, so instead I chose -X because
> of the passing resemblance to Chi as in the chi-square statistics
> test makes it vaguely statistics-ish. If someone has a better letter,
> I'm listening.
> 
> With that change, people should be able to use pg_dump -X --table=foo
> to dump existing stats for a table and its dependent indexes, and
> then tweak those calls to do tuning work. Have fun with it. If this
> becomes a common use-case then it may make sense to get functions to
> fetch relation/attribute stats for a given relation, either as a
> formed SQL statement or as the parameter values.

Can you explain what you did with the
SECTION_NONE/SECTION_DATA/SECTION_POST_DATA over v19-v21 and why?

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-24 Thread Matthias van de Meent

On Wed, 24 Apr 2024 at 21:31, Bruce Momjian  wrote:
>
> On Tue, Apr 23, 2024 at 06:33:48PM +0200, Matthias van de Meent wrote:
> > I've heard of use cases where dumping stats without data would help
> > with production database planner debugging on a non-prod system.
> >
> > Sure, some planner inputs would have to be taken into account too, but
> > having an exact copy of production stats is at least a start and can
> > help build models and alerts for what'll happen when the tables grow
> > larger with the current stats.
> >
> > As for other planner inputs: table size is relatively easy to shim
> > with sparse files; cumulative statistics can be copied from a donor
> > replica if needed, and btree indexes only really really need to
> > contain their highest and lowest values (and need their height set
> > correctly).
>
> Is it possible to prevent stats from being updated by autovacuum

You can set autovacuum_analyze_threshold and *_scale_factor to
excessively high values, which has the effect of disabling autoanalyze
until it has had similarly excessive tuple churn. But that won't
guarantee autoanalyze won't run; that guarantee only exists with
autovacuum = off.

> and other methods?

No nice ways. AFAIK there is no command (or command sequence) that can
"disable" only ANALYZE and which also guarantee statistics won't be
updated until ANALYZE is manually "re-enabled" for that table. An
extension could maybe do this, but I'm not aware of any extension
points where this would hook into PostgreSQL in a nice way.

You can limit maintenance access on the table to only trusted roles
that you know won't go in and run ANALYZE for those tables, or even
only your superuser (so only they can run ANALYZE, and have them
promise they won't). Alternatively, you can also constantly keep a
lock on the table that conflicts with ANALYZE. The last few are just
workarounds though, and not all something I'd suggest running on a
production database.

Kind regards,

Matthias van de Meent

Re: Statistics Import and Export

2024-04-24 Thread Bruce Momjian

On Tue, Apr 23, 2024 at 06:33:48PM +0200, Matthias van de Meent wrote:
> I've heard of use cases where dumping stats without data would help
> with production database planner debugging on a non-prod system.
> 
> Sure, some planner inputs would have to be taken into account too, but
> having an exact copy of production stats is at least a start and can
> help build models and alerts for what'll happen when the tables grow
> larger with the current stats.
> 
> As for other planner inputs: table size is relatively easy to shim
> with sparse files; cumulative statistics can be copied from a donor
> replica if needed, and btree indexes only really really need to
> contain their highest and lowest values (and need their height set
> correctly).

Is it possible to prevent stats from being updated by autovacuum and
other methods?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

2024-04-23 Thread Matthias van de Meent

On Tue, 23 Apr 2024, 05:52 Tom Lane,  wrote:
> Jeff Davis  writes:
> > On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
> >> Loading data without stats, and hoping
> >> that auto-analyze will catch up sooner not later, is exactly the
> >> current behavior that we're doing all this work to get out of.
>
> > That's the disconnect, I think. For me, the main reason I'm excited
> > about this work is as a way to solve the bad-plans-after-upgrade
> > problem and to repro planner issues outside of production. Avoiding the
> > need to ANALYZE at the end of a data load is also a nice convenience,
> > but not a primary driver (for me).
>
> Oh, I don't doubt that there are use-cases for dumping stats without
> data.  I'm just dubious about the reverse.  I think data+stats should
> be the default, even if only because pg_dump's default has always
> been to dump everything.  Then there should be a way to get stats
> only, and maybe a way to get data only.  Maybe this does argue for a
> four-section definition, despite the ensuing churn in the pg_dump API.

I've heard of use cases where dumping stats without data would help
with production database planner debugging on a non-prod system.

Sure, some planner inputs would have to be taken into account too, but
having an exact copy of production stats is at least a start and can
help build models and alerts for what'll happen when the tables grow
larger with the current stats.

As for other planner inputs: table size is relatively easy to shim
with sparse files; cumulative statistics can be copied from a donor
replica if needed, and btree indexes only really really need to
contain their highest and lowest values (and need their height set
correctly).

Kind regards,

Matthias van de Meent

Re: Statistics Import and Export

2024-04-22 Thread Tom Lane

Jeff Davis  writes:
> On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
>> Loading data without stats, and hoping
>> that auto-analyze will catch up sooner not later, is exactly the
>> current behavior that we're doing all this work to get out of.

> That's the disconnect, I think. For me, the main reason I'm excited
> about this work is as a way to solve the bad-plans-after-upgrade
> problem and to repro planner issues outside of production. Avoiding the
> need to ANALYZE at the end of a data load is also a nice convenience,
> but not a primary driver (for me).

Oh, I don't doubt that there are use-cases for dumping stats without
data.  I'm just dubious about the reverse.  I think data+stats should
be the default, even if only because pg_dump's default has always
been to dump everything.  Then there should be a way to get stats
only, and maybe a way to get data only.  Maybe this does argue for a
four-section definition, despite the ensuing churn in the pg_dump API.

> Should we just itemize some common use cases for pg_dump, and then
> choose the defaults that are least likely to cause surprise?

Per above, I don't find any difficulty in deciding what should be the
default.  What I think we need to consider is what the pg_dump and
pg_restore switch sets should be.  There's certainly a few different
ways we could present that; maybe we should sketch out the details for
a couple of ways.

regards, tom lane

Re: Statistics Import and Export

2024-04-22 Thread Jeff Davis

On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
> Loading data without stats, and hoping
> that auto-analyze will catch up sooner not later, is exactly the
> current behavior that we're doing all this work to get out of.

That's the disconnect, I think. For me, the main reason I'm excited
about this work is as a way to solve the bad-plans-after-upgrade
problem and to repro planner issues outside of production. Avoiding the
need to ANALYZE at the end of a data load is also a nice convenience,
but not a primary driver (for me).

Should we just itemize some common use cases for pg_dump, and then
choose the defaults that are least likely to cause surprise?

As for the section, I'm not sure what to do about that. Based on this
thread it seems that SECTION_NONE (or a SECTION_STATS?) is easiest to
implement, but I don't understand the long-term consequences of that
choice.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-22 Thread Tom Lane

Jeff Davis  writes:
> Would it make sense to have a new SECTION_STATS?

Perhaps, but the implications for pg_dump's API would be nontrivial,
eg would we break any applications that know about the current
options for --section.  And you still have to face up to the question
"does --data-only include this stuff?".

> Philosophically, I suppose stats are data, but I still don't understand
> why considering stats to be data is so important in pg_dump.
> Practically, I want to dump stats XOR data. That's because, if I dump
> the data, it's so costly to reload and rebuild indexes that it's not
> very important to avoid a re-ANALYZE.

Hmm, interesting point.  But the counterargument to that is that
the cost of building indexes will also dwarf the cost of installing
stats, so why not do so?  Loading data without stats, and hoping
that auto-analyze will catch up sooner not later, is exactly the
current behavior that we're doing all this work to get out of.
I don't really think we want it to continue to be the default.

regards, tom lane

Re: Statistics Import and Export

2024-04-22 Thread Jeff Davis

On Wed, 2024-04-17 at 11:50 -0500, Nathan Bossart wrote:
> It looks like the problem is that the ACLs are getting dumped in the
> data
> section when we are also dumping stats.  I'm able to get the tests to
> pass
> by moving the call to dumpRelationStats() that's in dumpTableSchema()
> to
> dumpTableData().  I'm not entirely sure why that fixes it yet, but if
> we're
> treating stats as data, then it intuitively makes sense for us to
> dump it
> in dumpTableData().

Would it make sense to have a new SECTION_STATS?

>  However, that seems to prevent the stats from getting
> exported in the --schema-only/--binary-upgrade scenario, which
> presents a
> problem for pg_upgrade.  ISTM we'll need some extra hacks to get this
> to
> work as desired.

Philosophically, I suppose stats are data, but I still don't understand
why considering stats to be data is so important in pg_dump.

Practically, I want to dump stats XOR data. That's because, if I dump
the data, it's so costly to reload and rebuild indexes that it's not
very important to avoid a re-ANALYZE.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-17 Thread Nathan Bossart

On Thu, Apr 11, 2024 at 03:54:07PM -0400, Corey Huinker wrote:
> At the request of a few people, attached is an attempt to move stats to
> DATA/POST-DATA, and the TAP test failure that results from that.
> 
> The relevant errors are confusing, in that they all concern GRANT/REVOKE,
> and the fact that I made no changes to the TAP test itself.
> 
> $ grep 'not ok' build/meson-logs/testlog.txt
> not ok 9347 - section_data: should not dump GRANT INSERT(col1) ON TABLE
> test_second_table

It looks like the problem is that the ACLs are getting dumped in the data
section when we are also dumping stats.  I'm able to get the tests to pass
by moving the call to dumpRelationStats() that's in dumpTableSchema() to
dumpTableData().  I'm not entirely sure why that fixes it yet, but if we're
treating stats as data, then it intuitively makes sense for us to dump it
in dumpTableData().  However, that seems to prevent the stats from getting
exported in the --schema-only/--binary-upgrade scenario, which presents a
problem for pg_upgrade.  ISTM we'll need some extra hacks to get this to
work as desired.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

2024-04-06 Thread Corey Huinker

>
>
>
> I think it'll be a serious, serious error for this not to be
> SECTION_DATA.  Maybe POST_DATA is OK, but even that seems like
> an implementation compromise not "the way it ought to be".
>

We'd have to split them on account of when the underlying object is
created. Index statistics would be SECTION_POST_DATA, and everything else
would be SECTION_DATA. Looking ahead, statistics data for extended
statistics objects would also be POST. That's not a big change, but my
first attempt at that resulted in a bunch of unrelated grants dumping in
the wrong section.

Re: Statistics Import and Export

2024-04-05 Thread Tom Lane

Jeff Davis  writes:
> Thank you, this has improved a lot and the fundamentals are very close.
> I think it could benefit from a bit more time to settle on a few
> issues:

Yeah ... it feels like we aren't quite going to manage to get this
over the line for v17.  We could commit with the hope that these
last details will get sorted later, but that path inevitably leads
to a mess.

> 1. SECTION_NONE. Conceptually, stats are more like data, and so
> intuitively I would expect this in the SECTION_DATA or
> SECTION_POST_DATA. However, the two most important use cases (in my
> opinion) don't involve dumping the data: pg_upgrade (data doesn't come
> from the dump) and planner simulations/repros. Perhaps the section we
> place it in is not a critical decision, but we will need to stick with
> it for a long time, and I'm not sure that we have consensus on that
> point.

I think it'll be a serious, serious error for this not to be
SECTION_DATA.  Maybe POST_DATA is OK, but even that seems like
an implementation compromise not "the way it ought to be".

> 2. We changed the stats import function API to be VARIADIC very
> recently. After we have a bit of time to think on it, I'm not 100% sure
> we will want to stick with that new API. It's not easy to document,
> which is something I always like to consider.

Perhaps.  I think the argument of wanting to be able to salvage
something even in the presence of unrecognized stats types is
stronger, but I agree this could use more time in the oven.
Unlike many other things in this patch, this would be nigh
impossible to reconsider later.

> 3. The error handling also changed recently to change soft errors (i.e.
> type input errors) to warnings. I like this change but I'd need a bit
> more time to get comfortable with how this is done, there is not a lot
> of precedent for doing this kind of thing.

I don't think there's much disagreement that that's the right thing,
but yeah there could be bugs or some more to do in this area.

regards, tom lane

Re: Statistics Import and Export

2024-04-05 Thread Jeff Davis

On Thu, 2024-04-04 at 00:30 -0400, Corey Huinker wrote:
> 
> v17
> 
> 0001
> - array_in now repackages cast errors as warnings and skips the stat,
> test added
> - version parameter added, though it's mostly for future
> compatibility, tests modified
> - both functions delay object/attribute locking until absolutely
> necessary
> - general cleanup
> 
> 0002
> - added version parameter to dumps
> - --schema-only will not dump stats unless in binary upgrade mode
> - stats are dumped SECTION_NONE
> - general cleanup
> 
> I think that covers the outstanding issues. 

Thank you, this has improved a lot and the fundamentals are very close.

I think it could benefit from a bit more time to settle on a few
issues:

1. SECTION_NONE. Conceptually, stats are more like data, and so
intuitively I would expect this in the SECTION_DATA or
SECTION_POST_DATA. However, the two most important use cases (in my
opinion) don't involve dumping the data: pg_upgrade (data doesn't come
from the dump) and planner simulations/repros. Perhaps the section we
place it in is not a critical decision, but we will need to stick with
it for a long time, and I'm not sure that we have consensus on that
point.

2. We changed the stats import function API to be VARIADIC very
recently. After we have a bit of time to think on it, I'm not 100% sure
we will want to stick with that new API. It's not easy to document,
which is something I always like to consider.

3. The error handling also changed recently to change soft errors (i.e.
type input errors) to warnings. I like this change but I'd need a bit
more time to get comfortable with how this is done, there is not a lot
of precedent for doing this kind of thing. This is connected to the
return value, as well as the machine-readability concern that Magnus
raised.

Additionally, a lot of people are simply very busy around this time,
and may not have had a chance to opine on all the recent changes yet.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-05 Thread Ashutosh Bapat

On Fri, Apr 5, 2024 at 10:07 AM Tom Lane  wrote:

> Ashutosh Bapat  writes:
> > I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase
> and
> > maybe also for IMPORT foreign schema where the SQL is generated by
> > PostgreSQL itself. But not for simulating statistics. In that case, if
> the
> > function happily installs statistics cooked by the user and those aren't
> > used anywhere, users may be misled by the plans that are generated
> > subsequently. Thus negating the very purpose of simulating
> > statistics.
>
> I'm not sure what you think the "purpose of simulating statistics" is,
> but it seems like you have an extremely narrow-minded view of it.
> I think we should allow injecting any stats that won't actively crash
> the backend.  Such functionality could be useful for stress-testing
> the planner, for example, or even just to see what it would do in
> a situation that is not what you have.
>

My reply was in the following context

> For a partitioned table this value has to be true. For a normal table when
>> setting this value to true, it should at least make sure that the table has
>> at least one child. Otherwise it should throw an error. Blindly accepting
>> the given value may render the statistics unusable. Prologue of the
>> function needs to be updated accordingly.
>>
>
> I can see rejecting non-inherited stats for a partitioned table. The
> reverse, however, isn't true, because a table may end up being inherited by
> another, so those statistics may be legit. Having said that, a great deal
> of the data validation I was doing was seen as unnecessary, so I' not sure
> where this check would fall on that line. It's a trivial check if we do add
> it.
>

If a user installs inherited stats for a non-inherited table by accidently
passing true to the corresponding argument, those stats won't be even used.
The user wouldn't know that those stats are not used. Yet, they would think
that any change in the plans is the result of their stats. So whatever
simulation experiment they are running would lead to wrong conclusions.
This could be easily avoided by raising an error. Similarly for installing
non-inherited stats for a partitioned table. There might be other scenarios
where the error won't be required.

-- 
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

2024-04-04 Thread Tom Lane

Ashutosh Bapat  writes:
> I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and
> maybe also for IMPORT foreign schema where the SQL is generated by
> PostgreSQL itself. But not for simulating statistics. In that case, if the
> function happily installs statistics cooked by the user and those aren't
> used anywhere, users may be misled by the plans that are generated
> subsequently. Thus negating the very purpose of simulating
> statistics.

I'm not sure what you think the "purpose of simulating statistics" is,
but it seems like you have an extremely narrow-minded view of it.
I think we should allow injecting any stats that won't actively crash
the backend.  Such functionality could be useful for stress-testing
the planner, for example, or even just to see what it would do in
a situation that is not what you have.

Note that I don't think pg_dump or pg_upgrade need to support
injection of counterfactual statistics.  But direct calls of the
stats insertion functions should be able to do so.

regards, tom lane

Re: Statistics Import and Export

2024-04-04 Thread Ashutosh Bapat

On Fri, Apr 5, 2024 at 7:00 AM Corey Huinker 
wrote:

> For a partitioned table this value has to be true. For a normal table when
>> setting this value to true, it should at least make sure that the table has
>> at least one child. Otherwise it should throw an error. Blindly accepting
>> the given value may render the statistics unusable. Prologue of the
>> function needs to be updated accordingly.
>>
>
> I can see rejecting non-inherited stats for a partitioned table. The
> reverse, however, isn't true, because a table may end up being inherited by
> another, so those statistics may be legit. Having said that, a great deal
> of the data validation I was doing was seen as unnecessary, so I' not sure
> where this check would fall on that line. It's a trivial check if we do add
> it.
>

I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and
maybe also for IMPORT foreign schema where the SQL is generated by
PostgreSQL itself. But not for simulating statistics. In that case, if the
function happily installs statistics cooked by the user and those aren't
used anywhere, users may be misled by the plans that are generated
subsequently. Thus negating the very purpose of simulating statistics. Once
the feature is out there, we won't be able to restrict its usage unless we
document the possible anomalies.

-- 
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

2024-04-04 Thread Corey Huinker

>
> For a partitioned table this value has to be true. For a normal table when
> setting this value to true, it should at least make sure that the table has
> at least one child. Otherwise it should throw an error. Blindly accepting
> the given value may render the statistics unusable. Prologue of the
> function needs to be updated accordingly.
>

I can see rejecting non-inherited stats for a partitioned table. The
reverse, however, isn't true, because a table may end up being inherited by
another, so those statistics may be legit. Having said that, a great deal
of the data validation I was doing was seen as unnecessary, so I' not sure
where this check would fall on that line. It's a trivial check if we do add
it.

Re: Statistics Import and Export

2024-04-03 Thread Michael Paquier

On Mon, Apr 01, 2024 at 01:21:53PM -0400, Tom Lane wrote:
> I'm not sure.  I think if we put our heads down we could finish
> the changes I'm suggesting and resolve the other issues this week.
> However, it is starting to feel like the sort of large, barely-ready
> patch that we often regret cramming in at the last minute.  Maybe
> we should agree that the first v18 CF would be a better time to
> commit it.

There are still 4 days remaining, so there's still time, but my
overall experience on the matter with my RMT hat on is telling me that
we should not rush this patch set.  Redesigning portions close to the
end of a dev cycle is not a good sign, I am afraid, especially if the
sub-parts of the design don't fit well in the global picture as that
could mean more maintenance work on stable branches in the long term.
Still, it is very good to be aware of the problems because you'd know
what to tackle to reach the goals of this patch set.
--
Michael


signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-04-03 Thread Tom Lane

Corey Huinker  writes:
>> As far as that goes, it shouldn't be that hard to deal with, at least
>> not for "soft" errors which hopefully cover most input-function
>> failures these days.  You should be invoking array_in via
>> InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
>> (Look at pg_input_error_info() for useful precedent.)

> Ah, my understanding may be out of date. I was under the impression that
> that mechanism relied on the the cooperation of the per-element input
> function, so even if we got all the builtin datatypes to play nice with
> *Safe(), we were always going to be at risk with a user-defined input
> function.

That's correct, but it's silly not to do what we can.  Also, I imagine
that there is going to be high evolutionary pressure on UDTs to
support soft error mode for COPY, so over time the problem will
decrease --- as long as we invoke the soft error mode.

> 1. NULL input =>  Return NULL. (because strict).
> 2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR
> (thus ruining binary upgrade)
> 3. Call values are so bad (examples: attname not found, required stat
> missing) that nothing can recover => WARN, return FALSE.
> 4. At least one stakind-stat is wonky (impossible for datatype, missing
> stat pair, wrong type on input parameter), but that's the worst of it => 1
> to N WARNs, write stats that do make sense, return TRUE.
> 5. Hunky-dory. => No warns. Write all stats. return TRUE.

> Which of those seem like good ereturn candidates to you?

I'm good with all those behaviors.  On reflection, the design I was
vaguely imagining wouldn't cope with case 4 (multiple WARNs per call)
so never mind that.

regards, tom lane

Re: Statistics Import and Export

2024-04-03 Thread Corey Huinker

>
>
> As far as that goes, it shouldn't be that hard to deal with, at least
> not for "soft" errors which hopefully cover most input-function
> failures these days.  You should be invoking array_in via
> InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
> (Look at pg_input_error_info() for useful precedent.)
>

Ah, my understanding may be out of date. I was under the impression that
that mechanism relied on the the cooperation of the per-element input
function, so even if we got all the builtin datatypes to play nice with
*Safe(), we were always going to be at risk with a user-defined input
function.


> There might be something to be said for handling all the error
> cases via an ErrorSaveContext and use of ereturn() instead of
> ereport().  Not sure if it's worth the trouble or not.
>

It would help us tailor the user experience. Right now we have several
endgames. To recap:

1. NULL input =>  Return NULL. (because strict).
2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR
(thus ruining binary upgrade)
3. Call values are so bad (examples: attname not found, required stat
missing) that nothing can recover => WARN, return FALSE.
4. At least one stakind-stat is wonky (impossible for datatype, missing
stat pair, wrong type on input parameter), but that's the worst of it => 1
to N WARNs, write stats that do make sense, return TRUE.
5. Hunky-dory. => No warns. Write all stats. return TRUE.

Which of those seem like good ereturn candidates to you?

Re: Statistics Import and Export

2024-04-03 Thread Tom Lane

Corey Huinker  writes:
> - functions strive to not ERROR unless absolutely necessary. The biggest
> exposure is the call to array_in().

As far as that goes, it shouldn't be that hard to deal with, at least
not for "soft" errors which hopefully cover most input-function
failures these days.  You should be invoking array_in via
InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
(Look at pg_input_error_info() for useful precedent.)

There might be something to be said for handling all the error
cases via an ErrorSaveContext and use of ereturn() instead of
ereport().  Not sure if it's worth the trouble or not.

regards, tom lane

Re: Statistics Import and Export

2024-04-02 Thread Corey Huinker

>
> Yeah, but that problem exists no matter what.  I haven't read enough
> of the patch to find where it's determining that, but I assume there's
> code in there to intuit the statistics storage type depending on the
> table column's data type and the statistics kind.
>

Correct. It borrows a lot from examine_attribute() and the *_typanalyze()
functions. Actually using VacAttrStats proved problematic, but that can be
revisited at some point.


> We could not trust the exporting side to tell us the correct answer;
> for one reason, it might be different across different releases.
> So "derive it reliably on the destination" is really the only option.
>

+1


> I think that it's impossible to do this in the general case, since
> type-specific typanalyze functions can store pretty nearly whatever
> they like.  However, the pg_stats view isn't going to show nonstandard
> statistics kinds anyway, so we are going to be lossy for custom
> statistics kinds.
>

Sadly true.

Re: Statistics Import and Export

2024-04-02 Thread Tom Lane

Jeff Davis  writes:
> We need to get the original element type on the import side somehow,
> right? Otherwise it will be hard to tell whether '{1, 2, 3, 4}' has
> element type "int4" or "text", which affects the binary representation
> of the anyarray value in pg_statistic.

Yeah, but that problem exists no matter what.  I haven't read enough
of the patch to find where it's determining that, but I assume there's
code in there to intuit the statistics storage type depending on the
table column's data type and the statistics kind.

> Either we need to get it at export time (which seems the most reliable
> in principle, but problematic for older versions) and pass it as an
> argument to pg_set_attribute_stats(); or we need to derive it reliably
> from the table schema on the destination side, right?

We could not trust the exporting side to tell us the correct answer;
for one reason, it might be different across different releases.
So "derive it reliably on the destination" is really the only option.

I think that it's impossible to do this in the general case, since
type-specific typanalyze functions can store pretty nearly whatever
they like.  However, the pg_stats view isn't going to show nonstandard
statistics kinds anyway, so we are going to be lossy for custom
statistics kinds.

regards, tom lane

Re: Statistics Import and Export

2024-04-02 Thread Jeff Davis

On Tue, 2024-04-02 at 17:31 -0400, Tom Lane wrote:
> And that means that we don't need the sending
> side to know the element type anyway.

We need to get the original element type on the import side somehow,
right? Otherwise it will be hard to tell whether '{1, 2, 3, 4}' has
element type "int4" or "text", which affects the binary representation
of the anyarray value in pg_statistic.

Either we need to get it at export time (which seems the most reliable
in principle, but problematic for older versions) and pass it as an
argument to pg_set_attribute_stats(); or we need to derive it reliably
from the table schema on the destination side, right?

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-02 Thread Corey Huinker

>
> side to know the element type anyway.  So, I apologize for sending
> us down a useless side path.  We may as well stick to the function
> signature as shown in the v15 patch --- although maybe variadic
> any is still worthwhile so that an unrecognized field name doesn't
> need to be a hard error?
>

Variadic is nearly done. This issue was the main blocking point. I can go
back to array_in() as we know that code works.

Re: Statistics Import and Export

2024-04-02 Thread Tom Lane

Jeff Davis  writes:
> On the export side, the problem is that the element type (and
> dimensionality and maybe hasnull) is an important part of the anyarray
> value, but it's not part of the output of anyarray_out(). For new
> versions, we can add a scalar function that simply outputs the
> information we need. For old versions, we can hack it by parsing the
> output of anyarray_send(), which contains the information we need
> (binary outputs are under-specified, but I believe they are specified
> enough in this case).

Yeah, I was thinking yesterday about pulling the anyarray columns in
binary and looking at the header fields.  However, I fear there is a
showstopper problem: anyarray_send will fail if the element type
doesn't have a typsend function, which is entirely possible for
user-defined types (and I'm not even sure we've provided them for
every type in the core distro).  I haven't thought of a good answer
to that other than a new backend function.  However ...

> On the import side, the problem is that there may not be an input
> function to go from a 1-D array of text to a 1-D array of any element
> type we want. For example, there's no input function that will create a
> 1-D array with element type float4[] (that's because Postgres doesn't
> really have arrays-of-arrays, it has multi-dimensional arrays).
> Instead, don't use the input function, pass each element of the 1-D
> text array to the element type's input function (which may be scalar or
> not) and then construct a 1-D array out of that with the appropriate
> element type (which may be scalar or not).

Yup.  I had hoped that we could avoid doing any array-munging inside
pg_set_attribute_stats, but this array-of-arrays problem seems to
mean we have to.  In turn, that means that the whole idea of
declaring the function inputs as anyarray rather than text[] is
probably pointless.  And that means that we don't need the sending
side to know the element type anyway.  So, I apologize for sending
us down a useless side path.  We may as well stick to the function
signature as shown in the v15 patch --- although maybe variadic
any is still worthwhile so that an unrecognized field name doesn't
need to be a hard error?

regards, tom lane

Re: Statistics Import and Export

2024-04-02 Thread Jeff Davis

On Tue, 2024-04-02 at 12:59 -0400, Corey Huinker wrote:
>  However, some of the ANYARRAYs have element types that are
> themselves arrays, and near as I can tell, such a construct is not
> expressible in SQL. So, rather than getting an anyarray of an array
> type, you instead get an array of one higher dimension.

Fundamentally, you want to recreate the exact same anyarray values on
the destination system as they existed on the source. There's some
complexity to that on both the export side as well as the import side,
but I believe the problems are solvable.

On the export side, the problem is that the element type (and
dimensionality and maybe hasnull) is an important part of the anyarray
value, but it's not part of the output of anyarray_out(). For new
versions, we can add a scalar function that simply outputs the
information we need. For old versions, we can hack it by parsing the
output of anyarray_send(), which contains the information we need
(binary outputs are under-specified, but I believe they are specified
enough in this case). There may be other hacks to get the information
from the older systems; that's just an idea. To get the actual data,
doing histogram_bounds::text::text[] seems to be enough: that seems to
always give a one-dimensional array with element type "text", even if
the element type is an array. (Note: this means we need the function's
API to also include this extra information about the anyarray values,
so it might be slightly more complex than name/value pairs).

On the import side, the problem is that there may not be an input
function to go from a 1-D array of text to a 1-D array of any element
type we want. For example, there's no input function that will create a
1-D array with element type float4[] (that's because Postgres doesn't
really have arrays-of-arrays, it has multi-dimensional arrays).
Instead, don't use the input function, pass each element of the 1-D
text array to the element type's input function (which may be scalar or
not) and then construct a 1-D array out of that with the appropriate
element type (which may be scalar or not).

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-02 Thread Corey Huinker

I have refactored pg_set_relation_stats to be variadic, and I'm working on
pg_set_attribute_sttats, but I'm encountering an issue with the anyarray
values.

Jeff suggested looking at anyarray_send as a way of extracting the type,
and with some extra twiddling we can get and cast the type. However, some
of the ANYARRAYs have element types that are themselves arrays, and near as
I can tell, such a construct is not expressible in SQL. So, rather than
getting an anyarray of an array type, you instead get an array of one
higher dimension.  Like so:

# select schemaname, tablename, attname,

 substring(substring(anyarray_send(histogram_bounds) from 9 for
4)::text,2)::bit(32)::integer::regtype,


 substring(substring(anyarray_send(histogram_bounds::text::text[][]) from 9
for 4)::text,2)::bit(32)::integer::regtype
from pg_stats where histogram_bounds is not null

and tablename = 'pg_proc' and attname = 'proargnames'


  ;

 schemaname | tablename |   attname   | substring | substring

+---+-+---+---

 pg_catalog | pg_proc   | proargnames | text[]| text

Luckily, passing in such a value would have done all of the element
typechecking for us, so we would just move the data to an array of one less
dimension typed elem[]. If there's an easy way to do that, I don't know of
it.

What remains is just checking the input types against the expected type of
the array, stepping down the dimension if need be, and skipping if the type
doesn't meet expectations.

Re: Statistics Import and Export

2024-04-02 Thread Jeff Davis

On Tue, 2024-04-02 at 05:38 -0400, Corey Huinker wrote:
> Here's a one-liner patch for disabling update of pg_class
> relpages/reltuples/relallviible during a binary upgrade.

This change makes sense to me regardless of the rest of the work.
Updating the relpages/reltuples/relallvisible during pg_upgrade before
the data is there will store the wrong stats.

It could use a brief comment, though.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-02 Thread Corey Huinker

Here's a one-liner patch for disabling update of pg_class
relpages/reltuples/relallviible during a binary upgrade.

This was causting pg_upgrade tests to fail in the existing stats import
work.
From 322db8c9e8796ce673f7d7b63bd64e4c9403a144 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Mon, 1 Apr 2024 18:25:18 -0400
Subject: [PATCH v1] Disable updating pg_class for CREATE INDEX during binary
 upgrade.

There is no point in setting these values because the table will always
be empty.
---
 src/backend/catalog/index.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b6a7c60e23..f83b35add5 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2874,7 +2874,7 @@ index_update_stats(Relation rel,
 		dirty = true;
 	}

-	if (reltuples >= 0)
+	if ((reltuples >= 0) && (!IsBinaryUpgrade))
 	{
 		BlockNumber relpages = RelationGetNumberOfBlocks(rel);
 		BlockNumber relallvisible;
-- 
2.44.0

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Corey Huinker  writes:
> A boolean is what we had before, I'm quite comfortable with that, and it
> addresses my silent-failure concerns.

WFM.

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Corey Huinker

>
> If we are envisioning that the function might emit multiple warnings
> per call, a useful definition could be to return the number of
> warnings (so zero is good, not-zero is bad).  But I'm not sure that's
> really better than a boolean result.  pg_dump/pg_restore won't notice
> anyway, but perhaps other programs using these functions would care.
>

A boolean is what we had before, I'm quite comfortable with that, and it
addresses my silent-failure concerns.

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Corey Huinker  writes:
> Any thoughts about going back to having a return value, a caller could then
> see that the function returned NULL rather than whatever the expected value
> was (example: TRUE)?

If we are envisioning that the function might emit multiple warnings
per call, a useful definition could be to return the number of
warnings (so zero is good, not-zero is bad).  But I'm not sure that's
really better than a boolean result.  pg_dump/pg_restore won't notice
anyway, but perhaps other programs using these functions would care.

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Corey Huinker

>
>
> I still think that we could just declare the function strict, if we
> use the variadic-any approach.  Passing a null in any position is
> indisputable caller error.  However, if you're allergic to silently
> doing nothing in such a case, we could have pg_set_attribute_stats
> check each argument and throw an error.  (Or warn and keep going;
> but according to the design principle I posited earlier, this'd be
> the sort of thing we don't need to tolerate.)
>

Any thoughts about going back to having a return value, a caller could then
see that the function returned NULL rather than whatever the expected value
was (example: TRUE)?

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Corey Huinker  writes:
> So what's the behavior when the user fails to supply a parameter that is
> currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?

I still think that we could just declare the function strict, if we
use the variadic-any approach.  Passing a null in any position is
indisputable caller error.  However, if you're allergic to silently
doing nothing in such a case, we could have pg_set_attribute_stats
check each argument and throw an error.  (Or warn and keep going;
but according to the design principle I posited earlier, this'd be
the sort of thing we don't need to tolerate.)

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Corey Huinker

>
> Ah, yeah, you could change the function to have more parameters,
> given the assumption that all calls will be named-parameter style.
> I still suggest that my proposal is more robust for the case where
> the dump lists parameters that the receiving system doesn't have.
>

So what's the behavior when the user fails to supply a parameter that is
currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?

Re: Statistics Import and Export

2024-04-01 Thread Corey Huinker

>
> I think pg_upgrade could check for the existence of extended statistics
> in any database and adjust the analyze recommdnation wording
> accordingly.
>

+1

Re: Statistics Import and Export

2024-04-01 Thread Corey Huinker

>
> That's not what I suggested at all.  The function parameters would
> be declared anyarray, but the values passed to them would be coerced
> to the correct concrete array types.  So as far as the coercion rules
> are concerned this'd be equivalent to the variadic-any approach.
>

+1



>
> > That's pretty persuasive. It also means that we need to trap for error in
> > the array_in() calls, as that function does not yet have a _safe() mode.
>
> Well, the approach I'm advocating for would have the array input and
> coercion done by the calling query before control ever reaches
> pg_set_attribute_stats, so that any incorrect-for-the-data-type values
> would result in hard errors.  I think that's okay for the same reason
> you probably figured you didn't have to trap array_in: it's the fault
> of the originating pg_dump if it offers a value that doesn't coerce to
> the datatype it claims the value is of.


+1

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Jeff Davis  writes:
> On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote:
>> What happens when
>> somebody adds a new stakind (and hence new pg_stats column)?

> Why would you need to overload in this case? Wouldn't we just define a
> new function with more optional named parameters?

Ah, yeah, you could change the function to have more parameters,
given the assumption that all calls will be named-parameter style.
I still suggest that my proposal is more robust for the case where
the dump lists parameters that the receiving system doesn't have.

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Jeff Davis  writes:
> On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote:
>> Reality check --- are we still targeting this feature for PG 17?

> I see a few useful pieces here:

> 1. Support import of statistics (i.e.
> pg_set_{relation|attribute}_stats()).

> 2. Support pg_dump of stats

> 3. Support pg_upgrade with stats

> It's possible that not all of them make it, but let's not dismiss the
> entire feature yet.

The unresolved questions largely have to do with the interactions
between these pieces.  I think we would seriously regret setting
any one of them in stone before all three are ready to go.

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Jeff Davis

On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote:
> What happens when
> somebody adds a new stakind (and hence new pg_stats column)?
> You could try to add an overloaded pg_set_attribute_stats
> version with more parameters, but I'm pretty sure that would
> lead to "ambiguous function call" failures when trying to load
> old dump files containing only the original parameters.

Why would you need to overload in this case? Wouldn't we just define a
new function with more optional named parameters?

>   The
> present design is also fragile in that an unrecognized parameter
> will lead to a parse-time failure and no function call happening,
> which is less robust than I'd like.

I agree on this point; I found this annoying when testing the feature.

> So this leads me to suggest that we'd be best off with a VARIADIC
> ANY signature, where the variadic part consists of alternating
> parameter labels and values:

I didn't consider this and I think it has a lot of advantages. It's
slightly unfortunate that we can't make them explicitly name/value
pairs, but pg_dump can use whitespace or even SQL comments to make it
more readable.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-01 Thread Bruce Momjian

On Sun, Mar 31, 2024 at 07:04:47PM -0400, Tom Lane wrote:
> Corey Huinker  writes:
> >> I can't quibble with that view of what has priority.  I'm just
> >> suggesting that redesigning what pg_upgrade does in this area
> >> should come later than doing something about extended stats.
> 
> > I mostly agree, with the caveat that pg_upgrade's existing message saying
> > that optimizer stats were not carried over wouldn't be 100% true anymore.
> 
> I think we can tweak the message wording.  I just don't want to be
> doing major redesign of the behavior, nor adding fundamentally new
> monitoring capabilities.

I think pg_upgrade could check for the existence of extended statistics
in any database and adjust the analyze recommdnation wording
accordingly.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

2024-04-01 Thread Jeff Davis

On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote:
> Reality check --- are we still targeting this feature for PG 17?

I see a few useful pieces here:

1. Support import of statistics (i.e.
pg_set_{relation|attribute}_stats()).

2. Support pg_dump of stats

3. Support pg_upgrade with stats

It's possible that not all of them make it, but let's not dismiss the
entire feature yet.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Bruce Momjian  writes:
> Reality check --- are we still targeting this feature for PG 17?

I'm not sure.  I think if we put our heads down we could finish
the changes I'm suggesting and resolve the other issues this week.
However, it is starting to feel like the sort of large, barely-ready
patch that we often regret cramming in at the last minute.  Maybe
we should agree that the first v18 CF would be a better time to
commit it.

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Jeff Davis  writes:
> On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote:
>> I haven't looked at the details, but I'm really a bit surprised
>> by Jeff's assertion that CREATE INDEX destroys statistics on the
>> base table.  That seems wrong from here, and maybe something we
>> could have it not do.  (I do realize that it recalculates reltuples
>> and relpages, but so what?  If it updates those, the results should
>> be perfectly accurate.)

> In the v15 of the patch I was looking at, "pg_dump -s" included the
> statistics. The stats appeared first in the dump, followed by the
> CREATE INDEX commands. The latter overwrote the relpages/reltuples set
> by the former.

> While zeros are the right answers for a schema-only dump, it defeated
> the purpose of including relpages/reltuples stats in the dump, and
> caused the pg_upgrade TAP test to fail.

> You're right that there are a number of ways this could be resolved --
> I don't think it's an inherent problem.

I'm inclined to call it not a problem at all.  While I do agree there
are use-cases for injecting false statistics with these functions,
I do not think that pg_dump has to cater to such use-cases.

In any case, I remain of the opinion that stats are data and should
not be included in a -s dump (with some sort of exception for
pg_upgrade).  If the data has been loaded, then a subsequent
overwrite by CREATE INDEX should not be a problem.

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Bruce Momjian

Reality check --- are we still targeting this feature for PG 17?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

2024-04-01 Thread Jeff Davis

On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote:
> I haven't looked at the details, but I'm really a bit surprised
> by Jeff's assertion that CREATE INDEX destroys statistics on the
> base table.  That seems wrong from here, and maybe something we
> could have it not do.  (I do realize that it recalculates reltuples
> and relpages, but so what?  If it updates those, the results should
> be perfectly accurate.)

In the v15 of the patch I was looking at, "pg_dump -s" included the
statistics. The stats appeared first in the dump, followed by the
CREATE INDEX commands. The latter overwrote the relpages/reltuples set
by the former.

While zeros are the right answers for a schema-only dump, it defeated
the purpose of including relpages/reltuples stats in the dump, and
caused the pg_upgrade TAP test to fail.

You're right that there are a number of ways this could be resolved --
I don't think it's an inherent problem.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Corey Huinker  writes:
>> IIRC, "variadic any" requires having at least one variadic parameter.
>> But that seems fine --- what would be the point, or even the
>> semantics, of calling pg_set_attribute_stats with no data fields?

> If my pg_dump run emitted a bunch of stats that could never be imported,
> I'd want to know. With silent failures, I don't.

What do you think would be silent about that?  If there's a complaint
to be made, it's that it'd be a hard failure ("no such function").

To be clear, I'm ok with emitting ERROR for something that pg_dump
clearly did wrong, which in this case would be emitting a
set_statistics call for an attribute it had exactly no stats values
for.  What I think needs to be WARN is conditions that the originating
pg_dump couldn't have foreseen, for example cross-version differences.
If we do try to check things like sort order, that complaint obviously
has to be WARN, since it's checking something potentially different
from what was correct at the source server.

>> Perhaps we could
>> invent a new backend function that extracts the actual element type
>> of a non-null anyarray argument.

> A backend function that we can't guarantee exists on the source system. :(

[ shrug... ] If this doesn't work for source servers below v17, that
would be a little sad, but it wouldn't be the end of the world.
I see your point that that is an argument for finding another way,
though.

>> Another way we could get to no-coercions is to stick with your
>> signature but declare the relevant parameters as anyarray instead of
>> text.

> I'm a bit confused here. AFAIK we can't construct an anyarray in SQL:

> # select '{1,2,3}'::anyarray;
> ERROR:  cannot accept a value of type anyarray

That's not what I suggested at all.  The function parameters would
be declared anyarray, but the values passed to them would be coerced
to the correct concrete array types.  So as far as the coercion rules
are concerned this'd be equivalent to the variadic-any approach.

> That's pretty persuasive. It also means that we need to trap for error in
> the array_in() calls, as that function does not yet have a _safe() mode.

Well, the approach I'm advocating for would have the array input and
coercion done by the calling query before control ever reaches
pg_set_attribute_stats, so that any incorrect-for-the-data-type values
would result in hard errors.  I think that's okay for the same reason
you probably figured you didn't have to trap array_in: it's the fault
of the originating pg_dump if it offers a value that doesn't coerce to
the datatype it claims the value is of.  My formulation is a bit safer
though in that it's the originating pg_dump, not the receiving server,
that is in charge of saying which type that is.  (If that type doesn't
agree with what the receiving server thinks it should be, that's a
condition that pg_set_attribute_stats itself will detect, and then it
can WARN and move on to the next thing.)

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Ashutosh Bapat

Hi Corey,

Some more comments on v15.

+/*
+ * A more encapsulated version of can_modify_relation for when the the
+ * HeapTuple and Form_pg_class are not needed later.
+ */
+static void
+check_relation_permissions(Relation rel)

This function is used exactly at one place, so usually won't make much
sense to write a separate function. But given that the caller is so long,
this seems ok. If this function returns the cached tuple when permission
checks succeed, it can be used at the other place as well. The caller will
be responsible to release the tuple Or update it.

Attached patch contains a test to invoke this function on a view. ANALYZE
throws a WARNING when a view is passed to it. Similarly this function
should refuse to update the statistics on relations for which ANALYZE
throws a warning. A warning instead of an error seems fine.

+
+ const float4 min = 0.0;
+ const float4 max = 1.0;

When reading the validation condition, I have to look up variable values.
That can be avoided by directly using the values in the condition itself?
If there's some dependency elsewhere in the code, we can use macros. But I
have not seen using constant variables in such a way elsewhere in the code.

+ values[Anum_pg_statistic_starelid - 1] = ObjectIdGetDatum(relid);
+ values[Anum_pg_statistic_staattnum - 1] = Int16GetDatum(attnum);
+ values[Anum_pg_statistic_stainherit - 1] = PG_GETARG_DATUM(P_INHERITED);

For a partitioned table this value has to be true. For a normal table when
setting this value to true, it should at least make sure that the table has
at least one child. Otherwise it should throw an error. Blindly accepting
the given value may render the statistics unusable. Prologue of the
function needs to be updated accordingly.

I have fixed a documentation error in the patch as well. Please incorporate
it in your next patchset.
-- 
Best Wishes,
Ashutosh Bapat
commit 353a4077d07cf097c5cb90c732b7dde2f16f04a6
Author: Ashutosh Bapat 
Date:   Mon Apr 1 13:04:40 2024 +0530

Review changes

Ashutosh Bapat

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 153d0dc6ac..6018e81486 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -29178,6 +29178,8 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
 be less than 0.


+   
+   

 
  pg_set_attribute_stats
diff --git a/src/test/regress/sql/stats_export_import.sql b/src/test/regress/sql/stats_export_import.sql
index 1a7d02a2c7..e3f42f85f0 100644
--- a/src/test/regress/sql/stats_export_import.sql
+++ b/src/test/regress/sql/stats_export_import.sql
@@ -15,6 +15,8 @@ CREATE TABLE stats_export_import.test(
 tags text[]
 );
 
+CREATE VIEW stats_export_import.test_view AS SELECT id, length(name), (comp).e FROM stats_export_import.test;
+
 SELECT relpages, reltuples, relallvisible FROM pg_class WHERE oid = 'stats_export_import.test'::regclass;
 
 SELECT pg_set_relation_stats('stats_export_import.test'::regclass, 999, 3.6::real, 15000);
@@ -26,6 +28,16 @@ SELECT pg_set_relation_stats('stats_export_import.test'::regclass, NULL, 3.6::re
 
 SELECT relpages, reltuples, relallvisible FROM pg_class WHERE oid = 'stats_export_import.test'::regclass;
 
+SELECT relpages, reltuples, relallvisible FROM pg_class WHERE oid = 'stats_export_import.test_view'::regclass;
+
+ANALYZE stats_export_import.test_view;
+
+SELECT relpages, reltuples, relallvisible FROM pg_class WHERE oid = 'stats_export_import.test_view'::regclass;
+
+SELECT pg_set_relation_stats('stats_export_import.test_view'::regclass, 999, 3.6::real, 15000);
+
+SELECT relpages, reltuples, relallvisible FROM pg_class WHERE oid = 'stats_export_import.test_view'::regclass;
+
 -- error: object doesn't exist
 SELECT pg_catalog.pg_set_attribute_stats(
 relation => '0'::oid,
@@ -62,6 +74,25 @@ SELECT pg_catalog.pg_set_attribute_stats(
 avg_width => 2::integer,
 n_distinct => 0.3::real);
 
+-- error: inherited true for a table which does not have any child
+SELECT pg_catalog.pg_set_attribute_stats(
+relation => 'stats_export_import.test'::regclass,
+attname => 'id'::name,
+inherited => true,
+null_frac => 0.1::real,
+avg_width => 2::integer,
+n_distinct => 0.3::real);
+
+CREATE TABLE stats_export_import.child() INHERITS(stats_export_import.test);
+
+ANALYZE VERBOSE stats_export_import.test;
+
+SELECT *
+FROM pg_stats
+WHERE schemaname = 'stats_export_import'
+AND tablename = 'test'
+AND attname = 'id';
+
 -- error: null_frac null
 SELECT pg_catalog.pg_set_attribute_stats(
 relation => 'stats_export_import.test'::regclass,

Re: Statistics Import and Export

2024-04-01 Thread Ashutosh Bapat

Hi Corey,


On Mon, Mar 25, 2024 at 3:38 PM Ashutosh Bapat 
wrote:

> Hi Corey,
>
>
> On Sat, Mar 23, 2024 at 7:21 AM Corey Huinker 
> wrote:
>
>> v12 attached.
>>
>> 0001 -
>>
>>
> Some random comments
>
> +SELECT
> +format('SELECT pg_catalog.pg_set_attribute_stats( '
> +|| 'relation => %L::regclass::oid, attname => %L::name, '
> +|| 'inherited => %L::boolean, null_frac => %L::real, '
> +|| 'avg_width => %L::integer, n_distinct => %L::real, '
> +|| 'most_common_vals => %L::text, '
> +|| 'most_common_freqs => %L::real[], '
> +|| 'histogram_bounds => %L::text, '
> +|| 'correlation => %L::real, '
> +|| 'most_common_elems => %L::text, '
> +|| 'most_common_elem_freqs => %L::real[], '
> +|| 'elem_count_histogram => %L::real[], '
> +|| 'range_length_histogram => %L::text, '
> +|| 'range_empty_frac => %L::real, '
> +|| 'range_bounds_histogram => %L::text) ',
> +'stats_export_import.' || s.tablename || '_clone', s.attname,
> +s.inherited, s.null_frac,
> +s.avg_width, s.n_distinct,
> +s.most_common_vals, s.most_common_freqs, s.histogram_bounds,
> +s.correlation, s.most_common_elems, s.most_common_elem_freqs,
> +s.elem_count_histogram, s.range_length_histogram,
> +s.range_empty_frac, s.range_bounds_histogram)
> +FROM pg_catalog.pg_stats AS s
> +WHERE s.schemaname = 'stats_export_import'
> +AND s.tablename IN ('test', 'is_odd')
> +\gexec
>
> Why do we need to construct the command and execute? Can we instead
> execute the function directly? That would also avoid ECHO magic.
>

Addressed in v15


>
> +   
> +Database Object Statistics Import Functions
> +
> + 
> +  
> +   
> +Function
> +   
> +   
> +Description
> +   
> +  
> + 
>
> COMMENT: The functions throw many validation errors. Do we want to list
> the acceptable/unacceptable input values in the documentation corresponding
> to those? I don't expect one line per argument validation. Something like
> "these, these and these arguments can not be NULL" or "both arguments in
> each of the pairs x and y, a and b, and c and d should be non-NULL or NULL
> respectively".
>

Addressed in v15.


> + /* Statistics are dependent on the definition, not the data */
> + /* Views don't have stats */
> + if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
> + (tbinfo->relkind == RELKIND_VIEW))
> + dumpRelationStats(fout, >dobj, reltypename,
> +  tbinfo->dobj.dumpId);
> +
>
> Statistics are about data. Whenever pg_dump dumps some filtered data, the
> statistics collected for the whole table are uselss. We should avoide
> dumping
> statistics in such a case. E.g. when only schema is dumped what good is
> statistics? Similarly the statistics on a partitioned table may not be
> useful
> if some its partitions are not dumped. Said that dumping statistics on
> foreign
> table makes sense since they do not contain data but the statistics still
> makes sense.
>

Dumping statistics without data is required for pg_upgrade. This is being
discussed in the same thread. But I don't see some of the suggestions e.g.
using binary-mode switch being used in v15.

Also, should we handle sequences, composite types the same way? THe latter
is probably not dumped, but in case.


>
> Whether or not I pass --no-statistics, there is no difference in the dump
> output. Am I missing something?
> $ pg_dump -d postgres > /tmp/dump_no_arguments.out
> $ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
> $ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
> $
>
> IIUC, pg_dump includes statistics by default. That means all our pg_dump
> related tests will have statistics output by default. That's good since the
> functionality will always be tested. 1. We need additional tests to ensure
> that the statistics is installed after restore. 2. Some of those tests
> compare dumps before and after restore. In case the statistics is changed
> because of auto-analyze happening post-restore, these tests will fail.
>

Fixed.

Thanks for addressing those comments.

-- 
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

2024-03-31 Thread Corey Huinker

>
> IIRC, "variadic any" requires having at least one variadic parameter.
> But that seems fine --- what would be the point, or even the
> semantics, of calling pg_set_attribute_stats with no data fields?
>

If my pg_dump run emitted a bunch of stats that could never be imported,
I'd want to know. With silent failures, I don't.



> Perhaps we could
> invent a new backend function that extracts the actual element type
> of a non-null anyarray argument.
>

A backend function that we can't guarantee exists on the source system. :(


> Another way we could get to no-coercions is to stick with your
> signature but declare the relevant parameters as anyarray instead of
> text.  I still think though that we'd be better off to leave the
> parameter matching to runtime, so that we-don't-recognize-that-field
> can be a warning not an error.
>

I'm a bit confused here. AFAIK we can't construct an anyarray in SQL:

# select '{1,2,3}'::anyarray;
ERROR:  cannot accept a value of type anyarray


> I think you missed my point: you're doing that inefficiently,
> and maybe even with race conditions.  Use the relcache's copy
> of the pg_class row.
>

Roger Wilco.


> Well, I'm here to debate it if you want, but I'll just note that *one*
> error will be enough to abort a pg_upgrade entirely, and most users
> these days get scared by errors during manual dump/restore too.  So we
> had better not be throwing errors except for cases that we don't think
> pg_dump could ever emit.
>

That's pretty persuasive. It also means that we need to trap for error in
the array_in() calls, as that function does not yet have a _safe() mode.

Re: Statistics Import and Export

2024-03-31 Thread Tom Lane

Corey Huinker  writes:
>> and I really think that we need to provide
>> the source server's major version number --- maybe we will never
>> need that, but if we do and we don't have it we will be sad.

> The JSON had it, and I never did use it. Not against having it again.

Well, you don't need it now seeing that the definition of pg_stats
columns hasn't changed in the past ... but there's no guarantee we
won't want to change them in the future.

>> So this leads me to suggest that we'd be best off with a VARIADIC
>> ANY signature, where the variadic part consists of alternating
>> parameter labels and values:
>> pg_set_attribute_stats(table regclass, attribute name,
>>inherited bool, source_version int,
>>variadic "any") returns void

> I'm not aware of how strict works with variadics. Would the lack of any
> variadic parameters trigger it?

IIRC, "variadic any" requires having at least one variadic parameter.
But that seems fine --- what would be the point, or even the
semantics, of calling pg_set_attribute_stats with no data fields?

> Also going with strict means that an inadvertent explicit NULL in one
> parameter would cause the entire attribute import to fail silently. I'd
> rather fail loudly.

Not really convinced that that is worth any trouble...

> * We can require the calling statement to cast arguments, particularly
>> arrays, to the proper type, removing the need for conversions within
>> the stats-setting function.  (But instead, it'd need to check that the
>> next "any" argument is the type it ought to be based on the type of
>> the target column.)

> So, that's tricky. The type of the values is not always the attribute type,

Hmm.  You would need to have enough smarts in pg_set_attribute_stats
to identify the appropriate array type in any case: as coded, it needs
that for coercion, whereas what I'm suggesting would only require it
for checking, but either way you need it.  I do concede that pg_dump
(or other logic generating the calls) needs to know more under my
proposal than before.  I had been thinking that it would not need to
hard-code that because it could look to see what the actual type is
of the array it's dumping.  However, I see that pg_typeof() doesn't
work for that because it just returns anyarray.  Perhaps we could
invent a new backend function that extracts the actual element type
of a non-null anyarray argument.

Another way we could get to no-coercions is to stick with your
signature but declare the relevant parameters as anyarray instead of
text.  I still think though that we'd be better off to leave the
parameter matching to runtime, so that we-don't-recognize-that-field
can be a warning not an error.

>> * why is check_relation_permissions looking up the pg_class row?
>> There's already a copy of that in the Relation struct.

> To prove that the caller is the owner (or better) of the table.

I think you missed my point: you're doing that inefficiently,
and maybe even with race conditions.  Use the relcache's copy
of the pg_class row.

>> * I'm dubious that we can fully vet the contents of these arrays,
>> and even a little dubious that we need to try.

> A lot of the feedback I got on this patch over the months concerned giving
> inaccurate, nonsensical, or malicious data to the planner. Surely the
> planner does do *some* defensive programming when fetching these values,
> but this is the first time those values were potentially set by a user, not
> by our own internal code. We can try to match types, collations, etc from
> source to dest, but even that would fall victim to another glibc-level
> collation change.

That sort of concern is exactly why I think the planner has to, and
does, defend itself.  Even if you fully vet the data at the instant
of loading, we might have the collation change under us later.

It could be argued that feeding bogus data to the planner for testing
purposes is a valid use-case for this feature.  (Of course, as
superuser we could inject bogus data into pg_statistic manually,
so it's not necessary to have this feature for that purpose.)
I guess I'm a great deal more sanguine than other people about the
planner's ability to tolerate inconsistent data; but in any case
I don't have a lot of faith in relying on checks in
pg_set_attribute_stats to substitute for that ability.  That idea
mainly leads to having a whole lot of code that has to be kept in
sync with other code that's far away from it and probably isn't
coded in a parallel fashion either.

>> * There's a lot of ERROR cases that maybe we ought to downgrade
>> to WARN-and-press-on, in the service of not breaking the restore
>> completely in case of trouble.

> All cases were made error precisely to spark debate about which cases we'd
> want to continue from and which we'd want to error from.

Well, I'm here to debate it if you want, but I'll just note that *one*
error will be enough to abort a pg_upgrade entirely, and most users
these

Re: Statistics Import and Export

2024-03-31 Thread Corey Huinker

>
>
> I concur with the plan of extracting data from pg_stats not
> pg_statistics, and with emitting a single "set statistics"
> call per attribute.  (I think at one point I'd suggested a call
> per stakind slot, but that would lead to a bunch of UPDATEs on
> existing pg_attribute tuples and hence a bunch of dead tuples
> at the end of an import, so it's not the way to go.  A series
> of UPDATEs would likely also play poorly with a background
> auto-ANALYZE happening concurrently.)
>

That was my reasoning as well.



> I do not like the current design for pg_set_attribute_stats' API
> though: I don't think it's at all future-proof.  What happens when
> somebody adds a new stakind (and hence new pg_stats column)?
> You could try to add an overloaded pg_set_attribute_stats
> version with more parameters, but I'm pretty sure that would
> lead to "ambiguous function call" failures when trying to load
> old dump files containing only the original parameters.


I don't think we'd overload, we'd just add new parameters to the function
signature.


>   The
> present design is also fragile in that an unrecognized parameter
> will lead to a parse-time failure and no function call happening,
> which is less robust than I'd like.


There was a lot of back-and-forth about what sorts of failures were
error-worthy, and which were warn-worthy. I'll discuss further below.


>   As lesser points,
> the relation argument ought to be declared regclass not oid for
> convenience of use,


+1


> and I really think that we need to provide
> the source server's major version number --- maybe we will never
> need that, but if we do and we don't have it we will be sad.
>

The JSON had it, and I never did use it. Not against having it again.


>
> So this leads me to suggest that we'd be best off with a VARIADIC
> ANY signature, where the variadic part consists of alternating
> parameter labels and values:
>
> pg_set_attribute_stats(table regclass, attribute name,
>inherited bool, source_version int,
>variadic "any") returns void
>
> where a call might look like
>
> SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
>   false, -- not inherited
>   16, -- source server major version
>   -- pairs of labels and values follow
>   'null_frac', 0.4,
>   'avg_width', 42,
>   'histogram_bounds',
>   array['a', 'b', 'c']::text[],
>   ...);
>
> Note a couple of useful things here:
>
> * AFAICS we could label the function strict and remove all those ad-hoc
> null checks.  If you don't have a value for a particular stat, you
> just leave that pair of arguments out (exactly as the existing code
> in 0002 does, just using a different notation).  This also means that
> we don't need any default arguments and so no need for hackery in
> system_functions.sql.
>

I'm not aware of how strict works with variadics. Would the lack of any
variadic parameters trigger it?

Also going with strict means that an inadvertent explicit NULL in one
parameter would cause the entire attribute import to fail silently. I'd
rather fail loudly.



> * If we don't recognize a parameter label at runtime, we can treat
> that as a warning rather than a hard error, and press on.  This case
> would mostly be useful in major version downgrades I suppose, but
> that will be something people will want eventually.
>

Interesting.

* We can require the calling statement to cast arguments, particularly
> arrays, to the proper type, removing the need for conversions within
> the stats-setting function.  (But instead, it'd need to check that the
> next "any" argument is the type it ought to be based on the type of
> the target column.)
>

So, that's tricky. The type of the values is not always the attribute type,
for expression indexes, we do call exprType() and exprCollation(), in which
case we always use the expression type over the attribute type, but only
use the collation type if the attribute had no collation. This mimics the
behavior of ANALYZE.

Then, for the MCELEM and DECHIST stakinds we have to find the type's
element type, and that has special logic for tsvectors, ranges, and other
non-scalars, borrowing from the various *_typanalyze() functions. For that
matter, the existing typanalyze functions don't grab the < operator, which
I need for later data validations, so using examine_attribute() was
simultaneously overkill and insufficient.

None of this functionality is accessible from a client program, so we'd
have to pull in a lot of backend stuff to pg_dump to make it resolve the
typecasts correctly. Text and array_in() was just easier.


> pg_set_relation_stats is simpler in that the set of stats values
> to be set will probably remain fairly static, and there seems little
> reason to allow

Re: Statistics Import and Export

2024-03-31 Thread Tom Lane

Corey Huinker  writes:
>> I can't quibble with that view of what has priority.  I'm just
>> suggesting that redesigning what pg_upgrade does in this area
>> should come later than doing something about extended stats.

> I mostly agree, with the caveat that pg_upgrade's existing message saying
> that optimizer stats were not carried over wouldn't be 100% true anymore.

I think we can tweak the message wording.  I just don't want to be
doing major redesign of the behavior, nor adding fundamentally new
monitoring capabilities.

regards, tom lane

Re: Statistics Import and Export

2024-03-31 Thread Corey Huinker

>
> I wonder if the right answer to that is "let's enhance the I/O
> functions for those types".  But whether that helps or not, it's
> v18-or-later material for sure.
>

That was Stephen's take as well, and I agreed given that I had to throw the
kitchen-sink of source-side oid mappings (attname, types, collatons,
operators) into the JSON to work around the limitation.


> I can't quibble with that view of what has priority.  I'm just
> suggesting that redesigning what pg_upgrade does in this area
> should come later than doing something about extended stats.
>

I mostly agree, with the caveat that pg_upgrade's existing message saying
that optimizer stats were not carried over wouldn't be 100% true anymore.

Re: Statistics Import and Export

2024-03-31 Thread Tom Lane

Corey Huinker  writes:
> On Sun, Mar 31, 2024 at 2:41 PM Tom Lane  wrote:
>> There's a bigger issue though: AFAICS this patch set does nothing
>> about dumping extended statistics.  I surely don't want to hold up
>> the patch insisting that that has to happen before we can commit the
>> functionality proposed here.  But we cannot rip out pg_upgrade's
>> support for post-upgrade ANALYZE processing before we do something
>> about extended statistics, and that means it's premature to be
>> designing any changes to how that works.  So I'd set that whole
>> topic on the back burner.

> So Extended Stats _were_ supported by earlier versions where the medium of
> communication was JSON. However, there were several problems with adapting
> that to the current model where we match params to stat types:

> * Several of the column types do not have functional input functions, so we
> must construct the data structure internally and pass them to
> statext_store().
> * The output functions for some of those column types have lists of
> attnums, with negative values representing positional expressions in the
> stat definition. This information is not translatable to another system
> without also passing along the attnum/attname mapping of the source system.

I wonder if the right answer to that is "let's enhance the I/O
functions for those types".  But whether that helps or not, it's
v18-or-later material for sure.

> At least three people told me "nobody uses extended stats" and to just drop
> that from the initial version.

I can't quibble with that view of what has priority.  I'm just
suggesting that redesigning what pg_upgrade does in this area
should come later than doing something about extended stats.

regards, tom lane

Re: Statistics Import and Export

2024-03-31 Thread Corey Huinker

On Sun, Mar 31, 2024 at 2:41 PM Tom Lane  wrote:

> Corey Huinker  writes:
> > Having given this some thought, I'd be inclined to create a view,
> > pg_stats_missing, with the same security barrier as pg_stats, but looking
> > for tables that lack stats on at least one column, or lack stats on an
> > extended statistics object.
>
> The week before feature freeze is no time to be designing something
> like that, unless you've abandoned all hope of getting this into v17.
>

It was a response to the suggestion that there be some way for
tools/automation to read the status of stats. I would view it as a separate
patch, as such a view would be useful now for knowing which tables to
ANALYZE, regardless of whether this patch goes in or not.

> There's a bigger issue though: AFAICS this patch set does nothing
> about dumping extended statistics.  I surely don't want to hold up
> the patch insisting that that has to happen before we can commit the
> functionality proposed here.  But we cannot rip out pg_upgrade's
> support for post-upgrade ANALYZE processing before we do something
> about extended statistics, and that means it's premature to be
> designing any changes to how that works.  So I'd set that whole
> topic on the back burner.
>

So Extended Stats _were_ supported by earlier versions where the medium of
communication was JSON. However, there were several problems with adapting
that to the current model where we match params to stat types:

* Several of the column types do not have functional input functions, so we
must construct the data structure internally and pass them to
statext_store().
* The output functions for some of those column types have lists of
attnums, with negative values representing positional expressions in the
stat definition. This information is not translatable to another system
without also passing along the attnum/attname mapping of the source system.

At least three people told me "nobody uses extended stats" and to just drop
that from the initial version. Unhappy with this assessment, I inquired as
to whether my employer (AWS) had some internal databases that used extended
stats so that I could get good test data, and came up with nothing, nor did
anyone know of customers who used the feature. So when the fourth person
told me that nobody uses extended stats, and not to let a rarely-used
feature get in the way of a feature that would benefit nearly 100% of
users, I dropped it.

> It's possible that we could drop the analyze-in-stages recommendation,
> figuring that this functionality will get people to the
> able-to-limp-along level immediately and that all that is needed is a
> single mop-up ANALYZE pass.  But I think we should leave that till we
> have a bit more than zero field experience with this feature.

It may be that we leave the recommendation exactly as it is.

Perhaps we enhance the error messages in pg_set_*_stats() to indicate what
command would remediate the issue.

Re: Statistics Import and Export

2024-03-31 Thread Tom Lane

My apologies for having paid so little attention to this thread for
months.  I got around to reading the v15 patches today, and while
I think they're going in more or less the right direction, there's
a long way to go IMO.

I concur with the plan of extracting data from pg_stats not
pg_statistics, and with emitting a single "set statistics"
call per attribute.  (I think at one point I'd suggested a call
per stakind slot, but that would lead to a bunch of UPDATEs on
existing pg_attribute tuples and hence a bunch of dead tuples
at the end of an import, so it's not the way to go.  A series
of UPDATEs would likely also play poorly with a background
auto-ANALYZE happening concurrently.)

I do not like the current design for pg_set_attribute_stats' API
though: I don't think it's at all future-proof.  What happens when
somebody adds a new stakind (and hence new pg_stats column)?
You could try to add an overloaded pg_set_attribute_stats
version with more parameters, but I'm pretty sure that would
lead to "ambiguous function call" failures when trying to load
old dump files containing only the original parameters.  The
present design is also fragile in that an unrecognized parameter
will lead to a parse-time failure and no function call happening,
which is less robust than I'd like.  As lesser points,
the relation argument ought to be declared regclass not oid for
convenience of use, and I really think that we need to provide
the source server's major version number --- maybe we will never
need that, but if we do and we don't have it we will be sad.

So this leads me to suggest that we'd be best off with a VARIADIC
ANY signature, where the variadic part consists of alternating
parameter labels and values:

pg_set_attribute_stats(table regclass, attribute name,
   inherited bool, source_version int,
   variadic "any") returns void

where a call might look like

SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
  false, -- not inherited
  16, -- source server major version
  -- pairs of labels and values follow
  'null_frac', 0.4,
  'avg_width', 42,
  'histogram_bounds',
  array['a', 'b', 'c']::text[],
  ...);

Note a couple of useful things here:

* AFAICS we could label the function strict and remove all those ad-hoc
null checks.  If you don't have a value for a particular stat, you
just leave that pair of arguments out (exactly as the existing code
in 0002 does, just using a different notation).  This also means that
we don't need any default arguments and so no need for hackery in
system_functions.sql.

* If we don't recognize a parameter label at runtime, we can treat
that as a warning rather than a hard error, and press on.  This case
would mostly be useful in major version downgrades I suppose, but
that will be something people will want eventually.

* We can require the calling statement to cast arguments, particularly
arrays, to the proper type, removing the need for conversions within
the stats-setting function.  (But instead, it'd need to check that the
next "any" argument is the type it ought to be based on the type of
the target column.)

If we write the labels as undecorated string literals as I show
above, I think they will arrive at the function as "unknown"-type
constants, which is a little weird but doesn't seem like it's
really a big problem.  The alternative is to cast them all to text
explicitly, but that's adding notation to no great benefit IMO.

pg_set_relation_stats is simpler in that the set of stats values
to be set will probably remain fairly static, and there seems little
reason to allow only part of them to be supplied (so personally I'd
drop the business about accepting nulls there too).  If we do grow
another value or values for it to set there shouldn't be much problem
with overloading it with another version with more arguments.
Still needs to take regclass not oid though ...

I've not read the patches in great detail, but I did note a
few low-level issues:

* why is check_relation_permissions looking up the pg_class row?
There's already a copy of that in the Relation struct.  Likewise
for the other caller of can_modify_relation (but why is that
caller not using check_relation_permissions?)  That all looks
overly complicated and duplicative.  I think you don't need two
layers of function there.

* I find the stuff with enums and "const char *param_names" to
be way too cute and unlike anything we do elsewhere.  Please
don't invent your own notations for coding patterns that have
hundreds of existing instances.  pg_set_relation_stats, for
example, has absolutely no reason not to look like the usual

Oid relid = PG_GETARG_OID(0);
float4  relpages = PG_GETARG_FLOAT4(1);
... etc ...

* The

Re: Statistics Import and Export

2024-03-31 Thread Tom Lane

Corey Huinker  writes:
> Having given this some thought, I'd be inclined to create a view,
> pg_stats_missing, with the same security barrier as pg_stats, but looking
> for tables that lack stats on at least one column, or lack stats on an
> extended statistics object.

The week before feature freeze is no time to be designing something
like that, unless you've abandoned all hope of getting this into v17.

There's a bigger issue though: AFAICS this patch set does nothing
about dumping extended statistics.  I surely don't want to hold up
the patch insisting that that has to happen before we can commit the
functionality proposed here.  But we cannot rip out pg_upgrade's
support for post-upgrade ANALYZE processing before we do something
about extended statistics, and that means it's premature to be
designing any changes to how that works.  So I'd set that whole
topic on the back burner.

It's possible that we could drop the analyze-in-stages recommendation,
figuring that this functionality will get people to the
able-to-limp-along level immediately and that all that is needed is a
single mop-up ANALYZE pass.  But I think we should leave that till we
have a bit more than zero field experience with this feature.

regards, tom lane

Re: Statistics Import and Export

2024-03-31 Thread Corey Huinker

>
> That will make it *really* hard for any form of automation or drivers of
> this. The information needs to go somewhere where such tools can easily
> consume it, and an informational message during runtime (which is also
> likely to be translated in many environments) is the exact opposite of that.
>

Having given this some thought, I'd be inclined to create a view,
pg_stats_missing, with the same security barrier as pg_stats, but looking
for tables that lack stats on at least one column, or lack stats on an
extended statistics object.

Table structure would be

schemaname name
tablename name
attnames text[]
ext_stats text[]


The informational message, if it changes at all, could reference this new
view as the place to learn about how well the stats import went.

vacuumdb might get a --missing-only option in addition to
--analyze-in-stages.

For that matter, we could add --analyze-missing options to pg_restore and
pg_upgrade to do the mopping up themselves.

Re: Statistics Import and Export

2024-03-30 Thread Tom Lane

Corey Huinker  writes:
>> I didn't have any specific proposal in mind, was just trying to think
>> outside the box.

> What if we added a separate resection SECTION_STATISTICS which is run
> following post-data?

Maybe, but that would have a lot of side-effects on pg_dump's API
and probably on some end-user scripts.  I'd rather not.

I haven't looked at the details, but I'm really a bit surprised
by Jeff's assertion that CREATE INDEX destroys statistics on the
base table.  That seems wrong from here, and maybe something we
could have it not do.  (I do realize that it recalculates reltuples
and relpages, but so what?  If it updates those, the results should
be perfectly accurate.)

regards, tom lane

Re: Statistics Import and Export

2024-03-30 Thread Corey Huinker

>
> I didn't have any specific proposal in mind, was just trying to think
> outside the box.
>

What if we added a separate resection SECTION_STATISTICS which is run
following post-data?

Re: Statistics Import and Export

2024-03-30 Thread Corey Huinker

>
> I'm getting late into this discussion and I apologize if I've missed this
> being discussed before. But.
>
> Please don't.
>
> That will make it *really* hard for any form of automation or drivers of
> this. The information needs to go somewhere where such tools can easily
> consume it, and an informational message during runtime (which is also
> likely to be translated in many environments) is the exact opposite of that.
>

That makes a lot of sense. I'm not sure what form it would take (file,
pseudo-table, something else?). Open to suggestions.

Re: Statistics Import and Export

2024-03-30 Thread Jeff Davis

On Sat, 2024-03-30 at 13:39 -0400, Tom Lane wrote:
> (You could also imagine an explicit positive --stats switch that
> would
> override --schema-only, but I don't see that it's worth the trouble.)

That would have its own utility for reproducing planner problems
outside of production systems.

(That could be a separate feature, though, and doesn't need to be a
part of this patch set.)

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-30 Thread Tom Lane

Jeff Davis  writes:
> On Sat, 2024-03-30 at 13:18 -0400, Tom Lane wrote:
>> Surely they are data, not schema.  It would make zero sense to
>> restore them if you aren't restoring the data they describe.

> The complexity is that pg_upgrade does create the data, but relies on a
> schema-only dump. So we'd need to at least account for that somehow,
> either with a separate stats-only dump, or make a special case in
> binary upgrade mode that dumps schema+stats (and resolves the CREATE
> INDEX issue).

Ah, good point.  But binary-upgrade mode is special in tons of ways
already.  I don't see a big problem with allowing it to dump stats
even though --schema-only would normally imply not doing that.

(You could also imagine an explicit positive --stats switch that would
override --schema-only, but I don't see that it's worth the trouble.)

>> Maybe we need to revisit CREATE INDEX's behavior rather
>> than assuming it's graven in stone?

> Would there be a significant cost to just not doing that? Or are you
> suggesting that we special-case the behavior, or turn it off during
> restore with a GUC?

I didn't have any specific proposal in mind, was just trying to think
outside the box.

regards, tom lane

Re: Statistics Import and Export

2024-03-30 Thread Jeff Davis

On Sat, 2024-03-30 at 13:18 -0400, Tom Lane wrote:
> Surely they are data, not schema.  It would make zero sense to
> restore
> them if you aren't restoring the data they describe.

The complexity is that pg_upgrade does create the data, but relies on a
schema-only dump. So we'd need to at least account for that somehow,
either with a separate stats-only dump, or make a special case in
binary upgrade mode that dumps schema+stats (and resolves the CREATE
INDEX issue).

> Maybe we need to revisit CREATE INDEX's behavior rather
> than assuming it's graven in stone?

Would there be a significant cost to just not doing that? Or are you
suggesting that we special-case the behavior, or turn it off during
restore with a GUC?

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-30 Thread Tom Lane

Jeff Davis  writes:
> This re-raises the question of whether stats are part of a schema-only
> dump or not. Did we settle conclusively that they are?

Surely they are data, not schema.  It would make zero sense to restore
them if you aren't restoring the data they describe.

Hence, it'll be a bit messy if we can't put them in the dump's DATA
section.  Maybe we need to revisit CREATE INDEX's behavior rather
than assuming it's graven in stone?

regards, tom lane

Re: Statistics Import and Export

2024-03-30 Thread Jeff Davis

On Sat, 2024-03-30 at 01:34 -0400, Corey Huinker wrote:
> 
> - 002pg_upgrade.pl now dumps before/after databases with --no-
> statistics. I tried to find out why some tables were getting their
> relstats either not set, or set and reset, never affecting the
> attribute stats. I even tried turning autovacuum off for both
> instances, but nothing seemed to change the fact that the same tables
> were having their relstats reset.

I think I found out why this is happening: a schema-only dump first
creates the table, then sets the relstats, then creates indexes. The
index creation updates the relstats, but because the dump was schema-
only, it overwrites the relstats with zeros.

That exposes an interesting dependency, which is that relstats must be
set after index creation, otherwise they will be lost -- at least in
the case of pg_upgrade.

This re-raises the question of whether stats are part of a schema-only
dump or not. Did we settle conclusively that they are?

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-30 Thread Magnus Hagander

On Sat, Mar 30, 2024 at 1:26 AM Corey Huinker 
wrote:

>
>
> On Fri, Mar 29, 2024 at 7:34 PM Jeff Davis  wrote:
>
>> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
>> > I’d certainly think “with stats” would be the preferred default of
>> > our users.
>>
>> I'm concerned there could still be paths that lead to an error. For
>> pg_restore, or when loading a SQL file, a single error isn't fatal
>> (unless -e is specified), but it still could be somewhat scary to see
>> errors during a reload.
>>
>
> To that end, I'm going to be modifying the "Optimizer statistics are not
> transferred by pg_upgrade..." message when stats _were_ transferred,
> width additional instructions that the user should treat any stats-ish
> error messages encountered as a reason to manually analyze that table. We
> should probably say something about extended stats as well.
>
>

I'm getting late into this discussion and I apologize if I've missed this
being discussed before. But.

Please don't.

That will make it *really* hard for any form of automation or drivers of
this. The information needs to go somewhere where such tools can easily
consume it, and an informational message during runtime (which is also
likely to be translated in many environments) is the exact opposite of that.

Surely we can come up with something better. Otherwise, I think all those
tools are just going ot have to end up assuming that it always failed and
proceed based on that, and that would be a shame.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Statistics Import and Export

2024-03-29 Thread Jeff Davis

On Fri, 2024-03-29 at 20:54 -0400, Stephen Frost wrote:
> What’s different, given the above arguments, in making the change
> with 18 instead of now?

Acknowledged. You, Tom, and Corey (and perhaps everyone else) seem to
be aligned here, so that's consensus enough for me. Default is with
stats, --no-statistics to disable them.

> Independently, I had a thought around doing an analyze as the data is
> being loaded ..

Right, I think there are some interesting things to pursue here. I also
had an idea to use logical decoding to get a streaming sample, which
would be better randomness than block sampling. At this point that's
just an idea, I haven't looked into it seriously.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-29 Thread Stephen Frost

Greetings,

On Fri, Mar 29, 2024 at 19:35 Jeff Davis  wrote:

> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> > I’d certainly think “with stats” would be the preferred default of
> > our users.
>
> I'm concerned there could still be paths that lead to an error. For
> pg_restore, or when loading a SQL file, a single error isn't fatal
> (unless -e is specified), but it still could be somewhat scary to see
> errors during a reload.

I understand that point.

Also, it's new behavior, so it may cause some minor surprises, or there
> might be minor interactions to work out. For instance, dumping stats
> doesn't make a lot of sense if pg_upgrade (or something else) is just
> going to run analyze anyway.

But we don’t expect anything to run analyze … do we?  So I’m not sure why
it makes sense to raise this as a concern.

What do you think about starting off with it as non-default, and then
> switching it to default in 18?

What’s different, given the above arguments, in making the change with 18
instead of now?  I also suspect that if we say “we will change the default
later” … that later won’t ever come and we will end up making our users
always have to remember to say “with-stats” instead.

The stats are important which is why the effort is being made in the first
place. If just doing an analyze after loading the data was good enough then
this wouldn’t be getting worked on.

Independently, I had a thought around doing an analyze as the data is being
loaded .. but we can’t do that for indexes (but we could perhaps analyze
the indexed values as we build the index..).  This works when we do a
truncate or create the table in the same transaction, so we would tie into
some of the existing logic that we have around that.  Would also adjust
COPY to accept an option that specifies the anticipated number of rows
being loaded (which we can figure out during the dump phase reasonably..).
Perhaps this would lead to a pg_dump option to do the data load as a
transaction with a truncate before the copy (point here being to be able to
still do parallel load while getting the benefits from knowing that we are
completely reloading the table). Just some other thoughts- which I don’t
intend to take away from the current effort at all, which I see as valuable
and should be enabled by default.

Thanks!

Stephen

>

Re: Statistics Import and Export

2024-03-29 Thread Corey Huinker

>
> (I've not read the patch yet, but I assume the switch works like
> other pg_dump filters in that you can apply it on the restore
> side?)
>

Correct. It follows the existing --no-something pattern.

Re: Statistics Import and Export

2024-03-29 Thread Corey Huinker

On Fri, Mar 29, 2024 at 7:34 PM Jeff Davis  wrote:

> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> > I’d certainly think “with stats” would be the preferred default of
> > our users.
>
> I'm concerned there could still be paths that lead to an error. For
> pg_restore, or when loading a SQL file, a single error isn't fatal
> (unless -e is specified), but it still could be somewhat scary to see
> errors during a reload.
>

To that end, I'm going to be modifying the "Optimizer statistics are not
transferred by pg_upgrade..." message when stats _were_ transferred,
width additional instructions that the user should treat any stats-ish
error messages encountered as a reason to manually analyze that table. We
should probably say something about extended stats as well.

Re: Statistics Import and Export

2024-03-29 Thread Tom Lane

Jeff Davis  writes:
> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
>> I’d certainly think “with stats” would be the preferred default of
>> our users.

> What do you think about starting off with it as non-default, and then
> switching it to default in 18?

I'm with Stephen: I find it very hard to imagine that there's any
users who wouldn't want this as default.  If we do what you suggest,
then there will be three historical behaviors to cope with not two.
That doesn't sound like it'll make anyone's life better.

As for the "it might break" argument, that could be leveled against
any nontrivial patch.  You're at least offering an opt-out switch,
which is something we more often don't do.

(I've not read the patch yet, but I assume the switch works like
other pg_dump filters in that you can apply it on the restore
side?)

regards, tom lane

Re: Statistics Import and Export

2024-03-29 Thread Jeff Davis

On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

Also, it's new behavior, so it may cause some minor surprises, or there
might be minor interactions to work out. For instance, dumping stats
doesn't make a lot of sense if pg_upgrade (or something else) is just
going to run analyze anyway.

What do you think about starting off with it as non-default, and then
switching it to default in 18?

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-29 Thread Corey Huinker

>
>
> There's still a failure in the pg_upgrade TAP test. One seems to be
> ordering, so perhaps we need to ORDER BY the attribute number. Others
> seem to be missing relstats and I'm not sure why yet. I suggest doing
> some manual pg_upgrade tests and comparing the before/after dumps to
> see if you can reproduce a smaller version of the problem.
>

That's fixed in my current working version, as is a tsvector-specific
issue. Working on the TAP issue.

Re: Statistics Import and Export

2024-03-29 Thread Stephen Frost

Greetings,

On Fri, Mar 29, 2024 at 11:05 Jeff Davis  wrote:

> On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> > 0002:
> > - All relstats and attrstats calls are now their own statement
> > instead of a compound statement
> > - moved the archive TOC entry from post-data back to SECTION_NONE (as
> > it was modeled on object COMMENTs), which seems to work better.
> > - remove meta-query in favor of more conventional query building
> > - removed all changes to fe_utils/
>
> Can we get a consensus on whether the default should be with stats or
> without? That seems like the most important thing remaining in the
> pg_dump changes.


I’d certainly think “with stats” would be the preferred default of our
users.

Thanks!

Stephen

Re: Statistics Import and Export

2024-03-29 Thread Jeff Davis

On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> That is fairly close to what I came up with per our conversation
> (attached below), but I really like the att_stats_arginfo construct
> and I definitely want to adopt that and expand it to a third
> dimension that flags the fields that cannot be null. I will
> incorporate that into v15.

Sounds good. I think it cuts down on the boilerplate.

> 0002:
> - All relstats and attrstats calls are now their own statement
> instead of a compound statement
> - moved the archive TOC entry from post-data back to SECTION_NONE (as
> it was modeled on object COMMENTs), which seems to work better.
> - remove meta-query in favor of more conventional query building
> - removed all changes to fe_utils/

Can we get a consensus on whether the default should be with stats or
without? That seems like the most important thing remaining in the
pg_dump changes.

There's still a failure in the pg_upgrade TAP test. One seems to be
ordering, so perhaps we need to ORDER BY the attribute number. Others
seem to be missing relstats and I'm not sure why yet. I suggest doing
some manual pg_upgrade tests and comparing the before/after dumps to
see if you can reproduce a smaller version of the problem.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-29 Thread Jeff Davis

On Tue, 2024-03-26 at 00:16 +0100, Tomas Vondra wrote:
> I did take a closer look at v13 today. I have a bunch of comments and
> some minor whitespace fixes in the attached review patches.

I also attached a patch implementing a different approach to the
pg_dump support. Instead of trying to create a query that uses SQL
"format()" to create more SQL, I did all the formatting in C. It turned
out to be about 30% fewer lines, and I find it more understandable and
consistent with the way other stuff in pg_dump happens.

The attached patch is pretty rough -- not many comments, and perhaps
some things should be moved around. I only tested very basic
dump/reload in SQL format.

Regards,
Jeff Davis

From 7ca575e5a02bf380af92b6144622468a501f7636 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Sat, 16 Mar 2024 17:21:10 -0400
Subject: [PATCH vjeff] Enable dumping of table/index stats in pg_dump.

For each table/matview/index dumped, it will also generate a statement
that calls all of the pg_set_relation_stats() and
pg_set_attribute_stats() calls necessary to restore the statistics of
the current system onto the destination system.

As is the pattern with pg_dump options, this can be disabled with
--no-statistics.
---
 src/bin/pg_dump/pg_backup.h  |   2 +
 src/bin/pg_dump/pg_backup_archiver.c |   5 +
 src/bin/pg_dump/pg_dump.c| 229 ++-
 src/bin/pg_dump/pg_dump.h|   1 +
 src/bin/pg_dump/pg_dumpall.c |   5 +
 src/bin/pg_dump/pg_restore.c |   3 +
 6 files changed, 243 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 9ef2f2017e..1db5cf52eb 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -112,6 +112,7 @@ typedef struct _restoreOptions
 	int			no_publications;	/* Skip publication entries */
 	int			no_security_labels; /* Skip security label entries */
 	int			no_subscriptions;	/* Skip subscription entries */
+	int			no_statistics;		/* Skip statistics import */
 	int			strict_names;
 
 	const char *filename;
@@ -179,6 +180,7 @@ typedef struct _dumpOptions
 	int			no_security_labels;
 	int			no_publications;
 	int			no_subscriptions;
+	int			no_statistics;
 	int			no_toast_compression;
 	int			no_unlogged_table_data;
 	int			serializable_deferrable;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..d5f61399d9 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -2833,6 +2833,10 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 	if (ropt->no_subscriptions && strcmp(te->desc, "SUBSCRIPTION") == 0)
 		return 0;
 
+	/* If it's a stats dump, maybe ignore it */
+	if (ropt->no_statistics && strcmp(te->desc, "STATISTICS") == 0)
+		return 0;
+
 	/* Ignore it if section is not to be dumped/restored */
 	switch (curSection)
 	{
@@ -2862,6 +2866,7 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 	 */
 	if (strcmp(te->desc, "ACL") == 0 ||
 		strcmp(te->desc, "COMMENT") == 0 ||
+		strcmp(te->desc, "STATISTICS") == 0 ||
 		strcmp(te->desc, "SECURITY LABEL") == 0)
 	{
 		/* Database properties react to createDB, not selectivity options. */
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index b1c4c3ec7f..d483122998 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -428,6 +428,7 @@ main(int argc, char **argv)
 		{"no-comments", no_argument, _comments, 1},
 		{"no-publications", no_argument, _publications, 1},
 		{"no-security-labels", no_argument, _security_labels, 1},
+		{"no-statistics", no_argument, _statistics, 1},
 		{"no-subscriptions", no_argument, _subscriptions, 1},
 		{"no-toast-compression", no_argument, _toast_compression, 1},
 		{"no-unlogged-table-data", no_argument, _unlogged_table_data, 1},
@@ -1144,6 +1145,7 @@ help(const char *progname)
 	printf(_("  --no-commentsdo not dump comments\n"));
 	printf(_("  --no-publicationsdo not dump publications\n"));
 	printf(_("  --no-security-labels do not dump security label assignments\n"));
+	printf(_("  --no-statistics  do not dump statistics\n"));
 	printf(_("  --no-subscriptions   do not dump subscriptions\n"));
 	printf(_("  --no-table-access-method do not dump table access methods\n"));
 	printf(_("  --no-tablespaces do not dump tablespace assignments\n"));
@@ -7001,6 +7003,7 @@ getTables(Archive *fout, int *numTables)
 
 		/* Tables have data */
 		tblinfo[i].dobj.components |= DUMP_COMPONENT_DATA;
+		tblinfo[i].dobj.components |= DUMP_COMPONENT_STATISTICS;
 
 		/* Mark whether table has an ACL */
 		if (!PQgetisnull(res, i, i_relacl))
@@ -7498,6 +7501,7 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables)
 			indxinfo[j].dobj.catId.tableoid = atooid(PQgetvalue(res, j, i_tableoid));
 			indxinfo[j].dobj.catId.oid =

Re: Statistics Import and Export

2024-03-28 Thread Jeff Davis

Hi Tom,

Comparing the current patch set to your advice below:

On Tue, 2023-12-26 at 14:19 -0500, Tom Lane wrote:
> I had things set up with simple functions, which
> pg_dump would invoke by writing more or less
> 
> SELECT pg_catalog.load_statistics();
> 
> This has a number of advantages, not least of which is that an
> extension
> could plausibly add compatible functions to older versions.

Check.

>   The trick,
> as you say, is to figure out what the argument lists ought to be.
> Unfortunately I recall few details of what I wrote for Salesforce,
> but I think I had it broken down in a way where there was a separate
> function call occurring for each pg_statistic "slot", thus roughly
> 
> load_statistics(table regclass, attname text, stakind int, stavalue
> ...);

The problem with basing the function on pg_statistic directly is that
it can only be exported by the superuser.

The current patches instead base it on the pg_stats view, which already
does the privilege checking. Technically, information about which
stakinds go in which slots is lost, but I don't think that's a problem
as long as the stats make it in, right? It's also more user-friendly to
have nice names for the function arguments. The only downside I see is
that it's slightly asymmetric: exporting from pg_stats and importing
into pg_statistic.

I do have some concerns about letting non-superusers import their own
statistics: how robust is the rest of the code to handle malformed
stats once they make it into pg_statistic? Corey has addressed that
with basic input validation, so I think it's fine, but perhaps I'm
missing something.

>  As mentioned already, we'd also need some sort of
> version identifier, and we'd expect the load_statistics() functions
> to be able to transform the data if the old version used a different
> representation.

You mean a version argument to the function, which would appear in the
exported stats data? That's not in the current patch set.

It's relying on the new version of pg_dump understanding the old
statistics data, and dumping it out in a form that the new server will
understand.

>   I agree with the idea that an explicit representation
> of the source table attribute's type would be wise, too.

That's not in the current patch set, either. 

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-27 Thread Corey Huinker

>
> 1) The docs say this:
>
>   
>The purpose of this function is to apply statistics values in an
>upgrade situation that are "good enough" for system operation until
>they are replaced by the next ANALYZE, usually via
>autovacuum This function is used by
>pg_upgrade and pg_restore to
>convey the statistics from the old system version into the new one.
>   
>
> I find this a bit confusing, considering the pg_dump/pg_restore changes
> are only in 0002, not in this patch.
>

True, I'll split the docs.


>
> 2) Also, I'm not sure about this:
>
>   relation, the parameters in this are all
>   derived from pg_stats, and the values
>   given are most often extracted from there.
>
> How do we know where do the values come from "most often"? I mean, where
> else would it come from?
>

The next most likely sources would be 1. stats from another similar table
and 2. the imagination of a user testing hypothetical query plans.


>
> 3) The function pg_set_attribute_stats() is very long - 1000 lines
> or so, that's way too many for me to think about. I agree the flow is
> pretty simple, but I still wonder if there's a way to maybe split it
> into some smaller "meaningful" steps.
>

I wrestle with that myself. I think there's some pieces that can be
factored out.


> 4) It took me *ages* to realize the enums at the beginning of some of
> the functions are actually indexes of arguments in PG_FUNCTION_ARGS.
> That'd surely deserve a comment explaining this.
>

My apologies, it definitely deserves a comment.


>
> 5) The comment for param_names in pg_set_attribute_stats says this:
>
> /* names of columns that cannot be null */
> const char *param_names[] = { ... }
>
> but isn't that actually incorrect? I think that applies only to a couple
> initial arguments, but then other fields (MCV, mcelem stats, ...) can be
> NULL, right?
>

Yes, that is vestigial, I'll remove it.


>
> 6) There's a couple minor whitespace fixes or comments etc.
>
>
> 0002
> 
>
> 1) I don't understand why we have exportExtStatsSupported(). Seems
> pointless - nothing calls it, even if it did we don't know how to export
> the stats.
>

It's not strictly necessary.


>
> 2) I think this condition in dumpTableSchema() is actually incorrect:
>
> IMO that's wrong, the statistics should be delayed to the post-data
> section. Which probably means there needs to be a separate dumpable
> object for statistics on table/index, with a dependency on the object.
>

Good points.


>
> 3) I don't like the "STATS IMPORT" description. For extended statistics
> we dump the definition as "STATISTICS" so why to shorten it to "STATS"
> here? And "IMPORT" seems more like the process of loading data, not the
> data itself. So I suggest "STATISTICS DATA".
>

+1

Re: Statistics Import and Export

2024-03-27 Thread Corey Huinker

>
>
>
> +\gexec
>
> Why do we need to construct the command and execute? Can we instead
> execute the function directly? That would also avoid ECHO magic.
>

We don't strictly need it, but I've found the set-difference operation to
be incredibly useful in diagnosing problems. Additionally, the values are
subject to change due to changes in test data, no guarantee that the output
of ANALYZE is deterministic, etc. But most of all, because the test cares
about the correct copying of values, not the values themselves.


>
> +   
> +Database Object Statistics Import Functions
> +
> + 
> +  
> +   
> +Function
> +   
> +   
> +Description
> +   
> +  
> + 
>
> COMMENT: The functions throw many validation errors. Do we want to list
> the acceptable/unacceptable input values in the documentation corresponding
> to those? I don't expect one line per argument validation. Something like
> "these, these and these arguments can not be NULL" or "both arguments in
> each of the pairs x and y, a and b, and c and d should be non-NULL or NULL
> respectively".
>

Yes. It should.


>  Statistics are about data. Whenever pg_dump dumps some filtered data, the
> statistics collected for the whole table are uselss. We should avoide
> dumping
> statistics in such a case. E.g. when only schema is dumped what good is
> statistics? Similarly the statistics on a partitioned table may not be
> useful
> if some its partitions are not dumped. Said that dumping statistics on
> foreign
> table makes sense since they do not contain data but the statistics still
> makes sense.
>

Good points, but I'm not immediately sure how to enforce those rules.


>
>
>>
>> Key areas where I'm seeking feedback:
>>
>> - What level of errors in a restore will a user tolerate, and what should
>> be done to the error messages to indicate that the data itself is fine, but
>> a manual operation to update stats on that particular table is now
>> warranted?
>> - To what degree could pg_restore/pg_upgrade take that recovery action
>> automatically?
>> - Should the individual attribute/class set function calls be grouped by
>> relation, so that they all succeed/fail together, or should they be called
>> separately, each able to succeed or fail on their own?
>> - Any other concerns about how to best use these new functions.
>>
>>
>>
> Whether or not I pass --no-statistics, there is no difference in the dump
> output. Am I missing something?
> $ pg_dump -d postgres > /tmp/dump_no_arguments.out
> $ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
> $ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
> $
>
> IIUC, pg_dump includes statistics by default. That means all our pg_dump
> related tests will have statistics output by default. That's good since the
> functionality will always be tested. 1. We need additional tests to ensure
> that the statistics is installed after restore. 2. Some of those tests
> compare dumps before and after restore. In case the statistics is changed
> because of auto-analyze happening post-restore, these tests will fail.
>

+1


> I believe, in order to import statistics through IMPORT FOREIGN SCHEMA,
> postgresImportForeignSchema() will need to add SELECT commands invoking
> pg_set_relation_stats() on each imported table and pg_set_attribute_stats()
> on each of its attribute. Am I right? Do we want to make that happen in the
> first cut of the feature? How do you expect these functions to be used to
> update statistics of foreign tables?
>

I don't think there's time to get it into this release. I think we'd want
to extend this functionality to both IMPORT FOREIGN SCHEMA and ANALYZE for
foreign tables, in both cases with a server/table option to do regular
remote sampling. In both cases, they'd do a remote query very similar to
what pg_dump does (hence putting it in fe_utils), with some filters on
which columns/tables it believes it can trust. The remote table might
itself be a view (in which case they query would turn up nothing) or column
data types may change across the wire, and in those cases we'd have to fall
back to sampling.

Re: Statistics Import and Export

2024-03-25 Thread Ashutosh Bapat

Hi Corey,


On Sat, Mar 23, 2024 at 7:21 AM Corey Huinker 
wrote:

> v12 attached.
>
> 0001 -
>
>
Some random comments

+SELECT
+format('SELECT pg_catalog.pg_set_attribute_stats( '
+|| 'relation => %L::regclass::oid, attname => %L::name, '
+|| 'inherited => %L::boolean, null_frac => %L::real, '
+|| 'avg_width => %L::integer, n_distinct => %L::real, '
+|| 'most_common_vals => %L::text, '
+|| 'most_common_freqs => %L::real[], '
+|| 'histogram_bounds => %L::text, '
+|| 'correlation => %L::real, '
+|| 'most_common_elems => %L::text, '
+|| 'most_common_elem_freqs => %L::real[], '
+|| 'elem_count_histogram => %L::real[], '
+|| 'range_length_histogram => %L::text, '
+|| 'range_empty_frac => %L::real, '
+|| 'range_bounds_histogram => %L::text) ',
+'stats_export_import.' || s.tablename || '_clone', s.attname,
+s.inherited, s.null_frac,
+s.avg_width, s.n_distinct,
+s.most_common_vals, s.most_common_freqs, s.histogram_bounds,
+s.correlation, s.most_common_elems, s.most_common_elem_freqs,
+s.elem_count_histogram, s.range_length_histogram,
+s.range_empty_frac, s.range_bounds_histogram)
+FROM pg_catalog.pg_stats AS s
+WHERE s.schemaname = 'stats_export_import'
+AND s.tablename IN ('test', 'is_odd')
+\gexec

Why do we need to construct the command and execute? Can we instead execute
the function directly? That would also avoid ECHO magic.

+   
+Database Object Statistics Import Functions
+
+ 
+  
+   
+Function
+   
+   
+Description
+   
+  
+ 

COMMENT: The functions throw many validation errors. Do we want to list the
acceptable/unacceptable input values in the documentation corresponding to
those? I don't expect one line per argument validation. Something like
"these, these and these arguments can not be NULL" or "both arguments in
each of the pairs x and y, a and b, and c and d should be non-NULL or NULL
respectively".



> The functions pg_set_relation_stats() and pg_set_attribute_stats() now
> return void. There just weren't enough conditions where a condition was
> considered recoverable to justify having it. This may mean that combining
> multiple pg_set_attribute_stats calls into one compound statement may no
> longer be desirable, but that's just one of the places where I'd like
> feedback on how pg_dump/pg_restore use these functions.
>
>
> 0002 -
>
> This patch concerns invoking the functions in 0001 via
> pg_restore/pg_upgrade. Little has changed here. Dumping statistics is
> currently the default for pg_dump/pg_restore/pg_upgrade, and can be
> switched off with the switch --no-statistics. Some have expressed concern
> about whether stats dumping should be the default. I have a slight
> preference for making it the default, for the following reasons:
>
>
+ /* Statistics are dependent on the definition, not the data */
+ /* Views don't have stats */
+ if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
+ (tbinfo->relkind == RELKIND_VIEW))
+ dumpRelationStats(fout, >dobj, reltypename,
+  tbinfo->dobj.dumpId);
+

Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide
dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be
useful
if some its partitions are not dumped. Said that dumping statistics on
foreign
table makes sense since they do not contain data but the statistics still
makes sense.


>
> Key areas where I'm seeking feedback:
>
> - What level of errors in a restore will a user tolerate, and what should
> be done to the error messages to indicate that the data itself is fine, but
> a manual operation to update stats on that particular table is now
> warranted?
> - To what degree could pg_restore/pg_upgrade take that recovery action
> automatically?
> - Should the individual attribute/class set function calls be grouped by
> relation, so that they all succeed/fail together, or should they be called
> separately, each able to succeed or fail on their own?
> - Any other concerns about how to best use these new functions.
>
>
>
Whether or not I pass --no-statistics, there is no difference in the dump
output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump
related tests will have statistics output by default. That's good since the
functionality will always be tested. 1. We need additional tests to ensure
that the statistics is installed after restore. 2. Some of those tests
compare dumps before and after restore. In case the

Re: Statistics Import and Export

2024-03-21 Thread Corey Huinker

>
>
>
> But ideally we'd just make it safe to dump and reload stats on your own
> tables, and then not worry about it.
>

That is my strong preference, yes.


>
> > Not off hand, no.
>
> To me it seems like inconsistent data to have most_common_freqs in
> anything but descending order, and we should prevent it.
>

Sorry, I misunderstood, I thought we were talking about values, not the
frequencies. Yes, the frequencies should only be monotonically
non-increasing (i.e. it can go down or flatline from N->N+1). I'll add a
test case for that.

Re: Statistics Import and Export

2024-03-21 Thread Jeff Davis

On Thu, 2024-03-21 at 15:10 -0400, Corey Huinker wrote:
> 
> In which case wouldn't the checkCanModify on pg_statistic would be a
> proxy for is_superuser/has_special_role_we_havent_created_yet.

So if someone pg_dumps their table and gets the statistics in the SQL,
then they will get errors loading it unless they are a member of a
special role?

If so we'd certainly need to make --no-statistics the default, and have
some way of skipping stats during reload of the dump (perhaps make the
set function a no-op based on a GUC?).

But ideally we'd just make it safe to dump and reload stats on your own
tables, and then not worry about it.

> Not off hand, no.

To me it seems like inconsistent data to have most_common_freqs in
anything but descending order, and we should prevent it.

> > 
Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-21 Thread Corey Huinker

>
> How about just some defaults then? Many of them have a reasonable
> default, like NULL or an empty array. Some are parallel arrays and
> either both should be specified or neither (e.g.
> most_common_vals+most_common_freqs), but you can check for that.
>

+1
Default NULL has been implemented for all parameters after n_distinct.


>
> > > Why are you calling checkCanModifyRelation() twice?
> >
> > Once for the relation itself, and once for pg_statistic.
>
> Nobody has the privileges to modify pg_statistic except superuser,
> right? I thought the point of a privilege check is that users could
> modify statistics for their own tables, or the tables they maintain.
>

In which case wouldn't the checkCanModify on pg_statistic would be a proxy
for is_superuser/has_special_role_we_havent_created_yet.



>
> >
> > I can see making it void and returning an error for everything that
> > we currently return false for, but if we do that, then a statement
> > with one pg_set_relation_stats, and N pg_set_attribute_stats (which
> > we lump together in one command for the locking benefits and atomic
> > transaction) would fail entirely if one of the set_attributes named a
> > column that we had dropped. It's up for debate whether that's the
> > right behavior or not.
>
> I'd probably make the dropped column a WARNING with a message like
> "skipping dropped column whatever". Regardless, have some kind of
> explanatory comment.
>

That's certainly do-able.




>
> >
> > I pulled most of the hardcoded values from pg_stats itself. The
> > sample set is trivially small, and the values inserted were in-order-
> > ish. So maybe that's why.
>
> In my simple test, most_common_freqs is descending:
>
>CREATE TABLE a(i int);
>INSERT INTO a VALUES(1);
>INSERT INTO a VALUES(2);
>INSERT INTO a VALUES(2);
>INSERT INTO a VALUES(3);
>INSERT INTO a VALUES(3);
>INSERT INTO a VALUES(3);
>INSERT INTO a VALUES(4);
>INSERT INTO a VALUES(4);
>INSERT INTO a VALUES(4);
>INSERT INTO a VALUES(4);
>ANALYZE a;
>SELECT most_common_vals, most_common_freqs
>  FROM pg_stats WHERE tablename='a';
> most_common_vals | most_common_freqs
>--+---
> {4,3,2}  | {0.4,0.3,0.2}
>(1 row)
>
> Can you show an example where it's not?
>

Not off hand, no.



>
> >
> > Maybe we could have the functions restricted to a role or roles:
> >
> > 1. pg_write_all_stats (can modify stats on ANY table)
> > 2. pg_write_own_stats (can modify stats on tables owned by user)
>
> If we go that route, we are giving up on the ability for users to
> restore stats on their own tables. Let's just be careful about
> validating data to mitigate this risk.
>

A great many test cases coming in the next patch.

Re: Statistics Import and Export

2024-03-21 Thread Jeff Davis

On Thu, 2024-03-21 at 03:27 -0400, Corey Huinker wrote:
> 
> They can, but part of what I wanted to show was that the values that
> aren't directly passed in as parameters (staopN, stacollN) get set to
> the correct values, and those values aren't guaranteed to match
> across databases, hence testing them in the regression test rather
> than in a TAP test. I'd still like to be able to test that.

OK, that's fine.

> > The function signature for pg_set_attribute_stats could be more
> > friendly 
...
> 1. We'd have to compare the stats provided against the stats that are
> already there, make that list in-memory, and then re-order what
> remains
> 2. There would be no way to un-set statistics of a given stakind,
> unless we added an "actually set it null" boolean for each parameter
> that can be null. 
> 3. I tried that with the JSON formats, it made the code even messier
> than it already was.

How about just some defaults then? Many of them have a reasonable
default, like NULL or an empty array. Some are parallel arrays and
either both should be specified or neither (e.g.
most_common_vals+most_common_freqs), but you can check for that.

> > Why are you calling checkCanModifyRelation() twice?
> 
> Once for the relation itself, and once for pg_statistic.

Nobody has the privileges to modify pg_statistic except superuser,
right? I thought the point of a privilege check is that users could
modify statistics for their own tables, or the tables they maintain.

> 
> I can see making it void and returning an error for everything that
> we currently return false for, but if we do that, then a statement
> with one pg_set_relation_stats, and N pg_set_attribute_stats (which
> we lump together in one command for the locking benefits and atomic
> transaction) would fail entirely if one of the set_attributes named a
> column that we had dropped. It's up for debate whether that's the
> right behavior or not.

I'd probably make the dropped column a WARNING with a message like
"skipping dropped column whatever". Regardless, have some kind of
explanatory comment.

> 
> I pulled most of the hardcoded values from pg_stats itself. The
> sample set is trivially small, and the values inserted were in-order-
> ish. So maybe that's why.

In my simple test, most_common_freqs is descending:

   CREATE TABLE a(i int);
   INSERT INTO a VALUES(1);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   ANALYZE a;
   SELECT most_common_vals, most_common_freqs
 FROM pg_stats WHERE tablename='a';
most_common_vals | most_common_freqs 
   --+---
{4,3,2}  | {0.4,0.3,0.2}
   (1 row)

Can you show an example where it's not?

> 
> Maybe we could have the functions restricted to a role or roles:
> 
> 1. pg_write_all_stats (can modify stats on ANY table)
> 2. pg_write_own_stats (can modify stats on tables owned by user)

If we go that route, we are giving up on the ability for users to
restore stats on their own tables. Let's just be careful about
validating data to mitigate this risk.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-21 Thread Corey Huinker

On Thu, Mar 21, 2024 at 2:29 AM Jeff Davis  wrote:

> On Tue, 2024-03-19 at 05:16 -0400, Corey Huinker wrote:
> > v11 attached.
>
> Thank you.
>
> Comments on 0001:
>
> This test:
>
>+SELECT
>+format('SELECT pg_catalog.pg_set_attribute_stats( '
>...
>
> seems misplaced. It's generating SQL that can be used to restore or
> copy the stats -- that seems like the job of pg_dump, and shouldn't be
> tested within the plain SQL regression tests.
>

Fair enough.

>
> And can the other tests use pg_stats rather than pg_statistic?
>

They can, but part of what I wanted to show was that the values that aren't
directly passed in as parameters (staopN, stacollN) get set to the correct
values, and those values aren't guaranteed to match across databases, hence
testing them in the regression test rather than in a TAP test. I'd still
like to be able to test that.

>
> The function signature for pg_set_attribute_stats could be more
> friendly -- how about there are a few required parameters, and then it
> only sets the stats that are provided and the other ones are either
> left to the existing value or get some reasonable default?
>

That would be problematic.

1. We'd have to compare the stats provided against the stats that are
already there, make that list in-memory, and then re-order what remains
2. There would be no way to un-set statistics of a given stakind, unless we
added an "actually set it null" boolean for each parameter that can be
null.
3. I tried that with the JSON formats, it made the code even messier than
it already was.

Make sure all error paths ReleaseSysCache().
>

+1

>
> Why are you calling checkCanModifyRelation() twice?
>

Once for the relation itself, and once for pg_statistic.

> I'm confused about when the function should return false and when it
> should throw an error. I'm inclined to think the return type should be
> void and all failures should be reported as ERROR.
>

I go back and forth on that. I can see making it void and returning an
error for everything that we currently return false for, but if we do that,
then a statement with one pg_set_relation_stats, and N
pg_set_attribute_stats (which we lump together in one command for the
locking benefits and atomic transaction) would fail entirely if one of the
set_attributes named a column that we had dropped. It's up for debate
whether that's the right behavior or not.

replaces[] is initialized to {true}, which means only the first element
> is initialized to true. Try following the pattern in AlterDatabase (or
> similar) which reads the catalog tuple first, then updates a few fields
> selectively, setting the corresponding element of replaces[] along the
> way.
>

+1.

>
> The test also sets the most_common_freqs in an ascending order, which
> is weird.
>

I pulled most of the hardcoded values from pg_stats itself. The sample set
is trivially small, and the values inserted were in-order-ish. So maybe
that's why.

> Relatedly, I got worried recently about the idea of plain users
> updating statistics. In theory, that should be fine, and the planner
> should be robust to whatever pg_statistic contains; but in practice
> there's some risk of mischief there until everyone understands that the
> contents of pg_stats should not be trusted. Fortunately I didn't find
> any planner crashes or even errors after a brief test.
>

Maybe we could have the functions restricted to a role or roles:

1. pg_write_all_stats (can modify stats on ANY table)
2. pg_write_own_stats (can modify stats on tables owned by user)

I'm iffy on the need for the first one, I list it first purely to show how
I derived the name for the second.

> One thing we can do is some extra validation for consistency, like
> checking that the arrays are properly sorted, check for negative
> numbers in the wrong place, or fractions larger than 1.0, etc.
>

+1. All suggestions of validation checks welcome.

Re: Statistics Import and Export

2024-03-21 Thread Jeff Davis

On Tue, 2024-03-19 at 05:16 -0400, Corey Huinker wrote:
> v11 attached.

Thank you.

Comments on 0001:

This test:

   +SELECT
   +format('SELECT pg_catalog.pg_set_attribute_stats( '
   ...

seems misplaced. It's generating SQL that can be used to restore or
copy the stats -- that seems like the job of pg_dump, and shouldn't be
tested within the plain SQL regression tests.

And can the other tests use pg_stats rather than pg_statistic?

The function signature for pg_set_attribute_stats could be more
friendly -- how about there are a few required parameters, and then it
only sets the stats that are provided and the other ones are either
left to the existing value or get some reasonable default?

Make sure all error paths ReleaseSysCache().

Why are you calling checkCanModifyRelation() twice?

I'm confused about when the function should return false and when it
should throw an error. I'm inclined to think the return type should be
void and all failures should be reported as ERROR.

replaces[] is initialized to {true}, which means only the first element
is initialized to true. Try following the pattern in AlterDatabase (or
similar) which reads the catalog tuple first, then updates a few fields
selectively, setting the corresponding element of replaces[] along the
way.

The test also sets the most_common_freqs in an ascending order, which
is weird.

Relatedly, I got worried recently about the idea of plain users
updating statistics. In theory, that should be fine, and the planner
should be robust to whatever pg_statistic contains; but in practice
there's some risk of mischief there until everyone understands that the
contents of pg_stats should not be trusted. Fortunately I didn't find
any planner crashes or even errors after a brief test.

One thing we can do is some extra validation for consistency, like
checking that the arrays are properly sorted, check for negative
numbers in the wrong place, or fractions larger than 1.0, etc.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-19 Thread Corey Huinker

v11 attached.

- TAP tests passing (the big glitch was that indexes that are used in
constraints should have their stats dependent on the constraint, not the
index, thanks Jeff)
- The new range-specific statistics types are now supported. I'm not happy
with the typid machinations I do to get them to work, but it is working so
far. These are stored out-of-stakind-order (7 before 6), which is odd
because all other types seem store stakinds in ascending order. It
shouldn't matter, it was just odd.
- regression tests now make simpler calls with arbitrary stats to
demonstrate the function usage more cleanly
- pg_set_*_stats function now have all of their parameters in the same
order as the table/view they pull from
From 5c63ed5748eb3817d193b64329b57dc590e0196e Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Mon, 11 Mar 2024 14:18:39 -0400
Subject: [PATCH v11 1/2] Create pg_set_relation_stats, pg_set_attribute_stats.

These functions will be used by pg_dump/restore and pg_upgrade to convey
relation and attribute statistics from the source database to the
target. This would be done instead of vacuumdb --analyze-in-stages.

Both functions take an oid to identify the target relation that will
receive the statistics. There is nothing requiring that relation to be
the same one as the one exported, though the statistics would have to
make sense in the context of the new relation.

The parameters of pg_set_attribute_stats intentionaly mirror the columns
in the view pg_stats, with the ANYARRAY types casted to TEXT. Those
values will be cast to arrays of the basetype of the attribute, and that
operation may fail if the attribute type has changed.

The statistics imported by pg_set_attribute_stats are imported
transactionally like any other operation. However, pg_set_relation_stats
does it's update in-place, which is to say non-transactionally. This is
in line with what ANALYZE does to avoid table bloat in pg_class.

These functions also allows for tweaking of table statistics in-place,
allowing the user to inflate rowcounts, skew histograms, etc, to see
what those changes will evoke from the query planner.
---
 src/include/catalog/pg_proc.dat   |  15 +
 src/include/statistics/statistics.h   |   2 +
 src/backend/statistics/Makefile   |   3 +-
 src/backend/statistics/meson.build|   1 +
 src/backend/statistics/statistics.c   | 603 ++
 .../regress/expected/stats_export_import.out  | 283 
 src/test/regress/parallel_schedule|   2 +-
 src/test/regress/sql/stats_export_import.sql  | 246 +++
 doc/src/sgml/func.sgml|  91 +++
 9 files changed, 1244 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/statistics/statistics.c
 create mode 100644 src/test/regress/expected/stats_export_import.out
 create mode 100644 src/test/regress/sql/stats_export_import.sql

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 177d81a891..f31412d4a6 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12177,4 +12177,19 @@
   proargtypes => 'int2',
   prosrc => 'gist_stratnum_identity' },
 
+# Statistics Import
+{ oid => '8048',
+  descr => 'set statistics on relation',
+  proname => 'pg_set_relation_stats', provolatile => 'v', proisstrict => 't',
+  proparallel => 'u', prorettype => 'bool',
+  proargtypes => 'oid int4 float4 int4',
+  proargnames => '{relation,relpages,reltuples,relallvisible}',
+  prosrc => 'pg_set_relation_stats' },
+{ oid => '8049',
+  descr => 'set statistics on attribute',
+  proname => 'pg_set_attribute_stats', provolatile => 'v', proisstrict => 'f',
+  proparallel => 'u', prorettype => 'bool',
+  proargtypes => 'oid name bool float4 int4 float4 text _float4 text float4 text _float4 _float4 text float4 text',
+  proargnames => '{relation,attname,inherited,null_frac,avg_width,n_distinct,most_common_vals,most_common_freqs,histogram_bounds,correlation,most_common_elems,most_common_elem_freqs,elem_count_histogram,range_length_histogram,range_empty_frac,range_bounds_histogram}',
+  prosrc => 'pg_set_attribute_stats' },
 ]
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 7f2bf18716..1dddf96576 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -127,4 +127,6 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, bool inh, int idx);
 
+extern Datum pg_set_relation_stats(PG_FUNCTION_ARGS);
+extern Datum pg_set_attribute_stats(PG_FUNCTION_ARGS);
 #endif			/* STATISTICS_H */
diff --git a/src/backend/statistics/Makefile b/src/backend/statistics/Makefile
index 89cf8c2797..e4f8ab7c4f 100644
--- a/src/backend/statistics/Makefile
+++ b/src/backend/statistics/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	dependencies.o \
 	extended_stats.o \
 	mcv.o \
-

Re: Statistics Import and Export

2024-03-18 Thread Corey Huinker

>
>
>
>
> From testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump, search
> for the "not ok" and then look at what it tried to do right before
> that. I see:
>
> pg_dump: error: prepared statement failed: ERROR:  syntax error at or
> near "%"
> LINE 1: ..._histogram => %L::real[]) coalesce($2, format('%I.%I',
> a.nsp...
>

Thanks. Unfamiliar turf for me.


>
> > All those changes are available in the patches attached.
>
> How about if you provided "get" versions of the functions that return a
> set of rows that match what the "set" versions expect? That would make
> 0001 essentially a complete feature itself.
>

That's tricky. At the base level, those functions would just be an
encapsulation of "SELECT * FROM pg_stats WHERE schemaname = $1 AND
tablename = $2" which isn't all that much of a savings. Perhaps we can make
the documentation more explicit about the source and nature of the
parameters going into the pg_set_ functions.

Per conversation, it would be trivial to add a helper functions that
replace the parameters after the initial oid with a pg_class rowtype, and
that would dissect the values needed and call the more complex function:

pg_set_relation_stats( oid, pg_class)
pg_set_attribute_stats( oid, pg_stats)


>
> I think it would also make the changes in pg_dump simpler, and the
> tests in 0001 a lot simpler.
>

I agree. The tests are currently showing that a fidelity copy can be made
from one table to another, but to do so we have to conceal the actual stats
values because those are 1. not deterministic/known and 2. subject to
change from version to version.

I can add some sets to arbitrary values like was done for
pg_set_relation_stats().

Re: Statistics Import and Export

2024-03-18 Thread Jeff Davis

On Sun, 2024-03-17 at 23:33 -0400, Corey Huinker wrote:
> 
> A few TAP tests are still failing and I haven't been able to diagnose
> why, though the failures in parallel dump seem to be that it tries to
> import stats on indexes that haven't been created yet, which is odd
> because I sent the dependency.

>From testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump, search
for the "not ok" and then look at what it tried to do right before
that. I see:

pg_dump: error: prepared statement failed: ERROR:  syntax error at or
near "%"
LINE 1: ..._histogram => %L::real[]) coalesce($2, format('%I.%I',
a.nsp...

> All those changes are available in the patches attached.

How about if you provided "get" versions of the functions that return a
set of rows that match what the "set" versions expect? That would make
0001 essentially a complete feature itself.

I think it would also make the changes in pg_dump simpler, and the
tests in 0001 a lot simpler.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-17 Thread Corey Huinker

>
>
> * pg_set_relation_stats() needs to do an ACL check so you can't set the
> stats on someone else's table. I suggest honoring the new MAINTAIN
> privilege as well.
>

Added.


>
> * If possible, reading from pg_stats (instead of pg_statistic) would be
> ideal because pg_stats already does the right checks at read time, so a
> non-superuser can export stats, too.
>

Done. That was sorta how it was originally, so returning to that wasn't too
hard.


>
> * If reading from pg_stats, should you change the signature of
> pg_set_relation_stats() to have argument names matching the columns of
> pg_stats (e.g. most_common_vals instead of stakind/stavalues)?
>

Done.


>
> In other words, make this a slightly higher level: conceptually
> exporting/importing pg_stats rather than pg_statistic. This may also
> make the SQL export queries simpler.
>

Eh, about the same.


> Also, I'm wondering about error handling. Is some kind of error thrown
> by pg_set_relation_stats() going to abort an entire restore? That might
> be easy to prevent with pg_restore, because it can just omit the stats,
> but harder if it's in a SQL file.
>

Aside from the oid being invalid, there's not a whole lot that can go wrong
in set_relation_stats(). The error checking I did closely mirrors that in
analyze.c.

Aside from the changes you suggested, as well as the error reporting change
you suggested for pg_dump, I also filtered out attempts to dump stats on
views.

A few TAP tests are still failing and I haven't been able to diagnose why,
though the failures in parallel dump seem to be that it tries to import
stats on indexes that haven't been created yet, which is odd because I sent
the dependency.

All those changes are available in the patches attached.
From ba411ce31c25193cf05bc4206dcb8f2bf32af0c7 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Mon, 11 Mar 2024 14:18:39 -0400
Subject: [PATCH v10 1/2] Create pg_set_relation_stats, pg_set_attribute_stats.

These functions will be used by pg_dump/restore and pg_upgrade to convey
relation and attribute statistics from the source database to the
target. This would be done instead of vacuumdb --analyze-in-stages.

Both functions take an oid to identify the target relation that will
receive the statistics. There is nothing requiring that relation to be
the same one as the one exported, though the statistics would have to
make sense in the context of the new relation.

The parameters of pg_set_attribute_stats intentionaly mirror the columns
in the view pg_stats, with the ANYARRAY types casted to TEXT. Those
values will be cast to arrays of the basetype of the attribute, and that
operation may fail if the attribute type has changed.

The statistics imported by pg_set_attribute_stats are imported
transactionally like any other operation. However, pg_set_relation_stats
does it's update in-place, which is to say non-transactionally. This is
in line with what ANALYZE does to avoid table bloat in pg_class.

These functions also allows for tweaking of table statistics in-place,
allowing the user to inflate rowcounts, skew histograms, etc, to see
what those changes will evoke from the query planner.
---
 src/include/catalog/pg_proc.dat   |  15 +
 src/include/statistics/statistics.h   |   2 +
 src/backend/statistics/Makefile   |   3 +-
 src/backend/statistics/meson.build|   1 +
 src/backend/statistics/statistics.c   | 521 ++
 .../regress/expected/stats_export_import.out  | 211 +++
 src/test/regress/parallel_schedule|   2 +-
 src/test/regress/sql/stats_export_import.sql  | 188 +++
 doc/src/sgml/func.sgml|  83 +++
 9 files changed, 1024 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/statistics/statistics.c
 create mode 100644 src/test/regress/expected/stats_export_import.out
 create mode 100644 src/test/regress/sql/stats_export_import.sql

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 700f7daf7b..3070d72d7a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12170,4 +12170,19 @@
   proargtypes => 'int2',
   prosrc => 'gist_stratnum_identity' },
 
+# Statistics Import
+{ oid => '8048',
+  descr => 'set statistics on relation',
+  proname => 'pg_set_relation_stats', provolatile => 'v', proisstrict => 't',
+  proparallel => 'u', prorettype => 'bool',
+  proargtypes => 'oid float4 int4 int4',
+  proargnames => '{relation,reltuples,relpages,relallvisible}',
+  prosrc => 'pg_set_relation_stats' },
+{ oid => '8049',
+  descr => 'set statistics on attribute',
+  proname => 'pg_set_attribute_stats', provolatile => 'v', proisstrict => 'f',
+  proparallel => 'u', prorettype => 'bool',
+  proargtypes => 'oid name bool float4 int4 float4 text _float4 text float4 text _float4 _float4',
+  proargnames =>

Re: Statistics Import and Export

2024-03-15 Thread Jeff Davis

On Fri, 2024-03-15 at 15:30 -0700, Jeff Davis wrote:
> Still looking, but one quick comment is that the third argument of
> dumpRelationStats() should be const, which eliminates a warning.

A few other comments:

* pg_set_relation_stats() needs to do an ACL check so you can't set the
stats on someone else's table. I suggest honoring the new MAINTAIN
privilege as well.

* If possible, reading from pg_stats (instead of pg_statistic) would be
ideal because pg_stats already does the right checks at read time, so a
non-superuser can export stats, too.

* If reading from pg_stats, should you change the signature of
pg_set_relation_stats() to have argument names matching the columns of
pg_stats (e.g. most_common_vals instead of stakind/stavalues)?

In other words, make this a slightly higher level: conceptually
exporting/importing pg_stats rather than pg_statistic. This may also
make the SQL export queries simpler.

Also, I'm wondering about error handling. Is some kind of error thrown
by pg_set_relation_stats() going to abort an entire restore? That might
be easy to prevent with pg_restore, because it can just omit the stats,
but harder if it's in a SQL file.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-15 Thread Jeff Davis

On Fri, 2024-03-15 at 03:55 -0400, Corey Huinker wrote:
> 
> Statistics are preserved by default, but this can be disabled with
> the option --no-statistics. This follows the prevailing option
> pattern in pg_dump, etc.

I'm not sure if saving statistics should be the default in 17. I'm
inclined to make it opt-in.

> There are currently several failing TAP tests around
> pg_dump/pg_restore/pg_upgrade.

It is a permissions problem. When user running pg_dump is not the
superuser, they don't have permission to access pg_statistic. That
causes an error in exportRelationStatsStmt(), which returns NULL, and
then the caller segfaults.

> I'm looking at those, but in the mean time I'm seeking feedback on
> the progress so far.

Still looking, but one quick comment is that the third argument of
dumpRelationStats() should be const, which eliminates a warning.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-03-15 Thread Corey Huinker

>
> ANALYZE takes out one lock StatisticRelationId per relation, not per
> attribute like we do now. If we didn't release the lock after every
> attribute, and we only called the function outside of a larger transaction
> (as we plan to do with pg_restore) then that is the closest we're going to
> get to being consistent with ANALYZE.
>

v9 attached. This adds pg_dump support. It works in tests against existing
databases such as dvdrental, though I was surprised at how few indexes have
attribute stats there.

Statistics are preserved by default, but this can be disabled with the
option --no-statistics. This follows the prevailing option pattern in
pg_dump, etc.

There are currently several failing TAP tests around
pg_dump/pg_restore/pg_upgrade. I'm looking at those, but in the mean
time I'm seeking feedback on the progress so far.
From bf291e323fc01215264d41b75d579c11bd22d2ec Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Mon, 11 Mar 2024 14:18:39 -0400
Subject: [PATCH v9 1/2] Create pg_set_relation_stats, pg_set_attribute_stats.

These functions will be used by pg_dump/restore and pg_upgrade to convey
relation and attribute statistics from the source database to the
target. This would be done instead of vacuumdb --analyze-in-stages.

Both functions take an oid to identify the target relation that will
receive the statistics. There is nothing requiring that relation to be
the same one as the one exported, though the statistics would have to
make sense in the context of the new relation. Typecasts for stavaluesN
parameters may fail if the destination column is not of the same type as
the source column.

The parameters of pg_set_attribute_stats identify the attribute by name
rather than by attnum. This is intentional because the column order may
be different in situations other than binary upgrades. For example,
dropped columns do not carry over in a restore.

The statistics imported by pg_set_attribute_stats are imported
transactionally like any other operation. However, pg_set_relation_stats
does it's update in-place, which is to say non-transactionally. This is
in line with what ANALYZE does to avoid table bloat in pg_class.

These functions also allows for tweaking of table statistics in-place,
allowing the user to inflate rowcounts, skew histograms, etc, to see
what those changes will evoke from the query planner.
---
 src/include/catalog/pg_proc.dat   |  15 +
 src/include/statistics/statistics.h   |   2 +
 src/backend/statistics/Makefile   |   3 +-
 src/backend/statistics/meson.build|   1 +
 src/backend/statistics/statistics.c   | 410 ++
 .../regress/expected/stats_export_import.out  | 211 +
 src/test/regress/parallel_schedule|   2 +-
 src/test/regress/sql/stats_export_import.sql  | 198 +
 doc/src/sgml/func.sgml|  95 
 9 files changed, 935 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/statistics/statistics.c
 create mode 100644 src/test/regress/expected/stats_export_import.out
 create mode 100644 src/test/regress/sql/stats_export_import.sql

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 700f7daf7b..a726451a6f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12170,4 +12170,19 @@
   proargtypes => 'int2',
   prosrc => 'gist_stratnum_identity' },
 
+# Statistics Import
+{ oid => '8048',
+  descr => 'set statistics on relation',
+  proname => 'pg_set_relation_stats', provolatile => 'v', proisstrict => 't',
+  proparallel => 'u', prorettype => 'bool',
+  proargtypes => 'oid float4 int4 int4',
+  proargnames => '{relation,reltuples,relpages,relallvisible}',
+  prosrc => 'pg_set_relation_stats' },
+{ oid => '8049',
+  descr => 'set statistics on attribute',
+  proname => 'pg_set_attribute_stats', provolatile => 'v', proisstrict => 'f',
+  proparallel => 'u', prorettype => 'bool',
+  proargtypes => 'oid name bool float4 int4 float4 int2 int2 int2 int2 int2 _float4 _float4 _float4 _float4 _float4 text text text text text',
+  proargnames => '{relation,attname,stainherit,stanullfrac,stawidth,stadistinct,stakind1,stakind2,stakind3,stakind4,stakind5,stanumbers1,stanumbers2,stanumbers3,stanumbers4,stanumbers5,stavalues1,stavalues2,stavalues3,stavalues4,stavalues5}',
+  prosrc => 'pg_set_attribute_stats' },
 ]
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 7f2bf18716..1dddf96576 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -127,4 +127,6 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, bool inh, int idx);
 
+extern Datum pg_set_relation_stats(PG_FUNCTION_ARGS);
+extern Datum pg_set_attribute_stats(PG_FUNCTION_ARGS);
 #endif			/* STATISTICS_H */
diff --git

Re: Statistics Import and Export

2024-03-13 Thread Corey Huinker

>
> Note that there's two different things we're talking about here- the
> lock on the relation that we're analyzing and then the lock on the
> pg_statistic (or pg_class) catalog itself.  Currently, at least, it
> looks like in the three places in the backend that we open
> StatisticRelationId, we release the lock when we close it rather than
> waiting for transaction end.  I'd be inclined to keep it that way in
> these functions also.  I doubt that one lock will end up causing much in
> the way of issues to acquire/release it multiple times and it would keep
> the code consistent with the way ANALYZE works.
>

ANALYZE takes out one lock StatisticRelationId per relation, not per
attribute like we do now. If we didn't release the lock after every
attribute, and we only called the function outside of a larger transaction
(as we plan to do with pg_restore) then that is the closest we're going to
get to being consistent with ANALYZE.

Re: Statistics Import and Export

2024-03-13 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > No, we should be keeping the lock until the end of the transaction
> > (which in this case would be just the one statement, but it would be the
> > whole statement and all of the calls in it).  See analyze.c:268 or
> > so, where we call relation_close(onerel, NoLock); meaning we're closing
> > the relation but we're *not* releasing the lock on it- it'll get
> > released at the end of the transaction.
>
> If that's the case, then changing the two table_close() statements to
> NoLock should resolve any remaining concern.

Note that there's two different things we're talking about here- the
lock on the relation that we're analyzing and then the lock on the
pg_statistic (or pg_class) catalog itself.  Currently, at least, it
looks like in the three places in the backend that we open
StatisticRelationId, we release the lock when we close it rather than
waiting for transaction end.  I'd be inclined to keep it that way in
these functions also.  I doubt that one lock will end up causing much in
the way of issues to acquire/release it multiple times and it would keep
the code consistent with the way ANALYZE works.

If it can be shown to be an issue then we could certainly revisit this.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-03-12 Thread Corey Huinker

>
> No, we should be keeping the lock until the end of the transaction
> (which in this case would be just the one statement, but it would be the
> whole statement and all of the calls in it).  See analyze.c:268 or
> so, where we call relation_close(onerel, NoLock); meaning we're closing
> the relation but we're *not* releasing the lock on it- it'll get
> released at the end of the transaction.
>
>
If that's the case, then changing the two table_close() statements to
NoLock should resolve any remaining concern.

1 2 >

1 - 100 of 151 matches

Mail list logo