date:20220217

Re: improve --with-lz4 install documentation

2022-02-17 Thread Michael Paquier

On Wed, Feb 16, 2022 at 11:36:31PM +0530, Jeevan Ladhe wrote:
> Now that lz4 also supports backup compression, decompression, and more
> future enhancements that can be done here, I think it is better to make
> the --with-lz4 description more generic than adding specific details in
> there.
> 
> Attached is the patch to remove the specifics related to WAL and TOAST
> compression.

With everything that has been merged recently based on LZ4, it makes
sense to me to simplify things as you are suggesting.

install-windows.sgml includes the same description, so we'd better do
the same thing and simplify the paragraph of LZ4 there as well, no?
--
Michael


signature.asc
Description: PGP signature

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-17 Thread Julien Rouhaud

Hi,

On Fri, Feb 18, 2022 at 12:20:26PM +0530, Nitin Jadhav wrote:
> >
> > If there's a checkpoint timed triggered and then someone calls
> > pg_start_backup() which then wait for the end of the current checkpoint
> > (possibly after changing the flags), I think the view should reflect that in
> > some way.  Maybe storing an array of (pid, flags) is too much, but at least 
> > a
> > counter with the number of processes actively waiting for the end of the
> > checkpoint.
> 
> Okay. I feel this can be added as additional field but it will not
> replace backend_pid field as this represents the pid of the backend
> which triggered the current checkpoint.

I don't think that's true.  Requesting a checkpoint means telling the
checkpointer that it should wake up and start a checkpoint (or restore point)
if it's not already doing so, so the pid will always be the checkpointer pid.
The only exception is a standalone backend, but in that case you won't be able
to query that view anyway.

And also while looking at the patch I see there's the same problem that I
mentioned in the previous thread, which is that the effective flags can be
updated once the checkpoint started, and as-is the view won't reflect that.  It
also means that you can't simply display one of wal, time or force but a
possible combination of the flags (including the one not handled in v1).

> Probably a new field named 'processes_wiating' or 'events_waiting' can be
> added for this purpose.

Maybe num_process_waiting?

> > > > > 'checkpoint or restartpoint?'
> > > >
> > > > Do you actually need to store that?  Can't it be inferred from
> > > > pg_is_in_recovery()?
> > >
> > > AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
> > > checkpoint or restartpoint because if the system exits from recovery
> > > mode during restartpoint then any query to pg_stat_progress_checkpoint
> > > view will return it as a checkpoint which is ideally not correct. Please
> > > correct me if I am wrong.
> >
> > Recovery ends with an end-of-recovery checkpoint that has to finish before 
> > the
> > promotion can happen, so I don't think that a restart can still be in 
> > progress
> > if pg_is_in_recovery() returns false.
> 
> Probably writing of buffers or syncing files may complete before
> pg_is_in_recovery() returns false. But there are some cleanup
> operations happen as part of the checkpoint. During this scenario, we
> may get false value for pg_is_in_recovery(). Please refer following
> piece of code which is present in CreateRestartpoint().
> 
> if (!RecoveryInProgress())
> replayTLI = XLogCtl->InsertTimeLineID;

Then maybe we could store the timeline rather then then kind of checkpoint?
You should still be able to compute the information while giving a bit more
information for the same memory usage.

Re: Time to drop plpython2?

2022-02-17 Thread Mikael Kjellström





On 2022-02-17 19:08, Andres Freund wrote:

I've pinged the owners of the animals failing so far:


Now also pinged:
- curculio


Should be fixed by now.

I did install the python3-package but the binary was called:

/usr/local/bin/python3.5

for some reason so configure didn't pick it up.

Fixed it by creating a symlink to:

/usr/local/bin/python3

/Mikael

Re: Timeout control within tests

2022-02-17 Thread Noah Misch

On Thu, Feb 17, 2022 at 09:48:25PM -0800, Andres Freund wrote:
> On 2022-02-17 21:28:42 -0800, Noah Misch wrote:
> > I propose to have environment variable PG_TEST_TIMEOUT_DEFAULT control the
> > timeout used in the places that currently hard-code 180s.
> 
> Meson's test runner has the concept of a "timeout multiplier" for ways of
> running tests. Meson's stuff is about entire tests (i.e. one tap test), so
> doesn't apply here, but I wonder if we shouldn't do something similar?

Hmmm.  It is good if the user can express an intent that continues to make
sense if we change the default timeout.  For the buildfarm use case, a
multiplier is moderately better on that axis (PG_TEST_TIMEOUT_MULTIPLIER=100
beats PG_TEST_TIMEOUT_DEFAULT=18000).  For the hacker use case, an absolute
value is substantially better on that axis (PG_TEST_TIMEOUT_DEFAULT=3 beats
PG_TEST_TIMEOUT_MULTIPLIER=.01).

> That
> way we could adjust different timeouts with one setting, instead of many
> different fobs to adjust?

I expect multiplier vs. absolute value doesn't change the expected number of
settings.  If this change proceeds, we'd have three: PG_TEST_TIMEOUT_DEFAULT,
PGCTLTIMEOUT, and PGISOLATIONTIMEOUT.  PGCTLTIMEOUT is separate for conceptual
reasons, and PGISOLATIONTIMEOUT is separate for historical reasons.  There's
little use case for setting them to unequal values.  If Meson can pass down
the overall timeout in effect for the test file, we could compute all three
variables from the passed-down value.  Orthogonal to Meson, as I mentioned, we
could eliminate PGISOLATIONTIMEOUT.

timeouts.spec used to have substantial timeouts that had to elapse for the
test to pass.  (Commit 741d7f1 ended that era.)  A multiplier would have been
a good fit for that use case.  If a similar test came back, we'd likely want
two multipliers, a low one for elapsing timeouts and a high one for
non-elapsing timeouts.  A multiplier of 10-100 is reasonable for non-elapsing
timeouts, with the exact value being irrelevant on the buildfarm.  Setting an
elapsing timeout higher than necessary causes measurable waste.

One could argue for offering both a multiplier variable and an absolute-value
variable.  If there's just one variable, I think the absolute-value variable
is more compelling, due to the aforementioned hacker use case.  What do you
think?

Re: automatically generating node support functions

2022-02-17 Thread Peter Eisentraut


On 14.02.22 18:09, Tom Lane wrote:

* It's time for action on the business about extracting comments
from the to-be-deleted code.


done


* The Perl script is kind of under-commented for my taste.
It lacks a copyright notice, too.


done


* In the same vein, I should not have to reverse-engineer what
the available pg_node_attr() properties are or do.  Perhaps they
could be documented in the comment for the pg_node_attr macro
in nodes.h.


done


* Maybe the generated file names could be chosen less opaquely,
say ".funcs" and ".switch" instead of ".inc1" and ".inc2".


done


* I don't understand why there are changes in the #include
lists in copyfuncs.c etc?


Those are #include lines required for the definitions of various 
structs.  Since the generation script already knows which header files 
are relevant (they are its input files), it can just generate the 
required #include lines as well.  That way, the remaining copyfuncs.c 
only has #include lines for things that the (remaining) file itself 
needs, not what the files included by it need, and if a new header file 
were to be added, it doesn't have to be added in 4+ places.



* I think that more thought needs to be put into the format
of the *nodes.h struct declarations, because I fear pgindent
is going to make a hash of what you've done here.  When we
did similar stuff in the catalog headers, I think we ended
up moving a lot of end-of-line comments onto their own lines.


I have tested pgindent repeatedly throughout this project, and it 
doesn't look too bad.  You are right that some manual curation of 
comment formatting would be sensible, but I think that might be better 
done as a separate patch.



* I assume the pg_config_manual.h changes are not meant for
commit?


rightFrom cdd9f2a6738b8c85cec8ba6169f281ebdc4d2e4d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 18 Feb 2022 07:39:42 +0100
Subject: [PATCH v5] Automatically generate node support functions

Add a script to automatically generate the node support functions
(copy, equal, out, and read, as well as the node tags enum) from the
struct definitions.

For each of the four node support files, it creates two include files,
e.g., copyfuncs.funcs.c and copyfuncs.switch.c, to include in the main
file.  All the scaffolding of the main file stays in place.

TODO: In this patch, I have only ifdef'ed out the code to could be
removed, mainly so that it won't constantly have merge conflicts.
Eventually, that should all be changed to delete the code.  All the
code comments that are worth keeping from those sections have already
been moved to the header files where the structs are defined.

I have tried to mostly make the coverage of the output match what is
currently there.  For example, one could now do out/read coverage of
utility statement nodes, but I have manually excluded those for now.
The reason is mainly that it's easier to diff the before and after,
and adding a bunch of stuff like this might require a separate
analysis and review.

Subtyping (TidScan -> Scan) is supported.

For the hard cases, you can just write a manual function and exclude
generating one.  For the not so hard cases, there is a way of
annotating struct fields to get special behaviors.  For example,
pg_node_attr(equal_ignore) has the field ignored in equal functions.

Discussion: 
https://www.postgresql.org/message-id/flat/c1097590-a6a4-486a-64b1-e1f9cc0533ce%40enterprisedb.com
---
 src/backend/Makefile  |   8 +-
 src/backend/nodes/.gitignore  |   4 +
 src/backend/nodes/Makefile|  46 ++
 src/backend/nodes/copyfuncs.c |  19 +-
 src/backend/nodes/equalfuncs.c|  22 +-
 src/backend/nodes/gen_node_support.pl | 729 ++
 src/backend/nodes/outfuncs.c  |  34 +-
 src/backend/nodes/readfuncs.c |  23 +-
 src/include/nodes/.gitignore  |   2 +
 src/include/nodes/nodes.h |  27 +
 src/include/nodes/parsenodes.h|   3 +-
 src/include/nodes/pathnodes.h | 170 +++---
 src/include/nodes/plannodes.h |  90 ++--
 src/include/nodes/primnodes.h |  33 +-
 src/include/utils/rel.h   |   6 +-
 src/tools/msvc/Solution.pm|  46 ++
 16 files changed, 1113 insertions(+), 149 deletions(-)
 create mode 100644 src/backend/nodes/.gitignore
 create mode 100644 src/backend/nodes/gen_node_support.pl
 create mode 100644 src/include/nodes/.gitignore

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 4a02006788..821bef2694 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -143,11 +143,15 @@ storage/lmgr/lwlocknames.h: 
storage/lmgr/generate-lwlocknames.pl storage/lmgr/lw
 submake-catalog-headers:
$(MAKE) -C catalog distprep generated-header-symlinks
 
+# run this unconditionally to avoid needing to know its dependencies here:
+submake-nodes-headers:
+   $(MAKE) -C nodes distprep generated-header-symlinks
+
 # run this unconditionally to avoid

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-17 Thread Nitin Jadhav

> > > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > > - can be stored without any problem.
> > >
> > > There can be multiple processes triggering a checkpoint, or at least 
> > > wanting it
> > > to happen or happen faster.
> >
> > Yes. There can be multiple processes but there will be one checkpoint
> > operation at a time. So the backend PID corresponds to the current
> > checkpoint operation. Let me know if I am missing something.
>
> If there's a checkpoint timed triggered and then someone calls
> pg_start_backup() which then wait for the end of the current checkpoint
> (possibly after changing the flags), I think the view should reflect that in
> some way.  Maybe storing an array of (pid, flags) is too much, but at least a
> counter with the number of processes actively waiting for the end of the
> checkpoint.

Okay. I feel this can be added as additional field but it will not
replace backend_pid field as this represents the pid of the backend
which triggered the current checkpoint. Probably a new field named
'processes_wiating' or 'events_waiting' can be added for this purpose.
Thoughts?

> > > > 'checkpoint or restartpoint?'
> > >
> > > Do you actually need to store that?  Can't it be inferred from
> > > pg_is_in_recovery()?
> >
> > AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
> > checkpoint or restartpoint because if the system exits from recovery
> > mode during restartpoint then any query to pg_stat_progress_checkpoint
> > view will return it as a checkpoint which is ideally not correct. Please
> > correct me if I am wrong.
>
> Recovery ends with an end-of-recovery checkpoint that has to finish before the
> promotion can happen, so I don't think that a restart can still be in progress
> if pg_is_in_recovery() returns false.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
replayTLI = XLogCtl->InsertTimeLineID;

Thanks & Regards,
Nitin Jadhav

On Thu, Feb 17, 2022 at 10:57 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Thu, Feb 17, 2022 at 10:39:02PM +0530, Nitin Jadhav wrote:
> > > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > > - can be stored without any problem.
> > >
> > > There can be multiple processes triggering a checkpoint, or at least 
> > > wanting it
> > > to happen or happen faster.
> >
> > Yes. There can be multiple processes but there will be one checkpoint
> > operation at a time. So the backend PID corresponds to the current
> > checkpoint operation. Let me know if I am missing something.
>
> If there's a checkpoint timed triggered and then someone calls
> pg_start_backup() which then wait for the end of the current checkpoint
> (possibly after changing the flags), I think the view should reflect that in
> some way.  Maybe storing an array of (pid, flags) is too much, but at least a
> counter with the number of processes actively waiting for the end of the
> checkpoint.
>
> > > > 'checkpoint or restartpoint?'
> > >
> > > Do you actually need to store that?  Can't it be inferred from
> > > pg_is_in_recovery()?
> >
> > AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
> > checkpoint or restartpoint because if the system exits from recovery
> > mode during restartpoint then any query to pg_stat_progress_checkpoint
> > view will return it as a checkpoint which is ideally not correct. Please
> > correct me if I am wrong.
>
> Recovery ends with an end-of-recovery checkpoint that has to finish before the
> promotion can happen, so I don't think that a restart can still be in progress
> if pg_is_in_recovery() returns false.

Re: row filtering for logical replication

2022-02-17 Thread Peter Smith

On Thu, Feb 17, 2022 at 5:37 PM Amit Kapila  wrote:
...
> As there is a new version, I would like to wait for a few more days
> before committing. I am planning to commit this early next week (by
> Tuesday) unless others or I see any more things that can be improved.

I have no more review comments.

This Row Filter patch v85 LGTM.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Fix CheckIndexCompatible comment

2022-02-17 Thread Yugo NAGATA

On Fri, 18 Feb 2022 12:22:32 +0900
Fujii Masao  wrote:

> 
> 
> On 2022/02/07 19:14, Yugo NAGATA wrote:
> > Agreed. I updated the patch to add a comment about 'oldId'.
> 
> Thanks for updating the patch! I slightly modified the patch and pushed it.

Thanks!

> 
> Regards,
> 
> -- 
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION


-- 
Yugo NAGATA

RE: Failed transaction statistics to measure the logical replication progress

2022-02-17 Thread tanghy.f...@fujitsu.com

On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com 
 wrote:
> 
> The attached v21 has a couple of other minor updates
> like a modification of error message text.
> 
> 

Thanks for updating the patch. Here are some comments.

1) I saw the following description about pg_stat_subscription_workers view in
existing doc:

   The pg_stat_subscription_workers view will contain
   one row per subscription worker on which errors have occurred, ...

It only says  "which errors have occurred", maybe we should also mention
transactions here, right?

2)
/* --
+ * pgstat_send_subworker_xact_stats() -
+ *
+ * Send a subworker's transaction stats to the collector.
+ * The statistics are cleared upon return.

Should "The statistics are cleared upon return" changed to "The statistics are
cleared upon sending"? Because if it doesn't reach PGSTAT_STAT_INTERVAL and the
transaction stats are not sent, the function will return without clearing out
statistics.

3)
+   Assert(command == LOGICAL_REP_MSG_COMMIT ||
+  command == LOGICAL_REP_MSG_STREAM_COMMIT ||
+  command == LOGICAL_REP_MSG_COMMIT_PREPARED ||
+  command == LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+   switch (command)
+   {
+   case LOGICAL_REP_MSG_COMMIT:
+   case LOGICAL_REP_MSG_STREAM_COMMIT:
+   case LOGICAL_REP_MSG_COMMIT_PREPARED:
+   MyLogicalRepWorker->commit_count++;
+   break;
+   case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+   MyLogicalRepWorker->abort_count++;
+   break;
+   default:
+   ereport(ERROR,
+   errmsg("invalid logical message type 
for transaction statistics of subscription"));
+   break;
+   }

I'm not sure that do we need the assert, because it will report an error later
if command is an invalid value.

4) I noticed that the abort_count doesn't include aborted streaming 
transactions.
Should we take this case into consideration?

Regards,
Tang

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-17 Thread Nitin Jadhav

> Interesting idea, and overall a nice addition to the
> pg_stat_progress_* reporting infrastructure.
>
> Could you add your patch to the current commitfest at
> https://commitfest.postgresql.org/37/?
>
> See below for some comments on the patch:

Thanks you for reviewing.
I have added it to the commitfest - https://commitfest.postgresql.org/37/3545/

> > xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, 
> > checkpoint_progress_end
> > +/* In bootstrap mode, we don't actually record anything. */
> > +if (IsBootstrapProcessingMode())
> > +return;
>
> Why do you check against the state of the system?
> pgstat_progress_update_* already provides protections against updating
> the progress tables if the progress infrastructure is not loaded; and
> otherwise (in the happy path) the cost of updating the progress fields
> will be quite a bit higher than normal. Updating stat_progress isn't
> very expensive (quite cheap, really), so I don't quite get why you
> guard against reporting stats when you expect no other client to be
> listening.

Nice point. I agree that the extra guards(IsBootstrapProcessingMode()
and (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) ==
0) are not needed as the progress reporting mechanism handles that
internally (It only updates when there is an access to the
pg_stat_progress_activity view). I am planning to add the progress of
checkpoint during shutdown and end-of-recovery cases in server logs as
we don't have access to the view. In this case these guards are
necessary. checkpoint_progress_update_param() is a generic function to
report progress to the view or server logs. Thoughts?

> I think you can simplify this a lot by directly using
> pgstat_progress_update_param() instead.
>
> > xlog.c @ checkpoint_progress_start
> > +pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, 
> > InvalidOid);
> > +checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
> > + PROGRESS_CHECKPOINT_PHASE_INIT);
> > +if (flags & CHECKPOINT_CAUSE_XLOG)
> > +checkpoint_progress_update_param(flags, 
> > PROGRESS_CHECKPOINT_KIND,
> > + PROGRESS_CHECKPOINT_KIND_WAL);
> > +else if (flags & CHECKPOINT_CAUSE_TIME)
> > +checkpoint_progress_update_param(flags, 
> > PROGRESS_CHECKPOINT_KIND,
> > + 
> > PROGRESS_CHECKPOINT_KIND_TIME);
> > + [...]
>
> Could you assign the kind of checkpoint to a local variable, and then
> update the "phase" and "kind" parameters at the same time through
> pgstat_progress_update_multi_param(2, ...)? See
> BuildRelationExtStatistics in extended_stats.c for an example usage.

I will make use of pgstat_progress_update_multi_param() in the next
patch to replace multiple calls to checkpoint_progress_update_param().

> Note that regardless of whether checkpoint_progress_update* will
> remain, the checks done in that function already have been checked in
> this function as well, so you can use the pgstat_* functions directly.

As I mentioned before I am planning to add progress reporting in the
server logs, checkpoint_progress_update_param() is required and it
makes the job easier.

> > monitoring.sgml
> > +   pg_stat_progress_checkpoint view will contain a
> > +   single row indicating the progress of checkpoint operation.
>
>... add "if a checkpoint is currently active".

I feel adding extra words here to indicate "if a checkpoint is
currently active" is not necessary as the view description provides
that information and also it aligns with the documentation of existing
progress views.

> > +   total_buffer_writes bigint
> > +   total_file_syncs bigint
>
> The other progress tables use [type]_total as column names for counter
> targets (e.g. backup_total for backup_streamed, heap_blks_total for
> heap_blks_scanned, etc.). I think that `buffers_total` and
> `files_total` would be better column names.

I agree and I will update this in the next patch.

> > +   The checkpoint operation is requested due to XLOG filling.
>
> + The checkpoint was started because >max_wal_size< of WAL was written.

How about this "The checkpoint is started because max_wal_size is reached".

> > +   The checkpoint operation is requested due to timeout.
>
> + The checkpoint was started due to the expiration of a
> >checkpoint_timeout< interval

"The checkpoint is started because checkpoint_timeout expired".

> > +   The checkpoint operation is forced even if no XLOG activity has 
> > occurred
> > +   since the last one.
>
> + Some operation forced a checkpoint.

"The checkpoint is started because some operation forced a checkpoint".

> > +  checkpointing CommitTs pages
>
> CommitTs -> Commit time stamp

I will handle this in the next patch.

Thanks & Regards,
Nitin Jadhav
> On Thu, 10 Feb 2022 at 07:53, Nitin Jadhav
>  wrote:
> >
> > > > We

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-17 Thread Dilip Kumar

On Fri, Feb 18, 2022 at 5:09 AM Andres Freund  wrote:

Thanks a lot Andres for taking time to read the thread and patch.

> > The only other kind of hazard I can think of is: could it be unsafe to
> > try to interpret the contents of a database to which no one else is
> > connected at present due to any of the issues you mention? But what's
> > the hazard exactly?
>
> I don't quite know. But I don't think it's impossible to run into trouble in
> this area. E.g. xid horizons are computed in a database specific way. If the
> routine reading pg_class did hot pruning, you could end up removing data
> that's actually needed for a logical slot in the other database because the
> backend local horizon state was computed for the "local" database?

I agree that while computing the xid horizon (ComputeXidHorizons()),
we only consider the backend which are connected to the same database
to which we are connected.  But we don't need to worry here because we
know the fact that there could be absolutely no backend connected to
the database we are trying to copy so we don't need to worry about
pruning the tuples which are visible to other backends.

Now if we are worried about the replication slot then for that we also
consider the xmin horizon from the replication slots so I don't think
that we have any problem here as well.  And we also consider the
walsender as well for computing the xid horizon.

> Could there be problems because other backends wouldn't see the backend
> accessing the other database as being connected to that database
> (PGPROC->databaseId)?

You mean that other backend will not consider this backend (which is
copying database) as connected to source database?  Yeah that is
correct but what is the problem in that, other backends can not
connect to the source database so what problem can they create to the
backend which is copying the database.

> Or what if somebody optimized snapshots to disregard readonly transactions in
> other databases?

Can you elaborate on this point?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?

2022-02-17 Thread Andy Fan

On Fri, Feb 18, 2022 at 4:15 AM Robert Haas  wrote:

> On Tue, Feb 1, 2022 at 10:08 AM Andy Fan  wrote:
> > To address the row estimation issue, The most straightforward way to fix
> this is to
> > ignore the derived clauses when figuring out the RelOptInfo->rows on
> base relation.
> > To note which clause is derived from this patch, I added a new field
> "EquivalenceClass *
> > derived" in RestrictInfo. and then added a  included_derived  option in
> clauselist_selectivity_ext,
> > during the set_xxx_rel_size function, we can pass the
> included_derived=false.  This strategy
> > should be used in get_parameterized_baserel_size.   In all the other
> cases, include_derived=true
> > is used. which are finished in commit 2. (Commit 1 is Daivd's patch, I
> just  rebased it)
>
> That doesn't sound correct to me.
>
> Suppose that we have A.x = B.x and also A.x < 42. We can choose to
> enforce A.x < 42 or we can choose to enforce B.x < 42 or we can do
> both. In general, any of those could be right:


This is impressive.  To achieve this, we have to treat a.x < 42 and
b.x < 42 equally rather than b.x < 42 is derived from a.x < 42,
and enforce the final plan to execute 1 qual in such a group at least.
This introduces some more complexity at the first glance, but I think
it is a great aspect to think about.


> .., which is good. However, the
> row count estimate for B will be too high, because it will not include
> the effect of B.x < 42. And that means that the cost estimate for
> join(A, B) will be wrong. It will be too high, because it's going to
> think that it has more rows coming from the B side of the join than
> what is actually the case. And that can also mess up the plan at
> higher levels.
>
>
IIUC, this would not happen if we apply the commit 3.

In commit 3, the real rows for the scan path are adjusted by
RelOptInfo->filter_rows,
which take the effect of B.x < 42.  It is used in cost_{ascan}_path lately.
Here is an example:

regression=# explain analyze select * from tenk1 a, tenk1 b where
a.thousand = b.thousand and a.thousand < 100;
QUERY PLAN
---
Nested Loop  (cost=24.90..459.16 rows=10740 width=488) (actual
time=0.416..17.459 rows=1 loops=1)
->  Bitmap Heap Scan on tenk1 b  (cost=24.61..383.03 rows=1074 width=244)
(actual time=0.369..1.801 rows=1000 loops=1)
Recheck Cond: (thousand < 100)
Heap Blocks: exact=324
->  Bitmap Index Scan on tenk1_thous_tenthous  (cost=0.00..24.34 rows=1074
width=0) (actual time=0.251..0.251 rows=1000 loops=1)
Index Cond: (thousand < 100)
->  Memoize  (cost=0.30..0.47 rows=1 width=244) (actual time=0.002..0.006
rows=10 loops=1000)
Cache Key: b.thousand
Cache Mode: logical
Hits: 900  Misses: 100  Evictions: 0  Overflows: 0  Memory Usage: 277kB
->  Index Scan using tenk1_thous_tenthous on tenk1 a  (cost=0.29..0.46
rows=1 width=244) (actual time=0.012..0.033 rows=10 loops=100)
Index Cond: ((thousand = b.thousand) AND (thousand < 100))
Planning Time: 0.934 ms
Execution Time: 18.496 ms
(14 rows)

b.thousand < 100 is derived from a.thousand < 100; and the final path cost
is:
->  Bitmap Heap Scan on tenk1 b  (cost=24.61..383.03 rows=1074 width=244)
(actual time=0.369..1.801 rows=1000 loops=1)
Recheck Cond: (thousand < 100)
Heap Blocks: exact=324
->  Bitmap Index Scan on tenk1_thous_tenthous  (cost=0.00..24.34 rows=1074
width=0) (actual time=0.251..0.251 rows=1000 loops=1)
Index Cond: (thousand < 100)

Which is exactly same as select * from tenk1 where thousand < 100;

== Commit 3 [1] message with some modification ==

Introduce RelOptInfo.filtered_rows.

Previously the Path.rows (shown in the explain output) and
RelOptInfo.rows
(which would be used to calculating joinrel's estimated rows) are same
at many scan paths, like SeqScan, IndexScan, BitmapHeapScan and so on.
But
they would be different after distributing a new restrictinfo from
ec_filter.
So I developed RelOptInfo.filtered_rows to take some duty out of
RelOptInfo.rows.
RelOptInfo.filtered_rows would count the effect of derived qual, and be
used for
cost_xxx function.



> On the other hand, I completely agree with David's comments on the
> other thread to the effect that holding our breath is not getting us
> anywhere.


+1 with this as well.  PostgreSQL community has an great reputation
in the world,  and mostly because the authority figures in this community
has set up a good example for the following people.  and it is not
easy.  But if the authority figures are too restrict with code quality,
 this
is not good for the community as well, we should encourage more people
to have a try,  to some extent.

Taking the current feature for example,  the estimation issue absolutely
needs a fix.  While I do listen/think carefully how to reduce planner
extra cost or rethink if any important items are missed by me. I also think
David's method is not

Re: Race conditions in 019_replslot_limit.pl

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 19:41:40 -0800, Andres Freund wrote:
> At this point the psql "walsender" is *not* terminated, because the fatal
> message is waiting behind all the WAL data sent. Which also means that the
> slot isn't yet released.

Isn't it pretty bonkers that we allow error processing to get stuck behind
network traffic, *before* we have have released resources (locks etc)?

I wonder if we should try to send, but do it in a nonblocking way.

Greetings,

Andres Freund

Re: Timeout control within tests

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 21:28:42 -0800, Noah Misch wrote:
> I propose to have environment variable PG_TEST_TIMEOUT_DEFAULT control the
> timeout used in the places that currently hard-code 180s.

Meson's test runner has the concept of a "timeout multiplier" for ways of
running tests. Meson's stuff is about entire tests (i.e. one tap test), so
doesn't apply here, but I wonder if we shouldn't do something similar?  That
way we could adjust different timeouts with one setting, instead of many
different fobs to adjust?

Greetings,

Andres Freund

Timeout control within tests

2022-02-17 Thread Noah Misch

On Fri, Feb 05, 2021 at 03:55:20PM -0500, Tom Lane wrote:
> We have, almost invariably, regretted it when we tried to use short
> timeouts in test cases.

> More generally, sometimes people want to do things like run a test
> under valgrind.  So it's not just "underpowered machines" that may
> need a generous timeout.  Even if we did reduce the default, I'd
> want a way (probably via an environment variable, cf PGCTLTIMEOUT)
> to kick it back up.

I have a few use cases for that:

1. My buildfarm members have more and more competition for CPU and I/O.
   Examples where I suspect animal slowness caused a 180s timeout:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow=2021-04-11%2003%3A11%3A39

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern=2021-07-26%2004%3A38%3A29

2. When I'm developing a change and I locally break a test in a way that leads
   to a timeout, I like to be able to lower that timeout.

3. I want more tests to use the right timeout from the start.  Low-timeout
   tests appear at least a few times per year:
   d03eeab Mon May 31 00:29:58 2021 -0700 Raise a timeout to 180s, in test 
010_logical_decoding_timelines.pl.
   388b959 Sat Feb 27 07:02:56 2021 -0800 Raise a timeout to 180s, in 
contrib/test_decoding.
   08dde1b Tue Dec 22 11:10:12 2020 -0500 Increase timeout in 
021_row_visibility.pl.
   8961355 Sat Apr 25 18:45:27 2020 -0700 Raise a timeout to 180s, in test 
003_recovery_targets.pl.
   1db439a Mon Dec 10 20:15:42 2018 -0800 Raise some timeouts to 180s, in test 
code.

I propose to have environment variable PG_TEST_TIMEOUT_DEFAULT control the
timeout used in the places that currently hard-code 180s.  TAP tests should
retrieve the value via $PostgreSQL::Test::Utils::timeout_default.  pg_regress
tests should retrieve it via \getenv.  I would like to back-patch the TAP
part, for cause (1).  (The pg_regress part hasn't been a buildfarm problem,
and \getenv is new in v15.)  Patches attached.  I considered and excluded
other changes, for now:

a. I considered consolidating this with PGISOLATIONTIMEOUT (default 300).  One
   could remove the older variable entirely or make isolationtester use the
   first-available of [PGISOLATIONTIMEOUT, 2 * PG_TEST_TIMEOUT_DEFAULT, 360].
   Does anyone have an opinion on what, if anything, to do there?

b. I briefly made stats.sql accept PG_TEST_TIMEOUT_DEFAULT to override its
   hard-coded 30s timeout.  However, a higher timeout won't help when a UDP
   buffer fills.  If the test were structured to observe evidence of a vacant
   UDP buffer before proceeding with the test stat messages, a higher timeout
   could make more sense.  I added a comment.

c. One could remove timeout-duration function arguments (e.g. from
   pg_recvlogical_upto) and just have the function consult timeout_default.
   This felt like highly-optional refactoring.
Author: Noah Misch 
Commit: Noah Misch 

Introduce PG_TEST_TIMEOUT_DEFAULT for TAP suite can't-happen timeouts.

Slow hosts may avoid load-induced, spurious failures by setting
environment variable PG_TEST_TIMEOUT_DEFAULT to some number of seconds
greater than 180.  Developers may see faster failures by setting that
environment variable to some lesser number of seconds.  In tests, write
$PostgreSQL::Test::Utils::timeout_default wherever the convention has
been to write 180.  This change raises the default for some briefer
timeouts.  Back-patch to v10 (all supported versions).

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/contrib/amcheck/t/002_cic.pl b/contrib/amcheck/t/002_cic.pl
index d604def..b8e4ac7 100644
--- a/contrib/amcheck/t/002_cic.pl
+++ b/contrib/amcheck/t/002_cic.pl
@@ -18,7 +18,8 @@ my ($node, $result);
 #
 $node = PostgreSQL::Test::Cluster->new('CIC_test');
 $node->init;
-$node->append_conf('postgresql.conf', 'lock_timeout = 18');
+$node->append_conf('postgresql.conf',
+   'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
 $node->safe_psql('postgres', q(CREATE TABLE tbl(i int)));
diff --git a/contrib/amcheck/t/003_cic_2pc.pl b/contrib/amcheck/t/003_cic_2pc.pl
index f668ed3..e66ccd9 100644
--- a/contrib/amcheck/t/003_cic_2pc.pl
+++ b/contrib/amcheck/t/003_cic_2pc.pl
@@ -22,7 +22,8 @@ my ($node, $result);
 $node = PostgreSQL::Test::Cluster->new('CIC_2PC_test');
 $node->init;
 $node->append_conf('postgresql.conf', 'max_prepared_transactions = 10');
-$node->append_conf('postgresql.conf', 'lock_timeout = 18');
+$node->append_conf('postgresql.conf',
+   'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
 $node->safe_psql('postgres', q(CREATE TABLE tbl(i int)));
@@ -38,7 +39,7 @@ $node->safe_psql('postgres', q(CREATE TABLE tbl(i int)));

 my $main_in= '';
 my $main_out   = '';
-my

Re: Fix a comment in worker.c

2022-02-17 Thread Masahiko Sawada

On Fri, Feb 18, 2022 at 1:48 PM Amit Kapila  wrote:
>
> On Wed, Feb 16, 2022 at 4:56 PM Amit Kapila  wrote:
> >
> > On Wed, Feb 16, 2022 at 5:57 AM Masahiko Sawada  
> > wrote:
> > >
> > > While reading the code, I realized that the second sentence of the
> > > following comment in worker.c is not correct:
> > >
> > >/*
> > > * Exit if the subscription was disabled. This normally should not 
> > > happen
> > > * as the worker gets killed during ALTER SUBSCRIPTION ... DISABLE.
> > > */
> > > if (!newsub->enabled)
> > > {
> > > ereport(LOG,
> > > (errmsg("logical replication apply worker for
> > > subscription \"%s\" will "
> > > "stop because the subscription was disabled",
> > > MySubscription->name)));
> > >
> > > proc_exit(0);
> > > }
> > >
> > > IIUC the apply worker normally exits here when the subscription is
> > > disabled since we don't stop the apply worker during ALTER
> > > SUBSCRIPTION DISABLE. I've attached a patch to remove it.
> > >
> >
> > Yes, I also have the same understanding. Your patch LGTM. I'll push
> > this unless someone thinks otherwise.
> >
>
> Pushed.

Thanks!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Fix a comment in worker.c

2022-02-17 Thread Amit Kapila

On Wed, Feb 16, 2022 at 4:56 PM Amit Kapila  wrote:
>
> On Wed, Feb 16, 2022 at 5:57 AM Masahiko Sawada  wrote:
> >
> > While reading the code, I realized that the second sentence of the
> > following comment in worker.c is not correct:
> >
> >/*
> > * Exit if the subscription was disabled. This normally should not happen
> > * as the worker gets killed during ALTER SUBSCRIPTION ... DISABLE.
> > */
> > if (!newsub->enabled)
> > {
> > ereport(LOG,
> > (errmsg("logical replication apply worker for
> > subscription \"%s\" will "
> > "stop because the subscription was disabled",
> > MySubscription->name)));
> >
> > proc_exit(0);
> > }
> >
> > IIUC the apply worker normally exits here when the subscription is
> > disabled since we don't stop the apply worker during ALTER
> > SUBSCRIPTION DISABLE. I've attached a patch to remove it.
> >
>
> Yes, I also have the same understanding. Your patch LGTM. I'll push
> this unless someone thinks otherwise.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-17 Thread Dilip Kumar

On Fri, Feb 18, 2022 at 4:30 AM Robert Haas  wrote:
>
>
> > This thread is long. Could you summarize what lead you to consider other
> > approaches (e.g. looking in the filesystem for relfilenodes) as not 
> > feasible /
> > too ugly / ...?
>
> I don't think it's infeasible to look at the filesystem for files and
> just copy whatever files we find. It's a plausible alternate design. I
> just don't like it as well. I think that relying on the filesystem
> contents to tell us what's going on is kind of hacky. The only
> technical issue I see there is that the WAL logging might require more
> kludgery, since that mechanism is kind of intertwined with
> shared_buffers. You'd have to get the right block references into the
> WAL record, and you have to make sure that checkpoints don't move the
> redo pointer at an inopportune moment.

Actually based on the previous discussion, I also tried to write the
POC with the file system scanning approach to identify the relation to
be copied seet patch 0007 in this thread [1].  And later we identified
one issue [2], i.e. while scanning directly the disk file we will only
know the relfilenode but we can not identify the relation oid that
means we can not lock the relation.  Now, I am not saying that there
is no way to work around that issue but that was also one of the
reasons for not pursuing that approach.

[1] 
https://www.postgresql.org/message-id/CAFiTN-v1KYsVAhq_fOWFa27LZiw9uK4n4cz5XmQJxJpsVcfq1w%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFiTN-v%3DU58by_BeiZruNhykxk1q9XUxF%2BqLzD2LZAsEn2EBkg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Race conditions in 019_replslot_limit.pl

2022-02-17 Thread Tom Lane

Andres Freund  writes:
> There've not been any new reports in the last 18 hours, but that's probably
> just due to lower commit activity triggering fewer runs.

myna just showed the same symptom:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=myna=2022-02-18%2003%3A00%3A17

#   Failed test 'have walsender pid 13681
# 13670'
#   at t/019_replslot_limit.pl line 335.
#   '13681
# 13670'
# doesn't match '(?^:^[0-9]+$)'
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 18.
[03:13:09] t/019_replslot_limit.pl .. 

I think that's the first one on something outside your menagerie.

regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 16:34:34 -0800, Andres Freund wrote:
> I've done this for a couple hundred iterations, under load, subsequently with
> additional assertions, without being able to reproduce.

Playing around with this further I did get into one interesting state:

I started psql with replication=1, created a slot
  SELECT pg_drop_replication_slot('t');SELECT 
pg_create_physical_replication_slot('t', true);
which was at 0/22C8 at this point
  START_REPLICATION SLOT "t" 0/22C8 TIMELINE 1
then did
  SELECT pg_switch_wal();checkpoint;
in another session, which tried to invalidate that slot because of
max_slot_wal_keep_size=1MB

At this point the psql "walsender" is *not* terminated, because the fatal
message is waiting behind all the WAL data sent. Which also means that the
slot isn't yet released.

At this point checkpointer is stuck, because it is waiting for the connection
to end.

Quitting the "psql walsender" or terminating it again resolves the issue.

Not sure how this could be related, but it seems interesting.


"psql walsender":

(gdb) bt
#0  0x7faf4a570ec6 in epoll_wait (epfd=4, events=0x7faf4c201458, 
maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x7faf4b8ced5c in WaitEventSetWaitBlock (set=0x7faf4c2013e0, 
cur_timeout=-1, occurred_events=0x7ffe47df2320, nevents=1)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465
#2  0x7faf4b8cebe3 in WaitEventSetWait (set=0x7faf4c2013e0, timeout=-1, 
occurred_events=0x7ffe47df2320, nevents=1, wait_event_info=100663297)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411
#3  0x7faf4b70f48b in secure_write (port=0x7faf4c22da50, 
ptr=0x7faf4c2f1210, len=21470) at 
/home/andres/src/postgresql/src/backend/libpq/be-secure.c:298
#4  0x7faf4b71aecb in internal_flush () at 
/home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1380
#5  0x7faf4b71ada1 in internal_putbytes (s=0x7ffe47df23dc "E\177", len=1) 
at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1326
#6  0x7faf4b71b0cf in socket_putmessage (msgtype=69 'E', s=0x7faf4c201700 
"SFATAL", len=112)
at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1507
#7  0x7faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at 
/home/andres/src/postgresql/src/backend/libpq/pqformat.c:301
#8  0x7faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 
) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253
#9  0x7faf4bab8aa0 in EmitErrorReport () at 
/home/andres/src/postgresql/src/backend/utils/error/elog.c:1541
#10 0x7faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", 
lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts")
at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592
#11 0x7faf4b907e73 in ProcessInterrupts () at 
/home/andres/src/postgresql/src/backend/tcop/postgres.c:3192
#12 0x7faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 
) at 
/home/andres/src/postgresql/src/backend/replication/walsender.c:2404
#13 0x7faf4b88f82e in StartReplication (cmd=0x7faf4c293fc0) at 
/home/andres/src/postgresql/src/backend/replication/walsender.c:834
#14 0x7faf4b89136f in exec_replication_command (cmd_string=0x7faf4c2073c0 
"START_REPLICATION SLOT \"t\" 0/1B00 TIMELINE 1\n;")
at /home/andres/src/postgresql/src/backend/replication/walsender.c:1771
#15 0x7faf4b909842 in PostgresMain (dbname=0x7faf4c22fce8 "", 
username=0x7faf4c22fcc8 "andres")
at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4494
#16 0x7faf4b830e7e in BackendRun (port=0x7faf4c22da50) at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4593
#17 0x7faf4b83075f in BackendStartup (port=0x7faf4c22da50) at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4321
#18 0x7faf4b82c55b in ServerLoop () at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1801
#19 0x7faf4b82bd71 in PostmasterMain (argc=49, argv=0x7faf4c1ff1e0) at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1473


checkpointer:
(gdb) bt
#0  0x7f0d4ee1fec6 in epoll_wait (epfd=10, events=0x7f0d51489b78, 
maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x7f0d5017dd5c in WaitEventSetWaitBlock (set=0x7f0d51489b18, 
cur_timeout=-1, occurred_events=0x7fffe7c5f410, nevents=1)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465
#2  0x7f0d5017dbe3 in WaitEventSetWait (set=0x7f0d51489b18, timeout=-1, 
occurred_events=0x7fffe7c5f410, nevents=1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411
#3  0x7f0d5017cec4 in WaitLatch (latch=0x7f0d4c27db04, wakeEvents=33, 
timeout=-1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:472
#4  0x7f0d5018fe34 in ConditionVariableTimedSleep (cv=0x7f0d4c342268, 
timeout=-1, wait_event_info=134217772)
at

Re: Fix CheckIndexCompatible comment

2022-02-17 Thread Fujii Masao





On 2022/02/07 19:14, Yugo NAGATA wrote:

Agreed. I updated the patch to add a comment about 'oldId'.


Thanks for updating the patch! I slightly modified the patch and pushed it.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Assert in pageinspect with NULL pages

2022-02-17 Thread Michael Paquier

On Thu, Feb 17, 2022 at 09:00:20PM -0600, Justin Pryzby wrote:
> BRIN can also crash if passed a non-brin index.
> 
> I've been sitting on this one for awhile.  Feel free to include it in your
> patchset.

Ugh.  Thanks!  I am keeping a note about this one.
--
Michael


signature.asc
Description: PGP signature

Fix formatting of Interval output

2022-02-17 Thread Joseph Koshakow

When formatting the output of an Interval, we call abs() on the hours
field of the Interval. Calling abs(INT_MIN) returns back INT_MIN
causing the output to contain two '-' characters. The attached patch
fixes that issue by special casing INT_MIN hours.

Here is an example of the issue:
postgres=# SELECT INTERVAL '-2147483648 hrs';
  interval

 --2147483648:00:00
(1 row)


Cheers,
Joe Koshakow
From 720f0903c71a428695672aa866fc8fb0146a472e Mon Sep 17 00:00:00 2001
From: Joseph Koshakow 
Date: Thu, 17 Feb 2022 21:57:48 -0500
Subject: [PATCH] Fix Interval formatting with INT_MIN hours

When formatting the output of an Interval, we call abs() on the hours
field of the Interval. Calling abs(INT_MIN) returns back INT_MIN
causing the output to contain two '-' characters. This commit fixes
that issue by special casing INT_MIN hours.
---
 src/backend/utils/adt/datetime.c   | 7 +--
 src/test/regress/expected/interval.out | 8 
 src/test/regress/sql/interval.sql  | 4 
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 7926258c06..368df715e6 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -4376,11 +4376,14 @@ EncodeInterval(struct pg_tm *tm, fsec_t fsec, int style, char *str)
 			if (is_zero || hour != 0 || min != 0 || sec != 0 || fsec != 0)
 			{
 bool		minus = (hour < 0 || min < 0 || sec < 0 || fsec < 0);
+// need to special case hours being INT_MIN because you can't call abs(INT_MIN)
+bool		int_min_hours = (hour == INT_MIN);
 
 sprintf(cp, "%s%s%02d:%02d:",
 		is_zero ? "" : " ",
-		(minus ? "-" : (is_before ? "+" : "")),
-		abs(hour), abs(min));
+		(minus ? (int_min_hours ? "" : "-") : (is_before ? "+" : "")),
+		(int_min_hours ? hour : abs(hour)),
+		abs(min));
 cp += strlen(cp);
 cp = AppendSeconds(cp, sec, fsec, MAX_INTERVAL_PRECISION, true);
 *cp = '\0';
diff --git a/src/test/regress/expected/interval.out b/src/test/regress/expected/interval.out
index accd4a7d90..bb0d59ccfa 100644
--- a/src/test/regress/expected/interval.out
+++ b/src/test/regress/expected/interval.out
@@ -1043,3 +1043,11 @@ SELECT extract(epoch from interval '10 days');
  864000.00
 (1 row)
 
+-- test that INT_MIN number of hours is formatted properly
+SET IntervalStyle to postgres;
+SELECT INTERVAL '-2147483648 hrs';
+ interval  
+---
+ -2147483648:00:00
+(1 row)
+
diff --git a/src/test/regress/sql/interval.sql b/src/test/regress/sql/interval.sql
index 6d532398bd..f1fe0a5ea0 100644
--- a/src/test/regress/sql/interval.sql
+++ b/src/test/regress/sql/interval.sql
@@ -355,3 +355,7 @@ SELECT f1,
 
 -- internal overflow test case
 SELECT extract(epoch from interval '10 days');
+
+-- test that INT_MIN number of hours is formatted properly
+SET IntervalStyle to postgres;
+SELECT INTERVAL '-2147483648 hrs';
-- 
2.25.1

Re: Assert in pageinspect with NULL pages

2022-02-17 Thread Michael Paquier

On Thu, Feb 17, 2022 at 05:40:41PM +0800, Julien Rouhaud wrote:
> About the patch, it's incorrectly using a hardcoded 8192 block-size rather 
> than
> using the computed :block_size (or computing one when it's not already the
> case).

Well, the tests of pageinspect fail would already fail when using a
different page size than 8k, like the checksum ones :)

Anywa, I agree with your point that if this is not a reason to not
make the tests more portable if we can do easily.

> I'm also wondering if it wouldn't be better to return NULL rather than 
> throwing
> an error for uninitialized pages.

Agreed that this is a sensible choice.  NULL would be helpful for the
case where one compiles all the checksums of a relation with a full
scan based on the relation size, for example.  All these behaviors
ought to be documented properly, as well.  For SRFs, this should just
return no rows rather than generating an error.

> While at it, it could also be worthwhile to add tests for invalid blkno and
> block size in page_checksum().

If we are on that, we could also have tests for the various "input
page too small" errors.  Now, these are just improvements, so I'd
rather treat this part separately of the issue reported, and add the
extra tests only on HEAD.

I can't help but notice that we should have a similar protection on
PageIsNew() in heap_page_items()?  Now the code happens to work for an
empty page thanks to PageGetMaxOffsetNumber(), but this extra
protection would make the code more solid in the long-term IMO for the
full-zero case?  It is worth noting that, in pageinspect, two of the
heap functions are the only ones to not be strict..

Something similar could be said about page_header() in rawpage.c,
though it is not wrong to return only zeros for a page full of zeros.

Shouldn't we similarly care about bt_metap() and
bt_page_stats_internal(), both callers of ReadBuffer().  The root
point is that the RBM_NORMAL mode of ReadBufferExtended() considers a
page full of zeros as valid, as per its top comments.

For the same reason, get_raw_page_internal() would work just fine and
return a page full of zeros.

Also, instead of an error in get_page_from_raw(), we should make sure
that all its callers are able to map with the case of a new page,
where we return 8kB worth of zeros from this inner routine.
--
Michael

signature.asc
Description: PGP signature

Re: Assert in pageinspect with NULL pages

2022-02-17 Thread Justin Pryzby

BRIN can also crash if passed a non-brin index.

I've been sitting on this one for awhile.  Feel free to include it in your
patchset.

commit 08010a6037fc4e24a9ba05e5386e766f4310d35e
Author: Justin Pryzby 
Date:   Tue Jan 19 00:25:15 2021 -0600

pageinspect: brin_page_items(): check that given relation is not only an 
index, but a brin one

diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c
index f1e64a39ef2..3de6dd943ef 100644
--- a/contrib/pageinspect/brinfuncs.c
+++ b/contrib/pageinspect/brinfuncs.c
@@ -16,6 +16,7 @@
 #include "access/brin_tuple.h"
 #include "access/htup_details.h"
 #include "catalog/index.h"
+#include "catalog/pg_am.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "lib/stringinfo.h"
@@ -59,7 +60,7 @@ brin_page_type(PG_FUNCTION_ARGS)
if (raw_page_size != BLCKSZ)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("input page too small"),
+errmsg("input page wrong size"),
 errdetail("Expected size %d, got %d",
   BLCKSZ, raw_page_size)));
 
@@ -97,7 +98,7 @@ verify_brin_page(bytea *raw_page, uint16 type, const char 
*strtype)
if (raw_page_size != BLCKSZ)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("input page too small"),
+errmsg("input page wrong size"),
 errdetail("Expected size %d, got %d",
   BLCKSZ, raw_page_size)));
 
@@ -169,7 +170,14 @@ brin_page_items(PG_FUNCTION_ARGS)
MemoryContextSwitchTo(oldcontext);
 
indexRel = index_open(indexRelid, AccessShareLock);
-   bdesc = brin_build_desc(indexRel);
+
+   /* Must be a BRIN index */
+   if (indexRel->rd_rel->relkind != RELKIND_INDEX ||
+   indexRel->rd_rel->relam != BRIN_AM_OID)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("\"%s\" is not a BRIN index",
+RelationGetRelationName(indexRel;
 
/* minimally verify the page we got */
page = verify_brin_page(raw_page, BRIN_PAGETYPE_REGULAR, "regular");
@@ -178,6 +186,7 @@ brin_page_items(PG_FUNCTION_ARGS)
 * Initialize output functions for all indexed datatypes; simplifies
 * calling them later.
 */
+   bdesc = brin_build_desc(indexRel);
columns = palloc(sizeof(brin_column_state *) * 
RelationGetDescr(indexRel)->natts);
for (attno = 1; attno <= bdesc->bd_tupdesc->natts; attno++)
{

Re: Logical replication timeout problem

2022-02-17 Thread Ajin Cherian

On Tue, Feb 8, 2022 at 1:59 PM wangw.f...@fujitsu.com
 wrote:
>
> On Wed, Jan 26, 2022 at 11:37 AM I wrote:
> > On Sat, Jan 22, 2022 at 7:12 PM Amit Kapila  wrote:
> > > Now, one idea to solve this problem could be that whenever we skip
> > > sending any change we do try to update the plugin progress via
> > > OutputPluginUpdateProgress(for walsender, it will invoke
> > > WalSndUpdateProgress), and there it tries to process replies and send
> > > keep_alive if necessary as we do when we send some data via
> > > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> > > don't know whether it is a good idea to invoke such a mechanism for
> > > every change we skip to send or we should do it after we skip sending
> > > some threshold of continuous changes. I think later would be
> > > preferred. Also, we might want to introduce a new parameter
> > > send_keep_alive to this API so that there is flexibility to invoke
> > > this mechanism as we don't need to invoke it while we are actually
> > > sending data and before that, we just update the progress via this
> > > API.
> > ..
> > Based on above, I think the second idea that sending some threshold of
> > continuous changes might be better, I will do some research about this
> > approach.
> Based on the second idea, I wrote a new patch(see attachment).

Hi Wang,

Some comments:
 I see you only track skipped Inserts/Updates and Deletes. What about
DDL operations that are skipped, what about truncate.
What about changes made to unpublished tables? I wonder if you could
create a test script that only did DDL operations
and truncates, would this timeout happen?

regards,
Ajin Cherian
Fujitsu Australia

Re: Fix overflow in DecodeInterval

2022-02-17 Thread Joseph Koshakow

Ok, so I've attached a patch with my final unprompted changes. It
contains the following two changes:

1. I found some more overflows with the ISO8601 formats and have
included some fixes.
2. I reverted the overflow checks for the seconds field. It's actually a
bit more complicated than I thought. For example consider the following
query:
postgres=# SELECT INTERVAL '0. min 2147483647 sec';
interval
--
-596523:13:09.01
(1 row)
This query will overflow the tm_sec field of the struct pg_tm, however
it's not actually out of range for the Interval. I'm not sure the best
way to handle this right now, but I think it would be best to leave it
for a future patch. Perhaps the time related fields in struct pg_tm
need to be changed to 64 bit ints.

- Joe Koshakow
From 5750dcfbc00cb1259263e9986898f1960edf9c7f Mon Sep 17 00:00:00 2001
From: Joseph Koshakow 
Date: Fri, 11 Feb 2022 14:18:32 -0500
Subject: [PATCH] Check for overflow when decoding an interval

When decoding an interval there are various date units which are
aliases for some multiple of another unit. For example a week is 7 days
and a decade is 10 years. When decoding these specific units, there is
no check for overflow, allowing the interval to overflow. This commit
adds an overflow check for all of these units.

Additionally fractional date/time units are rolled down into the next
smallest unit. For example 0.1 months is 3 days. When adding these
fraction units, there is no check for overflow, allowing the interval
to overflow. This commit adds an overflow check for all of the
fractional units.

Additionally adding the word "ago" to the interval negates every
field. However the negative of INT_MIN is still INT_MIN. This
commit adds a check to make sure that we don't try and negate
a field if it's INT_MIN.

Signed-off-by: Joseph Koshakow 
---
 src/backend/utils/adt/datetime.c   | 115 +++--
 src/test/regress/expected/interval.out | 112 
 src/test/regress/sql/interval.sql  |  29 +++
 3 files changed, 232 insertions(+), 24 deletions(-)

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 7926258c06..1c2ab58f82 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -21,6 +21,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "catalog/pg_type.h"
+#include "common/int.h"
 #include "common/string.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -46,8 +47,11 @@ static char *AppendSeconds(char *cp, int sec, fsec_t fsec,
 		   int precision, bool fillzeros);
 static void AdjustFractSeconds(double frac, struct pg_tm *tm, fsec_t *fsec,
 			   int scale);
-static void AdjustFractDays(double frac, struct pg_tm *tm, fsec_t *fsec,
+static bool AdjustFractDays(double frac, struct pg_tm *tm, fsec_t *fsec,
 			int scale);
+static bool AdjustFractMonths(double frac, struct pg_tm *tm, int scale);
+static bool AdjustDays(int val, struct pg_tm *tm, int multiplier);
+static bool AdjustYears(int val, struct pg_tm *tm, int multiplier);
 static int	DetermineTimeZoneOffsetInternal(struct pg_tm *tm, pg_tz *tzp,
 			pg_time_t *tp);
 static bool DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t,
@@ -515,18 +519,56 @@ AdjustFractSeconds(double frac, struct pg_tm *tm, fsec_t *fsec, int scale)
 }
 
 /* As above, but initial scale produces days */
-static void
+static bool
 AdjustFractDays(double frac, struct pg_tm *tm, fsec_t *fsec, int scale)
 {
 	int			extra_days;
 
 	if (frac == 0)
-		return;
+		return true;
 	frac *= scale;
 	extra_days = (int) frac;
-	tm->tm_mday += extra_days;
+	if (pg_add_s32_overflow(tm->tm_mday, extra_days, >tm_mday))
+		return false;
 	frac -= extra_days;
 	AdjustFractSeconds(frac, tm, fsec, SECS_PER_DAY);
+	return true;
+}
+
+/* As above, but initial scale produces months */
+static bool
+AdjustFractMonths(double frac, struct pg_tm *tm, int scale)
+{
+	int months = rint(frac * MONTHS_PER_YEAR * scale);
+	return !pg_add_s32_overflow(tm->tm_mon, months, >tm_mon);
+}
+
+/*
+ * Multiply val by multiplier (to produce days) and add to *tm.
+ * Returns true if successful, false if tm overflows.
+ */
+static bool
+AdjustDays(int val, struct pg_tm *tm, int multiplier)
+{
+	int			extra_days;
+	return !pg_mul_s32_overflow(val, multiplier, _days) &&
+		!pg_add_s32_overflow(tm->tm_mday, extra_days, >tm_mday);
+}
+
+/* As above, but initial val produces months */
+static bool
+AdjustMonths(int val, struct pg_tm *tm)
+{
+	return !pg_add_s32_overflow(tm->tm_mon, val, >tm_mon);
+}
+
+/* As above, but initial val produces years */
+static bool
+AdjustYears(int val, struct pg_tm *tm, int multiplier)
+{
+	int			years;
+	return !pg_mul_s32_overflow(val, multiplier, ) &&
+		!pg_add_s32_overflow(tm->tm_year, years, >tm_year);
 }
 
 /* Fetch a fractional-second value with suitable error checking */
@@ -3287,44 +3329,56 @@ DecodeInterval(char **field, int

Re: Support escape sequence for cluster_name in postgres_fdw.application_name

2022-02-17 Thread Fujii Masao





On 2022/02/15 8:52, r.takahash...@fujitsu.com wrote:

Hi Fujii san,


Thank you for updating the patch.
I have no additional comments.


Thanks for the review! Pushed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: adding 'zstd' as a compression algorithm

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 13:34:08 +0900, Michael Paquier wrote:
> %define needs to include HAVE_LIBZSTD, HAVE_ZSTD_H and USE_ZSTD, so
> this version fails the sanity check between pg_config.h.in and the
> MSVC scripts checking that all flags exist.

Do we really need all three defines? How about using AC_CHECK_HEADER() instead
of AC_CHECK_HEADERS()? That wouldn't define HAVE_ZSTD_H. Cases where we error
out if a header isn't found make it a bit pointless to then still define
HAVE_*_H.  Plenty other cases in configure.ac just use AC_CHECK_HEADER.

Greetings,

Andres Freund

Re: Small and unaffected typo in pg_logical_slot_get_changes_guts()

2022-02-17 Thread Michael Paquier

On Thu, Feb 17, 2022 at 11:33:55AM +, Dagfinn Ilmari Mannsåker wrote:
> Would it be possible to make that macro only defined when building
> extensions, but not when building Postgres itself?  For example, Perl
> has a PERL_CORE macro that's only defined when building Perl itself, but
> not when building CPAN modules, and backwards-compatibility macros and
> functions are guarded by `#ifndef PERL_CORE`.

That could be interesting in the long term, as compatibility macros
have accumulated over the years and there are from time to time
patches that are merged that make use of things they should not.

With the build we have now, I am not completely sure how you would do
the split though.  Extensions are one thing, and we could set a flag
as part of USE_PGXS.  There is also the popular case where extensions
are copied into the tree where we would not have USE_PGXS for them.
That would cause compilation failures.  Another limit that could be
defined is to limit the removal of the compatibility macros for
src/backend/ and src/bin/, leaving any in-core modules out of the
picture.
--
Michael


signature.asc
Description: PGP signature

Re: adding 'zstd' as a compression algorithm

2022-02-17 Thread Michael Paquier

On Thu, Feb 17, 2022 at 09:40:13AM -0500, Robert Haas wrote:
> Ah, OK, cool. It seems like we have consensus on this being a good
> direction, so I plan to commit this later today unless objections or
> additional review comments turn up.

So, will there be a part of the system where we'll make use of that in
15?  pg_basebackup for server-side and client-side compression, at
least, as far as I can guess?
--
Michael


signature.asc
Description: PGP signature

Re: Silencing upcoming warning about stack_base_ptr

2022-02-17 Thread Tom Lane

Andres Freund  writes:
> On 2022-02-17 18:44:27 -0500, Tom Lane wrote:
>> (that's visible now on buildfarm member caiman).  We probably
>> should take some thought for silencing this before it starts
>> to be in people's faces during routine development.

> We could try using __builtin_frame_address(0) when available, presumably gcc
> won't warn about that...

Oh, I didn't know about that.  Seems like a better option since
then we don't need any API changes at all.  Maybe at some point
some non-gcc-alike will start delivering a comparable warning,
but then we could fall back to what I did here.

regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-16 20:03:00 -0800, Andres Freund wrote:
> I'll run the test in a loop, perhaps I can reproduce...

I've done this for a couple hundred iterations, under load, subsequently with
additional assertions, without being able to reproduce.

There've not been any new reports in the last 18 hours, but that's probably
just due to lower commit activity triggering fewer runs.

Greetings,

Andres Freund

Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-17 Thread Peter Geoghegan

On Thu, Feb 17, 2022 at 4:18 PM Andres Freund  wrote:
> > What substantial changes are you referring to? The one thing that did
> > change was the commit message, which framed everything in terms of the
> > later work. It really is true that the patch that I committed was
> > essentially the same patch as the one posted on November 22, in both
> > style and substance. Before I really even began to think about the
> > freezing stuff. This is a matter of record.
>
> Here I was referencing your description of how the patch started ("purely as
> refactoring work"), and then evolved into something not just a refactoring.

Of course.

> If helpful I can give a go at showing how I think it could be split up. Or
> perhaps more productively, do that on a not-yet-committed larger patch.

Any help is appreciated.

-- 
Peter Geoghegan

Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 14:23:51 -0800, Peter Geoghegan wrote:
> On Thu, Feb 17, 2022 at 10:33 AM Andres Freund  wrote:
> > On 2022-02-16 19:43:09 -0800, Peter Geoghegan wrote:
> > > It makes the code in vacuumlazy.c much cleaner. In fact, that's how commit
> > > 44fa8488 started off -- purely as refactoring work.
> >
> > The problem is that it didn't end up as that. You combined refactoring with
> > substantial changes. And described it large and generic terms.
> 
> What substantial changes are you referring to? The one thing that did
> change was the commit message, which framed everything in terms of the
> later work. It really is true that the patch that I committed was
> essentially the same patch as the one posted on November 22, in both
> style and substance. Before I really even began to think about the
> freezing stuff. This is a matter of record.

Here I was referencing your description of how the patch started ("purely as
refactoring work"), and then evolved into something not just a refactoring.


> > It's a contentious thread. I asked for things to be split. In that context,
> > it's imo common curtesy for non-trivial patches to write something like
> 
> I didn't see a way to usefully split up 0001 any further (having
> already split it up into 0001 and 0002). That's not particularly
> obvious, but it's true.

If helpful I can give a go at showing how I think it could be split up. Or
perhaps more productively, do that on a not-yet-committed larger patch.

Greetings,

Andres Freund

Re: Silencing upcoming warning about stack_base_ptr

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 18:44:27 -0500, Tom Lane wrote:
> (that's visible now on buildfarm member caiman).  We probably
> should take some thought for silencing this before it starts
> to be in people's faces during routine development.

Agreed.

One annoying thing I recently encountered, related to this, is that our stack
check fails to work with some -fsanitize* options because they end up using
multiple stacks (e.g. -fsanitize-address-use-after-scope). Not sure we can
really do anything about that...


> Fixing this is a bit harder than one could wish because we export
> set_stack_base() for use by PL/Java, so it would be better to not
> change that API.  I ended up with the attached patch, which works
> to silence the warning so long as the new subroutine
> set_stack_base_from() is marked pg_noinline.  I'm a little worried
> that in a year or two GCC will be smart enough to complain anyway.
> If that happens, we could probably silence the warning again by
> moving set_stack_base() to a different source file --- but at some
> point we might have to give up and change its API, I suppose.

We could try using __builtin_frame_address(0) when available, presumably gcc
won't warn about that...

Greetings,

Andres Freund

Silencing upcoming warning about stack_base_ptr

2022-02-17 Thread Tom Lane

GCC 12, coming soon to a distro near you, complains like this:

postgres.c: In function 'set_stack_base':
postgres.c:3430:24: warning: storing the address of local variable 'stack_base' 
in 'stack_base_ptr' [-Wdangling-pointer=]
 3430 | stack_base_ptr = _base;
  | ~~~^
postgres.c:3419:25: note: 'stack_base' declared here
 3419 | charstack_base;
  | ^~
postgres.c:136:13: note: 'stack_base_ptr' declared here
  136 | char   *stack_base_ptr = NULL;
  | ^~

(that's visible now on buildfarm member caiman).  We probably
should take some thought for silencing this before it starts
to be in people's faces during routine development.

Fixing this is a bit harder than one could wish because we export
set_stack_base() for use by PL/Java, so it would be better to not
change that API.  I ended up with the attached patch, which works
to silence the warning so long as the new subroutine
set_stack_base_from() is marked pg_noinline.  I'm a little worried
that in a year or two GCC will be smart enough to complain anyway.
If that happens, we could probably silence the warning again by
moving set_stack_base() to a different source file --- but at some
point we might have to give up and change its API, I suppose.

I also took this opportunity to re-static-ify the stack_base_ptr
variable itself, as PL/Java seems to have adopted set_stack_base
since 1.5.0.  That part perhaps should not be back-patched.

regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 735fed490b..a05d84c0f4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -582,6 +582,7 @@ HANDLE		PostmasterHandle;
 void
 PostmasterMain(int argc, char *argv[])
 {
+	char		stack_reference = '\0';
 	int			opt;
 	int			status;
 	char	   *userDoption = NULL;
@@ -1083,7 +1084,7 @@ PostmasterMain(int argc, char *argv[])
 	/*
 	 * Set reference point for stack-depth checking.
 	 */
-	set_stack_base();
+	set_stack_base_from(_reference);
 
 	/*
 	 * Initialize pipe (or process handle on Windows) that allows children to
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 38d8b97894..b228948045 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -129,17 +129,15 @@ static long max_stack_depth_bytes = 100 * 1024L;
 
 /*
  * Stack base pointer -- initialized by PostmasterMain and inherited by
- * subprocesses. This is not static because old versions of PL/Java modify
- * it directly. Newer versions use set_stack_base(), but we want to stay
- * binary-compatible for the time being.
+ * subprocesses (but see also InitPostmasterChild).
  */
-char	   *stack_base_ptr = NULL;
+static char *stack_base_ptr = NULL;
 
 /*
  * On IA64 we also have to remember the register stack base.
  */
 #if defined(__ia64__) || defined(__ia64)
-char	   *register_stack_base_ptr = NULL;
+static char *register_stack_base_ptr = NULL;
 #endif
 
 /*
@@ -3408,15 +3406,36 @@ ia64_get_bsp(void)
 #endif			/* IA64 */
 
 
+/*
+ * set_stack_base_from: set up reference point for stack depth checking
+ *
+ * The passed-in value must be the address of a local variable in the caller.
+ * Ideally the caller is the process's outermost function, but it's also
+ * acceptable for it to be not too many stack levels down from there.
+ */
+pg_noinline void
+set_stack_base_from(char *stack_reference)
+{
+	/* Set up reference point for stack depth checking */
+	stack_base_ptr = stack_reference;
+#if defined(__ia64__) || defined(__ia64)
+	register_stack_base_ptr = ia64_get_bsp();
+#endif
+}
+
 /*
  * set_stack_base: set up reference point for stack depth checking
  *
+ * This is exported for use by PL/Java.  At some point we might have to
+ * require it to take a stack_reference pointer to silence compiler warnings.
+ * For now, we don't, so don't break the API unnecessarily.
+ *
  * Returns the old reference point, if any.
  */
 pg_stack_base_t
 set_stack_base(void)
 {
-	char		stack_base;
+	char		stack_reference = '\0';
 	pg_stack_base_t old;
 
 #if defined(__ia64__) || defined(__ia64)
@@ -3426,11 +3445,7 @@ set_stack_base(void)
 	old = stack_base_ptr;
 #endif
 
-	/* Set up reference point for stack depth checking */
-	stack_base_ptr = _base;
-#if defined(__ia64__) || defined(__ia64)
-	register_stack_base_ptr = ia64_get_bsp();
-#endif
+	set_stack_base_from(_reference);
 
 	return old;
 }
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0868e5a24f..1cb15d54d4 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -94,6 +94,8 @@ bool		IgnoreSystemIndexes = false;
 void
 InitPostmasterChild(void)
 {
+	char		stack_reference = '\0';
+
 	IsUnderPostmaster = true;	/* we are a postmaster subprocess now */
 
 	/*
@@ -106,13 +108,14 @@ InitPostmasterChild(void)
 #endif

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 18:00:19 -0500, Robert Haas wrote:
> Now you pointed out earlier one problem that it doesn't fix: unlike
> the current method, this method involves reading buffers from the
> template database into shared_buffers, and those buffers, once read,
> stick around even after the operation finishes.

Yea, I don't see a problem with that. A concurrent DROP DATABASE or such would
be problematic, but the locking prevents that.

> The only other kind of hazard I can think of is: could it be unsafe to
> try to interpret the contents of a database to which no one else is
> connected at present due to any of the issues you mention? But what's
> the hazard exactly?

I don't quite know. But I don't think it's impossible to run into trouble in
this area. E.g. xid horizons are computed in a database specific way. If the
routine reading pg_class did hot pruning, you could end up removing data
that's actually needed for a logical slot in the other database because the
backend local horizon state was computed for the "local" database?

Could there be problems because other backends wouldn't see the backend
accessing the other database as being connected to that database
(PGPROC->databaseId)?

Or what if somebody optimized snapshots to disregard readonly transactions in
other databases?

> It can't be a problem if we've failed to process sinval messages for the
> target database, because we're not using any caches anyway.

Could you end up with an outdated relmap entry? Probably not.

> We can't. It can't be unsafe to test visibility of XIDs for that database,
> because in an alternate universe some backend could have connected to that
> database and seen the same XIDs.

That's a weak argument, because in that alternative universe a PGPROC entry
with the PGPROC->databaseId = template_databases_oid would exist.

Greetings,

Andres Freund

Re: O(n) tasks cause lengthy startups and checkpoints

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 14:58:38 -0800, Nathan Bossart wrote:
> On Thu, Feb 17, 2022 at 02:28:29PM -0800, Andres Freund wrote:
> > As far as I understand, the primary concern are logical decoding serialized
> > snapshots, because a lot of them can accumulate if there e.g. is an old 
> > unused
> > / far behind slot. It should be easy to reduce the number of those snapshots
> > by e.g. eliding some redundant ones. Perhaps we could also make backends in
> > logical decoding occasionally do a bit of cleanup themselves.
> > 
> > I've not seen reports of the number of mapping files to be an real issue?
> 
> I routinely see all four of these tasks impacting customers, but I'd say
> the most common one is the temporary file cleanup.

I took temp file cleanup and StartupReorderBuffer() "out of consideration" for
custodian, because they're not needed during normal running.


> Besides eliminating some redundant files and having backends perform some
> cleanup, what do you think about skipping the logical decoding cleanup
> during end-of-recovery/shutdown checkpoints?

I strongly disagree with it. Then you might never get the cleanup done, but
keep on operating until you hit corruption issues.


> > The improvements around deleting temporary files and serialized snapshots
> > afaict don't require a dedicated process - they're only relevant during
> > startup. We could use the approach of renaming the directory out of the way 
> > as
> > done in this patchset but perform the cleanup in the startup process after
> > we're up.
> 
> Perhaps this is a good place to start.  As I mentioned above, IME the
> temporary file cleanup is the most common problem, so I think even getting
> that one fixed would be a huge improvement.

Cool.

Greetings,

Andres Freund

Re: Use generation context to speed up tuplesorts

2022-02-17 Thread David Rowley

On Sun, 13 Feb 2022 at 09:56, Tomas Vondra
 wrote:
> I'm not against pushing the generation context improvements, e.g.
> something like the patches posted in [1], because those seem reasonable
> in general. But I'm somewhat confused about the last patch (adjusting
> allocChunkLimit) causing fairly significant regressions.

My thoughts here are that we should pursue the patch that allows
growing of the block size in the generation context.  I do think the
investigation of the malloc() behaviour and performance is worth some
pursuit, I just don't think it should be done here or as part of this
patch.  I think it's a fairly large undertaking to ensure any changes
to the malloc options are an overall win and not just a win for
whatever benchmark we test them on. If they're really an overall win,
then shouldn't glibc know about them and maybe adopt them as the
standard options?

To get the ball rolling again on the changes to the generation
context, I took your patches, Tomas, and fixed a few things around the
keeper blocks not working correctly.  I've now made the keeper block
behaviour match how aset.c works.  There the keeper block is allocated
in the same allocation as the context itself. That reduces the number
of malloc() calls when setting up a new memory context.

On testing this, I discovered a problem regarding the use of
generation contexts for storing tuples in tuplesort.c.  The problem is
when the sort is bounded (e.g. SELECT * FROM ... ORDER BY .. LIMIT N),
we'll store and directly throw away any tuples that sort greater than
the existing set of N tuples. This causes a problem because once we
move away from using the keeper block, initially, we'll at some point
end up storing a tuple in a newly allocated generation block then
proceed to pfree() that tuple due to it sorting greater than the
existing N tuples. Because we currently always free() newly empty
blocks, we end up thrashing free()/malloc().  This behaviour results
in one of the regression test queries in tuplesort.sql going from
taking 10 ms to 1 minute, on my machine.

I had a few thoughts about how best to work around this problem.
Firstly, I thought that we should just *not* use the Generation
context when the sort is bounded.  That turns out to be a bit icky as
we only make the sort bounding when we call tuplesort_set_bound(),
which is after we've allocated the memory context for storing tuples.
I thought about maybe just adding another bool flag named
"allowBounding" and use a Generation context if that flag is not set.
I just don't really like that as a solution.  We also cannot make the
set bound part of the tuplesort_begin_heap() call as
nodeIncrementalSort.c does reset the bound. Possibly we could ditch
tuplesort_set_bound() and invent tuplesort_reset_bound() and change
the API so the bound is set when making the Tuplesortstate.

The other way I thought to fix it was by changing the logic for when
generation blocks are freed.  In the problem case mentioned above, the
block being freed is the current block (which was just allocated).  I
made some changes to adjust this behaviour so that we no longer free
the block when the final chunk is pfree()'d. Instead, that now lingers
and can be reused by future allocations, providing they fit inside it.
If they don't fit then we must free the block, otherwise it would just
remain empty forever.   The problem I see with this method is that
there still could be some pathological case that causes us to end up
storing just a single tuple per generation block.

Hitting the worst case here does seem pretty unlikely, so I'm a bit
drawn between if we should just play it safe and stick to the standard
memory allocator for bounded sort.

I've attached 2 patches. The 0001 patch adds the new logic to
generation.c to allow the block to grow and also adds the logic to
skip freeing the current block when it becomes empty.  The 0002 patch
simply makes tuplesort.c use the generation context for tuples
unconditionally.

I've also attached the benchmark results I got from this patch running
the attached sortbenchall script. It's fairly obvious that there are
performance improvements here even when the malloc() usage pattern is
similar to aset.c's

David
#!/bin/bash

dbname=postgres
mod=100
sec=5

psql -c "alter system set max_parallel_workers_per_gather TO 0;" $dbname
psql -c "select pg_reload_conf();" $dbname

for mem in "4MB" "4GB"
do

psql -c "alter system set work_mem TO '$mem';" $dbname
psql -c "select pg_reload_conf();" $dbname
psql -c "show work_mem" $dbname

sec=10

for sql in "select * from tenk1 order by two offset 100" "select * 
from tenk1 order by tenthous offset 100" "select * from tenk1 order by 
stringu1 offset 100" "select * from tenk1 order by stringu2 offset 100" 
"select * from tenk1 order by twenty offset 100" "select twenty,sum(unique1 
order by unique1) from tenk1 group by twenty" "select unique1,sum(twenty order 
by twenty)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-17 Thread Robert Haas

On Thu, Feb 17, 2022 at 4:13 PM Andres Freund  wrote:
> Could you or Dilip outline how it now works, and what exactly makes it safe
> etc (e.g. around locking, invalidation processing, snapshots, xid horizons)?
>
> I just scrolled through the patchset without finding such an explanation, so
> it's a bit hard to judge.

That's a good question and it's making me think about a few things I
hadn't considered before.

Dilip can add more here, but my impression is that most problems are
prevented by CREATE DATABASE, with or without this patch, starts by
acquiring a ShareLock on the database, preventing new connections, and
then making sure there are no existing connections. That means nothing
in the target database can be changing, which I think makes a lot of
the stuff on your list a non-issue. Any problems that remain have to
be the result of something that CREATE DATABASE does having a bad
interaction either with something that is completed beforehand or
something that begins afterward. There just can't be overlap, and I
think that rules out most problems.

Now you pointed out earlier one problem that it doesn't fix: unlike
the current method, this method involves reading buffers from the
template database into shared_buffers, and those buffers, once read,
stick around even after the operation finishes. That's not an
intrinsic problem, though, because a connection to the database could
do the same thing. However, again as you pointed out, it is a problem,
though, if we do it with less locking than a real database connection
would have done. It seems to me that if there are other problems here,
they have to be basically of the same sort: they have to leave the
system in some state which is otherwise impossible. Do you see some
other kind of hazard, or more examples of how that could happen? I
think the leftover buffers in shared_buffers have to be basically the
only thing, because apart from that, how is this any different than a
file copy?

The only other kind of hazard I can think of is: could it be unsafe to
try to interpret the contents of a database to which no one else is
connected at present due to any of the issues you mention? But what's
the hazard exactly? It can't be a problem if we've failed to process
sinval messages for the target database, because we're not using any
caches anyway. We can't. It can't be unsafe to test visibility of XIDs
for that database, because in an alternate universe some backend could
have connected to that database and seen the same XIDs. One thing we
COULD be doing wrong is using the wrong snapshot to test the
visibility of XIDs. The patch uses GetActiveSnapshot(), and I'm
thinking that is probably wrong. Shouldn't it be GetLatestSnapshot()?
And do we need to worry about snapshots being database-specific? Maybe
so.

> > But if it is the latter then there's really no point to that kind of cleanup
> > work and we should probably just give up now.
>
> This thread is long. Could you summarize what lead you to consider other
> approaches (e.g. looking in the filesystem for relfilenodes) as not feasible /
> too ugly / ...?

I don't think it's infeasible to look at the filesystem for files and
just copy whatever files we find. It's a plausible alternate design. I
just don't like it as well. I think that relying on the filesystem
contents to tell us what's going on is kind of hacky. The only
technical issue I see there is that the WAL logging might require more
kludgery, since that mechanism is kind of intertwined with
shared_buffers. You'd have to get the right block references into the
WAL record, and you have to make sure that checkpoints don't move the
redo pointer at an inopportune moment.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: O(n) tasks cause lengthy startups and checkpoints

2022-02-17 Thread Nathan Bossart

On Thu, Feb 17, 2022 at 02:28:29PM -0800, Andres Freund wrote:
> As far as I understand, the primary concern are logical decoding serialized
> snapshots, because a lot of them can accumulate if there e.g. is an old unused
> / far behind slot. It should be easy to reduce the number of those snapshots
> by e.g. eliding some redundant ones. Perhaps we could also make backends in
> logical decoding occasionally do a bit of cleanup themselves.
> 
> I've not seen reports of the number of mapping files to be an real issue?

I routinely see all four of these tasks impacting customers, but I'd say
the most common one is the temporary file cleanup.  Besides eliminating
some redundant files and having backends perform some cleanup, what do you
think about skipping the logical decoding cleanup during
end-of-recovery/shutdown checkpoints?  This was something that Bharath
brought up a while back [0].  As I noted in that thread, startup and
shutdown could still take a while if checkpoints are regularly delayed due
to logical decoding cleanup, but that might still help avoid a bit of
downtime.

> The improvements around deleting temporary files and serialized snapshots
> afaict don't require a dedicated process - they're only relevant during
> startup. We could use the approach of renaming the directory out of the way as
> done in this patchset but perform the cleanup in the startup process after
> we're up.

Perhaps this is a good place to start.  As I mentioned above, IME the
temporary file cleanup is the most common problem, so I think even getting
that one fixed would be a huge improvement.

[0] 
https://postgr.es/m/CALj2ACXkkSL8EBpR7m%3DMt%3DyRGBhevcCs3x4fsp3Bc-D13yyHOg%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: O(n) tasks cause lengthy startups and checkpoints

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 13:00:22 -0800, Nathan Bossart wrote:
> Okay.  So IIUC the problem might already exist today, but offloading these
> tasks to a separate process could make it more likely.

Vastly more, yes. Before checkpoints not happening would be a (but not a
great) form of backpressure. You can't cancel them without triggering a
crash-restart. Whereas custodian can be cancelled etc.

As I said before, I think this is tackling things from the wrong end. Instead
of moving the sometimes expensive task out of the way, but still expensive,
the focus should be to make the expensive task cheaper.

As far as I understand, the primary concern are logical decoding serialized
snapshots, because a lot of them can accumulate if there e.g. is an old unused
/ far behind slot. It should be easy to reduce the number of those snapshots
by e.g. eliding some redundant ones. Perhaps we could also make backends in
logical decoding occasionally do a bit of cleanup themselves.

I've not seen reports of the number of mapping files to be an real issue?

The improvements around deleting temporary files and serialized snapshots
afaict don't require a dedicated process - they're only relevant during
startup. We could use the approach of renaming the directory out of the way as
done in this patchset but perform the cleanup in the startup process after
we're up.

Greetings,

Andres Freund

Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-17 Thread Peter Geoghegan

On Thu, Feb 17, 2022 at 10:33 AM Andres Freund  wrote:
> On 2022-02-16 19:43:09 -0800, Peter Geoghegan wrote:
> > It makes the code in vacuumlazy.c much cleaner. In fact, that's how commit
> > 44fa8488 started off -- purely as refactoring work.
>
> The problem is that it didn't end up as that. You combined refactoring with
> substantial changes. And described it large and generic terms.

What substantial changes are you referring to? The one thing that did
change was the commit message, which framed everything in terms of the
later work. It really is true that the patch that I committed was
essentially the same patch as the one posted on November 22, in both
style and substance. Before I really even began to think about the
freezing stuff. This is a matter of record.

> It's a contentious thread. I asked for things to be split. In that context,
> it's imo common curtesy for non-trivial patches to write something like

I didn't see a way to usefully split up 0001 any further (having
already split it up into 0001 and 0002). That's not particularly
obvious, but it's true.

>   While the rest of the patchset is contentious, I think 0001 is ready to
>   go. I'd like to commit it tomorrow, unless somebody protests.

Okay. I'll be more explicit about it in this way in the future.

> On 2021-11-22 11:29:56 -0800, Andres Freund wrote:
> > I think this is a change mostly in the right direction. But as formulated 
> > this
> > commit does *WAY* too much at once.
>
> I do not know how to state more clearly that I think this is not a patch in a
> committable shape.

As I said, I dispute the idea that it could have been split up even
further. My reasons are complicated, and I can get into it if you'd
like.

> but also:
>
> On 2021-11-22 11:29:56 -0800, Andres Freund wrote:
> > I think it should be doable to add an isolation test for this path. There 
> > have
> > been quite a few bugs around the wider topic...

Yeah, I really should have included that in 0001 -- I accept that.
That didn't happen in the initial commit, but was high on my todo
list. I can take care of it in the next few days.

--
Peter Geoghegan

Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-17 Thread Peter Geoghegan

On Thu, Feb 17, 2022 at 1:36 PM Robert Haas  wrote:
> > It's not that simple. As I said in the fix-up commit message, and in
> > the opening email to this thread, it basically isn't a new behavior at
> > all. It would be much more accurate to describe it as a behavior that
> > originated with commit e8429082, from late 2015. There was a subtle
> > interaction with that existing behavior, and the refactoring work,
> > that caused the reltuples problem.
>
> Honestly, I really think it's that simple. It really is possible to
> describe what changes a commit is making in the commit message -- and,
> in my opinion, you're not doing it.

The text of the commit message from commit e8429082 (Tom's 2015
commit) talks about "force examination of the table's last page...".
And on that basis I'm saying that I didn't invent the behavior
involving scanning the last page specifically -- which is clearly
true. I *did* overlook an issue that led to this nonrandom
scanned_pages bug, of course, but what should the take away be now?
What does that have to do with the commit message? The significance of
this oversight only became apparent after I committed the patch.

> > > I think you *really* need to put more effort into
> > > making your patches, and the emails about your patches, and the commit
> > > messages for your patches understandable to other people. Otherwise,
> > > waiting 3 months between when you post the patch and when you commit
> > > it means nothing. You can wait 10 years to commit and still get
> > > objections, if other people don't understand what you're doing.
> >
> > I don't think it's fair to just suppose that much of what I write is
> > incomprehensible, just like that.
>
> I'm not supposing anything. I'm telling you what I can understand, and
> what I can't. Unless you intend to accuse me of pretending not to
> understand things that I actually do understand, I feel like my word
> on that topic should be treated as pretty much conclusive.

I'm not disputing that (how could I?). What I'm saying is that from my
perspective, the commit message of 44fa8488 was quite descriptive. I'm
certainly not asking you to agree with that, and I even think that the
fact that you disagree with it should be something that specifically
concerns me. But, there might be some specific instance in the future,
where you find some merit in the way I frame things, having first been
skeptical.

Doesn't that at least seem possible? And aren't you more or less
asking me to give you the same consideration? It's something to
consider -- that's all.

> I do think that you are doing some things right, for sure. But I don't
> think that you are following the community process particularly well.
> It doesn't really feel like you feel the need to convince other people
> that your changes are in good shape before committing them; and it is
> really hard for me to understand in detail what is being changed.

There is almost always ambiguity about these things, which I navigate
to the best of my ability, knowing that I will be judged in the future
based on my track record. It feels as if this discussion has been
muddied by the later work in the patch series, which really does make
pretty big, relatively risky changes to how we do certain things in
vacuumlazy.c. Maybe I shouldn't have referenced that work in the
commit message.

--
Peter Geoghegan

Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?

2022-02-17 Thread Robert Haas

On Thu, Feb 17, 2022 at 4:17 PM Tomas Vondra
 wrote:
> IMHO the whole problem is we're unable to estimate the join clause as a
> conditional probability, i.e.
>
>P(A.x = B.x | (A.x < 42) & (B.x < 42))
>
> so maybe instead of trying to generate additional RelOptInfo items we
> should think about improving that. The extra RelOptInfos don't really
> solve this, because even if you decide to join A|x<42 to B|x<42 it does
> nothing to improve the join clause estimate.

I guess I hadn't considered that angle. I think the extra RelOptInfos
(or whatever) actually do solve a problem, because enforcing a
high-selectivity join qual against both sides is potentially quite
wasteful, and you need some way to decide whether to do it on one
side, the other, or both. But it's also true that I was wrong to
assume independence ... and if we could avoid assuming that, then the
join selectivity would work itself out without any of the machinery
that I just proposed.

> It actually deals with a more general form of this case, because the
> clauses don't need to reference the same attribute - so for example this
> would work too, assuming there is extended stats object on the columns
> on each side:
>
>   P(A.c = B.d | (A.e < 42) & (B.f < 42))

That'd be cool.

> Not sure. In my experience queries with both a join clause and other
> clauses referencing the same attribute are pretty rare. But I agree if
> we can do the expensive stuff only when actually needed, with no cost in
> the 99.999% other cases, I don't see why not. Of course, code complexity
> is a cost too.

Right. I mean, we could have a planner GUC to control whether the
optimization is used even in cases where we see that it's possible.
But Tom keeps arguing that it is possible in many queries and would
benefit few queries, and I'm not seeing why that should be so. I think
it's likely to benefit many of the queries to which it applies.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-17 Thread Robert Haas

On Thu, Feb 17, 2022 at 4:10 PM Peter Geoghegan  wrote:
> > I would say it differently: I think the commit message does a poor job
> > describing what the commit actually does. For example, it says nothing
> > about changing VACUUM to always scan the last page of every heap
> > relation. This whole thread is about fixing a problem that was caused
> > by a significant behavior change that was *not even mentioned* in the
> > original commit message.
>
> It's not that simple. As I said in the fix-up commit message, and in
> the opening email to this thread, it basically isn't a new behavior at
> all. It would be much more accurate to describe it as a behavior that
> originated with commit e8429082, from late 2015. There was a subtle
> interaction with that existing behavior, and the refactoring work,
> that caused the reltuples problem.

Honestly, I really think it's that simple. It really is possible to
describe what changes a commit is making in the commit message -- and,
in my opinion, you're not doing it.

> > I think you *really* need to put more effort into
> > making your patches, and the emails about your patches, and the commit
> > messages for your patches understandable to other people. Otherwise,
> > waiting 3 months between when you post the patch and when you commit
> > it means nothing. You can wait 10 years to commit and still get
> > objections, if other people don't understand what you're doing.
>
> I don't think it's fair to just suppose that much of what I write is
> incomprehensible, just like that.

I'm not supposing anything. I'm telling you what I can understand, and
what I can't. Unless you intend to accuse me of pretending not to
understand things that I actually do understand, I feel like my word
on that topic should be treated as pretty much conclusive.

I do think that you are doing some things right, for sure. But I don't
think that you are following the community process particularly well.
It doesn't really feel like you feel the need to convince other people
that your changes are in good shape before committing them; and it is
really hard for me to understand in detail what is being changed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?

2022-02-17 Thread Tomas Vondra

On 2/17/22 21:15, Robert Haas wrote:
> On Tue, Feb 1, 2022 at 10:08 AM Andy Fan  wrote:
>> To address the row estimation issue, The most straightforward way to fix 
>> this is to
>> ignore the derived clauses when figuring out the RelOptInfo->rows on base 
>> relation.
>> To note which clause is derived from this patch, I added a new field 
>> "EquivalenceClass *
>> derived" in RestrictInfo. and then added a  included_derived  option in 
>> clauselist_selectivity_ext,
>> during the set_xxx_rel_size function, we can pass the 
>> included_derived=false.  This strategy
>> should be used in get_parameterized_baserel_size.   In all the other cases, 
>> include_derived=true
>> is used. which are finished in commit 2. (Commit 1 is Daivd's patch, I just  
>> rebased it)
> 
> That doesn't sound correct to me.
> 
> Suppose that we have A.x = B.x and also A.x < 42. We can choose to
> enforce A.x < 42 or we can choose to enforce B.x < 42 or we can do
> both. In general, any of those could be right: if either one of those
> two is highly selective while the other is not very selective at all,
> it's going to be fastest to enforce only the more selective qual. But
> if both are selective then it may be best to enforce both, so let's
> suppose we do that. If we don't adopt the proposal above and just do
> nothing, then our row count estimates for both A and B will include
> the effect of checking x < 42, and so they will be correct, but the
> row count estimate for join(A, B) will include the effect of checking
> x < 42 twice, and so it will be too low, which can mess up the plan at
> higher levels.
> 
> But discounting the effect of B.x < 42 when estimating the size of B
> is also incorrect. Now, the row count estimate for join(A, B) will
> include the effect of x < 42 only once, which is good. However, the
> row count estimate for B will be too high, because it will not include
> the effect of B.x < 42. And that means that the cost estimate for
> join(A, B) will be wrong. It will be too high, because it's going to
> think that it has more rows coming from the B side of the join than
> what is actually the case. And that can also mess up the plan at
> higher levels.
> 
> I think we could get around this problem by having multiple
> RelOptInfos (or something similar that is lighter-weight) for each
> relation. Today, we'd create a RelOptInfo for A, one for B, and one
> for join(A, B), and the paths for the join are created by joining a
> path for A to a path for B. Now imagine that we have instead 5
> RelOptInfos, for {A}, {A|x<42}, {B}, {B|x<42}, and join(A, B). The
> legal paths for that last one can be created by joining {A} to
> {B|x<42} or {A|x<42} to {B} or {A|x<42} to {B|x<42}. Each of those 5
> RelOptInfos can have its own cardinality estimate, and it seems pretty
> straightforward to see how to get both the scan cardinality and the
> join cardinality correct. Now I think this is decidedly non-trivial to
> implement, and I also hear the voice of Tom Lane saying that's going
> to be expensive in both time and memory, and he's not wrong.
> 

IMHO the whole problem is we're unable to estimate the join clause as a
conditional probability, i.e.

   P(A.x = B.x | (A.x < 42) & (B.x < 42))

so maybe instead of trying to generate additional RelOptInfo items we
should think about improving that. The extra RelOptInfos don't really
solve this, because even if you decide to join A|x<42 to B|x<42 it does
nothing to improve the join clause estimate.

With equality clauses we don't have this issue, because if you derive
clauses at the baserel level, the join clause becomes no-op with
selecitivity 1.0. But for inequalities that does not work ...

Interestingly enough, the patch [1] tries to do something like this by
applying extended statistics to joins, and using baserestrictinfos as
"conditions" for statistics on both sides.

It actually deals with a more general form of this case, because the
clauses don't need to reference the same attribute - so for example this
would work too, assuming there is extended stats object on the columns
on each side:

  P(A.c = B.d | (A.e < 42) & (B.f < 42))



[1] https://commitfest.postgresql.org/36/3055/


> On the other hand, I completely agree with David's comments on the
> other thread to the effect that holding our breath is not getting us
> anywhere. People don't keep asking for this feature because it's a
> stupid thing that nobody really wants, and when Tom alleges that it
> will rarely pay off, I think he's pretty far off the mark. The only
> time we need to consider doing any extra work is when we have
> something like the example discussed here, namely A.x = B.x and A.x <
> 42. If there is a variable that is part of an equivalence class and
> also is used in a scan qual, what are the chances that the implied
> inequality is useful? There's no way to estimate that mathematically -
> it's all about what you think human beings are typically going to do -
> but I'd say it's probably

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 14:27:09 -0500, Robert Haas wrote:
> The other one is trickier, because AFAICT it's basically an opinion
> question: is accessing pg_class in the template database from some backend
> that is connected to another database too ugly to be acceptable? Several
> people have expressed concerns about that, but it's not clear to me whether
> they are essentially saying "that is not what I would do if I were doing
> this project" or more like "if you commit something that does it that way I
> will be enraged and demand an immediate revert and the removal of your
> commit bit." If it's the former, I think it's possible to clean up various
> details of these patches to make them look nicer than they do at present and
> get something committed for PostgreSQL 15.

Could you or Dilip outline how it now works, and what exactly makes it safe
etc (e.g. around locking, invalidation processing, snapshots, xid horizons)?

I just scrolled through the patchset without finding such an explanation, so
it's a bit hard to judge.

> But if it is the latter then there's really no point to that kind of cleanup
> work and we should probably just give up now.

This thread is long. Could you summarize what lead you to consider other
approaches (e.g. looking in the filesystem for relfilenodes) as not feasible /
too ugly / ...?

Greetings,

Andres Freund

Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-17 Thread Peter Geoghegan

On Thu, Feb 17, 2022 at 6:17 AM Robert Haas  wrote:
> Let's back up a minute and talk about the commit of $SUBJECT. The
> commit message contains a Discussion link to this thread. This thread,
> at the time you put that link in there, had exactly one post: from
> you. That's not much of a discussion, although I do acknowledge that
> sometimes we commit things that have bugs, and those bugs need to be
> fixed even if nobody has responded.

All true. I don't like to leave bugs unfixed for very long. I'm happy
to revise the fix in light of new information, or opinions.

> I would say it differently: I think the commit message does a poor job
> describing what the commit actually does. For example, it says nothing
> about changing VACUUM to always scan the last page of every heap
> relation. This whole thread is about fixing a problem that was caused
> by a significant behavior change that was *not even mentioned* in the
> original commit message.

It's not that simple. As I said in the fix-up commit message, and in
the opening email to this thread, it basically isn't a new behavior at
all. It would be much more accurate to describe it as a behavior that
originated with commit e8429082, from late 2015. There was a subtle
interaction with that existing behavior, and the refactoring work,
that caused the reltuples problem.

> If it had been mentioned, I likely would have
> complained, because it's very similar to behavior that Tom eliminated
> in b503da135ab0bdd97ac3d3f720c35854e084e525, which he did because it
> was distorting reltuples estimates.

I cited that commit myself. I think that it's a good idea to not be
completely unconcerned about how the visibility map skipping behavior
tends to affect the reltuples estimate -- especially if it's easy to
do it in a way that doesn't cause problems anyway. And so I agree that
the "looks ahead rather than behind" behavior added by that commit
makes sense. Even still, the whole idea that scanned_pages is a random
sample from the table is kinda bogus, and always has been. The
consequences of this are usually not too bad, but even still: why
wouldn't it be possible for there to be all kinds of distortion to
reltuples, since the tuple density logic is built on an assumption
that's self evidently wrong?

It's not reasonable to expect vacuumlazy.c to truly contort itself,
just to conceal that shaky assumption. At the same time, we need the
model used by vac_estimate_reltuples() to work as well as it can, with
the limited information that it has close at hand. That doesn't seem
to leave a lot of options, as far as addressing the bug goes. Happy to
discuss it further if you have any ideas, though.

> I think you *really* need to put more effort into
> making your patches, and the emails about your patches, and the commit
> messages for your patches understandable to other people. Otherwise,
> waiting 3 months between when you post the patch and when you commit
> it means nothing. You can wait 10 years to commit and still get
> objections, if other people don't understand what you're doing.

I don't think it's fair to just suppose that much of what I write is
incomprehensible, just like that. Justin expressed the exact opposite
view about the commit message of 44fa8488, for one. I'm not saying
that there is no room at all for improvement, or that my reputation
for being hard to follow is completely undeserved. Just that it's
frustrating to be accused of not having put enough effort into
explaining what's going on with commit 44fa8488 -- I actually put a
great deal of effort into it.

I think that you'd acknowledge that overall, on balance, I must be
doing something right. Have you considered that that might have
happened because of my conceptual style, not in spite of it? I have a
tendency to define things in abstract terms, because it really works
for me. I try to abstract away inessential details, and emphasize
invariance, so that the concepts are easier to manipulate with in many
different contexts -- something that is almost essential when working
on B-Tree stuff. It's harder at first, but much easier later on (at
least for me). If I ever seem to go overboard with it, I ask that you
consider this perspective -- though you should also push back when
that seems appropriate.

Separately, I think that you're missing how hard it would have been to
"just say what you did" while making much sense, given the specifics
here. The complexity in commit 44fa8488 is more or less all due to
historical factors -- it's quite a messy, winding story. For example,
commit 1914c5ea from 2017 added the tupcount_pages field (that I just
removed) to fix a reltuples estimation related bug, that could be
thought of as a bug in the aforementioned commit e8429082 (from 2015).
Now, commit 1914c5ea didn't actually argue that it was fixing a bug in
that earlier commit directly -- whether or not you'd call it that is a
matter of perspective.

This is one of those situations where it's far easier to

Re: O(n) tasks cause lengthy startups and checkpoints

2022-02-17 Thread Nathan Bossart

On Thu, Feb 17, 2022 at 11:27:09AM -0800, Andres Freund wrote:
> On 2022-02-17 10:23:37 -0800, Nathan Bossart wrote:
>> On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote:
>> > They're accessed by xid. The LSN is just for cleanup. Accessing files
>> > left over from a previous transaction with the same xid wouldn't be
>> > good - we'd read wrong catalog state for decoding...
>> 
>> Okay, that part makes sense to me.  However, I'm still confused about how
>> this is handled today and why moving cleanup to a separate auxiliary
>> process makes matters worse.
> 
> Right now cleanup happens every checkpoint. So cleanup can't be deferred all
> that far. We currently include a bunch of 32bit xids inside checkspoints, so
> if they're rarer than 2^31-1, we're in trouble independent of logical
> decoding.
> 
> But with this patch cleanup of logical decoding mapping files (and other
> pieces) can be *indefinitely* deferred, without being noticeable.

I see.  The custodian should ordinarily remove the files as quickly as
possible.  In fact, I bet it will typically line up with checkpoints for
most users, as the checkpointer will set the latch.  However, if there are
many temporary files to clean up, removing the logical decoding files could
be delayed for some time, as you said.

> One possible way to improve this would be to switch the on-disk filenames to
> be based on 64bit xids. But that might also present some problems (file name
> length, cost of converting 32bit xids to 64bit xids).

Okay.

>> I've done quite a bit of reading, and I haven't found anything that seems
>> intended to prevent this problem.  Do you have any pointers?
> 
> I don't know if we have an iron-clad enforcement of checkpoints happening
> every 2*31-1 xids. It's very unlikely to happen - you'd run out of space
> etc. But it'd be good to have something better than that.

Okay.  So IIUC the problem might already exist today, but offloading these
tasks to a separate process could make it more likely.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: buildfarm warnings

2022-02-17 Thread Robert Haas

On Thu, Feb 17, 2022 at 3:51 PM Andres Freund  wrote:
> On 2022-02-17 15:22:08 -0500, Robert Haas wrote:
> > OK, sounds good, thanks. I couldn't (and still can't) think of a good
> > way of testing the progress-reporting code either. I mean I guess if
> > you could convince pg_basebackup not to truncate the filenames, maybe
> > by convincing it that your terminal is as wide as your garage door,
> > then you could capture the output and do some tests against it. But I
> > feel like the test code would be two orders of magnitude larger than
> > the code it intends to exercise, and I'm not sure it would be entirely
> > robust, either.
>
> How about just running pg_basebackup with --progress in one or two of the
> tests? Of course that's not testing very much, but at least it verifies not
> crashing...

True. That would be easy enough.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: buildfarm warnings

2022-02-17 Thread Andres Freund

On 2022-02-17 15:22:08 -0500, Robert Haas wrote:
> OK, sounds good, thanks. I couldn't (and still can't) think of a good
> way of testing the progress-reporting code either. I mean I guess if
> you could convince pg_basebackup not to truncate the filenames, maybe
> by convincing it that your terminal is as wide as your garage door,
> then you could capture the output and do some tests against it. But I
> feel like the test code would be two orders of magnitude larger than
> the code it intends to exercise, and I'm not sure it would be entirely
> robust, either.

How about just running pg_basebackup with --progress in one or two of the
tests? Of course that's not testing very much, but at least it verifies not
crashing...

Re: killing perl2host

2022-02-17 Thread Andres Freund

On 2022-02-17 15:23:36 -0500, Andrew Dunstan wrote:
> Very well. I think the easiest way will be to stash $host_os in the
> environment and let the script pick it up similarly to what I suggested
> with MSYSTEM.

WFM.

Re: killing perl2host

2022-02-17 Thread Andrew Dunstan



On 2/17/22 15:09, Andres Freund wrote:
> Hi,
>
> On 2022-02-17 14:40:01 -0500, Andrew Dunstan wrote:
>> Sure, that could be done, but what's the issue? Msys2 normally defines
>> MSYSTEM for you - see /etc/msystem which is sourced by /etc/profile.
> It seems not a great idea to me to use different sources of truth about build
> target. And I think it's not hard to get the actual host_os and MSYSTEM to
> disagree:
> - afaics it's quite possible to run configure outside of a login msys shell
> - you could do an msys build in a ucrt shell, but specify 
> --host=x86_64-pc-msys,
>   or the other way round


Very well. I think the easiest way will be to stash $host_os in the
environment and let the script pick it up similarly to what I suggested
with MSYSTEM.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: buildfarm warnings

2022-02-17 Thread Robert Haas

On Thu, Feb 17, 2022 at 2:36 PM Tom Lane  wrote:
> Yeah, I came to the same conclusion while out doing some errands.
> There's no very convincing reason to believe that what's passed to
> progress_update_filename has got adequate lifespan either, or that
> that would remain true even if it's true today, or that testing
> would detect such a problem (even if we had any, ahem).  The file
> names I was seeing in testing just now tended to contain fragments
> of temporary directory names, which'd be mighty hard to tell from
> random garbage in any automated way:
>
> 3/22774 kB (0%), 0/2 tablespaces (...pc1/PG_15_202202141/5/16392_init)
> 3/22774 kB (0%), 1/2 tablespaces (...pc1/PG_15_202202141/5/16392_init)
> 22785/22785 kB (100%), 1/2 tablespaces (...t_T8u0/backup1/global/pg_control)
>
> So I think we should make progress_update_filename strdup each
> input while freeing the last value, and insist that every update
> go through that logic.  (This is kind of a lot of cycles to spend
> in support of an optional mode, but maybe we could do it only if
> showprogress && verbose.)  I'll go make it so.

OK, sounds good, thanks. I couldn't (and still can't) think of a good
way of testing the progress-reporting code either. I mean I guess if
you could convince pg_basebackup not to truncate the filenames, maybe
by convincing it that your terminal is as wide as your garage door,
then you could capture the output and do some tests against it. But I
feel like the test code would be two orders of magnitude larger than
the code it intends to exercise, and I'm not sure it would be entirely
robust, either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?

2022-02-17 Thread Robert Haas

On Tue, Feb 1, 2022 at 10:08 AM Andy Fan  wrote:
> To address the row estimation issue, The most straightforward way to fix this 
> is to
> ignore the derived clauses when figuring out the RelOptInfo->rows on base 
> relation.
> To note which clause is derived from this patch, I added a new field 
> "EquivalenceClass *
> derived" in RestrictInfo. and then added a  included_derived  option in 
> clauselist_selectivity_ext,
> during the set_xxx_rel_size function, we can pass the included_derived=false. 
>  This strategy
> should be used in get_parameterized_baserel_size.   In all the other cases, 
> include_derived=true
> is used. which are finished in commit 2. (Commit 1 is Daivd's patch, I just  
> rebased it)

That doesn't sound correct to me.

Suppose that we have A.x = B.x and also A.x < 42. We can choose to
enforce A.x < 42 or we can choose to enforce B.x < 42 or we can do
both. In general, any of those could be right: if either one of those
two is highly selective while the other is not very selective at all,
it's going to be fastest to enforce only the more selective qual. But
if both are selective then it may be best to enforce both, so let's
suppose we do that. If we don't adopt the proposal above and just do
nothing, then our row count estimates for both A and B will include
the effect of checking x < 42, and so they will be correct, but the
row count estimate for join(A, B) will include the effect of checking
x < 42 twice, and so it will be too low, which can mess up the plan at
higher levels.

But discounting the effect of B.x < 42 when estimating the size of B
is also incorrect. Now, the row count estimate for join(A, B) will
include the effect of x < 42 only once, which is good. However, the
row count estimate for B will be too high, because it will not include
the effect of B.x < 42. And that means that the cost estimate for
join(A, B) will be wrong. It will be too high, because it's going to
think that it has more rows coming from the B side of the join than
what is actually the case. And that can also mess up the plan at
higher levels.

I think we could get around this problem by having multiple
RelOptInfos (or something similar that is lighter-weight) for each
relation. Today, we'd create a RelOptInfo for A, one for B, and one
for join(A, B), and the paths for the join are created by joining a
path for A to a path for B. Now imagine that we have instead 5
RelOptInfos, for {A}, {A|x<42}, {B}, {B|x<42}, and join(A, B). The
legal paths for that last one can be created by joining {A} to
{B|x<42} or {A|x<42} to {B} or {A|x<42} to {B|x<42}. Each of those 5
RelOptInfos can have its own cardinality estimate, and it seems pretty
straightforward to see how to get both the scan cardinality and the
join cardinality correct. Now I think this is decidedly non-trivial to
implement, and I also hear the voice of Tom Lane saying that's going
to be expensive in both time and memory, and he's not wrong.

On the other hand, I completely agree with David's comments on the
other thread to the effect that holding our breath is not getting us
anywhere. People don't keep asking for this feature because it's a
stupid thing that nobody really wants, and when Tom alleges that it
will rarely pay off, I think he's pretty far off the mark. The only
time we need to consider doing any extra work is when we have
something like the example discussed here, namely A.x = B.x and A.x <
42. If there is a variable that is part of an equivalence class and
also is used in a scan qual, what are the chances that the implied
inequality is useful? There's no way to estimate that mathematically -
it's all about what you think human beings are typically going to do -
but I'd say it's probably better than 50%. I know that when I was
regularly doing application programming on top of PostgreSQL I was
VERY aware of this limitation of the optimizer and habitually thought
about which table to write the inequality against. That kept me out of
trouble most of the time, but it sure seems like we're punting the
optimizer's job to the end user.

And even then, I still sometimes couldn't stay out of trouble, because
sometimes I knew that the implied inequality really ought to be
enforced against both sides of the join to get a decent plan. In that
case, the only way to get the optimizer to do what I wanted was to
duplicate the qual. But that runs headlong into the exact problem that
we're talking about here: now the join selectivity is going to be
messed up, and then some other part of the plan would get messed up. I
still remember the frustration associated with that scenario more than
10 years later. You can't even fix it by uglifying your query with a
planner hint, because we don't support those either. Which brings me
to another point: it's incoherent to simultaneously argue that we
shouldn't have planner hints but rather focus on improving the
planner, and at the same time refuse to improve the planner because it
would make planning too

Re: killing perl2host

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 14:40:01 -0500, Andrew Dunstan wrote:
> Sure, that could be done, but what's the issue? Msys2 normally defines
> MSYSTEM for you - see /etc/msystem which is sourced by /etc/profile.

It seems not a great idea to me to use different sources of truth about build
target. And I think it's not hard to get the actual host_os and MSYSTEM to
disagree:
- afaics it's quite possible to run configure outside of a login msys shell
- you could do an msys build in a ucrt shell, but specify --host=x86_64-pc-msys,
  or the other way round

There's probably also some stuff about cross building from linux, but that
doesn't matter much, because right now wine doesn't get through even the base
regression tests (although it gets through initdb these days, which is nice).


> > Or the test just implemented in
> > configure, as I suggested.
> >
> 
> No, because we don't know which perl will be invoked by $PROVE. That's
> why we set up check_modules.pl in the first place.

Ah.

Greetings,

Andres Freund

Re: killing perl2host

2022-02-17 Thread Andrew Dunstan

On 2/17/22 13:10, Andres Freund wrote:
> On 2022-02-17 13:08:01 -0500, Andrew Dunstan wrote:
 perhaps something like:

     my $msystem = $ENV{MSYSTEM} || 'undef';

     die "incompatible perl" if $Config{osname} eq 'msys' && $msystem ne
 'MSYS';
>>> Why tests MSYSTEM instead of $host_os?
>> Is that available in check_modules.pl? AFAICT it's an unexported shell
>> variable.
> No, but it could just be passed to it? 

Sure, that could be done, but what's the issue? Msys2 normally defines
MSYSTEM for you - see /etc/msystem which is sourced by /etc/profile.

> Or the test just implemented in
> configure, as I suggested.
>

No, because we don't know which perl will be invoked by $PROVE. That's
why we set up check_modules.pl in the first place.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: buildfarm warnings

2022-02-17 Thread Tom Lane

Robert Haas  writes:
> If not, I think that your quick-and-dirty fix is about right, except
> that we probably need to do it every place where we set
> progress_filename, and we should arrange to free the memory later.
> With -Ft, you won't have enough archives to matter, but with -Fp, you
> might have enough archive members to matter.

Yeah, I came to the same conclusion while out doing some errands.
There's no very convincing reason to believe that what's passed to
progress_update_filename has got adequate lifespan either, or that
that would remain true even if it's true today, or that testing
would detect such a problem (even if we had any, ahem).  The file
names I was seeing in testing just now tended to contain fragments
of temporary directory names, which'd be mighty hard to tell from
random garbage in any automated way:

3/22774 kB (0%), 0/2 tablespaces (...pc1/PG_15_202202141/5/16392_init)
3/22774 kB (0%), 1/2 tablespaces (...pc1/PG_15_202202141/5/16392_init)
22785/22785 kB (100%), 1/2 tablespaces (...t_T8u0/backup1/global/pg_control)

So I think we should make progress_update_filename strdup each
input while freeing the last value, and insist that every update
go through that logic.  (This is kind of a lot of cycles to spend
in support of an optional mode, but maybe we could do it only if
showprogress && verbose.)  I'll go make it so.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-17 Thread Robert Haas

On Tue, Feb 15, 2022 at 6:49 AM Dilip Kumar  wrote:
> Here is the updated version of the patch, the changes are 1) Fixed
> review comments given by Robert and one open comment from Ashutosh.
> 2) Preserved the old create db method. 3) As agreed upthread for now
> we are using the new strategy only for createdb not for movedb so I
> have removed the changes in ForgetDatabaseSyncRequests() and
> DropDatabaseBuffers().  3) Provided a database creation strategy
> option as of now I have kept it as below.
>
> CREATE DATABASE ... WITH (STRATEGY = WAL_LOG);  -- default if
> option is omitted
> CREATE DATABASE ... WITH (STRATEGY = FILE_COPY);

All right. I think there have been two design-level objections to this
patch, and this resolves one of them. The other one is trickier,
because AFAICT it's basically an opinion question: is accessing
pg_class in the template database from some backend that is connected
to another database too ugly to be acceptable? Several people have
expressed concerns about that, but it's not clear to me whether they
are essentially saying "that is not what I would do if I were doing
this project" or more like "if you commit something that does it that
way I will be enraged and demand an immediate revert and the removal
of your commit bit." If it's the former, I think it's possible to
clean up various details of these patches to make them look nicer than
they do at present and get something committed for PostgreSQL 15. But
if it is the latter then there's really no point to that kind of
cleanup work and we should probably just give up now. So, Andres,
Heikki, and anybody else with a strong opinion, can you clarify how
vigorously you hate this design, or don't?

My own opinion is that this is actually rather elegant. It just makes
sense to me that the right way to figure out what relations are in a
database is to get that list from the database rather than from the
filesystem. Nobody would accept the idea of making \d work by listing
out the directory contents rather than by walking pg_class, and so the
only reason we ought to accept that in the case of cloning a database
is if the code is too ugly any other way. But is it really? It's true
that this patch set does some refactoring of interfaces in order to
make that work, and there's a few things about how it does that that I
think could be improved, but on the whole, it's seems like a
remarkably small amount of code to do something that we've long
considered absolutely taboo. Now, it's nowhere close to being
something that could be used to allow fully general cross-database
access, and there are severe problems with the idea of allowing any
such thing. In particular, there are various places that test for
connections to a database, and aren't going to be expected processes
not connected to the database to be touching it. My belief is that a
heavyweight lock on the database is a suitable surrogate, because we
actually take such a lock when connecting to a database, and that
forms part of the regular interlock. Taking such locks routinely for
short periods would be expensive and might create other problems, but
doing it for a maintenance operation seems OK. Also, if we wanted to
actually support full cross-database access, locking wouldn't be the
only problem by far. We'd have to deal with things like the relcache
and the catcache, which would be hard, and might increase the cost of
very common things that we need to be cheap. But none of that is
implicated in this patch, which only generalizes code paths that are
not so commonly taken as to pose a problem, and manages to reuse quite
a bit of code rather than introducing entirely new code to do the same
things.
.
It does introduce some new code here and there, though: there isn't
zero duplication. The biggest chunk of that FWICS is in 0006, in
GetDatabaseRelationList and GetRelListFromPage. I just can't get
excited about that. It's literally about two screens worth of code.
We're not talking about duplicating src/backend/access/heapam or
something like that. I do think it would be a good idea to split it up
just a bit more: I think the code inside GetRelListFromPage that is
guarded by HeapTupleSatisfiesVisibility() could be moved into a
separate subroutine, and I think that would likely look a big nicer.
But fundamentally I just don't see a huge issue here. That is not to
say there isn't a huge issue here: just that I don't see it.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: O(n) tasks cause lengthy startups and checkpoints

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 10:23:37 -0800, Nathan Bossart wrote:
> On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote:
> > They're accessed by xid. The LSN is just for cleanup. Accessing files
> > left over from a previous transaction with the same xid wouldn't be
> > good - we'd read wrong catalog state for decoding...
> 
> Okay, that part makes sense to me.  However, I'm still confused about how
> this is handled today and why moving cleanup to a separate auxiliary
> process makes matters worse.

Right now cleanup happens every checkpoint. So cleanup can't be deferred all
that far. We currently include a bunch of 32bit xids inside checkspoints, so
if they're rarer than 2^31-1, we're in trouble independent of logical
decoding.

But with this patch cleanup of logical decoding mapping files (and other
pieces) can be *indefinitely* deferred, without being noticeable.

One possible way to improve this would be to switch the on-disk filenames to
be based on 64bit xids. But that might also present some problems (file name
length, cost of converting 32bit xids to 64bit xids).

> I've done quite a bit of reading, and I haven't found anything that seems
> intended to prevent this problem.  Do you have any pointers?

I don't know if we have an iron-clad enforcement of checkpoints happening
every 2*31-1 xids. It's very unlikely to happen - you'd run out of space
etc. But it'd be good to have something better than that.

Greetings,

Andres Freund

Proposal: Support custom authentication methods using hooks

2022-02-17 Thread samay sharma

Hi all,

I wanted to submit a patch to expose 2 new hooks (one for the
authentication check and another one for error reporting) in auth.c. These
will allow users to implement their own authentication methods for Postgres
or add custom logic around authentication.

A use case where this is useful are environments where you want
authentication to be centrally managed across different services. This is a
common deployment model for cloud providers where customers like to use
single sign on and authenticate across different services including
Postgres. Implementing this now is tricky as it requires syncing that
authentication method's credentials with Postgres (and that gets trickier
with TTL/expiry etc.). With these hooks, you can implement an extension to
check credentials directly using the authentication provider's APIs.

To enable this, I've proposed adding a new authentication method "custom"
which can be specified in pg_hba.conf and takes a mandatory argument
"provider" specifying which authentication provider to use. I've also moved
a couple static functions to headers so that extensions can call them.

Sample pg_hba.conf line to use a custom provider:

hostall all ::1/128 custom
provider=test


As an example and a test case, I've added an extension named
test_auth_provider in src/test/modules which fetches credentials from
a pre-defined array. I've also added tap tests for the extension to test
this functionality.


One constraint in the current implementation is that we allow only one
authentication provider to be loaded at a time. In the future, we can add
more functionality to maintain an array of hooks and call the appropriate
one based on the provider name in the pg_hba line.


A couple of my tests are flaky and sometimes fail in CI. I think the reason
for that is I don't wait for pg_hba reload to be processed before checking
logs for error messages. I didn't find an immediate way to address that and
I'm looking into it but wanted to get an initial version out for
feedback on the approach taken and interfaces. Once those get finalized, I
can submit a patch to add docs as well.


Looking forward to your feedback.


Regards,

Samay


0001-Add-support-for-custom-authentication-methods.patch
Description: Binary data


0002-Add-sample-extension-to-test-custom-auth-provider-ho.patch
Description: Binary data


0003-Add-tests-for-test_auth_provider-extension.patch
Description: Binary data

Re: [PATCH] Support pg_ident mapping for LDAP

2022-02-17 Thread Jacob Champion

On Fri, 2021-10-29 at 17:38 +, Jacob Champion wrote:
> v3 attached, which uses the above naming scheme and removes the stale
> TODO. Changes in since-v2.

v4 rebases over the recent TAP changes.

--Jacob
From e0f36725013610eade9bc83414c4d1f5adea17e2 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 30 Aug 2021 16:29:59 -0700
Subject: [PATCH v4] Allow user name mapping with LDAP

Enable the `map` HBA option for the ldap auth method. To make effective
use of this from the client side, the authuser connection option (and a
corresponding environment variable, PGAUTHUSER) has been added; it
defaults to the PGUSER.

For more advanced mapping, the ldap_map_dn HBA option can be set to use
the full user Distinguished Name during user mapping. (This parallels
the include_realm=1 and clientname=DN options.)
---
 .../postgres_fdw/expected/postgres_fdw.out|   2 +-
 contrib/postgres_fdw/option.c |   8 +-
 doc/src/sgml/client-auth.sgml |  45 +-
 doc/src/sgml/libpq.sgml   |  32 
 doc/src/sgml/postgres-fdw.sgml|   3 +-
 src/backend/libpq/auth.c  |  38 +++--
 src/backend/libpq/hba.c   |  29 +++-
 src/backend/postmaster/postmaster.c   |   2 +
 src/include/libpq/hba.h   |   1 +
 src/include/libpq/libpq-be.h  |   6 +
 src/interfaces/libpq/fe-connect.c |   4 +
 src/interfaces/libpq/fe-protocol3.c   |   2 +
 src/interfaces/libpq/libpq-int.h  |   1 +
 src/test/ldap/t/001_auth.pl   | 143 ++
 14 files changed, 297 insertions(+), 19 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index b2e02caefe..c2eca6521a 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -188,7 +188,7 @@ ALTER USER MAPPING FOR public SERVER testserver1
 ALTER USER MAPPING FOR public SERVER testserver1
 	OPTIONS (ADD sslmode 'require');
 ERROR:  invalid option "sslmode"
-HINT:  Valid options in this context are: user, password, sslpassword, password_required, sslcert, sslkey
+HINT:  Valid options in this context are: user, password, sslpassword, authuser, password_required, sslcert, sslkey
 -- But we can add valid ones fine
 ALTER USER MAPPING FOR public SERVER testserver1
 	OPTIONS (ADD sslpassword 'dummy');
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index fc3ce6a53a..d073418a5d 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -316,10 +316,12 @@ InitPgFdwOptions(void)
 		popt->keyword = lopt->keyword;
 
 		/*
-		 * "user" and any secret options are allowed only on user mappings.
-		 * Everything else is a server option.
+		 * "user", "authuser", and any secret options are allowed only on user
+		 * mappings.  Everything else is a server option.
 		 */
-		if (strcmp(lopt->keyword, "user") == 0 || strchr(lopt->dispchar, '*'))
+		if (strcmp(lopt->keyword, "user") == 0 ||
+			strcmp(lopt->keyword, "authuser") == 0 ||
+			strchr(lopt->dispchar, '*'))
 			popt->optcontext = UserMappingRelationId;
 		else
 			popt->optcontext = ForeignServerRelationId;
diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index 02f0489112..d902eb9d01 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -1624,10 +1624,11 @@ omicron bryanh  guest1
 This authentication method operates similarly to
 password except that it uses LDAP
 as the password verification method. LDAP is used only to validate
-the user name/password pairs. Therefore the user must already
-exist in the database before LDAP can be used for
-authentication.
-   
+the user name/password pairs. Therefore database users must already
+exist in the database before LDAP can be used for authentication. User name
+mapping can be used to allow the LDAP user name to be different from the
+database user name.
+
 

 LDAP authentication can operate in two modes. In the first mode,
@@ -1703,6 +1704,42 @@ omicron bryanh  guest1

   
  
+ 
+  map
+  
+   
+Allows for mapping between LDAP and database user names. See
+ for details.
+   
+  
+ 
+ 
+  ldap_map_dn
+  
+   
+Set to 1 to use the user's full Distinguished Name (e.g.
+uid=someuser,dc=example,dc=com) during user name
+mapping. When set to 0 (the default), only the bare username (in this
+example, someuser) will be mapped. This option may
+only be used in conjunction with the map option.
+   
+   
+
+ When using regular expression matching in combination with this option,
+ care should be taken to correctly anchor the regular expression in
+

Re: support for CREATE MODULE

2022-02-17 Thread Bruce Momjian

On Tue, Feb 15, 2022 at 12:29:54PM -0800, Swaha Miller wrote:
> On Mon, Feb 14, 2022 at 4:58 PM Bruce Momjian  wrote:
> > I was working on a talk about microservices today and decided to create
> > two schemas --- a public one that has USAGE permission for other 
> services
> > with views and SECURITY DEFINER functions that referenced a private
> > schema that can only be accessed by the owning service.  Is that a usage
> > pattern that requires modules since it already works with schemas and
> > just uses schema permissions to designate public/private schema
> > interfaces.
> 
> 
> I think the reason for modules would be to make it a better experience for
> PostgreSQL users to do what they need. Yours is a great example

...

> Yes, anything a user may want to do with modules is likely possible to
> do already with schemas. But just because it is possible doesn't mean
> it is not tedious and awkward because of the mechanisms available to do 
> them now. And I would advocate for more expressive constructs to enable
> users of PostgreSQL to focus and reason about more of the "what" than 
> the "how" of the systems they are trying to build or migrate.

Well, I think there are two downsides of having modules that do
some things like schemas, and some not:

*  You now have two ways of controlling namespaces and permissions
*  It is unclear how the two methods would interact

Some people might find that acceptable, but historically more people have
rejected that, partly due to user API complexity and partly due to
server code complexity.

Given what you can do with Postgres already, what is trying to be
accomplished?  Having public and private objects in the same schema? 
Having public objects in a schema as visible to some users and private
objects invisible to them?  The visibility control is currently not
possible without using two schemas, but access control is possible.

Now, if we could throw away are current schema/permission controls and
just use one based on modules, that might be more acceptable, but it
would cause too much breakage and be non-standard, and also be less
flexible than what we already support.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Isn't gist_page_items broken for opclasses with compress methods?

2022-02-17 Thread Tomas Vondra

Hi,

I've been looking at a report of a crash an in-place upgrade [1],
related to a GiST index, and I tried to use the new GiST support added
to pageinspect in 756ab29124 (so it's in 14). My knowledge of GiST is
somewhat limited, but after struggling with it for a while I wonder if
it's actually correct for GiST indexes with "compress" methods.

Consider a simple example with ltree:

  CREATE TABLE test (a ltree);
  INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy');

  SELECT * FROM test;

   a
  
   Top.Collections.Pictures.Astronomy
  (1 row)

  CREATE INDEX ON test USING gist (a);

so far everything seems fine, but let's use the pageinspect stuff.

  SELECT * FROM gist_page_items(get_raw_page('test_a_idx', 0),
   'test_a_idx');

  WARNING:  problem in alloc set ExprContext: detected write past chunk
end in block 0x1b1b110, chunk 0x1b1d2c0
  WARNING:  problem in alloc set ExprContext: bad single-chunk 0x1b1d358
in block 0x1b1b110
  WARNING:  problem in alloc set ExprContext: found inconsistent memory
block 0x1b1b110

   itemoffset | ctid  | itemlen | dead |  keys
  +---+-+--+
1 | (0,1) |  96 | f| (a)=()
  (1 row)

Well, that's not great, right? I know pageinspect may misbehave if you
supply incorrect info or for corrupted pages, but that's not what's
happening here - the heap/index are fine, we have the right descriptor
and all of that. Yet we corrupt memory, and generate bogus result (I
mean, the keys clear are not empty).

The simple truth is this simply assumes the GiST index simply stores the
input data as is, and we can happily call output on the procedure, so we
read stuff from the index and call ltree_out() on it. But that's
nonsense, because what the index actually stores is ltree_gist, which is
essentially "ltree" with extra 8B header. Which (of course) utterly
confuses ltree_out, producing nonsense result.


The comment before BuildIndexValueDescription(), which is used to print
the keys, says this:

  * The passed-in values/nulls arrays are the "raw" input to the
  * index AM, e.g. results of FormIndexDatum --- this is not
  * necessarily what is stored in the index, but it's what the
  * user perceives to be stored.

Which I think simply means it's fine to call BuildIndexValueDescription
on the pass to index AMs, not on the stuff we read from the index.
Because it's up to the AM to transform it arbitrarily.


In the regression tests it seemingly works, but I believe that's merely
due to using a Point. gist/point_ops defines gist_point_compress, but
that merely builds BOX, which is two points next to each other. So we
simply print the first one ...

I've tried various other GiST opclasses (e.g. btree_git) and all of that
prints just bogus values.


Not sure what to do about this. BuildIndexValueDescription seems like
the wrong place to deal with this, and it was intended for entirely
different purpose (essentially just error messages).

So perhaps pageinspect adding a variant that knows how to decompress the
value? Or just not printing the keys, because the current state seems a
bit useless and confusing.


regards

https://www.postgresql.org/message-id/17406-71e02820ae79bb40%40postgresql.org

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?

2022-02-17 Thread Finnerty, Jim



So I think knowing what bad it is to have this feature is the key point to 
discussion now.


> While I've only read your description of the patch not the patch itself,

This comment applies to me also.

Is the join selectivity properly calculated in all cases, e.g. in the n:m join 
case in particular, or in general when you’re not joining to a unique key? 
(this would be the usual situation here, since it adds a range qual to a join 
qual)

>> Furthermore, there are some cases involving parameterized paths where
>> enforcing the inequality multiple times is definitely bad


  *   This has been done by committing 4.

What remaining cases are there where the qual is evaluated redundantly?



  *   Anyone still have interest in this?  Or is a better solution really 
possible?
Or is the current method  too bad to rescue?

As you’ve shown, this can potentially be very important, though I don’t think 
you’ll often see equijoins with an additional range restriction on the join 
keys.  When it happens, though, it could be especially important for joins to 
partitioned tables with many remote fdw partitions when the join can’t be 
pushed down to the remote server.

Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-16 19:43:09 -0800, Peter Geoghegan wrote:
> It makes the code in vacuumlazy.c much cleaner. In fact, that's how commit
> 44fa8488 started off -- purely as refactoring work.

The problem is that it didn't end up as that. You combined refactoring with
substantial changes. And described it large and generic terms.


> > You didn't really give a heads up that you're about to commit either. 
> > There's
> > a note at the bottom of [1], a very long email, talking about committing in 
> > a
> > couple of weeks. After which there was substantial discussion of the 
> > patchset.
>
> How can you be surprised that I committed 44fa8488? It's essentially
> the same patch as the first version, posted November 22 -- almost 3
> months ago.

It's a contentious thread. I asked for things to be split. In that context,
it's imo common curtesy for non-trivial patches to write something like

  While the rest of the patchset is contentious, I think 0001 is ready to
  go. I'd like to commit it tomorrow, unless somebody protests.


> And it's certainly not a big patch (though it is complicated).

For me a vacuum change with the following diffstat is large:
3 files changed, 515 insertions(+), 297 deletions(-)


> > It doesn't look to me like there was a lot of review for 44fa8488, despite 
> > it
> > touching very critical parts of the code. I'm listed as a reviewer, that 
> > was a
> > few months ago, and rereading my review I don't think it can be read to 
> > agree
> > with the state of the code back then.
>
> Can you be more specific?

Most importantly:

On 2021-11-22 11:29:56 -0800, Andres Freund wrote:
> I think this is a change mostly in the right direction. But as formulated this
> commit does *WAY* too much at once.

I do not know how to state more clearly that I think this is not a patch in a
committable shape.

but also:

On 2021-11-22 11:29:56 -0800, Andres Freund wrote:
> I think it should be doable to add an isolation test for this path. There have
> been quite a few bugs around the wider topic...


Greetings,

Andres Freund

Re: O(n) tasks cause lengthy startups and checkpoints

2022-02-17 Thread Nathan Bossart

On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote:
> On 2022-02-16 20:14:04 -0800, Nathan Bossart wrote:
>> >> - while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
>> >> + while (!ShutdownRequestPending &&
>> >> +(spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != 
>> >> NULL)
>> >
>> > Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
>> > not do their jobs when ShutdownRequestPending is set, particularly without 
>> > a
>> > huge fat comment.
>>
>> The idea was to avoid delaying shutdown because we're waiting for the
>> custodian to finish relatively nonessential tasks.  Another option might be
>> to just exit immediately when the custodian receives a shutdown request.
> 
> I think we should just not do either of these and let the functions
> finish. For the cases where shutdown really needs to be immediate
> there's, uhm, immediate mode shutdowns.

Alright.

>> > Why does this not open us up to new xid wraparound issues? Before there 
>> > was a
>> > hard bound on how long these files could linger around. Now there's not
>> > anymore.
>>
>> Sorry, I'm probably missing something obvious, but I'm not sure how this
>> adds transaction ID wraparound risk.  These files are tied to LSNs, and
>> AFAIK they won't impact slots' xmins.
> 
> They're accessed by xid. The LSN is just for cleanup. Accessing files
> left over from a previous transaction with the same xid wouldn't be
> good - we'd read wrong catalog state for decoding...

Okay, that part makes sense to me.  However, I'm still confused about how
this is handled today and why moving cleanup to a separate auxiliary
process makes matters worse.  I've done quite a bit of reading, and I
haven't found anything that seems intended to prevent this problem.  Do you
have any pointers?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 09:17:04 -0500, Robert Haas wrote:
> Commit messages need to describe what the commit actually changes.
> Theoretical ideas are fine, but if I, as a committer who have done
> significant work in this area in the past, can't read the commit
> message and understand what is actually different, it's not a good
> commit message. I think you *really* need to put more effort into
> making your patches, and the emails about your patches, and the commit
> messages for your patches understandable to other people. Otherwise,
> waiting 3 months between when you post the patch and when you commit
> it means nothing. You can wait 10 years to commit and still get
> objections, if other people don't understand what you're doing.
>
> I would guess that's really the root of Andres's concern here.

Yes.


> I believe that both Andres and I are in favor of the kinds of things you
> want to do here *in principle*.

Yea. And I might even agree more with Peter than with you on some of the
contended design bits. But it's too hard to know right now, because too many
things are changed at once and the descriptions are high level. And often
about some vague ideal principles that are nearly impossible to concretely
disagree with.


> But in practice I feel like it's not working well, and thereby putting the
> project at risk. What if some day one of us needs to fix a bug in your code?
> It's not like VACUUM is some peripheral system where bugs aren't that
> critical -- and it's also not the case that the risk of introducing new bugs
> is low.  Historically, it's anything but.

+1

Greetings,

Andres Freund

Re: killing perl2host

2022-02-17 Thread Andres Freund

On 2022-02-17 13:08:01 -0500, Andrew Dunstan wrote:
> >> perhaps something like:
> >>
> >>
> >>     my $msystem = $ENV{MSYSTEM} || 'undef';
> >>
> >>     die "incompatible perl" if $Config{osname} eq 'msys' && $msystem ne
> >> 'MSYS';
> > Why tests MSYSTEM instead of $host_os?

> Is that available in check_modules.pl? AFAICT it's an unexported shell
> variable.

No, but it could just be passed to it? Or the test just implemented in
configure, as I suggested.

Greetings,

Andres Freund

Re: Time to drop plpython2?

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-16 23:14:46 -0800, Andres Freund wrote:
> >
> > Done. Curious how red the BF will turn out to be. Let's hope it's not
> > too bad.
>
> I've pinged the owners of the animals failing so far:

Now also pinged:
- curculio
- guaibasaurus
- snapper
- gadwall, takin


> - snakefly, massasauga

Fixed.


> - jay, trilobite, hippopotamus, avocet

Nothing yet.


> - myna, butterflyfish

Fixed, as noted by Micheal on this thread.


> - rhinoceros

Joe replied that he is afk, looking into it tomorrow.


Greetings,

Andres Freund

Re: killing perl2host

2022-02-17 Thread Andrew Dunstan



On 2/17/22 12:12, Andres Freund wrote:
> Hi,
>
> On 2022-02-17 09:20:56 -0500, Andrew Dunstan wrote:
>> I don't think we have or have ever had a buildfarm animal targeting
>> msys. In general I think of msys as a build environment to create native
>> binaries. But if we want to support targeting msys we should have an
>> animal doing that.
> It's pretty much cygwin. Wouldn't hurt to have a dedicated animal though, I
> agree. We do have a dedicated path for it in configure.ac:
>
> case $host_os in
> ...
>   cygwin*|msys*) template=cygwin ;;
>
>
>
>>> I think this means we should do the msys test in configure, inside
>>>
>>> if test "$enable_tap_tests" = yes; then
>>>
>>> and verify that $host_os != msys.
>> config/check_modules.pl is probably the right place for the test, as it
>> will be running with the perl we should be testing, and is only called
>> if we configure with TAP tests enabled.
> Makes sense.
>
>
>> perhaps something like:
>>
>>
>>     my $msystem = $ENV{MSYSTEM} || 'undef';
>>
>>     die "incompatible perl" if $Config{osname} eq 'msys' && $msystem ne
>> 'MSYS';
> Why tests MSYSTEM instead of $host_os?



Is that available in check_modules.pl? AFAICT it's an unexported shell
variable.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: buildfarm warnings

2022-02-17 Thread Robert Haas

On Thu, Feb 17, 2022 at 12:08 PM Tom Lane  wrote:
> I wrote:
> > Justin Pryzby  writes:
> >> pg_basebackup.c:1261:35: warning: storing the address of local variable 
> >> archive_filename in progress_filename [-Wdangling-pointer=]
> >> => new in 23a1c6578 - looks like a real error
>
> > I saw that one a few days ago but didn't get around to looking
> > more closely yet.  It does look fishy, but it might be okay
> > depending on when the global variable can be accessed.
>
> I got around to looking at this, and it absolutely is a bug.
> The test scripts don't reach it, because they don't exercise -P
> mode, let alone -P -v which is what you need to get progress_report()
> to try to print the filename.  However, if you modify the test
> to do so, then you find that the output differs depending on
> whether or not you add "progress_filename = NULL;" at the bottom
> of CreateBackupStreamer().  So the variable is continuing to be
> referenced, not only after it goes out of scope within that
> function, but even after the function returns entirely.
> (Interestingly, the output isn't obvious garbage, at least not
> on my machine; so somehow the array's part of the stack is not
> getting overwritten very soon.)
>
> A quick-and-dirty fix is
>
> -   progress_filename = archive_filename;
> +   progress_filename = pg_strdup(archive_filename);
>
> but perhaps this needs more thought.  How long is that filename
> actually reasonable to show in the progress reports?

Man, you couldn't have asked a question that made me any happier.
Preserving that behavior was one of the most annoying parts of this
whole refactoring. The server always sends a series of tar files, but
the pre-existing behavior is that progress reports show the archive
name with -Ft and the name of the archive member with -Fp. I think
that's sort of useful, but it's certainly annoying from an
implementation perspective because we only know the name of the
archive member if we're extracting the tarfile, possibly after
decompressing it, whereas the name of the archive itself is easily
known. This results in annoying gymnastics to try to update the global
variable that we use to store this information from very different
places depending on what the user requested, and evidently I messed
that up, and also, why in the world does this code think that "more
global variables" is the right answer to every problem?

I'm not totally convinced that it's OK to just hit this progress
reporting stuff with a hammer and remove some functionality in the
interest of simplifying the code. But I will shed no tears if that's
the direction other people would like to go.

If not, I think that your quick-and-dirty fix is about right, except
that we probably need to do it every place where we set
progress_filename, and we should arrange to free the memory later.
With -Ft, you won't have enough archives to matter, but with -Fp, you
might have enough archive members to matter.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Per-table storage parameters for TableAM/IndexAM extensions

2022-02-17 Thread Sadhuprasad Patro

> On Sat, Feb 12, 2022 at 2:35 AM Robert Haas  wrote:
>>
>>
>> Imagine that I am using the "foo" tableam with "compression=lots" and
>> I want to switch to the "bar" AM which does not support that option.
>> If I remove the "compression=lots" option using a separate command,
>> the "foo" table AM may rewrite my whole table and decompress
>> everything. Then when I convert to the "bar" AM it's going to have to
>> be rewritten again. That's painful. I clearly need some way to switch
>> AMs without having to rewrite the table twice.
>
Agreed. Better to avoid multiple rewrites here. Thank you for figuring out this.


> You'd need to be able to do multiple things with one command e.g.

> ALTER TABLE mytab SET ACCESS METHOD baz RESET compression, SET
> preferred_fruit = 'banana';

+1
Silently dropping some options is not right and it may confuse users
too. So I would like to go
for the command you have suggested, where the user should be able to
SET & RESET multiple
options in a single command for an object.

Thanks & Regards
SadhuPrasad
http://www.enterprisedb.com

Re: CREATEROLE and role ownership hierarchies

2022-02-17 Thread Robert Haas

On Mon, Jan 31, 2022 at 1:57 PM Joshua Brindle
 wrote:
> This is precisely the use case I am trying to accomplish with this
> patchset, roughly:
>
> - An automated bot that creates users and adds them to the employees role
> - Bot cannot access any employee (or other roles) table data
> - Bot cannot become any employee
> - Bot can disable the login of any employee
>
> Yes there are attack surfaces around the fringes of login, etc but
> those can be mitigated with certificate authentication. My pg_hba
> would require any role in the employees role to use cert auth.
>
> This would adequately mitigate many threats while greatly enhancing
> user management.

So, where do we go from here?

I've been thinking about this comment a bit. On the one hand, I have
some reservations about the phrase "the use case I am trying to
accomplish with this patchset," because in the end, this is not your
patch set. It's not reasonable to complain that a patch someone else
wrote doesn't solve your problem; of course everyone writes patches to
solve their own problems, or those of their employer, not other
people's problems. And that's as it should be, else we will have few
contributors. On the other hand, to the extent that this patch set
makes things worse for a reasonable use case which you have in mind,
that's an entirely legitimate complaint.

After a bit of testing, it seems to me that as things stand today,
things are nearly perfect for the use case that you have in mind. I
would be interested to know whether you agree. If I set up an account
and give it CREATEROLE, it can create users, and it can put them into
the employees role, but it can't SET ROLE to any of those accounts. It
can also ALTER ROLE ... NOLOGIN on any of those accounts. The only gap
I see is that there are certain role-based flags which the CREATEROLE
account cannot set: SUPERUSER, BYPASSRLS, REPLICATION. You might
prefer a system where your bot account had the option to grant those
privileges also, and I think that's a reasonable thing to want.

However, I *also* think it's reasonable to want an account that can
create roles but can't give to those roles membership in roles that it
does not itself possess. Likewise, I think it's reasonable to want an
account that can only drop roles which that account itself created.
These kinds of requirements stem from a different use case than what
you are talking about here, but they seem like fine things to want,
and as far as I know we have pretty broad agreement that they are
reasonable. It seems extremely difficult to make a convincing argument
that this is not a thing which anyone should want to block:

rhaas=> create role bob role pg_execute_server_program;
CREATE ROLE

Honestly, that seems like a big yikes from here. How is it OK to block
"create role bob superuser" yet allow that command? I'm inclined to
think that's just broken. Even if the role were pg_write_all_data
rather than pg_execute_server_program, it's still a heck of a lot of
power to be handing out, and I don't see how anyone could make a
serious argument that we shouldn't have an option to restrict that.

Let me separate the two features that I just mentioned and talk about
them individually:

1. Don't allow a CREATEROLE user to give out membership in groups
which that user does not possess. Leaving aside the details of any
previously-proposed patches and just speaking theoretically, how can
this be implemented? I can think of a few ideas. We could (1A) just
change CREATEROLE to work that way, but IIUC that would break the use
case you outline here, so I guess that's off the table unless I am
misunderstanding the situation. We could also (1B) add a second role
attribute with a different name, like, err, CREATEROLEWEAKLY, that
behaves in that way, leaving the existing one untouched. But we could
also take it a lot further, because someone might want to let an
account hand out a set of privileges which corresponds neither to the
privileges of that account nor to the full set of available
privileges. That leads to another idea: (1C) implement an in-database
system that lets you specify which privileges an account has, and,
separately, which ones it can assign to others. I am skeptical of that
idea because it seems really, really complicated, not only from an
implementation standpoint but even just from a user-experience
standpoint. Suppose user 'userbot' has rights to grant a suitable set
of groups to the new users that it creates -- but then someone creates
a new group. Should that also be added to the things 'userbot' can
grant or not? What if we have 'userbot1' through 'userbot6' and each
of them can grant a different set of roles? I wouldn't mind (1D)
providing a hook that allows the system administrator to install a
loadable module that can enforce any rules it likes, but it seems way
too complicated to me to expose all of this configuration as SQL,
especially because for what I want to do, either (1A) or (1B) is
adequate, and (1B) is a LOT

Re: [PATCH] Accept IP addresses in server certificate SANs

2022-02-17 Thread Jacob Champion

On Tue, 2022-02-15 at 15:16 +0900, Kyotaro Horiguchi wrote:
> (This needs rebasing)

Done in v6, attached.

> # I forgot to mention that, the test fails for me even without the
> # change.  I didn't checked what is wrong there, though.

Ah. We should probably figure that out, then -- what failures do you
see?

> Mmm. after the end of the loop, rc is non-zero only when the loop has
> exited by the break and otherwise rc is zero. Why isn't it equivalent
> to setting check_cn to false at the break?

check_cn can be false if rc is zero, too; it means that we found a SAN
of the correct type but it didn't match.

> Anyway, apart from that detail, I reconfirmed the spec the patch is
> going to implement.
> 
>   * If connhost contains a DNS name, and the certificate's SANs contain any
>   * dNSName entries, then we'll ignore the Subject Common Name entirely;
>   * otherwise, we fall back to checking the CN. (This behavior matches the
>   * RFC.)
> 
> Sure.
> 
>   * If connhost contains an IP address, and the SANs contain iPAddress
>   * entries, we again ignore the CN. Otherwise, we allow the CN to match,
>   * EVEN IF there is a dNSName in the SANs. (RFC 6125 prohibits this: "A
>   * client MUST NOT seek a match for a reference identifier of CN-ID if the
>   * presented identifiers include a DNS-ID, SRV-ID, URI-ID, or any
>   * application-specific identifier types supported by the client.")
> 
> Actually the patch searches for a match of IP address connhost from
> dNSName SANs even if iPAddress SANs exist.  I think we've not
> explicitly defined thebehavior in that case.

That's a good point; I didn't change the prior behavior. I feel more
comfortable leaving that check, since it is technically possible to
push something that looks like an IP address into a dNSName SAN. We
should probably make an explicit decision on that, as you say.

But I don't think that contradicts the code comment, does it? The
comment is just talking about CN fallback scenarios. If you find a
match in a dNSName, there's no reason to fall back to the CN.

> I supposed that we only
> be deviant in the case "IP address connhost and no SANs of any type
> exists". What do you think about it?

We fall back in the case of "IP address connhost and dNSName SANs
exist", which is prohibited by that part of RFC 6125. I don't think we
deviate in the case you described; can you explain further?

> - For the certificate that have only dNSNames or no SANs presented, we
>   serach for a match from all dNSNames if any or otherwise try CN
>   regardless of the type of connhost.

Correct. (I don't find that way of dividing up the cases very
intuitive, though.)

> - Otherwise (the cert has at least one iPAddress SANs) we follow the RFCs.
> 
>   - For IP-addr connhost, we search only the iPAddress SANs.

We search the dNSNames as well, as you pointed out above. But we don't
fall back to the CN.

>   - For DNSName connhost, we search only dNSName SANs if any or
> otherwise try CN.

Effectively, yes. (We call the IP address verification functions too,
to get alt_name, but they can't match. If that's too confusing, we'd
need to pull the alt_name handling up out of the verification layer.)

> Honestly I didn't consider to that detail. On second thought, with
> this specification we cannot decide the behavior unless we scanned all
> SANs.

Right.

> Maybe we can find an elegant implement but I don't think people
> here would welcome even that level of complexity needed only for that
> dubious existing use case.

Which use case do you mean?

> What do you think about this?  And I'd like to hear from others.

I think we need to decide whether or not to keep the current "IP
address connhost can match a dNSName SAN" behavior, and if so I need to
add it to the test cases. (And we need to figure out why the tests are
failing in your build, of course.)

Thanks!
--Jacob
From fa93bc48654164d36fc163ee84d8f9b78fe29e96 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Wed, 24 Nov 2021 14:46:11 -0800
Subject: [PATCH v6 1/3] Move inet_net_pton() to src/port

This will be helpful for IP address verification in libpq.
---
 src/backend/utils/adt/Makefile  |  1 -
 src/include/port.h  |  3 +++
 src/include/utils/builtins.h|  4 
 src/port/Makefile   |  1 +
 src/{backend/utils/adt => port}/inet_net_pton.c | 16 +++-
 src/tools/msvc/Mkvcbuild.pm |  2 +-
 6 files changed, 20 insertions(+), 7 deletions(-)
 rename src/{backend/utils/adt => port}/inet_net_pton.c (96%)

diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 41b486bcef..d173d52157 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -43,7 +43,6 @@ OBJS = \
 	geo_selfuncs.o \
 	geo_spgist.o \
 	inet_cidr_ntop.o \
-	inet_net_pton.o \
 	int.o \
 	int8.o \
 	json.o \
diff --git a/src/include/port.h b/src/include/port.h
index

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-17 Thread Julien Rouhaud

Hi,

On Thu, Feb 17, 2022 at 10:39:02PM +0530, Nitin Jadhav wrote:
> > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > - can be stored without any problem.
> >
> > There can be multiple processes triggering a checkpoint, or at least 
> > wanting it
> > to happen or happen faster.
> 
> Yes. There can be multiple processes but there will be one checkpoint
> operation at a time. So the backend PID corresponds to the current
> checkpoint operation. Let me know if I am missing something.

If there's a checkpoint timed triggered and then someone calls
pg_start_backup() which then wait for the end of the current checkpoint
(possibly after changing the flags), I think the view should reflect that in
some way.  Maybe storing an array of (pid, flags) is too much, but at least a
counter with the number of processes actively waiting for the end of the
checkpoint.

> > > 'checkpoint or restartpoint?'
> >
> > Do you actually need to store that?  Can't it be inferred from
> > pg_is_in_recovery()?
> 
> AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
> checkpoint or restartpoint because if the system exits from recovery
> mode during restartpoint then any query to pg_stat_progress_checkpoint
> view will return it as a checkpoint which is ideally not correct. Please
> correct me if I am wrong.

Recovery ends with an end-of-recovery checkpoint that has to finish before the
promotion can happen, so I don't think that a restart can still be in progress
if pg_is_in_recovery() returns false.

Re: killing perl2host

2022-02-17 Thread Andres Freund

Hi,

On 2022-02-17 09:20:56 -0500, Andrew Dunstan wrote:
> I don't think we have or have ever had a buildfarm animal targeting
> msys. In general I think of msys as a build environment to create native
> binaries. But if we want to support targeting msys we should have an
> animal doing that.

It's pretty much cygwin. Wouldn't hurt to have a dedicated animal though, I
agree. We do have a dedicated path for it in configure.ac:

case $host_os in
...
  cygwin*|msys*) template=cygwin ;;



> > I think this means we should do the msys test in configure, inside
> >
> > if test "$enable_tap_tests" = yes; then
> >
> > and verify that $host_os != msys.

> config/check_modules.pl is probably the right place for the test, as it
> will be running with the perl we should be testing, and is only called
> if we configure with TAP tests enabled.

Makes sense.


> perhaps something like:
> 
> 
>     my $msystem = $ENV{MSYSTEM} || 'undef';
> 
>     die "incompatible perl" if $Config{osname} eq 'msys' && $msystem ne
> 'MSYS';

Why tests MSYSTEM instead of $host_os?

Greetings,

Andres Freund

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-17 Thread Nitin Jadhav

> > Thank you for sharing the information.  'triggering backend PID' (int)
> > - can be stored without any problem.
>
> There can be multiple processes triggering a checkpoint, or at least wanting 
> it
> to happen or happen faster.

Yes. There can be multiple processes but there will be one checkpoint
operation at a time. So the backend PID corresponds to the current
checkpoint operation. Let me know if I am missing something.

> > 'checkpoint or restartpoint?'
>
> Do you actually need to store that?  Can't it be inferred from
> pg_is_in_recovery()?

AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
checkpoint or restartpoint because if the system exits from recovery
mode during restartpoint then any query to pg_stat_progress_checkpoint
view will return it as a checkpoint which is ideally not correct. Please
correct me if I am wrong.

Thanks & Regards,
Nitin Jadhav

On Thu, Feb 17, 2022 at 4:35 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Thu, Feb 17, 2022 at 12:26:07PM +0530, Nitin Jadhav wrote:
> >
> > Thank you for sharing the information.  'triggering backend PID' (int)
> > - can be stored without any problem.
>
> There can be multiple processes triggering a checkpoint, or at least wanting 
> it
> to happen or happen faster.
>
> > 'checkpoint or restartpoint?'
>
> Do you actually need to store that?  Can't it be inferred from
> pg_is_in_recovery()?

Re: buildfarm warnings

2022-02-17 Thread Tom Lane

I wrote:
> Justin Pryzby  writes:
>> pg_basebackup.c:1261:35: warning: storing the address of local variable 
>> archive_filename in progress_filename [-Wdangling-pointer=]
>> => new in 23a1c6578 - looks like a real error

> I saw that one a few days ago but didn't get around to looking
> more closely yet.  It does look fishy, but it might be okay
> depending on when the global variable can be accessed.

I got around to looking at this, and it absolutely is a bug.
The test scripts don't reach it, because they don't exercise -P
mode, let alone -P -v which is what you need to get progress_report()
to try to print the filename.  However, if you modify the test
to do so, then you find that the output differs depending on
whether or not you add "progress_filename = NULL;" at the bottom
of CreateBackupStreamer().  So the variable is continuing to be
referenced, not only after it goes out of scope within that
function, but even after the function returns entirely.
(Interestingly, the output isn't obvious garbage, at least not
on my machine; so somehow the array's part of the stack is not
getting overwritten very soon.)

A quick-and-dirty fix is

-   progress_filename = archive_filename;
+   progress_filename = pg_strdup(archive_filename);

but perhaps this needs more thought.  How long is that filename
actually reasonable to show in the progress reports?

regards, tom lane

Re: initdb / bootstrap design

2022-02-17 Thread Andrew Dunstan



On 2/17/22 10:36, Robert Haas wrote:
> On Wed, Feb 16, 2022 at 2:50 PM Andres Freund  wrote:
>>> initdb is already plenty fast enough for any plausible production
>>> usage; it's cases like check-world where we wish it were faster.
>> It's not just our own usage though. I've seen it be a noticable time in test
>> suites of applications using postgres.
> I'd just like to second this point.
>
> I was working on an EDB proprietary software project for a while
> which, because of the nature of what it did, ran initdb frequently in
> its test suite. And it was unbelievably painful. The test suite just
> took forever. Fortunately, it always ran initdb with the same options,
> so somebody invented a mechanism for doing one initdb and saving the
> results someplace and just copying them every time, and it made a huge
> difference. Before that experience, I probably would have agreed with
> the idea that there was no need at all for initdb to be any faster
> than it is already. But, like, what if we'd been trying to run initdb
> with different options for different tests, the way the core code
> does? That seems like an entirely plausible thing to want to do, and
> then caching becomes a real pain.


Indeed. When initdb.c was written the testing landscape was very
different both for the community and for projects that used Postgres. So
we need to catch up.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit

2022-02-17 Thread Fujii Masao





On 2022/02/11 21:59, Etsuro Fujita wrote:

I tweaked comments/docs a little bit as well.


Thanks for updating the patches!

I reviewed 0001 patch. It looks good to me except the following minor things. 
If these are addressed, I think that the 001 patch can be marked as ready for 
committer.

+* Also determine to commit (sub)transactions opened on the remote 
server
+* in parallel at (sub)transaction end.

Like the comment "Determine whether to keep the connection ...", "determine to commit" 
should be "determine whether to commit"?

"remote server" should be "remote servers"?


+   curlevel = GetCurrentTransactionNestLevel();
+   snprintf(sql, sizeof(sql), "RELEASE SAVEPOINT s%d", curlevel);

Why does pgfdw_finish_pre_subcommit_cleanup() need to call 
GetCurrentTransactionNestLevel() and construct the "RELEASE SAVEPOINT" query 
string again? pgfdw_subxact_callback() already does them and probably we can make it pass 
either of them to pgfdw_finish_pre_subcommit_cleanup() as its argument.


+   This option controls whether postgres_fdw commits
+   remote (sub)transactions opened on a foreign server in a local
+   (sub)transaction in parallel when the local (sub)transaction commits.

"a foreign server" should be "foreign servers"?

"in a local (sub)transaction" part seems not to be necessary.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pgsql: pg_upgrade: Preserve relfilenodes and tablespace OIDs.

2022-02-17 Thread Robert Haas

On Sun, Feb 13, 2022 at 6:51 AM Christoph Berg  wrote:
> Re: Robert Haas
> > pg_upgrade: Preserve relfilenodes and tablespace OIDs.
>
> > src/bin/pg_dump/pg_dumpall.c   |   3 +
>
> --- a/src/bin/pg_dump/pg_dumpall.c
> +++ b/src/bin/pg_dump/pg_dumpall.c
> @@ -1066,6 +1066,9 @@ dumpTablespaces(PGconn *conn)
> /* needed for buildACLCommands() */
> fspcname = pg_strdup(fmtId(spcname));
>
> +   appendPQExpBufferStr(buf, "\n-- For binary upgrade, must 
> preserve pg_table
> +   appendPQExpBuffer(buf, "SELECT 
> pg_catalog.binary_upgrade_set_next_pg_table
>
> This needs to be guarded with "if (binary_upgrade)".

Right. Sorry about that, and sorry for not responding sooner also. Fix
pushed now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-17 Thread Daniel Gustafsson

> On 17 Feb 2022, at 16:05, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> Question remains though, should src/interfaces/ecpg/test/sql/sqlda.pgc really
>> be using WIN32 and not _WIN32, or doesn't it matter?  (or does it only matter
>> for consistency?) WIN32 and _WIN32 aren't very informative searchterms to use
>> for finding more information.
> 
> I find this in src/include/port/win32.h:
> 
> /*
> * We always rely on the WIN32 macro being set by our build system,
> * but _WIN32 is the compiler pre-defined macro. So make sure we define
> * WIN32 whenever _WIN32 is set, to facilitate standalone building.
> */
> #if defined(_WIN32) && !defined(WIN32)
> #define WIN32
> #endif
> 
> So for most of our code it shouldn't matter.  However, I'm not sure
> that the ECPG test cases include our port.h --- they probably shouldn't
> if they're to reflect actual use-cases.  [ pokes around... ]  See
> 517bf2d91 which added this code.

Judging by the info in the Stack Overflow post and the commit message for
517bf2d91 it's seems perfectly correct to use _WIN32 in the ECPG test.  Thanks!

--
Daniel Gustafsson   https://vmware.com/

Re: initdb / bootstrap design

2022-02-17 Thread Robert Haas

On Wed, Feb 16, 2022 at 2:50 PM Andres Freund  wrote:
> > initdb is already plenty fast enough for any plausible production
> > usage; it's cases like check-world where we wish it were faster.
>
> It's not just our own usage though. I've seen it be a noticable time in test
> suites of applications using postgres.

I'd just like to second this point.

I was working on an EDB proprietary software project for a while
which, because of the nature of what it did, ran initdb frequently in
its test suite. And it was unbelievably painful. The test suite just
took forever. Fortunately, it always ran initdb with the same options,
so somebody invented a mechanism for doing one initdb and saving the
results someplace and just copying them every time, and it made a huge
difference. Before that experience, I probably would have agreed with
the idea that there was no need at all for initdb to be any faster
than it is already. But, like, what if we'd been trying to run initdb
with different options for different tests, the way the core code
does? That seems like an entirely plausible thing to want to do, and
then caching becomes a real pain.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-17 Thread Tom Lane

Daniel Gustafsson  writes:
> Question remains though, should src/interfaces/ecpg/test/sql/sqlda.pgc really
> be using WIN32 and not _WIN32, or doesn't it matter?  (or does it only matter
> for consistency?) WIN32 and _WIN32 aren't very informative searchterms to use
> for finding more information.

I find this in src/include/port/win32.h:

/*
 * We always rely on the WIN32 macro being set by our build system,
 * but _WIN32 is the compiler pre-defined macro. So make sure we define
 * WIN32 whenever _WIN32 is set, to facilitate standalone building.
 */
#if defined(_WIN32) && !defined(WIN32)
#define WIN32
#endif

So for most of our code it shouldn't matter.  However, I'm not sure
that the ECPG test cases include our port.h --- they probably shouldn't
if they're to reflect actual use-cases.  [ pokes around... ]  See
517bf2d91 which added this code.

regards, tom lane

Re: do only critical work during single-user vacuum?

2022-02-17 Thread Robert Haas

On Wed, Feb 16, 2022 at 10:08 PM Peter Geoghegan  wrote:
> > 6. Sometimes the user decides to run VACUUM FULL instead of plain
> > VACUUM because it sounds better.
>
> It's a pity that the name suggests otherwise. If only we'd named it
> something that suggests "option of last resort". Oh well.

Unfortunately, such a name would also be misleading, just in a
different way. It is really not at all difficult to have a workload
that demands routine use of VACUUM FULL. I suppose technically it is a
last resort in such situations, because what would you resort to after
trying VF? But it's not like some kind of in-emergency-break-glass
kind of thing, it's just the right tool for the job.

Some years ago I worked with a customer who had a table that was being
used as an update-heavy queue. I don't remember all the details any
more, but I think the general pattern was that they would insert rows,
update them A TON, and then eventually delete them. And they got
really bad table bloat, because vacuum just wasn't running often
enough to keep up. Reducing autovacuum_naptime to 15s fixed the issue,
fortunately, but I was initially thinking that it might be completely
unfixable, because what if they'd also been running a series
4-minute-long reporting queries in a loop on some other table? More
frequent vacuuming wouldn't have helped then, because xmin would not
have been able to advance until the current instance of the reporting
query finished, and then vacuuming more often would have done nothing
useful. I think, anyway.

That's just one example that comes to mind. I think there are lots of
workloads where it's simply not possible to make VACUUM keep up.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: PGEventProcs must not be allowed to break libpq

2022-02-17 Thread Tom Lane

Alvaro Herrera  writes:
> Not really related to this complaint and patch, but as far as I can see,
> libpq events go completely untested in the core source.  Maybe we should
> come up with a test module or something?

Yeah, I suppose.  The libpq part of it is pretty simple, but still...

regards, tom lane

Re: PGEventProcs must not be allowed to break libpq

2022-02-17 Thread Alvaro Herrera

Not really related to this complaint and patch, but as far as I can see,
libpq events go completely untested in the core source.  Maybe we should
come up with a test module or something?

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: adding 'zstd' as a compression algorithm

2022-02-17 Thread Robert Haas

On Wed, Feb 16, 2022 at 11:34 PM Michael Paquier  wrote:
> Thanks.  This looks pretty much right, except for two things that I
> have taken the freedom to fix as of the v3 attached.

Ah, OK, cool. It seems like we have consensus on this being a good
direction, so I plan to commit this later today unless objections or
additional review comments turn up.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: killing perl2host

2022-02-17 Thread Andrew Dunstan

On 2/16/22 21:17, Andres Freund wrote:
> Hi,
>
> On 2022-02-16 14:42:30 -0800, Andres Freund wrote:
>> On February 16, 2022 1:10:35 PM PST, Andrew Dunstan  
>> wrote:
>>> So something like this in Utils.pm:
>>>
>>>
>>> die "Msys targeted perl is unsupported for running TAP tests" if
>>> $Config{osname}eq 'msys';
>> I don't think we should reject msys general - it's fine as long as the 
>> target is msys, no? Msys includes postgres fwiw...

I don't think we have or have ever had a buildfarm animal targeting
msys. In general I think of msys as a build environment to create native
binaries. But if we want to support targeting msys we should have an
animal doing that.

> I think this means we should do the msys test in configure, inside
>
> if test "$enable_tap_tests" = yes; then
>
> and verify that $host_os != msys.
>

config/check_modules.pl is probably the right place for the test, as it
will be running with the perl we should be testing, and is only called
if we configure with TAP tests enabled.

perhaps something like:

    my $msystem = $ENV{MSYSTEM} || 'undef';

    die "incompatible perl" if $Config{osname} eq 'msys' && $msystem ne
'MSYS';

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-17 Thread Robert Haas

On Wed, Feb 16, 2022 at 10:43 PM Peter Geoghegan  wrote:
> How can you be surprised that I committed 44fa8488? It's essentially
> the same patch as the first version, posted November 22 -- almost 3
> months ago. And it's certainly not a big patch (though it is
> complicated).

Let's back up a minute and talk about the commit of $SUBJECT. The
commit message contains a Discussion link to this thread. This thread,
at the time you put that link in there, had exactly one post: from
you. That's not much of a discussion, although I do acknowledge that
sometimes we commit things that have bugs, and those bugs need to be
fixed even if nobody has responded.

That brings us to the question of whether 44fa8488 was improvidently
committed. I don't know the answer to that question, and here's why:

> The commit message is high level.

I would say it differently: I think the commit message does a poor job
describing what the commit actually does. For example, it says nothing
about changing VACUUM to always scan the last page of every heap
relation. This whole thread is about fixing a problem that was caused
by a significant behavior change that was *not even mentioned* in the
original commit message. If it had been mentioned, I likely would have
complained, because it's very similar to behavior that Tom eliminated
in b503da135ab0bdd97ac3d3f720c35854e084e525, which he did because it
was distorting reltuples estimates.

Commit messages need to describe what the commit actually changes.
Theoretical ideas are fine, but if I, as a committer who have done
significant work in this area in the past, can't read the commit
message and understand what is actually different, it's not a good
commit message. I think you *really* need to put more effort into
making your patches, and the emails about your patches, and the commit
messages for your patches understandable to other people. Otherwise,
waiting 3 months between when you post the patch and when you commit
it means nothing. You can wait 10 years to commit and still get
objections, if other people don't understand what you're doing.

I would guess that's really the root of Andres's concern here. I
believe that both Andres and I are in favor of the kinds of things you
want to do here *in principle*. But in practice I feel like it's not
working well, and thereby putting the project at risk. What if some
day one of us needs to fix a bug in your code? It's not like VACUUM is
some peripheral system where bugs aren't that critical -- and it's
also not the case that the risk of introducing new bugs is low.
Historically, it's anything but.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-17 Thread Justin Pryzby

On Wed, Feb 16, 2022 at 07:43:09PM -0800, Peter Geoghegan wrote:
> > I also have a hard time making heads or tails out of the commit message of
> > 44fa84881ff. It's quite long without being particularly descriptive. The
> > commit just changes a lot of things at once, making it hard to precisely
> > describe and very hard to review and understand.
> 
> The commit message is high level. The important point is that we can
> more or less treat all scanned_pages the same, without generally
> caring about whether or not a cleanup lock could be acquired (since
> the exceptions where that isn't quite true are narrow and rare). That
> is the common theme, for everything.

As for myself, I particularly appreciated the clarity and detail of this commit
message.  (I have looked at the associated change a bit, but not deeply).

Thanks,
-- 
Justin

Re: postgres_fdw and skip locked

2022-02-17 Thread Ashutosh Bapat

On Wed, Feb 16, 2022 at 8:38 PM Alexander Pyhalov
 wrote:
>
> Ashutosh Bapat писал 2022-02-16 16:40:
> > On Mon, Feb 14, 2022 at 4:23 PM Alexander Pyhalov
> >  wrote:
> >>
> >> Hi.
> >>
> >> Now select ... for update ... [skip locked|nowait] options are not
> >> pushed down to remote servers. I see the only reason is that we can
> >> speak to pre-9.5 server, which doesn't understand skip locked option.
> >> Are there any other possible issues? Should we add foreign table
> >> option
> >> to control this behavior?
> >
> > Should we always push these clauses if remote server's version is
> > newer than 9.5? There are quite a few options already. It will be good
> > not to add one more.
> >
> > I see that these options will work for all kinds of relations. So no
> > problem if foreign table is pointing to something other than a table.
>
> Hi.
> The issue is that we perform deparsing while planing, we haven't
> connected to server yet.
> Are there any ideas how to find out its version without specifying it,
> for example, in server options?

Yes, you are right. I wish foreign servers, esp. postgresql, could be
more integrated and if we could defer deparsing till execution phase.
But that's out of scope for this patch.

-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-17 Thread Ranier Vilela

Em qui., 17 de fev. de 2022 às 10:18, Daniel Gustafsson 
escreveu:

> > On 17 Feb 2022, at 13:59, Ranier Vilela  wrote:
> >
> > Em qui., 17 de fev. de 2022 às 09:52, Daniel Gustafsson  > escreveu:
> > > On 17 Feb 2022, at 13:19, Ranier Vilela  ranier...@gmail.com>> wrote:
> >
> > > 1. One #ifdef with a mistake, the correct is _WIN32 and not WIN32.
> >
> > Can you elaborate on this, we are using WIN32 pretty extensively in the
> code:
> >
> >  $ git grep "if[n]\{0,1\}def WIN32$"|wc -l
> >  511
> >  $ git grep "if[n]\{0,1\}def _WIN32$"|wc -l
> >2
> >
> > The _WIN32 cases are in the same ECPG testcase.
> >
> > Why would _WIN32 be correct in this case?
> > Sorry, my fault.
> >
> > I only use _WIN32 and I jumped to conclusions very quickly.
>
> Question remains though, should src/interfaces/ecpg/test/sql/sqlda.pgc
> really
> be using WIN32 and not _WIN32, or doesn't it matter?  (or does it only
> matter
> for consistency?) WIN32 and _WIN32 aren't very informative searchterms to
> use
> for finding more information.
>
According the StackOverflow:
https://stackoverflow.com/questions/662084/whats-the-difference-between-the-win32-and-win32-defines-in-c
WIN32 -> SDK
_WIN32 -> Compiler (MSVC)

WIN32 with MingW (Windows) is 0?

For consistency, I think that will use only WIN32, if Postgres uses it
extensively.

regards,

Ranier Vilela

Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-17 Thread Daniel Gustafsson

> On 17 Feb 2022, at 13:59, Ranier Vilela  wrote:
> 
> Em qui., 17 de fev. de 2022 às 09:52, Daniel Gustafsson  > escreveu:
> > On 17 Feb 2022, at 13:19, Ranier Vilela  > > wrote:
> 
> > 1. One #ifdef with a mistake, the correct is _WIN32 and not WIN32.
> 
> Can you elaborate on this, we are using WIN32 pretty extensively in the code:
> 
>  $ git grep "if[n]\{0,1\}def WIN32$"|wc -l
>  511
>  $ git grep "if[n]\{0,1\}def _WIN32$"|wc -l
>2
> 
> The _WIN32 cases are in the same ECPG testcase.
> 
> Why would _WIN32 be correct in this case?
> Sorry, my fault.
> 
> I only use _WIN32 and I jumped to conclusions very quickly.

Question remains though, should src/interfaces/ecpg/test/sql/sqlda.pgc really
be using WIN32 and not _WIN32, or doesn't it matter?  (or does it only matter
for consistency?) WIN32 and _WIN32 aren't very informative searchterms to use
for finding more information.

--
Daniel Gustafsson   https://vmware.com

Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-17 Thread Ranier Vilela

Em qui., 17 de fev. de 2022 às 09:52, Daniel Gustafsson 
escreveu:

> > On 17 Feb 2022, at 13:19, Ranier Vilela  wrote:
>
> > 1. One #ifdef with a mistake, the correct is _WIN32 and not WIN32.
>
> Can you elaborate on this, we are using WIN32 pretty extensively in the
> code:
>
>  $ git grep "if[n]\{0,1\}def WIN32$"|wc -l
>  511
>  $ git grep "if[n]\{0,1\}def _WIN32$"|wc -l
>2
>
> The _WIN32 cases are in the same ECPG testcase.
>
> Why would _WIN32 be correct in this case?
>
Sorry, my fault.

I only use _WIN32 and I jumped to conclusions very quickly.

regards,
Ranier Vilela

Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-17 Thread Daniel Gustafsson

> On 17 Feb 2022, at 13:19, Ranier Vilela  wrote:

> 1. One #ifdef with a mistake, the correct is _WIN32 and not WIN32.

Can you elaborate on this, we are using WIN32 pretty extensively in the code:

 $ git grep "if[n]\{0,1\}def WIN32$"|wc -l
 511
 $ git grep "if[n]\{0,1\}def _WIN32$"|wc -l
   2

The _WIN32 cases are in the same ECPG testcase.

Why would _WIN32 be correct in this case?

--
Daniel Gustafsson   https://vmware.com/

[PATCH] Fix possible minor memory leak (src/backend/catalog/heap.c)

2022-02-17 Thread Ranier Vilela

Hi,

Per Coverity.

Maybe this is a mistake, but,
Is it necessary or not to free the memory allocated by nodeToString?

If yes, the patch attached fixes this.

regards,

Ranier Vilela


fix_minor_memory_leak_heap.patch
Description: Binary data

1 2 >

1 - 100 of 119 matches

Mail list logo