Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-28 Thread Andres Freund
Hi,

On 2023-02-21 17:33:31 +0100, Alvaro Herrera wrote:
> On 2023-Feb-21, Heikki Linnakangas wrote:
> 
> > > +static BlockNumber
> > > +BulkExtendSharedRelationBuffered(Relation rel,
> > > +  SMgrRelation 
> > > smgr,
> > > +  bool 
> > > skip_extension_lock,
> > > +  char 
> > > relpersistence,
> > > +  ForkNumber 
> > > fork, ReadBufferMode mode,
> > > +  
> > > BufferAccessStrategy strategy,
> > > +  uint32 
> > > *num_pages,
> > > +  uint32 
> > > num_locked_pages,
> > > +  Buffer 
> > > *buffers)
> > 
> > Ugh, that's a lot of arguments, some are inputs and some are outputs. I
> > don't have any concrete suggestions, but could we simplify this somehow?
> > Needs a comment at least.
> 
> Yeah, I noticed this too.  I think it would be easy enough to add a new
> struct that can be passed as a pointer, which can be stack-allocated
> by the caller, and which holds the input arguments that are common to
> both functions, as is sensible.

I played a fair bit with various options. I ended up not using a struct to
pass most options, but instead go for a flags argument. However, I did use a
struct for passing either relation or smgr.


typedef enum ExtendBufferedFlags {
EB_SKIP_EXTENSION_LOCK = (1 << 0),
EB_IN_RECOVERY = (1 << 1),
EB_CREATE_FORK_IF_NEEDED = (1 << 2),
EB_LOCK_FIRST = (1 << 3),

/* internal flags follow */
EB_RELEASE_PINS = (1 << 4),
} ExtendBufferedFlags;

/*
 * To identify the relation - either relation or smgr + relpersistence has to
 * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
 * to use the same function for both crash recovery and normal operation.
 */
typedef struct ExtendBufferedWhat
{
Relation rel;
struct SMgrRelationData *smgr;
char relpersistence;
} ExtendBufferedWhat;

#define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
/* requires use of EB_SKIP_EXTENSION_LOCK */
#define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, 
.relpersistence = p_relpersistence})


extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
ForkNumber 
forkNum,

BufferAccessStrategy strategy,
uint32 flags);
extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
   
ForkNumber fork,
   
BufferAccessStrategy strategy,
   
uint32 flags,
   
uint32 extend_by,
   
Buffer *buffers,
   
uint32 *extended_by);
extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
  ForkNumber 
fork,
  
BufferAccessStrategy strategy,
  uint32 flags,
  BlockNumber 
extend_to,
  
ReadBufferMode mode);

As you can see I removed ReadBufferMode from most of the functions (as
suggested by Heikki earlier). When extending by 1/multiple pages, we only need
to know whether to lock or not.

The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to
fall back to reading page normally if there was a concurrent relation
extension.

The reason EB_CREATE_FORK_IF_NEEDED exists is to remove the duplicated,
gnarly, code to do so from vm_extend(), fsm_extend().


I'm not sure about the function naming pattern. I do like 'By' a lot more than
the Bulk prefix I used before.


What do you think?


Greetings,

Andres Freund




Combine pg_walinspect till_end_of_wal functions with others

2023-02-28 Thread Bharath Rupireddy
Hi,

In a recent discussion [1], Michael Paquier asked if we can combine
pg_walinspect till_end_of_wal functions with other functions
pg_get_wal_records_info and pg_get_wal_stats. The code currently looks
much duplicated and the number of functions that pg_walinspect exposes
to the users is bloated. The point was that the till_end_of_wal
functions determine the end LSN and everything else that they do is
the same as their counterpart functions. Well, the idea then was to
keep things simple, not clutter the APIs, have better and consistent
user-inputted end_lsn validations at the cost of usability and code
redundancy. However, now I tend to agree with the feedback received.

I'm attaching a patch doing the $subject with the following behavior:
1. If start_lsn is NULL, error out/return NULL.
2. If end_lsn isn't specified, default to NULL, then determine the end_lsn.
3. If end_lsn is specified as NULL, then determine the end_lsn.
4. If end_lsn is specified as non-NULL, then determine if it is
greater than start_lsn if yes, go ahead do the job, otherwise error
out.

Another idea is to convert till_end_of_wal flavors to SQL-only
functions and remove the c code from pg_walinspect.c. However, I
prefer $subject and completely remove till_end_of_wal flavors for
better usability in the long term.

Thoughts?

[1] 
https://www.postgresql.org/message-id/CALj2ACV-WBN%3DEUgUPyYOGitp%2Brn163vMnQd%3DHcWrnKt-uqFYFA%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 156e1d9148af79b905095b627e2db7919e8489ca Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 26 Feb 2023 13:34:33 +
Subject: [PATCH v1] Reduce pg_walinspect's functions without losing
 functionality

---
 contrib/pg_walinspect/Makefile|   6 +-
 .../pg_walinspect/expected/oldextversions.out |  64 +++
 .../pg_walinspect/expected/pg_walinspect.out  |  32 +++-
 contrib/pg_walinspect/meson.build |   4 +-
 .../pg_walinspect/pg_walinspect--1.1--1.2.sql |  78 
 contrib/pg_walinspect/pg_walinspect.c | 174 +-
 contrib/pg_walinspect/pg_walinspect.control   |   2 +-
 contrib/pg_walinspect/sql/oldextversions.sql  |  27 +++
 contrib/pg_walinspect/sql/pg_walinspect.sql   |  28 ++-
 doc/src/sgml/pgwalinspect.sgml|  54 ++
 10 files changed, 336 insertions(+), 133 deletions(-)
 create mode 100644 contrib/pg_walinspect/expected/oldextversions.out
 create mode 100644 contrib/pg_walinspect/pg_walinspect--1.1--1.2.sql
 create mode 100644 contrib/pg_walinspect/sql/oldextversions.sql

diff --git a/contrib/pg_walinspect/Makefile b/contrib/pg_walinspect/Makefile
index 7033878a79..9527a0b268 100644
--- a/contrib/pg_walinspect/Makefile
+++ b/contrib/pg_walinspect/Makefile
@@ -7,9 +7,11 @@ OBJS = \
 PGFILEDESC = "pg_walinspect - functions to inspect contents of PostgreSQL Write-Ahead Log"
 
 EXTENSION = pg_walinspect
-DATA = pg_walinspect--1.0.sql pg_walinspect--1.0--1.1.sql
+DATA =  pg_walinspect--1.0.sql \
+		pg_walinspect--1.0--1.1.sql \
+		pg_walinspect--1.1--1.2.sql
 
-REGRESS = pg_walinspect
+REGRESS = pg_walinspect oldextversions
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_walinspect/walinspect.conf
 
diff --git a/contrib/pg_walinspect/expected/oldextversions.out b/contrib/pg_walinspect/expected/oldextversions.out
new file mode 100644
index 00..65963c77f4
--- /dev/null
+++ b/contrib/pg_walinspect/expected/oldextversions.out
@@ -0,0 +1,64 @@
+-- test old extension version entry points
+DROP EXTENSION pg_walinspect;
+CREATE EXTENSION pg_walinspect VERSION '1.1';
+-- New function introduced in 1.1
+SELECT pg_get_functiondef('pg_get_wal_fpi_info'::regproc);
+  pg_get_functiondef   
+---
+ CREATE OR REPLACE FUNCTION public.pg_get_wal_fpi_info(start_lsn pg_lsn, end_lsn pg_lsn, OUT lsn pg_lsn, OUT reltablespace oid, OUT reldatabase oid, OUT relfilenode oid, OUT relblocknumber bigint, OUT forkname text, OUT fpi bytea)+
+  RETURNS SETOF record+
+  LANGUAGE c  +
+  PARALLEL SAFE STRICT 

Re: Add LZ4 compression in pg_dump

2023-02-28 Thread Michael Paquier
On Tue, Feb 28, 2023 at 05:58:34PM -0600, Justin Pryzby wrote:
> I found that e9960732a broke writing of empty gzip-compressed data,
> specifically LOs.  pg_dump succeeds, but then the restore fails:

The number of issues you have been reporting here begins to worries
me..  How many of them have you found?  Is it right to assume that all
of them have basically 03d02f5 as oldest origin point?
--
Michael


signature.asc
Description: PGP signature


Re: meson: Add equivalent of configure --disable-rpath option

2023-02-28 Thread Peter Eisentraut

On 22.02.23 13:56, Peter Eisentraut wrote:
The configure option --disable-rpath currently has no equivalent in 
meson.  This option is used by packagers, so I think it would be good to 
have it in meson as well.  I came up with the attached patch.


committed




Re: Track Oldest Initialized WAL Buffer Page

2023-02-28 Thread Bharath Rupireddy
On Wed, Mar 1, 2023 at 9:49 AM Nathan Bossart  wrote:
>
> On Tue, Feb 28, 2023 at 11:12:29AM +0530, Bharath Rupireddy wrote:
> > On Tue, Feb 28, 2023 at 5:52 AM Nathan Bossart  
> > wrote:
> >> It's confusing to me that OldestInitializedPage is set to NewPageBeginPtr.
> >> Doesn't that set it to the beginning of the newest initialized page?
> >
> > Yes, that's the intention, see below. OldestInitializedPage points to
> > the start address of the oldest initialized page whereas the
> > InitializedUpTo points to the end address of the latest initialized
> > page. With this, one can easily track all the WAL between
> > OldestInitializedPage and InitializedUpTo.
>
> This is where I'm confused.  Why would we set the variable for the start
> address of the _oldest_ initialized page to the start address of the
> _newest_ initialized page?  I must be missing something obvious, so sorry
> if this is a silly question.

That's the crux of the patch. Let me clarify it a bit.

Firstly, we try to set OldestInitializedPage at the end of the
recovery but that's conditional, that is, only when the last replayed
WAL record spans partially to the end block.

Secondly, we set OldestInitializedPage while initializing the page for
the first time, so the missed-conditional case above gets coverd too.

And, OldestInitializedPage isn't updated for every new initialized
page, only when the previous OldestInitializedPage is being reused
i.e. the wal_buffers are full and it wraps around. Please see the
comment and the condition
XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == nextidx which
holds true if we're crossing-over/wrapping around previous
OldestInitializedPage.

+/*
+ * Try updating oldest initialized XLog buffer page.
+ *
+ * Update it if we are initializing an XLog buffer page for the first
+ * time or if XLog buffers are full and we are wrapping around.
+ */
+if (XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) ||
+XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == nextidx)
+{
+Assert(XLogCtl->OldestInitializedPage < NewPageBeginPtr);
+
+XLogCtl->OldestInitializedPage = NewPageBeginPtr;
+}

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade and logical replication

2023-02-28 Thread Julien Rouhaud
On Wed, Mar 01, 2023 at 11:51:49AM +0530, Amit Kapila wrote:
> On Tue, Feb 28, 2023 at 10:18 AM Julien Rouhaud  wrote:
> >
> > Well, as I mentioned I'm *not* interested in a logical-replication-only
> > scenario.  Logical replication is nice but it will always be less efficient
> > than physical replication, and some workloads also don't really play well 
> > with
> > it.  So while it can be a huge asset in some cases I'm for now looking at
> > leveraging logical replication for the purpose of major upgrade only for a
> > physical replication cluster, so the publications and subscriptions are only
> > temporary and trashed after use.
> >
> > That being said I was only saying that if I had to do a major upgrade of a
> > logical replication cluster this is probably how I would try to do it, to
> > minimize downtime, even if there are probably *a lot* difficulties to
> > overcome.
> >
>
> Okay, but it would be better if you list out your detailed steps. It
> would be useful to support the new mechanism in this area if others
> also find your steps to upgrade useful.

Sure.  Here are the overly detailed steps:

 1) setup a normal physical replication cluster (pg_basebackup, restoring PITR,
whatever), let's call the primary node "A" and replica node "B"
 2) ensure WAL level is "logical" on the primary node A
 3) create a logical replication slot on every (connectable) database (or just
the one you're interested in if you don't want to preserve everything) on A
 4) create a FOR ALL TABLE publication (again for every databases or just the
one you're interested in)
 5) wait for replication to be reasonably if not entirely up to date
 6) promote the standby node B
 7) retrieve the promotion LSN (from the .history file,
pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()...)
 8) call pg_replication_slot_advance() with that LSN for all previously created
logical replication slots on A
 9) create a normal subscription on all wanted databases on the promoted node
10) wait for it to catchup if needed on B
12) stop the node B
13) run pg_upgrade on B, creating the new node C
14) start C, run the global ANALYZE and any sanity check needed (hopefully you
would have validated that your application is compatible with that new
version before this point)
15) re-enable the subscription on C.  This is currently not possible without
losing data, the patch fixes that
16) wait for it to catchup if needed
17) create any missing relation and do the ALTER SUBSCRIPTION ... REFRESH if
needed
18) trash B
19) create new nodes D, E... as physical replica from C if needed, possibly
using cheaper approach like pg_start_backup() / rsync / pg_stop_backup if
needed
20) switchover to C and trash A (or convert it to another replica if you want)
21) trash the publications on C on all databases

As noted the step 15 is currently problematic, and is also problematic in any
variation of that scenario that doesn't require you to entirely recreate the
node C from scratch using logical replication, which is what I want to avoid.

This isn't terribly complicated but requires to be really careful if you don't
want to end up with an incorrect node C.  This approach is also currently not
entirely ideal, but hopefully logical replication of sequences and DDL will
remove the main sources of downtime when upgrading using logical replication.

My ultimate goal is to provide some tooling to do that in a much simpler way.
Maybe a new "promote to logical" action that would take care of steps 2 to 9.
Users would therefore only have to do this "promotion to logical", and then run
pg_upgrade and create a new physical replication cluster if they want.




Re: Commitfest 2023-03 starting tomorrow!

2023-02-28 Thread Michael Paquier
On Tue, Feb 28, 2023 at 01:45:27PM -0500, Greg Stark wrote:
> So I'm not sure if I'll be CFM this month but I'm assuming I will be
> at this point

Okay, that's OK for me!  Thanks for helping out.

> The next pass would be to grab any patches not marked Ready for
> Committer and if they look like they'll need more than a one round of
> feedback and a couple weeks to polish they'll probably get bounced to
> the next commitfest too. It sucks not getting feedback on your patches
> for so long but there are really just sooo many patches and so few
> eyeballs... It would be great if people could do initial reviews of
> these patches before we bounce them because it really is discouraging
> for developers to send patches and not get feedback. But realistically
> it's going to happen to a lot of patches.

I don't have many patches registered this time for the sole reason of
being able to spend more cycles on reviews and see what could make the
cut.  So we'll see how it goes, I guess..

The CF would begin in more or less 5 hours as of the moment of this
message:
https://www.timeanddate.com/time/zones/aoe
--
Michael


signature.asc
Description: PGP signature


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-28 Thread John Naylor
On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada 
wrote:
>
> On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada 
wrote:
> >
> > On Tue, Feb 28, 2023 at 3:42 PM John Naylor
> >  wrote:
> > >
> > >
> > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <
sawada.m...@gmail.com> wrote:
> > > >
> > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > > >  wrote:
> > > > >
> > > > > That doesn't seem useful to me. If we've done enough testing to
reassure us the new way always gives the same answer, the old way is not
needed at commit time. If there is any doubt it will always give the same
answer, then the whole patchset won't be committed.
> > >
> > > > My idea is to make the bug investigation easier but on
> > > > reflection, it seems not the best idea given this purpose.
> > >
> > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics
the old tid array. As I've said, that doesn't seem like a good thing to
carry forward forevermore, in any form. Plus, comparing new code with new
code is not the same thing as comparing existing code with new code. That
was my idea upthread.
> > >
> > > Maybe the effort my idea requires is too much vs. the likelihood of
finding a problem. In any case, it's clear that if I want that level of
paranoia, I'm going to have to do it myself.
> > >
> > > > What do you think
> > > > about the attached patch? Please note that it also includes the
> > > > changes for minimum memory requirement.
> > >
> > > Most of the asserts look logical, or at least harmless.
> > >
> > > - int max_off; /* the maximum offset number */
> > > + OffsetNumber max_off; /* the maximum offset number */
> > >
> > > I agree with using the specific type for offsets here, but I'm not
sure why this change belongs in this patch. If we decided against the new
asserts, this would be easy to lose.
> >
> > Right. I'll separate this change as a separate patch.
> >
> > >
> > > This change, however, defies common sense:
> > >
> > > +/*
> > > + * The minimum amount of memory required by TidStore is 2MB, the
current minimum
> > > + * valid value for the maintenance_work_mem GUC. This is required to
allocate the
> > > + * DSA initial segment, 1MB, and some meta data. This number is
applied also to
> > > + * the local TidStore cases for simplicity.
> > > + */
> > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
> > >
> > > + /* Sanity check for the max_bytes */
> > > + if (max_bytes < TIDSTORE_MIN_MEMORY)
> > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu
provided",
> > > + TIDSTORE_MIN_MEMORY, max_bytes);
> > >
> > > Aside from the fact that this elog's something that would never get
past development, the #define just adds a hard-coded copy of something that
is already hard-coded somewhere else, whose size depends on an
implementation detail in a third place.
> > >
> > > This also assumes that all users of tid store are limited by
maintenance_work_mem. Andres thought of an example of some day unifying
with tidbitmap.c, and maybe other applications will be limited by work_mem.
> > >
> > > But now that I'm looking at the guc tables, I am reminded that
work_mem's minimum is 64kB, so this highlights a design problem: There is
obviously no requirement that the minimum work_mem has to be >= a single
DSA segment, even though operations like parallel hash and parallel bitmap
heap scan are limited by work_mem.
> >
> > Right.
> >
> > >  It would be nice to find out what happens with these parallel
features when work_mem is tiny (maybe parallelism is not even considered?).
> >
> > IIUC both don't care about the allocated DSA segment size. Parallel
> > hash accounts actual tuple (+ header) size as used memory but doesn't
> > consider how much DSA segment is allocated behind. Both parallel hash
> > and parallel bitmap scan can work even with work_mem = 64kB, but when
> > checking the total DSA segment size allocated during these operations,
> > it was 1MB.
> >
> > I realized that there is a similar memory limit design issue also on
> > the non-shared tidstore cases. We deduct 70kB from max_bytes but it
> > won't work fine with work_mem = 64kB.  Probably we need to reconsider
> > it. FYI 70kB comes from the maximum slab block size for node256.
>
> Currently, we calculate the slab block size enough to allocate 32
> chunks from there. For node256, the leaf node is 2,088 bytes and the
> slab block size is 66,816 bytes. One idea to fix this issue to
> decrease it.

I think we're trying to solve the wrong problem here. I need to study this
more, but it seems that code that needs to stay within a memory limit only
needs to track what's been allocated in chunks within a block, since
writing there is what invokes a page fault. If we're not keeping track of
each and every chunk space, for speed, it doesn't follow that we need to
keep every block allocation within the configured limit. I'm guessing we
can just ask the context if the block space has gone *over* the limit, and
we can assume that the last 

Re: LOG: invalid record length at : wanted 24, got 0

2023-02-28 Thread Bharath Rupireddy
On Wed, Mar 1, 2023 at 10:51 AM Harinath Kanchu  wrote:
>
> Hello,
>
> We are seeing an interesting STANDBY behavior, that’s happening once in 3-4 
> days.
>
> The standby suddenly disconnects from the primary, and it throws the error 
> “LOG: invalid record length at : wanted 24, got0”.

Firstly, this isn't an error per se, especially for a standby as it
can get/retry the same WAL record from other sources. It's a bit hard
to say anything further just by looking at this LOG message, one needs
to look at what's happening around the same time. You mentioned that
the connection to primary was lost, so you need to dive deep as to why
it got lost. If the connection was lost half-way through fetching the
WAL record, the standby may emit such a LOG message.

Secondly, you definitely need to understand why the connection to
primary keeps getting lost - network disruption, parameter changes or
primary going down, standby going down etc.?

> And then it tries to restore the WAL file from the archive. Due to low write 
> activity on primary, the WAL file will be switched and archived only after 1 
> hr.
>
> So, it stuck in a loop of switching the WAL sources from STREAM and ARCHIVE 
> without replicating the primary.
>
> Due to this there will be write outage as the standby is synchronous standby.

I understand this problem and there's a proposed patch to help with
this - 
https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3njkvzt8q...@mail.gmail.com.

It basically allows one to set a timeout as to how much duration the
standby can restore from archive before switching to stream.
Therefore, in your case, the standby doesn't have to wait for 1hr to
connect to primary, but it can connect before that.

> We are using “wal_sync_method” as “fsync” assuming WAL file not getting 
> flushed correctly.
>
> But this is happening even after making it as “fsync” instead of “fdatasync”.

I don't think that's a problem, unless wal_sync_method isn't changed
to something else in between.

> Restarting the STANDBY sometimes fixes this problem, but detecting this 
> automatically is a big problem as the postgres standby process will be still 
> running fine, but WAL RECEIVER process is up and down continuously due to 
> switching of WAL sources.

Yes, the standby after failure to connect to primary, it switches to
archive and stays there until it exhausts all the WAL from the archive
and then switches to stream. You can monitor the replication slot of
the standby on the primary, if it's inactive, then one needs to jump
in. As mentioned above, there's an in-progress feature that helps in
these cases.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: add PROCESS_MAIN to VACUUM

2023-02-28 Thread Michael Paquier
On Thu, Jan 19, 2023 at 11:08:07AM -0800, Nathan Bossart wrote:
> rebased

PROCESS_TOAST has that:
/* sanity check for PROCESS_TOAST */
if ((params->options & VACOPT_FULL) != 0 &&
(params->options & VACOPT_PROCESS_TOAST) == 0)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("PROCESS_TOAST required with VACUUM FULL")));
[...]
-   if (params->options & VACOPT_FULL)
+   if (params->options & VACOPT_FULL &&
+   params->options & VACOPT_PROCESS_MAIN)
{

Shouldn't we apply the same rule for PROCESS_MAIN?  One of the
regression tests added means that FULL takes priority over
PROCESS_MAIN=FALSE, which is a bit confusing IMO.  

@@ -190,6 +190,7 @@ typedef struct VacAttrStats
 #define VACOPT_DISABLE_PAGE_SKIPPING 0x80  /* don't skip any pages */
 #define VACOPT_SKIP_DATABASE_STATS 0x100   /* skip vac_update_datfrozenxid() */
 #define VACOPT_ONLY_DATABASE_STATS 0x200   /* only vac_update_datfrozenxid() */
+#define VACOPT_PROCESS_MAIN 0x400  /* process main relation */

Perhaps the options had better be reorganized so as PROCESS_MAIN is
just before PROCESS_TOAST?

+-- PROCESS_MAIN option
+VACUUM (PROCESS_MAIN FALSE) vactst;
+VACUUM (PROCESS_MAIN FALSE, PROCESS_TOAST FALSE) vactst;
+VACUUM (PROCESS_MAIN FALSE, FULL) vactst;

Thinking a bit here.  This set of tests does not make sure that the
main relation and/or the toast relation have been actually processed. 
pg_stat_user_tables does not track what's happening on the toast
relations.  So...  What about adding some tests in 100_vacuumdb.pl
that rely on vacuumdb --verbose and check the logs produced?  We
should make sure that the toast or the main relation are processed,
by tracking, for example, logs like vacuuming "schema.table".  When
FULL is involved, we may want to track the changes on relfilenodes
depending on what's wanted.
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade and logical replication

2023-02-28 Thread Amit Kapila
On Tue, Feb 28, 2023 at 10:18 AM Julien Rouhaud  wrote:
>
> On Tue, Feb 28, 2023 at 08:56:37AM +0530, Amit Kapila wrote:
> > On Tue, Feb 28, 2023 at 7:55 AM Julien Rouhaud  wrote:
> > >
> > >
> > > The scenario I'm interested in is to rely on logical replication only for 
> > > the
> > > upgrade, so the end state (and start state) is to go back to physical
> > > replication.  In that case, I would just create new physical replica from 
> > > the
> > > pg_upgrade'd server and failover to that node, or rsync the previous 
> > > publisher
> > > node to make it a physical replica.
> > >
> > > But even if you want to only rely on logical replication, I'm not sure 
> > > why you
> > > would want to keep the publisher node as a publisher node?  I think that 
> > > doing
> > > it this way will lead to a longer downtime compared to doing a failover 
> > > on the
> > > pg_upgrade'd node, make it a publisher and then move the former publisher 
> > > node
> > > to a subscriber.
> > >
> >
> > I am not sure if this is usually everyone follows because it sounds
> > like a lot of work to me. IIUC, to achieve this, one needs to recreate
> > all the publications and subscriptions after changing the roles of
> > publisher and subscriber. Can you please write steps to show exactly
> > what you have in mind to avoid any misunderstanding?
>
> Well, as I mentioned I'm *not* interested in a logical-replication-only
> scenario.  Logical replication is nice but it will always be less efficient
> than physical replication, and some workloads also don't really play well with
> it.  So while it can be a huge asset in some cases I'm for now looking at
> leveraging logical replication for the purpose of major upgrade only for a
> physical replication cluster, so the publications and subscriptions are only
> temporary and trashed after use.
>
> That being said I was only saying that if I had to do a major upgrade of a
> logical replication cluster this is probably how I would try to do it, to
> minimize downtime, even if there are probably *a lot* difficulties to
> overcome.
>

Okay, but it would be better if you list out your detailed steps. It
would be useful to support the new mechanism in this area if others
also find your steps to upgrade useful.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Amit Kapila
On Wed, Mar 1, 2023 at 10:57 AM Masahiko Sawada  wrote:
>
> On Wed, Mar 1, 2023 at 1:55 PM Amit Kapila  wrote:
> >
>
> > Won't a malicious user can block the
> > replication in other ways as well and let the publisher stall (or
> > crash the publisher) even without setting min_send_delay? Basically,
> > one needs to either disable the subscription or create a
> > constraint-violating row in the table to make that happen. If the
> > system is exposed for arbitrarily allowing the creation of a
> > subscription then a malicious user can create a subscription similar
> > to one existing subscription and block the replication due to
> > constraint violations. I don't think it would be so easy to bypass the
> > current system that a malicious user will be allowed to create/alter
> > subscriptions arbitrarily.
>
> Right. But a difference is that with min_send_delay, it's just to
> create a subscription.
>

But, currently, only superusers would be allowed to create
subscriptions. Even, if we change it and allow it based on some
pre-defined role, it won't be allowed to create a subscription
arbitrarily. So, not sure, if any malicious user can easily bypass it
as you are envisioning it.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Masahiko Sawada
On Wed, Mar 1, 2023 at 1:55 PM Amit Kapila  wrote:
>
> On Wed, Mar 1, 2023 at 8:18 AM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu)
> >  wrote:
> >
> > Thinking of side effects of this feature (no matter where we delay
> > applying the changes), on the publisher, vacuum cannot collect garbage
> > and WAL cannot be recycled. Is that okay in the first place? The point
> > is that the subscription setting affects the publisher. That is,
> > min_send_delay is specified on the subscriber but the symptoms that
> > could ultimately lead to a server crash appear on the publisher, which
> > sounds dangerous to me.
> >
> > Imagine a service or system like where there is a publication server
> > and it's somewhat exposed so that a user (or a subsystem) arbitrarily
> > can create a subscriber to replicate a subset of the data. A malicious
> > user can have the publisher crash by creating a subscription with,
> > say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation
> > but it's -1 by default.
> >
>
> By publisher crash, do you mean due to the disk full situation, it can
> lead the publisher to stop/panic?

Exactly.

> Won't a malicious user can block the
> replication in other ways as well and let the publisher stall (or
> crash the publisher) even without setting min_send_delay? Basically,
> one needs to either disable the subscription or create a
> constraint-violating row in the table to make that happen. If the
> system is exposed for arbitrarily allowing the creation of a
> subscription then a malicious user can create a subscription similar
> to one existing subscription and block the replication due to
> constraint violations. I don't think it would be so easy to bypass the
> current system that a malicious user will be allowed to create/alter
> subscriptions arbitrarily.

Right. But a difference is that with min_send_delay, it's just to
create a subscription.

> Similarly, if there is a network issue
> (unreachable or slow), one will see similar symptoms. I think
> retention of data and WAL on publisher do rely on acknowledgment from
> subscribers and delay in that due to any reason can lead to the
> symptoms you describe above.

I think that piling up WAL files due to a slow network is a different
story since it's a problem not only on the subscriber side.

>  We have documented at least one such case
> already where during Drop Subscription, if the network is not
> reachable then also, a similar problem can happen and users need to be
> careful about it [1].

Apart from a bad-use case example I mentioned, in general, piling up
WAL files due to the replication slot has many bad effects on the
system. I'm concerned that the side effect of this feature (at least
of the current design) is too huge compared to the benefit, and afraid
that users might end up using this feature without understanding the
side effect well. It might be okay if we thoroughly document it but
I'm not sure.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




LOG: invalid record length at : wanted 24, got 0

2023-02-28 Thread Harinath Kanchu
Hello,

We are seeing an interesting STANDBY behavior, that’s happening once in 3-4 
days.

The standby suddenly disconnects from the primary, and it throws the error 
“LOG: invalid record length at : wanted 24, got0”.

And then it tries to restore the WAL file from the archive. Due to low write 
activity on primary, the WAL file will be switched and archived only after 1 hr.

So, it stuck in a loop of switching the WAL sources from STREAM and ARCHIVE 
without replicating the primary.

Due to this there will be write outage as the standby is synchronous standby.

We are using “wal_sync_method” as “fsync” assuming WAL file not getting flushed 
correctly.

But this is happening even after making it as “fsync” instead of “fdatasync”.

Restarting the STANDBY sometimes fixes this problem, but detecting this 
automatically is a big problem as the postgres standby process will be still 
running fine, but WAL RECEIVER process is up and down continuously due to 
switching of WAL sources.


How can we fix this ? Any suggestions regarding this will be appreciated.


Postgres Version: 13.6
OS: RHEL Linux


Thank you,


Best,
Harinath.



Re: Support logical replication of global object commands

2023-02-28 Thread Zheng Li
On Fri, Feb 17, 2023 at 4:48 AM Amit Kapila  wrote:
>
> On Fri, Feb 17, 2023 at 10:58 AM Zheng Li  wrote:
> >
> > > > > Actually, I intend something for global objects. But the main thing
> > > > > that is worrying me about this is that we don't have a clean way to
> > > > > untie global object replication from database-specific object
> > > > > replication.
> > > >
> > > > I think ultimately we need a clean and efficient way to publish (and
> > > > subscribe to) any changes in all databases, preferably in one logical
> > > > replication slot.
> > > >
> > >
> > > Agreed. I was thinking currently for logical replication both
> > > walsender and slot are database-specific. So we need a way to
> > > distinguish the WAL for global objects and then avoid filtering based
> > > on the slot's database during decoding.
> >
> > But which WALSender should handle the WAL for global objects if we
> > don't filter by database? Is there any specific problem you see for
> > decoding global objects commands in a database specific WALSender?
> >
>
> I haven't verified but I was concerned about the below check:
> logicalddl_decode
> {
> ...
> +
> + if (message->dbId != ctx->slot->data.database ||

OK, let's suppose we don't filter by database for global commands when
decoding ddl records, roughly what the following code does:
 logicalddl_decode
 {
 ...

  if (message->dbId != ctx->slot->data.database ||
  +   message->cmdtype != DCT_GlobalObjectCmd

But this is not enough, we also need the subsequent commit record of
the txn to be decoded in order to replicate the global command. So I
think we also need to make DecodeCommit bypass the filter by database
if global object replication is turned on and we have decoded a global
command in the txn.

Regards,
Zane




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Amit Kapila
On Tue, Feb 28, 2023 at 9:21 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > 1.
> > + /*
> > + * If we've requested to shut down, exit the process.
> > + *
> > + * Note that WalSndDone() cannot be used here because the delaying
> > + * changes will be sent in the function.
> > + */
> > + if (got_STOPPING)
> > + {
> > + QueryCompletion qc;
> > +
> > + /* Inform the standby that XLOG streaming is done */
> > + SetQueryCompletion(, CMDTAG_COPY, 0);
> > + EndCommand(, DestRemote, false);
> > + pq_flush();
> >
> > Do we really need to do anything except for breaking the loop and let
> > the exit handling happen in the main loop when 'got_STOPPING' is set?
> > AFAICS, this is what we are doing in some other palces (See
> > WalSndWaitForWal). Won't that work? It seems that will help us sending
> > all the pending WAL.
>
> If we exit the loop after got_STOPPING is set, as you said, the walsender will
> send delaying changes and then exit. The behavior is same as the case that 
> WalSndDone()
> is called. But I think it is not suitable for the motivation of the feature.
> If users notice the miss operation like TRUNCATE, they must shut down the 
> publisher
> once and then recovery from back up or old subscriber. If the walsender sends 
> all
> pending changes, miss operations will be also propagated to subscriber and 
> data
> cannot be protected. So currently I want to keep the style.
> FYI - In case of physical replication, received WALs are not applied when the
> secondary is shutted down.
>

Fair point but I think the current comment should explain why we are
doing something different here. How about extending the existing
comments to something like: "If we've requested to shut down, exit the
process. This is unlike handling at other places where we allow
complete WAL to be sent before shutdown because we don't want the
delayed transactions to be applied downstream. This will allow one to
use the data from downstream in case of some unwanted operations on
the current node."

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Amit Kapila
On Wed, Mar 1, 2023 at 8:18 AM Masahiko Sawada  wrote:
>
> On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu)
>  wrote:
>
> Thinking of side effects of this feature (no matter where we delay
> applying the changes), on the publisher, vacuum cannot collect garbage
> and WAL cannot be recycled. Is that okay in the first place? The point
> is that the subscription setting affects the publisher. That is,
> min_send_delay is specified on the subscriber but the symptoms that
> could ultimately lead to a server crash appear on the publisher, which
> sounds dangerous to me.
>
> Imagine a service or system like where there is a publication server
> and it's somewhat exposed so that a user (or a subsystem) arbitrarily
> can create a subscriber to replicate a subset of the data. A malicious
> user can have the publisher crash by creating a subscription with,
> say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation
> but it's -1 by default.
>

By publisher crash, do you mean due to the disk full situation, it can
lead the publisher to stop/panic? Won't a malicious user can block the
replication in other ways as well and let the publisher stall (or
crash the publisher) even without setting min_send_delay? Basically,
one needs to either disable the subscription or create a
constraint-violating row in the table to make that happen. If the
system is exposed for arbitrarily allowing the creation of a
subscription then a malicious user can create a subscription similar
to one existing subscription and block the replication due to
constraint violations. I don't think it would be so easy to bypass the
current system that a malicious user will be allowed to create/alter
subscriptions arbitrarily. Similarly, if there is a network issue
(unreachable or slow), one will see similar symptoms. I think
retention of data and WAL on publisher do rely on acknowledgment from
subscribers and delay in that due to any reason can lead to the
symptoms you describe above. We have documented at least one such case
already where during Drop Subscription, if the network is not
reachable then also, a similar problem can happen and users need to be
careful about it [1].

[1] - 
https://www.postgresql.org/docs/devel/logical-replication-subscription.html

-- 
With Regards,
Amit Kapila.




Re: Normalization of utility queries in pg_stat_statements

2023-02-28 Thread Michael Paquier
On Mon, Feb 20, 2023 at 11:34:59AM +0900, Michael Paquier wrote:
> With the patches..

Attached is an updated patch set, where I have done more refactoring
work for the regression tests of pg_stat_statements, splitting
pg_stat_statments.sql into the following files:
- user_activity.sql for the role-level resets.
- wal.sql for the WAL generation tracking.
- dml.sql for insert/update/delete/merge and row counts.
- The main file is renamed to select.sql, as it now only covers SELECT
patterns.

There is no change in the code coverage or the patterns tested.  And
with that, I am rather comfortable with the shape of the regression
tests moving forward.

0002 and 0003 are equivalent to the previous 0003 and 0004 in v4, that
switch pg_stat_statements to apply the normalization to utilities that
use Const nodes.
--
Michael
From 23476ee28a8e5bb82859369a8f54e9cd3c45fc32 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 1 Mar 2023 13:34:11 +0900
Subject: [PATCH v5 1/3] Split more regression tests of pg_stat_statements

This commit expands more the refactoring of the regression tests of
pg_stat_statements, with tests moved out of pg_stat_statements.sql to
separate files, as of:
- select.sql is mainly the former pg_stat_statements.sql, renamed.
- dml.sql for INSERT/UPDATE/DELETE and MERGE
- user_activity, to test role-level checks and stat resets.
- wal, to check the WAL generation.
---
 contrib/pg_stat_statements/Makefile   |   4 +-
 contrib/pg_stat_statements/expected/dml.out   | 148 
 .../expected/pg_stat_statements.out   | 768 --
 .../pg_stat_statements/expected/select.out| 411 ++
 .../expected/user_activity.out| 199 +
 contrib/pg_stat_statements/expected/wal.out   |  35 +
 contrib/pg_stat_statements/meson.build|   5 +-
 contrib/pg_stat_statements/sql/dml.sql|  78 ++
 .../sql/pg_stat_statements.sql| 300 ---
 contrib/pg_stat_statements/sql/select.sql | 146 
 .../pg_stat_statements/sql/user_activity.sql  |  65 ++
 contrib/pg_stat_statements/sql/wal.sql|  22 +
 12 files changed, 1110 insertions(+), 1071 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/dml.out
 delete mode 100644 contrib/pg_stat_statements/expected/pg_stat_statements.out
 create mode 100644 contrib/pg_stat_statements/expected/select.out
 create mode 100644 contrib/pg_stat_statements/expected/user_activity.out
 create mode 100644 contrib/pg_stat_statements/expected/wal.out
 create mode 100644 contrib/pg_stat_statements/sql/dml.sql
 delete mode 100644 contrib/pg_stat_statements/sql/pg_stat_statements.sql
 create mode 100644 contrib/pg_stat_statements/sql/select.sql
 create mode 100644 contrib/pg_stat_statements/sql/user_activity.sql
 create mode 100644 contrib/pg_stat_statements/sql/wal.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 69fbc6a858..5578a9dd4e 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -17,8 +17,8 @@ PGFILEDESC = "pg_stat_statements - execution statistics of SQL statements"
 LDFLAGS_SL += $(filter -lm, $(LIBS))
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
-REGRESS = pg_stat_statements cursors utility level_tracking planning \
-	cleanup oldextversions
+REGRESS = select dml cursors utility level_tracking planning \
+	user_activity wal cleanup oldextversions
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/dml.out b/contrib/pg_stat_statements/expected/dml.out
new file mode 100644
index 00..803d993e85
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/dml.out
@@ -0,0 +1,148 @@
+--
+-- DMLs on test table
+--
+SET pg_stat_statements.track_utility = FALSE;
+-- utility "create table" should not be shown
+CREATE TEMP TABLE pgss_dml_tab (a int, b char(20));
+INSERT INTO pgss_dml_tab VALUES(generate_series(1, 10), 'aaa');
+UPDATE pgss_dml_tab SET b = 'bbb' WHERE a > 7;
+DELETE FROM pgss_dml_tab WHERE a > 9;
+-- explicit transaction
+BEGIN;
+UPDATE pgss_dml_tab SET b = '111' WHERE a = 1 ;
+COMMIT;
+BEGIN \;
+UPDATE pgss_dml_tab SET b = '222' WHERE a = 2 \;
+COMMIT ;
+UPDATE pgss_dml_tab SET b = '333' WHERE a = 3 \;
+UPDATE pgss_dml_tab SET b = '444' WHERE a = 4 ;
+BEGIN \;
+UPDATE pgss_dml_tab SET b = '555' WHERE a = 5 \;
+UPDATE pgss_dml_tab SET b = '666' WHERE a = 6 \;
+COMMIT ;
+-- many INSERT values
+INSERT INTO pgss_dml_tab (a, b) VALUES (1, 'a'), (2, 'b'), (3, 'c');
+-- SELECT with constants
+SELECT * FROM pgss_dml_tab WHERE a > 5 ORDER BY a ;
+ a |  b   
+---+--
+ 6 | 666 
+ 7 | aaa 
+ 8 | bbb 
+ 9 | bbb 
+(4 rows)
+
+SELECT *
+  FROM pgss_dml_tab
+  WHERE a > 

Re: stopgap fix for signal handling during restore_command

2023-02-28 Thread Nathan Bossart
On Sun, Feb 26, 2023 at 12:12:27PM -0800, Andres Freund wrote:
> On 2023-02-26 11:39:00 -0800, Nathan Bossart wrote:
>> What precisely did you have in mind?  AFAICT you are asking for a wrapper
>> around write().
> 
> Partially I just want something that can easily be searched for, that can have
> comments attached to it documenting why what it is doing is safe.
> 
> It'd not be a huge amount of work to have a slow and restricted string
> interpolation support, to make it easier to write messages. Converting floats
> is probably too hard to do safely, and I'm not sure %m can safely be
> supported. But basic things like %d would be pretty simple.
> 
> Basically a loop around the format string that directly writes to stderr using
> write(), and only supports a signal safe subset of normal format strings.

Got it, thanks.  I will try to put something together along these lines,
although I don't know if I'll pick up the interpolation support in this
thread.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Weird failure with latches in curculio on v15

2023-02-28 Thread Nathan Bossart
On Sat, Feb 25, 2023 at 11:00:31AM -0800, Andres Freund wrote:
> TBH, I think the current archive and restore module APIs aren't useful. I
> think it was a mistake to add archive modules without having demonstrated that
> one can do something useful with them that the restore_command didn't already
> do. If anything, archive modules have made it harder to improve archiving
> performance via concurrency.

I must respectfully disagree that this work is useless.  Besides the
performance and security benefits of not shelling out for every WAL file,
I've found it very useful to be able to use the standard module framework
to develop archive modules.  It's relatively easy to make use of GUCs,
background workers, compression, etc.  Of course, there is room for
improvement in areas like concurrency support as you rightly point out, but
I don't think that makes the current state worthless.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Track Oldest Initialized WAL Buffer Page

2023-02-28 Thread Nathan Bossart
On Tue, Feb 28, 2023 at 11:12:29AM +0530, Bharath Rupireddy wrote:
> On Tue, Feb 28, 2023 at 5:52 AM Nathan Bossart  
> wrote:
>> It's confusing to me that OldestInitializedPage is set to NewPageBeginPtr.
>> Doesn't that set it to the beginning of the newest initialized page?
> 
> Yes, that's the intention, see below. OldestInitializedPage points to
> the start address of the oldest initialized page whereas the
> InitializedUpTo points to the end address of the latest initialized
> page. With this, one can easily track all the WAL between
> OldestInitializedPage and InitializedUpTo.

This is where I'm confused.  Why would we set the variable for the start
address of the _oldest_ initialized page to the start address of the
_newest_ initialized page?  I must be missing something obvious, so sorry
if this is a silly question.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-28 Thread Nathan Bossart
On Tue, Feb 28, 2023 at 10:38:31AM +0530, Bharath Rupireddy wrote:
> On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart  
> wrote:
>> Why do we only read a page at a time in XLogReadFromBuffersGuts()?  What is
>> preventing us from copying all the data we need in one go?
> 
> Note that most of the WALRead() callers request a single page of
> XLOG_BLCKSZ bytes even if the server has less or more available WAL
> pages. It's the streaming replication wal sender that can request less
> than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
> if we read, say, MAX_SEND_SIZE at once while holding
> WALBufMappingLock, that might impact concurrent inserters (at least, I
> can say it in theory) - one of the main intentions of this patch is
> not to impact inserters much.

Perhaps we should test both approaches to see if there is a noticeable
difference.  It might not be great for concurrent inserts to repeatedly
take the lock, either.  If there's no real difference, we might be able to
simplify the code a bit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Amit Kapila
On Wed, Mar 1, 2023 at 8:06 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 28 Feb 2023 08:35:11 +0530, Amit Kapila  
> wrote in
> > On Tue, Feb 28, 2023 at 8:14 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila  
> > > wrote in
> > > > The one difference w.r.t recovery_min_apply_delay is that here we will
> > > > hold locks for the duration of the delay which didn't seem to be a
> > > > good idea. This will also probably lead to more bloat as we will keep
> > > > transactions open for a long time. Doing it before DecodePrepare won't
> > >
> > > I don't have a concrete picture but could we tell reorder buffer to
> > > retain a PREPAREd transaction until a COMMIT PREPARED comes?
> > >
> >
> > Yeah, we could do that and that is what is the behavior unless the
> > user enables 2PC via 'two_phase' subscription option. But, I don't see
> > the need to unnecessarily delay the prepare till the commit if a user
> > has specified 'two_phase' option. It is quite possible that COMMIT
> > PREPARED happens at a much later time frame than the amount of delay
> > the user is expecting.
>
> It looks like the user should decide between potential long locks or
> extra delays, and this choice ought to be documented.
>

Sure, we can do that. However, I think the way this feature works is
that we keep standby/subscriber behind the primary/publisher by a
certain time period and if there is any unwanted transaction (say
Delete * .. without where clause), we can recover it from the receiver
side. So, it may not matter much even if we wait at PREPARE to avoid
long locks instead of documenting it.

> > >  If
> > > delaying non-prepared transactions until COMMIT is adequate, then the
> > > same thing seems to work for prepared transactions.
> > >
> > > > have such problems. This is the reason that we decide to perform a
> > > > delay at the start of the transaction instead at commit/prepare in the
> > > > subscriber-side approach.
> > >
> > > It seems that there are no technical obstacles to do that on the
> > > publisher side. The only observable difference would be that
> > > relatively large prepared transactions may experience noticeable
> > > additional delays.  IMHO I don't think it's a good practice
> > > protocol-wise to intentionally choke a stream at the receiving end
> > > when it has not been flow-controlled on the transmitting end.
> > >
> >
> > But in this proposal, we are not choking/delaying anything on the receiving 
> > end.
>
> I didn't say that to the latest patch.  I interpreted the quote of
> your description as saying that the subscriber-side solution is
> effective in solving the long-lock problems, so I replied that that
> can be solved with the publisher-side solution and the subscriber-side
> solution could cause some unwanted behavior.
>
> Do you think we have decided to go with the publisher-side solution?
> I'm fine if so.
>

I am fine too unless we discover any major challenges with
publisher-side implementation.

-- 
With Regards,
Amit Kapila.




Re: Auth extensions, with an LDAP/SCRAM example [was: Proposal: Support custom authentication methods using hooks]

2023-02-28 Thread Stephen Frost
Greetings,

* Jacob Champion (jchamp...@timescale.com) wrote:
> On Mon, Feb 27, 2023 at 12:43 PM Stephen Frost  wrote:
> > * Jacob Champion (jchamp...@timescale.com) wrote:
> > > This patchset should ideally have required zero client side changes, but
> > > because our SCRAM implementation is slightly nonstandard too -- it
> > > doesn't embed the username into the SCRAM data -- libpq can't talk to
> > > the OpenLDAP/Cyrus SASL implementation. I included a quick-and-bad patch
> > > to fix it in libpq; that needs its own conversation.
> >
> > Indeed it does... as there were specific reasons that what we
> > implemented for PG's SCRAM doesn't embed the username into the SCRAM
> > data and my recollection is that we don't because SCRAM (or maybe SASL?
> > either way though...) requires it to be utf-8 encoded and we support a
> > lot of other encoding that aren't that, so it wouldn't work for a lot
> > of our users.
> 
> Yes. Interoperable authentication is going to have to solve those
> sorts of problems somehow, though. And there are a bunch of levers to
> pull: clients aren't required to run their sent usernames through
> SASLprep; our existing servers don't mind having it sent, so we could
> potentially backport a client change; and then after that it's a game
> of balancing compatibility and compliance. A deeper conversation for
> sure.

We did solve those problems with SCRAM, by not exactly following the
unfortunately short-sighted standard.  What we have gets us the benefits
of SCRAM and generally follows the standard without giving up non-utf8
encodings of roles.  If there's a compelling reason to support another
authentication method then I'd hope we would look for similar ways to
support it without giving up too much.

I certainly don't see it making sense to backport some change in this
area.

> > > I think this is exactly the sort of thing that's too niche to be in core
> > > but might be extremely useful for the few people it applies to, so I'm
> > > proposing it as an argument in favor of general authn/z extensibility.
> >
> > If it was able to actually work for our users (and maybe it is if we
> > make it optional somehow) and we have users who want it (which certainly
> > seems plausible) then I'd say that it's something we would benefit from
> > having in core.  While it wouldn't solve all the issues with 'ldap'
> > auth, if it works with AD's LDAP servers and doesn't require the
> > password to be passed from the client to the server (in any of this, be
> > the client psql, pgadmin, or the PG server when it goes to talk to the
> > LDAP server..) then that would be a fantastic thing and we could
> > replace the existing 'ldap' auth with that and be *much* better off for
> > it, and so would our users.
> 
> I think the argument you're making here boils down to "if it's not
> niche, it should be in core", and I agree with that general sentiment.

Good.  That certainly seems like a start.

> But not all of the prerequisites you stated are met. I see no evidence
> that Active Directory supports SCRAM [1], for instance, unless the MS
> documentation is just omitting it.

That's certainly unfortunate if that's the case.  The question then
becomes- are a lot of people using SCRAM to connect to OpenLDAP, to the
extent that it's a sensible thing for us to support and gives us a more
secure option for 'ldap' auth than what exists today where we're passing
around cleartext passwords?  That seems like it's at least plausible.

> Even if it were applicable to every single LDAP deployment, I'd still
> like users to be able to choose the best authentication available in
> their setups without first having to convince the community. They can
> maintain the things that make sense for them, like they do with
> extensions. And the authors can iterate on better authentication out
> of cycle, like extension authors do.

Considering how few outside-of-core well maintained extensions for PG
exist, I have an extremely hard time buying off on the argument that the
authentication system is a smart area for us to encourage approach in.

Broadly speaking, I feel confident saying that authentication systems,
rather like encryption libraries, should be standardized, well
documented, heavily reviewed, and carefully maintained.  If the argument
is that there are these teams out there who are itching to write amazing
new authentication methods for PG who are going to spend time
documenting them, reviewing them, and maintaining them, then we should
try to get those folks to work with the community to add support for
these new methods so that everyone can benefit from them- not
encouraging them to be independent projects which have to work through
hooks that limit how those authentication methods are able to be
integrated.  Consider things like pg_stat_ssl and pg_stat_gssapi.  What
about pg_ident maps?  Will each of these end up with their own version
of that?  And those are questions beyond just the issue around making
sure 

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Masahiko Sawada
On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> > Few comments:
>
> Thank you for reviewing! PSA new version.
> Note that the starting point of delay for 2PC was not changed,
> I think it has been under discussion.
>
> > 1.
> > + /*
> > + * If we've requested to shut down, exit the process.
> > + *
> > + * Note that WalSndDone() cannot be used here because the delaying
> > + * changes will be sent in the function.
> > + */
> > + if (got_STOPPING)
> > + {
> > + QueryCompletion qc;
> > +
> > + /* Inform the standby that XLOG streaming is done */
> > + SetQueryCompletion(, CMDTAG_COPY, 0);
> > + EndCommand(, DestRemote, false);
> > + pq_flush();
> >
> > Do we really need to do anything except for breaking the loop and let
> > the exit handling happen in the main loop when 'got_STOPPING' is set?
> > AFAICS, this is what we are doing in some other palces (See
> > WalSndWaitForWal). Won't that work? It seems that will help us sending
> > all the pending WAL.
>
> If we exit the loop after got_STOPPING is set, as you said, the walsender will
> send delaying changes and then exit. The behavior is same as the case that 
> WalSndDone()
> is called. But I think it is not suitable for the motivation of the feature.
> If users notice the miss operation like TRUNCATE, they must shut down the 
> publisher
> once and then recovery from back up or old subscriber. If the walsender sends 
> all
> pending changes, miss operations will be also propagated to subscriber and 
> data
> cannot be protected. So currently I want to keep the style.
> FYI - In case of physical replication, received WALs are not applied when the
> secondary is shutted down.
>
> > 2.
> > + /* Try to flush pending output to the client */
> > + if (pq_flush_if_writable() != 0)
> > + WalSndShutdown();
> >
> > Is there a reason to try flushing here?
>
> IIUC if pq_flush_if_writable() returns non-zero (EOF), it means that there is 
> a
> trouble and walsender fails to send messages to subscriber.
>
> In Linux, the stuck trace from pq_flush_if_writable() will finally reach the 
> send() system call.
> And according to man page[1], it will be triggered by some unexpected state 
> or the connection is closed.
>
> Based on above, I think the returned value should be confirmed.
>
> > Apart from the above, I have made a few changes in the comments in the
> > attached diff patch. If you agree with those then please include them
> > in the next version.
>
> Thanks! I checked and I thought all of them should be included.
>
> Moreover, I used grammar checker and slightly reworded the commit message.

Thinking of side effects of this feature (no matter where we delay
applying the changes), on the publisher, vacuum cannot collect garbage
and WAL cannot be recycled. Is that okay in the first place? The point
is that the subscription setting affects the publisher. That is,
min_send_delay is specified on the subscriber but the symptoms that
could ultimately lead to a server crash appear on the publisher, which
sounds dangerous to me.

Imagine a service or system like where there is a publication server
and it's somewhat exposed so that a user (or a subsystem) arbitrarily
can create a subscriber to replicate a subset of the data. A malicious
user can have the publisher crash by creating a subscription with,
say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation
but it's -1 by default.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Kyotaro Horiguchi
At Tue, 28 Feb 2023 08:35:11 +0530, Amit Kapila  wrote 
in 
> On Tue, Feb 28, 2023 at 8:14 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila  
> > wrote in
> > > The one difference w.r.t recovery_min_apply_delay is that here we will
> > > hold locks for the duration of the delay which didn't seem to be a
> > > good idea. This will also probably lead to more bloat as we will keep
> > > transactions open for a long time. Doing it before DecodePrepare won't
> >
> > I don't have a concrete picture but could we tell reorder buffer to
> > retain a PREPAREd transaction until a COMMIT PREPARED comes?
> >
> 
> Yeah, we could do that and that is what is the behavior unless the
> user enables 2PC via 'two_phase' subscription option. But, I don't see
> the need to unnecessarily delay the prepare till the commit if a user
> has specified 'two_phase' option. It is quite possible that COMMIT
> PREPARED happens at a much later time frame than the amount of delay
> the user is expecting.

It looks like the user should decide between potential long locks or
extra delays, and this choice ought to be documented.

> >  If
> > delaying non-prepared transactions until COMMIT is adequate, then the
> > same thing seems to work for prepared transactions.
> >
> > > have such problems. This is the reason that we decide to perform a
> > > delay at the start of the transaction instead at commit/prepare in the
> > > subscriber-side approach.
> >
> > It seems that there are no technical obstacles to do that on the
> > publisher side. The only observable difference would be that
> > relatively large prepared transactions may experience noticeable
> > additional delays.  IMHO I don't think it's a good practice
> > protocol-wise to intentionally choke a stream at the receiving end
> > when it has not been flow-controlled on the transmitting end.
> >
> 
> But in this proposal, we are not choking/delaying anything on the receiving 
> end.

I didn't say that to the latest patch.  I interpreted the quote of
your description as saying that the subscriber-side solution is
effective in solving the long-lock problems, so I replied that that
can be solved with the publisher-side solution and the subscriber-side
solution could cause some unwanted behavior.

Do you think we have decided to go with the publisher-side solution?
I'm fine if so.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: [Proposal] Add foreign-server health checks infrastructure

2023-02-28 Thread Katsuragi Yuta

Hi Kuroda-san,

Thank you for updating the patch!



I rethought the pqSocketPoll part. Current interpretation of
arguments seems a little bit confusing because a specific pattern
of arguments has a different meaning. What do you think about
introducing a new argument like `int forConnCheck`? This seems
straightforward and readable.


I think it may be better, so fixed.
But now we must consider another thing - will we support combination
of conncheck
and {read|write} or timeout? Especially about timeout, if someone
calls pqSocketPoll()
with forConnCheck = 1 and end_time = -1, the process may be stuck
because it waits
till socket peer closed connection.
Currently the forConnCheck can be specified with other request, but
timeout must be zero.


Yes, we need to consider these.
I'm wondering whether we need a special care for the combination
of event and timeout. Surely, if forConnCheck is set and end_time = -1,
pqSocketPoll blocks until the connection close. However, the
behavior matches the meaning of the arguments and does not seem
confusing (also not an error state). Do we need to restrict this
kind of usage in the pqSocketPoll side? I think like the following
might be fine.

```
if (!forRead && !forWrite)
{
if (!(forConnCheck && PQconnCheckable()))
return 0;
}
```




While making the patch, I come up with idea that
postgres_fdw_verify_connection_states*
returns NULL if PQconnCheck() cannot work on this platform. This can be 
done by

adding PQconnCheckable(). It may be reasonable because the checking is
not really
done on this environment. How do you think?


I agree with you.

Followings are comments for v33. Please check.
0001:
1. the comment of PQconnCheck
+/*
+ * Check whether the socket peer closed connection or not.
+ *

Check whether the socket peer closed 'the' connection or not?

2. the comment of pqSocketCheck
- * or both.  Returns >0 if one or more conditions are met, 0 if it 
timed

- * out, -1 if an error occurred.
+ * or both. Moreover, this function can check the health of socket on 
some

+ * limited platforms if end_time is 0.

the health of socket -> the health of the connection?
if end_time is 0 -> if forConnCehck is specified?


3. the comment of pqSocketPoll
- * If neither forRead nor forWrite are set, immediately return a 
timeout

- * condition (without waiting).  Return >0 if condition is met, 0
- * if a timeout occurred, -1 if an error or interrupt occurred.
+ * Moreover, this function can check the health of socket on some 
limited

+ * platforms if end_time is 0.

the health of socket -> the health of the connection?
if end_time is 0 -> if forConnCehck is specified?

4. the code of pqSocketPoll
+#if defined(POLLRDHUP)
+   if (forConnCheck)
+   input_fd.events |= POLLRDHUP;
+#endif

I think it is better to use PQconnCheckable() to remove the macro.

5. the code of pqSocketCheck and the definition of pqSocketPoll
-   result = pqSocketPoll(conn->sock, forRead, forWrite, end_time);
+		result = pqSocketPoll(conn->sock, forRead, forWrite, forConnCheck, 
end_time);


-pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time)
+pqSocketPoll(int sock, int forRead, int forWrite, int forConnCheck, 
time_t end_time)


Should these be divided into two lines?


6. the comment of verify_cached_connections
+ * This function emits warnings if a disconnection is found, and this 
returns

+ * false only when the lastly verified server seems to be disconnected.

It seems better to write the case where this function returns
true.

7. the comment of postgres_fdw_verify_connection_states
+ * This function emits a warning if a disconnection is found. This 
returns
+ * false only when the verified server seems to be disconnected, and 
reutrns

+ * NULL if the connection check had not been done.

It seems better to write the case where this function returns
true.

8. the code of
+   (errcode(ERRCODE_CONNECTION_FAILURE),
+errmsg("%s", str.data),
+errdetail("Socket close is detected."),
+errhint("Plsease check the health of 
server.")));

Is it better to use "Connection close is detected" rather than
"Socket close is detected"?

9. the document of postgres_fdw
The document of postgres_fdw_verify_connection_states_* is a little
bit old. Could you update it?

regards,

--
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Non-superuser subscription owners

2023-02-28 Thread Stephen Frost
Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Tue, 2023-02-28 at 08:37 -0500, Robert Haas wrote:
> > The existing SECURITY_RESTRICTED_OPERATION flag basically prevents
> > you
> > from tinkering with the session state.
> 
> Currently, every time we set that flag we also run all the code as the
> table owner.
> 
> You're suggesting using the SECURITY_RESTRICTED_OPERATION flag, along
> with the new security flags, but not switch to the table owner, right?

I'm having trouble following this too, I have to admit.  If we aren't
changing who we're running the code under.. but making it so that the
code isn't actually able to do anything then that doesn't strike me as
likely to actually be useful?  Surely things like triggers which are
used to update another table or insert into another table what happened
on the table with the trigger need to be allowed to modify the database-
how do we make that possible while the code runs as the invoker and not
the table owner when the table owner is the one who gets to write the
code?

> >  If we also had a similar flags
> > like DATABASE_READS_PROHIBITED and DATABASE_WRITES_PROHIBITED (or
> > just
> > a combined DATABASE_ACCESS_PROHIBITED flag) I think that would be
> > pretty close to what we need. The idea would be that, when a user
> > executes a function or procedure 
> 
> Or default expressions, I presume. If we at least agree on this point,
> then I think we should try to find a way to treat these other hunks of
> code in a secure way (which I think is what Andres was suggesting).

Would need to apply to functions in views and functions in RLS too,
along wth default expressions and everything else that could be defined
by one person and run by another.

> > owned by a user that they don't trust
> > completely, we'd set
> > SECURITY_RESTRICTED_OPERATION|DATABASE_READS_PROHIBITED|DATABASE_WRIT
> > ES_PROHIBITED.
> 
> It seems like you're saying to basically just keep the user ID the
> same, and maybe keep USAGE privileges, but not be able to do anything
> else? Might be useful. Kind of like running it as a nobody user but
> without the problems you mentioned. Some details to think about, I'm
> sure.

While there's certainly some use-cases where a completely unprivileged
user would work, there's certainly an awful lot where it wouldn't.
Having that as an option might be interesting for those much more
limited use-cases and maybe you could even say "only run functions which
are owned by a superuser or X roles" but it's certainly not a general
solution to the problem.

> > And we could provide a user with a way to express the degree of trust
> > they have in some other user or perhaps even some specific function,
> > e.g.
> > 
> > SET trusted_roles='alice:read';
> > 
> > ...could mean that I trust alice to read from the database with my
> > permissions, should I happen to run code provided by her in SECURITY
> > INVOKER modacke.
> 
> I'm not very excited about inventing a new privilege language inside a
> GUC, but perhaps a simpler form could be a reasonable mitigation (or at
> least a starting place).

I'm pretty far down the path of "wow that looks really difficult to work
with", to put it nicely.

> > I'm sure there's some details to sort out here, e.g. around security
> > related to the trusted_roles GUC itself. But I don't really see a
> > fundamental problem. We can invent arbitrary flags that prohibit
> > classes of operations that are of concern, set them by default in
> > cases where concern is justified, and then give users who want the
> > current behavior some kind of escape hatch that causes those flags to
> > not get set after all. Not only does such a solution not seem
> > impossible, I can possibly even imagine back-patching it, depending
> > on
> > exactly what the shape of the final solution is, how important we
> > think it is to get a fix out there, and how brave I'm feeling that
> > day.
> 
> Unless the trusted roles defaults to '*', then I think it will still
> break some things.

Defaulting to an option that is "don't break anything" while giving
users flexibility to test out other, more secure, options seems like it
would be a pretty reasonable way forward, generally.  That said.. I
don't really think this particular approach ends up being a good
direction to go in...

> One of my key tests for user-facing proposals is whether the
> documentation will be reasonable or not. Most of these proposals to
> make SECURITY INVOKER less bad fail that test.

and this is certainly a very good point as to why.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Peter Smith
Here are some review comments for v9-0001, but these are only very trivial.

==
Commit Message

1.
Nitpick. The new text is jagged-looking. It should wrap at ~80 chars.

~~~

2.
2. Another reason is for that parallel streaming, the transaction will be opened
immediately by the parallel apply worker. Therefore, if the walsender
is delayed in sending the final record of the transaction, the
parallel apply worker must wait to receive it with an open
transaction. This would result in the locks acquired during the
transaction not being released until the min_send_delay has elapsed.

~

The text already said there are "two reasons", and already this is
numbered as reason 2. So it doesn't need to keep saying "Another
reason" here.

"Another reason is for that parallel streaming" --> "For parallel streaming..."

==
src/backend/replication/walsender.c

3. WalSndDelay

+ /* die if timeout was reached */
+ WalSndCheckTimeOut();

Other nearby comments start uppercase, so this should too.

==
src/include/replication/walreceiver.h

4. WalRcvStreamOptions

@@ -187,6 +187,7 @@ typedef struct
  * prepare time */
  char*origin; /* Only publish data originating from the
  * specified origin */
+ int32 min_send_delay; /* The minimum send delay */
  } logical;
  } proto;
 } WalRcvStreamOptions;

~

Should that comment mention the units are "(ms)"

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Doc update for pg_stat_statements normalization

2023-02-28 Thread Michael Paquier
On Wed, Mar 01, 2023 at 12:43:40AM +, Imseih (AWS), Sami wrote:
> Yes, that makes sense.

Okay, thanks.  Done that now on HEAD.
--
Michael


signature.asc
Description: PGP signature


Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-28 Thread Kirk Wolak
I cannot get the last email to show up for the commitfest.
This is version 2 of the original patch. [1]
Thanks Jim!

[1]
https://postgr.es/m/CACLU5mSRwHr_8z%3DenMj-nXF1tmC7%2BJn5heZQNiKuLyxYUtL2fg%40mail.gmail.com

Regards Kirk.
From b9db157177bbdeeeb6d35c3623ca9355141419d7 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Wed, 1 Mar 2023 00:07:55 +0100
Subject: [PATCH v2] Time option added to psql prompt

This adds a useful time option to the prompt: %T. Which does not
require a wasteful backquoted shell command which is also not
compatible between operating systems.
The format is simply HH24:MI:SS no other options available by design!

Author: Kirk Wolak 
Reviewed-By: Andrey Borodin 
Reviewed-By: Nikolay Samokhvalov 
Thread: 
https://postgr.es/m/CACLU5mSRwHr_8z%3DenMj-nXF1tmC7%2BJn5heZQNiKuLyxYUtL2fg%40mail.gmail.com
---
 doc/src/sgml/ref/psql-ref.sgml |  9 +
 src/bin/psql/prompt.c  | 10 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index dc6528dc11..04ab9eeb8c 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -4575,6 +4575,15 @@ testdb= INSERT INTO my_table VALUES 
(:'content');
 
   
 
+  
+%T
+
+ 
+  The current time on the client in HH24:MI:SS format.
+ 
+
+  
+
   
 %x
 
diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c
index 969cd9908e..0c0c725df5 100644
--- a/src/bin/psql/prompt.c
+++ b/src/bin/psql/prompt.c
@@ -41,6 +41,7 @@
  * or a ! if session is not connected to a database;
  * in prompt2 -, *, ', or ";
  * in prompt3 nothing
+ * %T - time in HH24:MI:SS format
  * %x - transaction status: empty, *, !, ? (unknown or no connection)
  * %l - The line number inside the current statement, starting from 1.
  * %? - the error code of the last query (not yet implemented)
@@ -223,7 +224,14 @@ get_prompt(promptStatus_t status, ConditionalStack cstack)
break;
}
break;
-
+   /* output HH24:MI:SS */
+   case 'T':
+   {
+   time_t current_time = 
time(NULL);
+   struct tm *tm_info = 
localtime(_time);
+   sprintf(buf, "%02d:%02d:%02d", 
tm_info->tm_hour, tm_info->tm_min, tm_info->tm_sec);
+   }
+   break;
case 'x':
if (!pset.db)
buf[0] = '?';
-- 
2.25.1



Re: Minimal logical decoding on standbys

2023-02-28 Thread Jeff Davis
On Mon, 2023-02-27 at 09:40 +0100, Drouvot, Bertrand wrote:
> Please find attached V51 tiny rebase due to a6cd1fc692 (for 0001) and
> 8a8661828a (for 0005).

[ Jumping into this thread late, so I apologize if these comments have
already been covered. ]

Regarding v51-0004:

* Why is the CV sleep not being canceled?
* Comments on WalSndWaitForWal need to be updated to explain the
difference between the flush (primary) and the replay (standby) cases.

Overall, it seems like what you really want for the sleep/wakeup logic
in WalSndWaitForLSN is something like this:

   condVar = RecoveryInProgress() ? replayCV : flushCV;
   waitEvent = RecoveryInProgress() ? 
   WAIT_EVENT_WAL_SENDER_WAIT_REPLAY :
   WAIT_EVENT_WAL_SENDER_WAIT_FLUSH;

   ConditionVariablePrepareToSleep(condVar);
   for(;;)
   {
  ...
  sleeptime = WalSndComputeSleepTime(GetCurrentTimestamp());
  socketEvents = WL_SOCKET_READABLE;
  if (pq_is_send_pending())
  socketEvents = WL_SOCKET_WRITABLE;
  ConditionVariableTimedSleepOrEvents(
  condVar, sleeptime, socketEvents, waitEvent);
   }
   ConditionVariableCancelSleep();


But the problem is that ConditionVariableTimedSleepOrEvents() doesn't
exist, and I think that's what Andres was suggesting here[1].
WalSndWait() only waits for a timeout or a socket event, but not a CV;
ConditionVariableTimedSleep() only waits for a timeout or a CV, but not
a socket event.

I'm also missing how WalSndWait() works currently. It calls
ModifyWaitEvent() with NULL for the latch, so how does WalSndWakeup()
wake it up?

Assuming I'm wrong, and WalSndWait() does use the latch, then I guess
it could be extended by having two different latches in the WalSnd
structure, and waking them up separately and waiting on the right one.
Not sure if that's a good idea though.

[1]
https://www.postgresql.org/message-id/20230106034036.2m4qnn7ep7b5i...@awork3.anarazel.de

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-28 Thread Kirk Wolak
On Thu, Feb 23, 2023 at 2:05 PM Kirk Wolak  wrote:

> On Thu, Feb 23, 2023 at 9:52 AM Tom Lane  wrote:
>
>> Heikki Linnakangas  writes:
>> > On 23/02/2023 13:20, Peter Eisentraut wrote:
>> >> If you don't have \timing turned on before the query starts, psql won't
>> >> record what the time was before the query, so you can't compute the run
>> >> time afterwards.  This kind of feature would only work if you always
>> >> take the start time, even if \timing is turned off.
>>
>> > Correct. That seems acceptable though? gettimeofday() can be slow on
>> > some platforms, but I doubt it's *that* slow, that we couldn't call it
>> > two times per query.
>>
>> Yeah, you'd need to capture both the start and stop times even if
>> \timing isn't on, in case you get asked later.  But the backend is
>> going to call gettimeofday at least once per query, likely more
>> depending on what features you use.  And there are inherently
>> multiple kernel calls involved in sending a query and receiving
>> a response.  I tend to agree with Heikki that this overhead would
>> be unnoticeable.  (Of course, some investigation proving that
>> wouldn't be unwarranted.)
>>
>> regards, tom lane
>>
>
> Note, for this above feature, I was thinking we have a  ROW_COUNT variable
> I use \set to see.
> The simplest way to add this is maybe a set variable:  EXEC_TIME
> And it's set when ROW_COUNT gets set.
> +1
>
> ==
> Now, since this opened a lively discussion, I am officially submitting my
> first patch.
> This includes the small change to prompt.c and the documentation.  I had
> help from Andrey Borodin,
> and Pavel Stehule, who have supported me in how to propose, and use
> gitlab, etc.
>
> We are programmers... It's literally our job to sharpen our tools.  And
> PSQL is one of my most used.
> A small frustration, felt regularly was the motive.
>
> Regards, Kirk
> PS: If I am supposed to edit the subject to say there is a patch here, I
> did not know
> PPS: I appreciate ANY and ALL feedback... This is how we learn!
>

Patch Posted with one edit, for line editings (Thanks Jim!)
From b9db157177bbdeeeb6d35c3623ca9355141419d7 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Wed, 1 Mar 2023 00:07:55 +0100
Subject: [PATCH v2] Time option added to psql prompt

This adds a useful time option to the prompt: %T. Which does not
require a wasteful backquoted shell command which is also not
compatible between operating systems.
The format is simply HH24:MI:SS no other options available by design!

Author: Kirk Wolak 
Reviewed-By: Andrey Borodin 
Reviewed-By: Nikolay Samokhvalov 
Thread: 
https://postgr.es/m/CACLU5mSRwHr_8z%3DenMj-nXF1tmC7%2BJn5heZQNiKuLyxYUtL2fg%40mail.gmail.com
---
 doc/src/sgml/ref/psql-ref.sgml |  9 +
 src/bin/psql/prompt.c  | 10 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index dc6528dc11..04ab9eeb8c 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -4575,6 +4575,15 @@ testdb= INSERT INTO my_table VALUES 
(:'content');
 
   
 
+  
+%T
+
+ 
+  The current time on the client in HH24:MI:SS format.
+ 
+
+  
+
   
 %x
 
diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c
index 969cd9908e..0c0c725df5 100644
--- a/src/bin/psql/prompt.c
+++ b/src/bin/psql/prompt.c
@@ -41,6 +41,7 @@
  * or a ! if session is not connected to a database;
  * in prompt2 -, *, ', or ";
  * in prompt3 nothing
+ * %T - time in HH24:MI:SS format
  * %x - transaction status: empty, *, !, ? (unknown or no connection)
  * %l - The line number inside the current statement, starting from 1.
  * %? - the error code of the last query (not yet implemented)
@@ -223,7 +224,14 @@ get_prompt(promptStatus_t status, ConditionalStack cstack)
break;
}
break;
-
+   /* output HH24:MI:SS */
+   case 'T':
+   {
+   time_t current_time = 
time(NULL);
+   struct tm *tm_info = 
localtime(_time);
+   sprintf(buf, "%02d:%02d:%02d", 
tm_info->tm_hour, tm_info->tm_min, tm_info->tm_sec);
+   }
+   break;
case 'x':
if (!pset.db)
buf[0] = '?';
-- 
2.25.1



Re: Doc update for pg_stat_statements normalization

2023-02-28 Thread Imseih (AWS), Sami
> + 
> + Queries on which normalization can be applied may be observed with constant
> + values in pg_stat_statements, especially when there
> + is a high rate of entry deallocations. To reduce the likelihood of this
> + happening, consider increasing pg_stat_statements.max.
> + The pg_stat_statements_info view, discussed below
> + in ,
> + provides statistics about entry deallocations.
> + 

> Are you OK with this text?

Yes, that makes sense.


Regards,

--
Sami Imseih
Amazon Web Services





Re: pg_upgrade and logical replication

2023-02-28 Thread Julien Rouhaud
On Tue, Feb 28, 2023 at 08:02:13AM -0800, Nikolay Samokhvalov wrote:
> On Fri, Feb 17, 2023 at 7:35 AM Julien Rouhaud  wrote:
> >
> >  Any table later added in the
> > publication is either already fully replicated until that LSN on the 
> > upgraded
> > node, so only the delta is needed, or has been created after that LSN.  In 
> > the
> > latter case, the entirety of the table will be replicated with the logical
> > replication as a delta right?
>
> What if we consider a slightly adjusted procedure?
>
> 0. Temporarily, forbid running any DDL on the source cluster.

This is (at least for me) a non starter, as I want an approach that doesn't
impact the primary node, at least not too much.

Also, how would you do that?  If you need some new infrastructure it means that
you can only upgrade nodes starting from pg16+, while my approach can upgrade
any node that supports publications as long as the target version is pg16+.

It also raises some concerns: why prevent any DDL while e.g. creating a
temporary table shouldn't not be a problem, same for renaming some underlying
object, adding indexes...  You would have to curate a list of what exactly is
allowed which is never great.

Also, how exactly would you ensure that indeed DDL were forbidden since a long
enough point in time rather than just "currently" forbidden at the time you do
some check?




Re: Doc update for pg_stat_statements normalization

2023-02-28 Thread Michael Paquier
On Tue, Feb 28, 2023 at 11:11:30PM +, Imseih (AWS), Sami wrote:
> I agree. We introduce the concept of a plannable statement in a 
> previous section and we can then make this distinction in the new
> paragraph. 
> 
> I also added a link to pg_stat_statements_info since that is introduced
> later on int the doc.

I have reworded the paragraph a bit to be more general so as it would
not need an update once more normalization is applied to utility
queries (I am going to fix the part where we mention that we use the
strings for utilities, which is not the case anymore now):
+  
+   Queries on which normalization can be applied may be observed with constant
+   values in pg_stat_statements, especially when there
+   is a high rate of entry deallocations. To reduce the likelihood of this
+   happening, consider increasing pg_stat_statements.max.
+   The pg_stat_statements_info view, discussed below
+   in ,
+   provides statistics about entry deallocations.
+  

Are you OK with this text?
--
Michael
From dd8938e4ba1b1f29d14b3fa2dc76301f42592cad Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 1 Mar 2023 09:05:08 +0900
Subject: [PATCH v3] doc update regarding pg_stat_statements normalization.

It is quite possible that a query text willl not normalize
(replace constants with $1, $2..) when there is a high rate
pgss_hash deallocation. This commit calls this out in docs.
---
 doc/src/sgml/pgstatstatements.sgml | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index efc36da602..f1ba78c8cb 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -516,6 +516,16 @@
pg_stat_statements entry.
   
 
+  
+   Queries on which normalization can be applied may be observed with constant
+   values in pg_stat_statements, especially when there
+   is a high rate of entry deallocations. To reduce the likelihood of this
+   happening, consider increasing pg_stat_statements.max.
+   The pg_stat_statements_info view, discussed below
+   in ,
+   provides statistics about entry deallocations.
+  
+
   
In some cases, queries with visibly different texts might get merged into a
single pg_stat_statements entry.  Normally this will happen
-- 
2.39.2



signature.asc
Description: PGP signature


Re: Add LZ4 compression in pg_dump

2023-02-28 Thread Justin Pryzby
On Thu, Feb 23, 2023 at 09:24:46PM +0100, Tomas Vondra wrote:
> On 2/23/23 16:26, Tomas Vondra wrote:
> > Thanks for v30 with the updated commit messages. I've pushed 0001 after
> > fixing a comment typo and removing (I think) an unnecessary change in an
> > error message.
> > 
> > I'll give the buildfarm a bit of time before pushing 0002 and 0003.
> > 
> 
> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
> and marked the CF entry as committed. Thanks for the patch!

I found that e9960732a broke writing of empty gzip-compressed data,
specifically LOs.  pg_dump succeeds, but then the restore fails:

postgres=# SELECT lo_create(1234);
lo_create | 1234

$ time ./src/bin/pg_dump/pg_dump -h /tmp -d postgres -Fc 
|./src/bin/pg_dump/pg_restore -f /dev/null -v 
pg_restore: implied data-only restore
pg_restore: executing BLOB 1234
pg_restore: processing BLOBS
pg_restore: restoring large object with OID 1234
pg_restore: error: could not uncompress data: (null)

The inline patch below fixes it, but you won't be able to apply it
directly, as it's on top of other patches which rename the functions
back to "Zlib" and rearranges the functions to their original order, to
allow running:

git diff --diff-algorithm=minimal -w e9960732a~:./src/bin/pg_dump/compress_io.c 
./src/bin/pg_dump/compress_gzip.c

The current function order avoids 3 lines of declarations, but it's
obviously pretty useful to be able to run that diff command.  I already
argued for not calling the functions "Gzip" on the grounds that the name
was inaccurate.

I'd want to create an empty large object in src/test/sql/largeobject.sql
to exercise this tested during pgupgrade.  But unfortunately that
doesn't use -Fc, so this isn't hit.  Empty input is an important enough
test case to justify a tap test, if there's no better way.

diff --git a/src/bin/pg_dump/compress_gzip.c b/src/bin/pg_dump/compress_gzip.c
index f3f5e87c9a8..68f3111b2fe 100644
--- a/src/bin/pg_dump/compress_gzip.c
+++ b/src/bin/pg_dump/compress_gzip.c
@@ -55,6 +55,32 @@ InitCompressorZlib(CompressorState *cs,
gzipcs = (ZlibCompressorState *) 
pg_malloc0(sizeof(ZlibCompressorState));
 
cs->private_data = gzipcs;
+
+   if (cs->writeF)
+   {
+   z_streamp   zp;
+   zp = gzipcs->zp = (z_streamp) pg_malloc0(sizeof(z_stream));
+   zp->zalloc = Z_NULL;
+   zp->zfree = Z_NULL;
+   zp->opaque = Z_NULL;
+
+   /*
+* outsize is the buffer size we tell zlib it can output to.  We
+* actually allocate one extra byte because some routines want 
to append a
+* trailing zero byte to the zlib output.
+*/
+
+   gzipcs->outbuf = pg_malloc(ZLIB_OUT_SIZE + 1);
+   gzipcs->outsize = ZLIB_OUT_SIZE;
+
+   if (deflateInit(gzipcs->zp, cs->compression_spec.level) != Z_OK)
+   pg_fatal("could not initialize compression library: %s",
+   zp->msg);
+
+   /* Just be paranoid - maybe End is called after Start, with no 
Write */
+   zp->next_out = gzipcs->outbuf;
+   zp->avail_out = gzipcs->outsize;
+   }
 }
 
 static void
@@ -63,7 +89,7 @@ EndCompressorZlib(ArchiveHandle *AH, CompressorState *cs)
ZlibCompressorState *gzipcs = (ZlibCompressorState *) cs->private_data;
z_streamp   zp;
 
-   if (gzipcs->zp)
+   if (cs->writeF != NULL)
{
zp = gzipcs->zp;
zp->next_in = NULL;
@@ -131,29 +157,6 @@ WriteDataToArchiveZlib(ArchiveHandle *AH, CompressorState 
*cs,
   const void *data, size_t dLen)
 {
ZlibCompressorState *gzipcs = (ZlibCompressorState *) cs->private_data;
-   z_streamp   zp;
-
-   if (!gzipcs->zp)
-   {
-   zp = gzipcs->zp = (z_streamp) pg_malloc(sizeof(z_stream));
-   zp->zalloc = Z_NULL;
-   zp->zfree = Z_NULL;
-   zp->opaque = Z_NULL;
-
-   /*
-* outsize is the buffer size we tell zlib it can output to.  We
-* actually allocate one extra byte because some routines want 
to
-* append a trailing zero byte to the zlib output.
-*/
-   gzipcs->outbuf = pg_malloc(ZLIB_OUT_SIZE + 1);
-   gzipcs->outsize = ZLIB_OUT_SIZE;
-
-   if (deflateInit(zp, cs->compression_spec.level) != Z_OK)
-   pg_fatal("could not initialize compression library: 
%s", zp->msg);
-
-   zp->next_out = gzipcs->outbuf;
-   zp->avail_out = gzipcs->outsize;
-   }
 
gzipcs->zp->next_in = (void *) unconstify(void *, data);
gzipcs->zp->avail_in = dLen;
>From 1c707279596f3cffde9c97b450dcbef0b6ddbd94 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 26 Feb 2023 17:12:03 -0600
Subject: 

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-02-28 Thread Jacob Champion
On 2/16/23 10:38, Jacob Champion wrote:
> Thanks for the nudge, v7 rebases over the configure conflict from 9244c11afe.

I think/hope this is well-baked enough for a potential commit this CF,
so I've adjusted the target version. Let me know if there are any
concerns about the approach.

Thanks!
--Jacob




Re: Ability to reference other extensions by schema in extension scripts

2023-02-28 Thread Sandro Santilli
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

I've applied the patch attached to message 
https://www.postgresql.org/message-id/000401d94bc8%2448dff700%24da9fe500%24%40pcorp.us
 (md5sum a7d45a32c054919d94cd4a26d7d07c20) onto current tip of the master 
branch being 128dd9f9eca0b633b51ffcd5b0f798fbc48ec4c0

The review written in 
https://www.postgresql.org/message-id/20230228224608.ak7br5shev4wic5a%40c19 all 
still applies.

The `make installcheck-world`  test fails for me but the failures seem 
unrelated to the patch (many occurrences of "+ERROR:  function 
pg_input_error_info(unknown, unknown) does not exist" in the regression.diff).

Documentation exists for the new feature

The new status of this patch is: Ready for Committer


Re: Beautify pg_walinspect docs a bit

2023-02-28 Thread Michael Paquier
On Tue, Feb 28, 2023 at 11:57:40AM -0800, Nathan Bossart wrote:
> It looks like 58597ed accidentally added an "end_lsn" to the docs for
> pg_get_wal_stats_till_end_of_wal().

Indeed.  Fixed, thanks!
--
Michael


signature.asc
Description: PGP signature


Re: [PoC] Let libpq reject unexpected authentication requests

2023-02-28 Thread Jacob Champion
On 2/16/23 10:57, Jacob Champion wrote:
> v14 rebases over the test and solution conflicts from 9244c11afe2.

Since we're to the final CF for PG16, here's a rough summary.

This patchset provides two related features: 1) the ability for a client
to explicitly allow or deny particular methods of in-band authentication
(that is, things like password exchange), and 2) the ability to withhold
a client certificate from a server that asks for it.

Feature 1 was originally proposed to mitigate abuse where a successful
MITM attack can then be used to fish for client credentials [1]. It also
lets users disable undesirable authentication types (like plaintext) by
default, which seems to be a common interest. Both features came up
again in the context of proxies such as postgres_fdw, where it's
sometimes important that users authenticate using only their credentials
and not piggyback on the authority of the proxy host [2]. And another
use case for feature 2 just came up independently [3], to fix
connections where the default client certificate isn't valid for a
particular server.

Since this is all client-side, it's compatible with existing servers.
Also since it's client-side, it can't prevent connections from being
established by an eager server; it can only drop the connection once it
sees that its requirement was not met, similar to how we handle
target_session_attrs. That means it can't prevent a login trigger from
being processed on behalf of a confused proxy. (I think that would
require server-side support.)

0001 and 0002 are the core features. 0003 is a more future-looking
refactoring of the internals, to make it easier to handle more SASL
mechanisms, but it's not required and contains some unexercised code.

Thanks,
--Jacob

[1]
https://www.postgresql.org/message-id/fcc3ebeb7f05775b63f3207ed52a54ea5d17fb42.camel%40vmware.com
[2]
https://www.postgresql.org/message-id/20230123015255.h3jro3yyitlsqykp%40awork3.anarazel.de
[3]
https://www.postgresql.org/message-id/CAAWbhmh_QqCnRVV8ct3gJULReQjWxLTaTBqs%2BfV7c7FpH0zbew%40mail.gmail.com




Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

2023-02-28 Thread Tom Lane
Daniel Gustafsson  writes:
> The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around
> SysCacheGetAttr where a NULL value triggers an elog().

+1, seems like a good idea.  (I didn't review the patch details.)

> This will reduce granularity of error messages, and as the patch sits now it
> does so a lot since the message is left to work on - I wanted to see if this
> was at all seen as a net positive before spending time on that part.  I chose
> an elog since I as a user would prefer to hit an elog instead of a silent keep
> going with an assert, this is of course debateable.

I'd venture that the Assert cases are mostly from laziness, and
that once we centralize this it's plenty worthwhile to generate
a decent elog message.  You ought to be able to look up the
table and column name from the info that is at hand.

Also ... at least in assert-enabled builds, maybe we could check that
the column being fetched this way is actually marked attnotnull?
That would help to catch misuse.

regards, tom lane




Re: Marking options deprecated in help output

2023-02-28 Thread Daniel Gustafsson
> On 24 Feb 2023, at 21:31, Heikki Linnakangas  wrote:
> On 05/12/2022 11:42, Daniel Gustafsson wrote:

> How about putting the deprecated option on a separate line like this:
> 
>>  -h, --host=HOSTNAMEdatabase server host or socket directory
>>  -H (same as -h, deprecated)

If nothing else, it helps to keep it shorter and avoids breaking the line
between command and description, so I agree with you.  Done in the attached v2.

--
Daniel Gustafsson



deprecated_options_v2.diff
Description: Binary data


Re: Doc update for pg_stat_statements normalization

2023-02-28 Thread Imseih (AWS), Sami
> I am OK with an addition to the documentation to warn that one may
> have to increase the maximum number of entries that can be stored if
> seeing a non-normalized entry that should have been normalized.

I agree. We introduce the concept of a plannable statement in a 
previous section and we can then make this distinction in the new
paragraph. 

I also added a link to pg_stat_statements_info since that is introduced
later on int the doc.


Regards,

-- 
Sami Imseih
Amazon Web Services




0001-doc-update-regarding-pg_stat_statements-normalizatio.patch
Description: 0001-doc-update-regarding-pg_stat_statements-normalizatio.patch


Re: Stale references to guc.c in comments/tests

2023-02-28 Thread Tom Lane
Daniel Gustafsson  writes:
> Specifying the GUCs in question is a good idea, done in the attached.  I'm not
> sure the phrasing is spot-on though, but I can't think of a better one.  If 
> you
> can think of a better one I'm all ears.

I'd just change "the definition of" to "the definitions of".
LGTM otherwise.

regards, tom lane




RE: Ability to reference other extensions by schema in extension scripts

2023-02-28 Thread Regina Obe

> On Sun, Feb 26, 2023 at 01:39:24AM -0500, Regina Obe wrote:
> 
> > > 1) Just don't allow any extensions referenced by other
> > >extensions to be relocatable.
> >
> > Attached is my revision 3 patch, which follows the proposed #1.
> > Don't allow schema relocation of an extension if another extension
> > requires it.
> 
> I've built a version of PostgreSQL with this patch applied and I confirm
it
> works as expected.
> 
> The "ext1" is relocatable and creates a function ext1log():
> 
>   =# create extension ext1 schema n1;
>   CREATE EXTENSION
> 
> The "ext2" is relocatable and creates a function ext2log() relying on the
> ext1log() function from "ext1" extension, referencing it via
> @extschema:ext1@:
> 
>   =# create extension ext2 schema n2;
>   CREATE EXTENSION
>   =# select n2.ext2log('hello'); -- things work here
>   ext1: ext2: hello
> 
> By creating "ext2", "ext1" becomes effectively non-relocatable:
> 
>   =# alter extension ext1 set schema n2;
>   ERROR:  cannot SET SCHEMA of extension ext1 because other extensions
>   require it
>   DETAIL:  extension ext2 requires extension ext1
> 
> Drop "ext2" makes "ext1" relocatable again:
> 
>   =# drop extension ext2;
>   DROP EXTENSION
>   =# alter extension ext1 set schema n2;
>   ALTER EXTENSION
> 
> Upon re-creating "ext2" the referenced ext1 schema will be the correct
one:
> 
>   =# create extension ext2 schema n1;
>   CREATE EXTENSION
>   =# select n1.ext2log('hello');
>   ext1: ext2: hello
> 
> The code itself builds w/out warnings with:
> 
>   mkdir build
>   cd build
>   ../configure
>   make 2> ERR # ERR is empty
> 
> The testsuite reports all successes:
> 
>   make check
>   [...]
>   ===
>All 213 tests passed.
>   ===
> 
> Since I didn't see the tests for extension in there, I've also explicitly
run that
> portion:
> 
>   make -C src/test/modules/test_extensions/ check
>   [...]
>   test test_extensions  ... ok   32 ms
>   test test_extdepend   ... ok   12 ms
>   [...]
>   =
>All 2 tests passed.
>   =
> 
> 
> As mentioned already the downside of this patch is that it would not be
> possibile to change the schema of an otherwise relocatable extension once
> other extension depend on it, but I can't think of any good reason to
allow
> that, as it would mean dependent code would need to always dynamically
> determine the install location of the objects in that extension, which
sounds
> dangerous, security wise.
> 
> --strk;
> 
>   Libre GIS consultant/developer
>   https://strk.kbt.io/services.html

Oops I had forgotten to submit the updated patch strk was testing against in
my fork.
He had asked me to clean up the warnings off list and the description.

Attached is the revised.
Thanks strk for the patient help and guidance.

Thanks,
Regina




0004-Allow-use-of-extschema-reqextname-to-reference.patch
Description: Binary data


Re: Stale references to guc.c in comments/tests

2023-02-28 Thread Daniel Gustafsson
> On 27 Feb 2023, at 17:59, Tom Lane  wrote:

> The grammar is a bit off ("the GUC definition" would read better),
> but really I think the wording was vague already and we should tighten
> it up.  Can we specify exactly which GUC variable(s) we're talking about?

Specifying the GUCs in question is a good idea, done in the attached.  I'm not
sure the phrasing is spot-on though, but I can't think of a better one.  If you
can think of a better one I'm all ears.

--
Daniel Gustafsson



v3-0001-Fix-outdated-references-to-guc.c.patch
Description: Binary data


Re: RFC: logical publication via inheritance root?

2023-02-28 Thread Jacob Champion
Hi,

I'm going to register this in CF for feedback.

Summary for potential reviewers: we don't use declarative partitions in
the Timescale partitioning scheme, but it'd be really nice to be able to
replicate between our tables and standard tables, or between two
Timescale-partitioned tables with different layouts. This patch lets
extensions (or savvy users) upgrade an existing inheritance relationship
between two tables into a "logical partition" relationship, so that they
can be handled with the publish_via_partition_root machinery.

I hope this might also help pg_partman users migrate between old- and
new-style partition schemes, but that's speculation.

On 1/20/23 09:53, Jacob Champion wrote:
>> 2) While this strategy works well for ongoing replication, it's not
>> enough to get the initial synchronization correct. The subscriber
>> still does a COPY of the root table directly, missing out on all the
>> logical descendant data. The publisher will have to tell the
>> subscriber about the relationship somehow, and older subscriber
>> versions won't understand how to use that (similar to how old
>> subscribers can't correctly handle row filters).
> 
> I partially solved this by having the subscriber pull the logical
> hierarchy from the publisher to figure out which tables to COPY. This
> works when publish_via_partition_root=true, but it doesn't correctly
> return to the previous behavior when the setting is false. I need to
> check the publication setting from the subscriber, too, but that opens
> up the question of what to do if two different publications conflict.

Second draft attached, which fixes that bug. I kept thinking to myself
that this would be much easier if the publisher told the subscriber what
data to copy rather than having the subscriber hardcode the initial sync
process... and then I realized that I could, sort of, move in that
direction.

This version adds a SQL function to determine the list of source tables
to COPY into a subscriber's target table. Now the publisher can make use
of whatever catalogs it needs to make that list and the subscriber
doesn't need to couple to them. (This could also provide a way for
publishers to provide more generic "table indirection" in the future,
but I'm wary of selling genericism as a feature here.)

I haven't solved the problem where two publications of the same table
have different settings for publish_via_partition_root. I was curious to
see how the existing partition code prevented problems, but I'm not
really sure that it does... Here are some situations where the existing
implementation duplicates data on the initial sync:

1) A single subscription to two publications, one with
publish_via_partition_root on and the other off, which publish the same
partitioned table

2) A single subscription to two publications with
publish_via_partition_root on, one of which publishes a root partition
and the other of which publishes a descendant/leaf

3) A single subscription to two publications with
publish_via_partition_root on, one of which publishes FOR ALL TABLES and
the other of which publishes a descendant/leaf

Is it expected that DBAs should avoid these cases, or are they worth
pursuing with a bug fix?

Thanks,
--JacobFrom 6d31d13f9d7c400cf717426c4f336f7ad19a4391 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 26 Sep 2022 13:23:51 -0700
Subject: [PATCH] WIP: introduce pg_set_logical_root for use with pubviaroot

Allows regular inherited tables to be published via their root table,
just like partitions. This works by hijacking pg_inherit's inhseqno
column, and replacing a (single) existing entry for the child with the
value zero, indicating that it should be treated as a logical partition
by the publication machinery.

Initial sync works by asking the publisher for a list of logical
descendants of the published table, then COPYing them one-by-one into
the root. The publisher reuses the existing pubviaroot logic, adding the
new logical roots to code that previously looked only for partition
roots.

Known bugs/TODOs:
- The pg_inherits machinery doesn't prohibit changes to inheritance
  after an entry has been marked as a logical root.
- I haven't given any thought to interactions with row filters, or to
  column lists, or to multiple publications with conflicting pubviaroot
  settings.
- pg_set_logical_root() doesn't check for table ownership yet. Anyone
  can muck with pg_inherits through it.
- I'm not sure that I'm taking all the necessary locks yet, and those I
  do take may be taken in the wrong order.
---
 src/backend/catalog/pg_inherits.c   | 202 -
 src/backend/catalog/pg_publication.c| 120 +-
 src/backend/commands/publicationcmds.c  |  10 +
 src/backend/partitioning/partdesc.c |   3 +-
 src/backend/replication/logical/tablesync.c | 235 +--
 src/backend/replication/pgoutput/pgoutput.c |  54 ++---
 src/include/catalog/pg_inherits.h   |   7 +-
 

Re: Ability to reference other extensions by schema in extension scripts

2023-02-28 Thread Sandro Santilli
On Sun, Feb 26, 2023 at 01:39:24AM -0500, Regina Obe wrote:

> > 1) Just don't allow any extensions referenced by other
> >extensions to be relocatable.
> 
> Attached is my revision 3 patch, which follows the proposed #1.
> Don't allow schema relocation of an extension if another extension
> requires it.

I've built a version of PostgreSQL with this patch applied and I
confirm it works as expected.

The "ext1" is relocatable and creates a function ext1log():

  =# create extension ext1 schema n1;
  CREATE EXTENSION

The "ext2" is relocatable and creates a function ext2log() relying
on the ext1log() function from "ext1" extension, referencing
it via @extschema:ext1@:

  =# create extension ext2 schema n2;
  CREATE EXTENSION
  =# select n2.ext2log('hello'); -- things work here
  ext1: ext2: hello

By creating "ext2", "ext1" becomes effectively non-relocatable:

  =# alter extension ext1 set schema n2;
  ERROR:  cannot SET SCHEMA of extension ext1 because other extensions
  require it
  DETAIL:  extension ext2 requires extension ext1

Drop "ext2" makes "ext1" relocatable again:

  =# drop extension ext2;
  DROP EXTENSION
  =# alter extension ext1 set schema n2;
  ALTER EXTENSION

Upon re-creating "ext2" the referenced ext1 schema will be
the correct one:

  =# create extension ext2 schema n1;
  CREATE EXTENSION
  =# select n1.ext2log('hello');
  ext1: ext2: hello
  
The code itself builds w/out warnings with:

  mkdir build
  cd build
  ../configure
  make 2> ERR # ERR is empty

The testsuite reports all successes:

  make check
  [...]
  ===
   All 213 tests passed.
  ===

Since I didn't see the tests for extension in there, I've also
explicitly run that portion:

  make -C src/test/modules/test_extensions/ check
  [...]
  test test_extensions  ... ok   32 ms
  test test_extdepend   ... ok   12 ms
  [...]
  =
   All 2 tests passed.
  =


As mentioned already the downside of this patch is that it would
not be possibile to change the schema of an otherwise relocatable
extension once other extension depend on it, but I can't think of
any good reason to allow that, as it would mean dependent code
would need to always dynamically determine the install location
of the objects in that extension, which sounds dangerous, security
wise.

--strk; 

  Libre GIS consultant/developer
  https://strk.kbt.io/services.html




Re: Ability to reference other extensions by schema in extension scripts

2023-02-28 Thread Sandro Santilli
On Sat, Feb 25, 2023 at 03:40:24PM -0500, Regina Obe wrote:
> > On Mon, Feb 06, 2023 at 05:19:39AM -0500, Regina Obe wrote:
> > 
> > I was thinking: how about using the "refobjsubid" to encode the "level" of
> > dependency on an extension ? Right now "refobjsubid" is always 0 when the
> > referenced object is an extension.
> > Could we consider subid=1 to mean the dependency is not only on the
> > extension but ALSO on it's schema location ?
> 
> I like that idea.  It's only been ever used for tables I think, but I don't
> see why it wouldn't apply in this case as the concept is kinda the same.
> Only concern if other parts rely on this being 0.

This has to be verified, yes. But it feels to me like "must be 0" was
mostly to _allow_ for future extensions like the proposed one.

> The other question, should this just update the existing DEPENDENCY_NORMAL
> extension or add a new DEPENDENCY_NORMAL between the extensions with
> subid=1?

I'd use the existing record.

--strk;

  Libre GIS consultant/developer
  https://strk.kbt.io/services.html




Making empty Bitmapsets always be NULL

2023-02-28 Thread Tom Lane
When I designed the Bitmapset module, I set things up so that an empty
Bitmapset could be represented either by a NULL pointer, or by an
allocated object all of whose bits are zero.  I've recently come to
the conclusion that that was a bad idea and we should instead have
a convention like the longstanding invariant for Lists: an empty
list is represented by NIL and nothing else.

To do this, we need to fix bms_intersect, bms_difference, and a couple
of other functions to check for having produced an empty result; but
then we can replace bms_is_empty() by a simple NULL test.  I originally
guessed that that would be a bad tradeoff, but now I think it likely
is a win performance-wise, because we call bms_is_empty() many more
times than those other functions put together.

However, any performance gain would likely be marginal; the real
reason why I'm pushing this is that we have various places that have
hand-implemented a rule about "this Bitmapset variable must be exactly
NULL if empty", so that they can use checks-for-null in place of
bms_is_empty() calls in particularly hot code paths.  That is a really
fragile, mistake-prone way to do things, and I'm surprised that we've
seldom been bitten by it.  It's not well documented at all which
variables have this property, so you can't readily tell which code
might be violating those conventions.

So basically I'd like to establish that convention everywhere and
get rid of these ad-hoc reduce-to-NULL checks.  I put together the
attached draft patch to do so.  I've not done any hard performance
testing on it --- I did do one benchmark that showed maybe 0.8%
speedup, but I'd regard that as below the noise.

I found just a few places that have issues with this idea.  One thing
that is problematic is bms_first_member(): assuming you allow it to
loop to completion, it ends with the passed Bitmapset being empty,
which is now an invariant violation.  I made it pfree the argument
at that point, and fixed a couple of callers that would be broken
thereby; but I wonder if it wouldn't be better to get rid of that
function entirely and convert all its callers to use bms_next_member.
There are only about half a dozen.

I also discovered that nodeAppend.c is relying on bms_del_members
not reducing a non-empty set to NULL, because it uses the nullness
of appendstate->as_valid_subplans as a state boolean.  That was
probably acceptable when it was written, but whoever added
classify_matching_subplans made a hash of the state invariants here,
because that can set as_valid_subplans to empty.  I added a separate
boolean as an easy way out, but maybe that code could do with a more
thorough revisit.

I'll add this to the about-to-start CF.

regards, tom lane

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7eb79cee58..a5d74cc462 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3046,6 +3046,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	modified_attrs = HeapDetermineColumnsInfo(relation, interesting_attrs,
 			  id_attrs, ,
 			  newtup, _has_external);
+	interesting_attrs = NULL;	/* don't try to free it again */
 
 	/*
 	 * If we're not updating any "key" column, we can grab a weaker lock type.
@@ -3860,8 +3861,9 @@ heap_attr_equals(TupleDesc tupdesc, int attrnum, Datum value1, Datum value2,
  * listed as interesting) of the old tuple is a member of external_cols and is
  * stored externally.
  *
- * The input interesting_cols bitmapset is destructively modified; that is OK
- * since this is invoked at most once in heap_update.
+ * The input interesting_cols bitmapset is destructively modified, and freed
+ * before we return; that is OK since this is invoked at most once in
+ * heap_update.
  */
 static Bitmapset *
 HeapDetermineColumnsInfo(Relation relation,
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index c33a3c0bec..cfa95a07e4 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -877,15 +877,7 @@ UpdateChangedParamSet(PlanState *node, Bitmapset *newchg)
 	 * include anything else into its chgParam set.
 	 */
 	parmset = bms_intersect(node->plan->allParam, newchg);
-
-	/*
-	 * Keep node->chgParam == NULL if there's not actually any members; this
-	 * allows the simplest possible tests in executor node files.
-	 */
-	if (!bms_is_empty(parmset))
-		node->chgParam = bms_join(node->chgParam, parmset);
-	else
-		bms_free(parmset);
+	node->chgParam = bms_join(node->chgParam, parmset);
 }
 
 /*
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 20d23696a5..51a30ddf65 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1642,7 +1642,7 @@ find_hash_columns(AggState *aggstate)
 			perhash->hashGrpColIdxHash[i] = i + 1;
 			perhash->numhashGrpCols++;
 			/* delete already mapped columns */
-			bms_del_member(colnos, 

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-02-28 Thread Alexander Korotkov
On Wed, Mar 1, 2023 at 12:03 AM Gregory Stark  wrote:
> It looks like this patch received some feedback from Andres and hasn't
> had any further work posted. I'm going to move it to "Waiting on
> Author".

I'll post the updated version in the next couple of days.

> It doesn't sound like this is likely to get committed this release
> cycle unless responding to Andres's points simpler than I expect.

I wouldn't think ahead that much.

--
Regards,
Alexander Korotkov




Re: Non-superuser subscription owners

2023-02-28 Thread Jeff Davis
On Tue, 2023-02-28 at 08:37 -0500, Robert Haas wrote:
> The existing SECURITY_RESTRICTED_OPERATION flag basically prevents
> you
> from tinkering with the session state.

Currently, every time we set that flag we also run all the code as the
table owner.

You're suggesting using the SECURITY_RESTRICTED_OPERATION flag, along
with the new security flags, but not switch to the table owner, right?

>  If we also had a similar flags
> like DATABASE_READS_PROHIBITED and DATABASE_WRITES_PROHIBITED (or
> just
> a combined DATABASE_ACCESS_PROHIBITED flag) I think that would be
> pretty close to what we need. The idea would be that, when a user
> executes a function or procedure 

Or default expressions, I presume. If we at least agree on this point,
then I think we should try to find a way to treat these other hunks of
code in a secure way (which I think is what Andres was suggesting).

> owned by a user that they don't trust
> completely, we'd set
> SECURITY_RESTRICTED_OPERATION|DATABASE_READS_PROHIBITED|DATABASE_WRIT
> ES_PROHIBITED.

It seems like you're saying to basically just keep the user ID the
same, and maybe keep USAGE privileges, but not be able to do anything
else? Might be useful. Kind of like running it as a nobody user but
without the problems you mentioned. Some details to think about, I'm
sure.

> And we could provide a user with a way to express the degree of trust
> they have in some other user or perhaps even some specific function,
> e.g.
> 
> SET trusted_roles='alice:read';
> 
> ...could mean that I trust alice to read from the database with my
> permissions, should I happen to run code provided by her in SECURITY
> INVOKER modacke.

I'm not very excited about inventing a new privilege language inside a
GUC, but perhaps a simpler form could be a reasonable mitigation (or at
least a starting place).

> I'm sure there's some details to sort out here, e.g. around security
> related to the trusted_roles GUC itself. But I don't really see a
> fundamental problem. We can invent arbitrary flags that prohibit
> classes of operations that are of concern, set them by default in
> cases where concern is justified, and then give users who want the
> current behavior some kind of escape hatch that causes those flags to
> not get set after all. Not only does such a solution not seem
> impossible, I can possibly even imagine back-patching it, depending
> on
> exactly what the shape of the final solution is, how important we
> think it is to get a fix out there, and how brave I'm feeling that
> day.

Unless the trusted roles defaults to '*', then I think it will still
break some things.


One of my key tests for user-facing proposals is whether the
documentation will be reasonable or not. Most of these proposals to
make SECURITY INVOKER less bad fail that test.

Each of these ideas and sub-ideas affect the semantics, and should be
documented. But how do we document that some code runs as you, some as
the person who wrote it, sometimes we obey SECURITY INVOKER and
sometimes we ignore it and use DEFINER semantics, some code is outside
a function and always executes as the invoker, some code has some
security flags, and some code has more security flags, code can change
between the time you look at it and the time it runs, and it's all
filtered through GUCs with their own privilege sub-language?

OK, let's assume that we have all of that documented, then how do we
guide users on what reasonable best practices are for the GUC settings,
etc.? Or do we just say "this is mechanically how all these parts work,
good luck assembling it into a secure system!". [ Note: I feel like
this is the state we are in now. Even if technically we don't have live
security bugs that I'm aware of, we are setting users up for security
problems. ]

On the other hand, if we focus on executing code as the user who wrote
it in most places, then the documentation will be something like: "you
defined the table, you wrote the code, it runs as you, here are some
best practices for writing secure code". And we have some different
documentation for writing a cool SECURITY INVOKER function and how to
get other users to trust you enough to run it. That sounds a LOT more
understandable for users.

Regards,
Jeff Davis





Re: Non-superuser subscription owners

2023-02-28 Thread Jeff Davis
On Tue, 2023-02-28 at 11:28 -0800, Andres Freund wrote:
> I can only repeat myself in stating that SECURITY DEFINER solves none
> of the
> relevant issues. I included several examples of why it doesn't in the
> recent
> thread about "blocking SECURITY INVOKER". E.g. that default arguments
> of
> SECDEF functions are evaluated with the current user's privileges,
> not the
> function owner's privs:
> 
> https://postgr.es/m/20230113032943.iyxdu7bnxe4cmbld%40awork3.anarazel.de

I was speaking a bit loosely, using "SECURITY DEFINER" to mean the
semantics of executing code as the one who wrote it. I didn't
specifically mean the function marker, because as you pointed out in
the other thread, that's not enough.

>From your email it looks like there is still a path forward:

"The proposal to not trust any expressions controlled by untrusted
users at least allows to prevent execution of code, even if it doesn't
provide a way to execute the code in a safe manner.  Given that we
don't have the former, it seems foolish to shoot for the latter."

And later:

"I think the combination of
a) a setting that restricts evaluation of any non-trusted expressions,
   independent of the origin
b) an easy way to execute arbitrary statements within
   SECURITY_RESTRICTED_OPERATION"

My takeaway from that thread was that we need a mechanism to deal with
non-function code (e.g. default expressions) first; but once we have
that, it opens up the design space to better solutions or at least
mitigations. Is that right?

Regards,
Jeff Davis





Re: WIN32 pg_import_system_collations

2023-02-28 Thread Andrew Dunstan


On 2023-02-28 Tu 11:40, Juan José Santamaría Flecha wrote:


On Tue, Feb 28, 2023 at 12:55 PM Andrew Dunstan  
wrote:


On 2023-02-27 Mo 17:20, Juan José Santamaría Flecha wrote:


Hmm, yeah. I'm not sure I understand the point of this test anyway:


SELECT to_char(date '2010-03-01', 'DD TMMON ' COLLATE "de_DE");


Uhm, they probably don't make much sense except for "tr_TR", so I'm 
fine with removing them. PFA a patch for so.





I think you missed my point, which was that the COLLATE clause above 
seemed particularly pointless. But I agree that all these are not much 
use, so I'll remove them as you suggest.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Refactoring SysCacheGetAttr to know when attr cannot be NULL

2023-02-28 Thread Daniel Gustafsson
Today we have two fairly common patterns around extracting an attr from a
cached tuple:

  a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, );
  Assert(!isnull);

  a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, );
  if (isnull)
elog(ERROR, "..");

The error message in the elog() cases also vary quite a lot.  I've been unable
to find much in terms of guidelines for when to use en elog or an Assert, with
the likelyhood of a NULL value seemingly being the guiding principle (but not
in all cases IIUC).

The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around
SysCacheGetAttr where a NULL value triggers an elog().  This removes a lot of
boilerplate error handling which IMO leads to increased readability as the
error handling *in these cases* don't add much (there are other cases where
checking isnull does a lot of valuable work of course).  Personally I much
prefer the error-out automatically style of APIs like how palloc saves a ton of
checking the returned allocation for null, this aims at providing a similar
abstraction.

This will reduce granularity of error messages, and as the patch sits now it
does so a lot since the message is left to work on - I wanted to see if this
was at all seen as a net positive before spending time on that part.  I chose
an elog since I as a user would prefer to hit an elog instead of a silent keep
going with an assert, this is of course debateable.

Thoughts?

--
Daniel Gustafsson



0001-Add-SysCacheGetAttrNotNull-for-guarnteed-not-null-at.patch
Description: Binary data


Re: Beautify pg_walinspect docs a bit

2023-02-28 Thread Nathan Bossart
It looks like 58597ed accidentally added an "end_lsn" to the docs for
pg_get_wal_stats_till_end_of_wal().

diff --git a/doc/src/sgml/pgwalinspect.sgml b/doc/src/sgml/pgwalinspect.sgml
index 22677e54f2..3d7cdb95cc 100644
--- a/doc/src/sgml/pgwalinspect.sgml
+++ b/doc/src/sgml/pgwalinspect.sgml
@@ -174,7 +174,7 @@ combined_size_percentage | 2.8634072910530795
 
 
  
-  pg_get_wal_stats_till_end_of_wal(start_lsn pg_lsn, end_lsn pg_lsn, 
per_record boolean DEFAULT false)
+  pg_get_wal_stats_till_end_of_wal(start_lsn pg_lsn, per_record boolean 
DEFAULT false)
   returns setof record
  
 

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Memory leak from ExecutorState context?

2023-02-28 Thread Tomas Vondra
On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:
> Hello all,
> 
> A customer is facing out of memory query which looks similar to this 
> situation:
> 
>   
> https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
> 
> This PostgreSQL version is 11.18. Some settings:
> 
> * shared_buffers: 8GB
> * work_mem: 64MB
> * effective_cache_size: 24GB
> * random/seq_page_cost are by default
> * physical memory: 32GB
> 
> The query is really large and actually update kind of a materialized view.
> 
> The customer records the plans of this query on a regular basis. The explain
> analyze of this query before running out of memory was:
> 
>   https://explain.depesz.com/s/sGOH
> 
> The customer is aware he should rewrite this query to optimize it, but it's a
> long time process he can not start immediately. To make it run in the 
> meantime,
> he actually removed the top CTE to a dedicated table. According to their
> experience, it's not the first time they had to split a query this way to make
> it work.
> 
> I've been able to run this query on a standby myself.  I've "call
> MemoryContextStats(TopMemoryContext)" every 10s on a run, see the data parsed
> (best view with "less -S") and the graph associated with it in attachment. It
> shows:
> 
> * HashBatchContext goes up to 1441MB after 240s then stay flat until the end
>   (400s as the last record)

That's interesting. We're using HashBatchContext for very few things, so
what could it consume so much memory? But e.g. the number of buckets
should be limited by work_mem, so how could it get to 1.4GB?

Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
and print how many batches/butches are there?

> * ALL other context are stable before 240s, but ExecutorState
> * ExecutorState keeps rising up to 13GB with no interruption until the memory
>   exhaustion
> 
> I did another run with interactive gdb session (see the messy log session in
> attachment, for what it worth). Looking at some backtraces during the memory
> inflation close to the end of the query, all of them were having these frames 
> in
> common:
> 
>   [...]
>   #6  0x00621ffc in ExecHashJoinImpl (parallel=false, 
> pstate=0x31a3378)
>   at nodeHashjoin.c:398 [...]
> 
> ...which is not really helpful but at least, it seems to come from a hash join
> node or some other hash related code. See the gdb session log for more 
> details.
> After the out of mem, pmap of this process shows:
> 
>   430:   postgres: postgres  [local] EXPLAIN
>   Address   Kbytes RSS   Dirty Mode  Mapping
>   [...]
>   02c5e000 13719620 8062376 8062376 rw---   [ anon ]
>   [...]
> 
> Is it usual a backend is requesting such large memory size (13 GB) and
> actually use less of 60% of it (7.7GB of RSS)?
> 

No idea. Interpreting this info is pretty tricky, in my experience. It
might mean the memory is no longer used but sbrk couldn't return it to
the OS yet, or something like that.

> Sadly, the database is 1.5TB large and I can not test on a newer major 
> version.
> I did not try to check how large would be the required data set to reproduce
> this, but it moves 10s of million of rows from multiple tables anyway...
> 
> Any idea? How could I help to have a better idea if a leak is actually
> occurring and where exactly?
> 

Investigating memory leaks is tough, especially for generic memory
contexts like ExecutorState :-( Even more so when you can't reproduce it
on a machine with custom builds.

What I'd try is this:

1) attach breakpoints to all returns in AllocSetAlloc(), printing the
pointer and size for ExecutorState context, so something like

  break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
  commands
  print MemoryChunkGetPointer(chunk) size
  cont
  end

2) do the same for AllocSetFree()

3) Match the palloc/pfree calls (using the pointer address), to
determine which ones are not freed and do some stats on the size.
Usually there's only a couple distinct sizes that account for most of
the leaked memory.

4) Break AllocSetAlloc on those leaked sizes, to determine where the
calls come from.

This usually gives enough info about the leak or at least allows
focusing the investigation to a particular area of code.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Auth extensions, with an LDAP/SCRAM example [was: Proposal: Support custom authentication methods using hooks]

2023-02-28 Thread Jacob Champion
On Mon, Feb 27, 2023 at 12:43 PM Stephen Frost  wrote:
> * Jacob Champion (jchamp...@timescale.com) wrote:
> > This patchset should ideally have required zero client side changes, but
> > because our SCRAM implementation is slightly nonstandard too -- it
> > doesn't embed the username into the SCRAM data -- libpq can't talk to
> > the OpenLDAP/Cyrus SASL implementation. I included a quick-and-bad patch
> > to fix it in libpq; that needs its own conversation.
>
> Indeed it does... as there were specific reasons that what we
> implemented for PG's SCRAM doesn't embed the username into the SCRAM
> data and my recollection is that we don't because SCRAM (or maybe SASL?
> either way though...) requires it to be utf-8 encoded and we support a
> lot of other encoding that aren't that, so it wouldn't work for a lot
> of our users.

Yes. Interoperable authentication is going to have to solve those
sorts of problems somehow, though. And there are a bunch of levers to
pull: clients aren't required to run their sent usernames through
SASLprep; our existing servers don't mind having it sent, so we could
potentially backport a client change; and then after that it's a game
of balancing compatibility and compliance. A deeper conversation for
sure.

> > I think this is exactly the sort of thing that's too niche to be in core
> > but might be extremely useful for the few people it applies to, so I'm
> > proposing it as an argument in favor of general authn/z extensibility.
>
> If it was able to actually work for our users (and maybe it is if we
> make it optional somehow) and we have users who want it (which certainly
> seems plausible) then I'd say that it's something we would benefit from
> having in core.  While it wouldn't solve all the issues with 'ldap'
> auth, if it works with AD's LDAP servers and doesn't require the
> password to be passed from the client to the server (in any of this, be
> the client psql, pgadmin, or the PG server when it goes to talk to the
> LDAP server..) then that would be a fantastic thing and we could
> replace the existing 'ldap' auth with that and be *much* better off for
> it, and so would our users.

I think the argument you're making here boils down to "if it's not
niche, it should be in core", and I agree with that general sentiment.
But not all of the prerequisites you stated are met. I see no evidence
that Active Directory supports SCRAM [1], for instance, unless the MS
documentation is just omitting it.

Even if it were applicable to every single LDAP deployment, I'd still
like users to be able to choose the best authentication available in
their setups without first having to convince the community. They can
maintain the things that make sense for them, like they do with
extensions. And the authors can iterate on better authentication out
of cycle, like extension authors do.

--Jacob

[1] 
https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-adts/e7d814a5-4cb5-4b0d-b408-09d79988b550




Re: Non-superuser subscription owners

2023-02-28 Thread Andres Freund
Hi,

On 2023-02-22 09:18:34 -0800, Jeff Davis wrote:
> I can't resist mentioning that these are all SECURITY INVOKER problems.
> SECURITY INVOKER is insecure unless the invoker absolutely trusts the
> definer, and that only really makes sense if the definer is a superuser
> (or something very close). That's why we keep adding exceptions with
> SECURITY_RESTRICTED_OPERATION, which is really just a way to silently
> ignore the SECURITY INVOKER label and use SECURITY DEFINER instead.
> 
> At some point we need to ask: "when is SECURITY INVOKER both safe and
> useful?" and contain it to those cases, rather than silently ignoring
> it in an expanding list of cases.

I can only repeat myself in stating that SECURITY DEFINER solves none of the
relevant issues. I included several examples of why it doesn't in the recent
thread about "blocking SECURITY INVOKER". E.g. that default arguments of
SECDEF functions are evaluated with the current user's privileges, not the
function owner's privs:

https://postgr.es/m/20230113032943.iyxdu7bnxe4cmbld%40awork3.anarazel.de

Greetings,

Andres Freund




Re: Experiments with Postgres and SSL

2023-02-28 Thread Jacob Champion
On Tue, Feb 28, 2023 at 10:33 AM Greg Stark  wrote:
> On Wed, 22 Feb 2023 at 07:27, Heikki Linnakangas  wrote:
> > Good idea. Do we want to just require the protocol to be "postgres", or
> > perhaps "postgres/3.0"? Need to register that with IANA, I guess.
>
> I had never heard of this before, it does seem useful. But if I
> understand it right it's entirely independent of this patch.

It can be. If you want to use it in the strongest possible way,
though, you'd have to require its use by clients. Introducing that
requirement later would break existing ones, so I think it makes sense
to do it at the same time as the initial implementation, if there's
interest.

> We can
> add it to all our Client/Server exchanges whether they're the initial
> direct SSL connection or the STARTTLS negotiation?

I'm not sure it would buy you anything during the STARTTLS-style
opening. You already know what protocol you're speaking in that case.
(So with the ALPACA example, the damage is already done.)

Thanks,
--Jacob




Re: Memory leak from ExecutorState context?

2023-02-28 Thread Tomas Vondra



On 2/28/23 19:25, Justin Pryzby wrote:
> On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:
>> Hello all,
>>
>> A customer is facing out of memory query which looks similar to this 
>> situation:
>>
>>   
>> https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
>>
>> This PostgreSQL version is 11.18. Some settings:
> 
> hash joins could exceed work_mem until v13:
> 
> |Allow hash aggregation to use disk storage for large aggregation result
> |sets (Jeff Davis)
> |

That's hash aggregate, not hash join.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Commitfest 2023-03 starting tomorrow!

2023-02-28 Thread Greg Stark
So I'm not sure if I'll be CFM this month but I'm assuming I will be
at this point

Regardless, Commitfest 2023-03 starts tomorrow!

So this is a good time to check your submitted patches and ensure
they're actually in building and don't need a rebase. Take a look at
http://cfbot.cputube.org/ for example. I'll do an initial pass marking
anything red here as Waiting on Author

The next pass would be to grab any patches not marked Ready for
Committer and if they look like they'll need more than a one round of
feedback and a couple weeks to polish they'll probably get bounced to
the next commitfest too. It sucks not getting feedback on your patches
for so long but there are really just sooo many patches and so few
eyeballs... It would be great if people could do initial reviews of
these patches before we bounce them because it really is discouraging
for developers to send patches and not get feedback. But realistically
it's going to happen to a lot of patches.


-- 
greg




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-02-28 Thread Andres Freund
Hi,

On 2023-02-25 16:00:05 +0530, Amit Kapila wrote:
> On Tue, Feb 21, 2023 at 7:55 PM Önder Kalacı  wrote:
> >> I think this overhead seems to be mostly due to the need to perform
> >> tuples_equal multiple times for duplicate values.

I think more work needs to be done to determine the source of the
overhead. It's not clear to me why there'd be an increase in tuples_equal()
calls in the tests upthread.


> Wouldn't a table-level option like 'apply_index_scan' be better than a
> subscription-level option with a default value as false? Anyway, the
> bigger point is that we don't see a better way to proceed here than to
> introduce some option to control this behavior.

I don't think this should default to false. The quadratic apply performance
the sequential scans cause, are a much bigger hazard for users than some apply
performance reqression.


> I see this as a way to provide this feature for users but I would
> prefer to proceed with this if we can get some more buy-in from senior
> community members (at least one more committer) and some user(s) if
> possible. So, I once again request others to chime in and share their
> opinion.

I'd prefer not having an option, because we figure out the cause of the
performance regression (reducing it to be small enough to not care). After
that an option defaulting to using indexes. I don't think an option defaulting
to false makes sense.

I don't care whether it's subscription or relation level option.

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-28 Thread Kuntal Ghosh
On Tue, Feb 28, 2023 at 10:38 AM Bharath Rupireddy
 wrote:
>
+/*
+ * Guts of XLogReadFromBuffers().
+ *
+ * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and return the read bytes.
+ */
s/fetched WAL buffers/fetched from WAL buffers


+ else if (nread < nbytes)
+ {
+ /*
+ * We read some of the requested bytes. Continue to read remaining
+ * bytes.
+ */
+ ptr += nread;
+ nbytes -= nread;
+ dst += nread;
+ *read_bytes += nread;
+ }

The 'if' condition should always be true. You can replace the same
with an assertion instead.
s/Continue to read remaining/Continue to read the remaining

The good thing about this patch is that it reduces read IO calls
without impacting the write performance (at least not that
noticeable). It also takes us one step forward towards the
enhancements mentioned in the thread. If performance is a concern, we
can introduce a GUC to enable/disable this feature.

-- 
Thanks & Regards,
Kuntal Ghosh




Re: Experiments with Postgres and SSL

2023-02-28 Thread Greg Stark
On Wed, 22 Feb 2023 at 07:27, Heikki Linnakangas  wrote:
>
> On 20/01/2023 03:28, Jacob Champion wrote:
> > On Wed, Jan 18, 2023 at 7:16 PM Greg Stark  wrote:
> >> * "Service Mesh" type tools that hide multiple services behind a
> >> single host/port ("Service Mesh" is just a new buzzword for "proxy").
> >
> > If you want to multiplex protocols on a port, now is an excellent time
> > to require clients to use ALPN on implicit-TLS connections. (There are
> > no clients that can currently connect via implicit TLS, so you'll
> > never have another chance to force the issue without breaking
> > backwards compatibility.) That should hopefully make it harder to
> > ALPACA yourself or others [2].
>
> Good idea. Do we want to just require the protocol to be "postgres", or
> perhaps "postgres/3.0"? Need to register that with IANA, I guess.

I had never heard of this before, it does seem useful. But if I
understand it right it's entirely independent of this patch. We can
add it to all our Client/Server exchanges whether they're the initial
direct SSL connection or the STARTTLS negotiation?


> We implemented a protocol version negotiation mechanism in the libpq
> protocol itself, how would this interact with it? If it's just
> "postgres", then I guess we'd still negotiate the protocol version and
> list of extensions after the TLS handshake.
>
> >> Incidentally I find the logic in ProcessStartupPacket incredibly
> >> confusing. It took me a while before I realized it's using tail
> >> recursion to implement the startup logic. I think it would be way more
> >> straightforward and extensible if it used a much more common iterative
> >> style. I think it would make it possible to keep more state than just
> >> ssl_done and gss_done without changing the function signature every
> >> time for example.
> >
> > +1. The complexity of the startup logic, both client- and server-side,
> > is a big reason why I want implicit TLS in the first place. That way,
> > bugs in that code can't be exploited before the TLS handshake
> > completes.
>
> +1. We need to support explicit TLS for a long time, so we can't
> simplify by just removing it. But let's refactor the code somehow, to
> make it more clear.
>
> Looking at the patch, I think it accepts an SSLRequest packet even if
> implicit TLS has already been established. That's surely wrong, and
> shows how confusing the code is. (Or I'm reading it incorrectly, which
> also shows how confusing it is :-) )

I'll double check it but I think I tested that that wasn't the case. I
think it accepts the SSL request packet and sends back an N which the
client libpq just interprets as the server not supporting SSL and does
an unencrypted connection (which is tunneled over stunnel unbeknownst
to libpq).

I agree I would want to flatten this logic to an iterative approach
but having wrapped my head around it now I'm not necessarily rushing
to do it now. The main advantage of flattening it would be to make it
easy to support other protocol types which I think could be really
interesting. It would be much clearer to document the state machine if
all the state is in one place and the code just loops through
processing startup packets and going to a new state until the
connection is established. That's true now but you have to understand
how the state is passed in the function parameters and notice that all
the recursion is tail recursive (I think). And extending that state
would require extending the function signature which would get awkward
quickly.

> Regarding Vladimir's comments on how clients can migrate to this, I
> don't have any great suggestions. To summarize, there are several options:
>
> - Add an "fast_tls" option that the user can enable if they know the
> server supports it
>
> - First connect in old-fashioned way, and remember the server version.
> Later, if you reconnect to the same server, use implicit TLS if the
> server version was high enough. This would be most useful for connection
> pools.

Vladimir pointed out that this doesn't necessarily work. The server
may be new enough to support it but it could be behind a proxy like
pgbouncer or something. The same would be true if the server reported
a "connection option" instead of depending on version.

> - Connect both ways at the same time, and continue with the fastest,
> i.e. "happy eyeballs"

That seems way too complex for us to bother with imho.

> - Try implicit TLS first, and fall back to explicit TLS if it fails.

> For libpq, we don't necessarily need to do anything right now. We can
> add the implicit TLS support in a later version. Not having libpq
> support makes it hard to test the server codepath, though. Maybe just
> test it with 'stunnel' or 'openssl s_client'.

I think we should have an option to explicitly enable it in psql, if
only for testing. And then wait five years and switch the default on
it then. In the meantime users can just set it based on their setup.
That's not the way to the quickest adoption but 

Re: Memory leak from ExecutorState context?

2023-02-28 Thread Justin Pryzby
On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:
> Hello all,
> 
> A customer is facing out of memory query which looks similar to this 
> situation:
> 
>   
> https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
> 
> This PostgreSQL version is 11.18. Some settings:

hash joins could exceed work_mem until v13:

|Allow hash aggregation to use disk storage for large aggregation result
|sets (Jeff Davis)
|
|Previously, hash aggregation was avoided if it was expected to use more
|than work_mem memory. Now, a hash aggregation plan can be chosen despite
|that. The hash table will be spilled to disk if it exceeds work_mem
|times hash_mem_multiplier.
|
|This behavior is normally preferable to the old behavior, in which once
|hash aggregation had been chosen, the hash table would be kept in memory
|no matter how large it got — which could be very large if the planner
|had misestimated. If necessary, behavior similar to that can be obtained
|by increasing hash_mem_multiplier.

>   https://explain.depesz.com/s/sGOH

This shows multiple plan nodes underestimating the row counts by factors
of ~50,000, which could lead to the issue fixed in v13.

I think you should try to improve the estimates, which might improve
other queries in addition to this one, in addition to maybe avoiding the
issue with joins.

> The customer is aware he should rewrite this query to optimize it, but it's a
> long time process he can not start immediately. To make it run in the 
> meantime,
> he actually removed the top CTE to a dedicated table.

Is the table analyzed ?

> Is it usual a backend is requesting such large memory size (13 GB) and
> actually use less of 60% of it (7.7GB of RSS)?

It's possible it's "using less" simply because it's not available.  Is
the process swapping ?

-- 
Justin




Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-02-28 Thread Heikki Linnakangas

On 28/02/2023 16:17, Maxim Orlov wrote:
Either I do not understand something, or the files from pg_commit_ts 
directory are not copied.


Huh, yeah you're right, pg_upgrade doesn't copy pg_commit_ts.

- Heikki





Re: WIN32 pg_import_system_collations

2023-02-28 Thread Juan José Santamaría Flecha
On Tue, Feb 28, 2023 at 12:55 PM Andrew Dunstan  wrote:

> On 2023-02-27 Mo 17:20, Juan José Santamaría Flecha wrote:
>
>
> Hmm, yeah. I'm not sure I understand the point of this test anyway:
>
>
> SELECT to_char(date '2010-03-01', 'DD TMMON ' COLLATE "de_DE");
>

Uhm, they probably don't make much sense except for "tr_TR", so I'm fine
with removing them. PFA a patch for so.

Regards,

Juan José Santamaría Flecha


0002-remove-useless-test-from-collate.windows.win1252.patch
Description: Binary data


0001-change-locale-name-for-test-collate.windows.win1252.patch
Description: Binary data


Re: pg_upgrade and logical replication

2023-02-28 Thread Nikolay Samokhvalov
On Fri, Feb 17, 2023 at 7:35 AM Julien Rouhaud  wrote:
>
>  Any table later added in the
> publication is either already fully replicated until that LSN on the upgraded
> node, so only the delta is needed, or has been created after that LSN.  In the
> latter case, the entirety of the table will be replicated with the logical
> replication as a delta right?


What if we consider a slightly adjusted procedure?

0. Temporarily, forbid running any DDL on the source cluster.
1. On the source, create publication, replication slot and remember
the LSN for it
2. Restore the target cluster to that LSN using restore_target_lsn (PITR)
3. Run pg_upgrade on the target cluster
4. Only now, create subscription to target
5. Wait until logical replication catches up
6. Perform a switchover to the new cluster taking care of lags in sequences, etc
7. Resume DDL when needed

Do you see any data loss happening in this approach?




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> Few comments:

Thank you for reviewing! PSA new version.
Note that the starting point of delay for 2PC was not changed,
I think it has been under discussion.

> 1.
> + /*
> + * If we've requested to shut down, exit the process.
> + *
> + * Note that WalSndDone() cannot be used here because the delaying
> + * changes will be sent in the function.
> + */
> + if (got_STOPPING)
> + {
> + QueryCompletion qc;
> +
> + /* Inform the standby that XLOG streaming is done */
> + SetQueryCompletion(, CMDTAG_COPY, 0);
> + EndCommand(, DestRemote, false);
> + pq_flush();
> 
> Do we really need to do anything except for breaking the loop and let
> the exit handling happen in the main loop when 'got_STOPPING' is set?
> AFAICS, this is what we are doing in some other palces (See
> WalSndWaitForWal). Won't that work? It seems that will help us sending
> all the pending WAL.

If we exit the loop after got_STOPPING is set, as you said, the walsender will
send delaying changes and then exit. The behavior is same as the case that 
WalSndDone()
is called. But I think it is not suitable for the motivation of the feature.
If users notice the miss operation like TRUNCATE, they must shut down the 
publisher
once and then recovery from back up or old subscriber. If the walsender sends 
all
pending changes, miss operations will be also propagated to subscriber and data
cannot be protected. So currently I want to keep the style.
FYI - In case of physical replication, received WALs are not applied when the
secondary is shutted down.

> 2.
> + /* Try to flush pending output to the client */
> + if (pq_flush_if_writable() != 0)
> + WalSndShutdown();
> 
> Is there a reason to try flushing here?

IIUC if pq_flush_if_writable() returns non-zero (EOF), it means that there is a
trouble and walsender fails to send messages to subscriber.

In Linux, the stuck trace from pq_flush_if_writable() will finally reach the 
send() system call.
And according to man page[1], it will be triggered by some unexpected state or 
the connection is closed.

Based on above, I think the returned value should be confirmed.

> Apart from the above, I have made a few changes in the comments in the
> attached diff patch. If you agree with those then please include them
> in the next version.

Thanks! I checked and I thought all of them should be included.

Moreover, I used grammar checker and slightly reworded the commit message.

[1]: https://man7.org/linux/man-pages/man3/send.3p.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v9-0001-Time-delayed-logical-replication-on-publisher-sid.patch
Description:  v9-0001-Time-delayed-logical-replication-on-publisher-sid.patch


Re: Maybe we can remove the type cast in typecache.c

2023-02-28 Thread Tom Lane
qinghao huang  writes:
> When I was reading postgres code, I found there is a wierd type cast. I'm 
> pondering if it is necessary.

> ```
> /* Allocate a new typmod number.  This will be wasted if we error out. */
> typmod = (int)
> 
> pg_atomic_fetch_add_u32(>shared_typmod_registry->next_typmod,
> 1);

> ```
> typmod has u32 type, but we cast it to int first.

typmods really ought to be int32, not uint32, so IMO none of this is
exactly right.  But it's also true that it makes no real difference.
Postgres pretty much assumes that "int" is 32 bits.

regards, tom lane




Avoid multiple SetLatch() calls in procsignal_sigusr1_handler()

2023-02-28 Thread Bharath Rupireddy
Hi,

Most of the multiplexed SIGUSR1 handlers are setting latch explicitly
when the procsignal_sigusr1_handler() can do that for them at the end.
These multiplexed handlers are currently being used as SIGUSR1
handlers, not as independent handlers, so no problem if SetLatch() is
removed from them. A few others do it right by saying /* latch will be
set by procsignal_sigusr1_handler */. Although, calling SetLatch() in
quick succession does no harm (it just returns if the latch was
previously set), it seems unnecessary.

I'm attaching a patch that avoids multiple SetLatch() calls.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Avoid-multiple-SetLatch-calls-in-procsignal_sigus.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-28 Thread Masahiko Sawada
On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada  wrote:
>
> On Tue, Feb 28, 2023 at 3:42 PM John Naylor
>  wrote:
> >
> >
> > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > >  wrote:
> > > >
> > > > That doesn't seem useful to me. If we've done enough testing to 
> > > > reassure us the new way always gives the same answer, the old way is 
> > > > not needed at commit time. If there is any doubt it will always give 
> > > > the same answer, then the whole patchset won't be committed.
> >
> > > My idea is to make the bug investigation easier but on
> > > reflection, it seems not the best idea given this purpose.
> >
> > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old 
> > tid array. As I've said, that doesn't seem like a good thing to carry 
> > forward forevermore, in any form. Plus, comparing new code with new code is 
> > not the same thing as comparing existing code with new code. That was my 
> > idea upthread.
> >
> > Maybe the effort my idea requires is too much vs. the likelihood of finding 
> > a problem. In any case, it's clear that if I want that level of paranoia, 
> > I'm going to have to do it myself.
> >
> > > What do you think
> > > about the attached patch? Please note that it also includes the
> > > changes for minimum memory requirement.
> >
> > Most of the asserts look logical, or at least harmless.
> >
> > - int max_off; /* the maximum offset number */
> > + OffsetNumber max_off; /* the maximum offset number */
> >
> > I agree with using the specific type for offsets here, but I'm not sure why 
> > this change belongs in this patch. If we decided against the new asserts, 
> > this would be easy to lose.
>
> Right. I'll separate this change as a separate patch.
>
> >
> > This change, however, defies common sense:
> >
> > +/*
> > + * The minimum amount of memory required by TidStore is 2MB, the current 
> > minimum
> > + * valid value for the maintenance_work_mem GUC. This is required to 
> > allocate the
> > + * DSA initial segment, 1MB, and some meta data. This number is applied 
> > also to
> > + * the local TidStore cases for simplicity.
> > + */
> > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
> >
> > + /* Sanity check for the max_bytes */
> > + if (max_bytes < TIDSTORE_MIN_MEMORY)
> > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
> > + TIDSTORE_MIN_MEMORY, max_bytes);
> >
> > Aside from the fact that this elog's something that would never get past 
> > development, the #define just adds a hard-coded copy of something that is 
> > already hard-coded somewhere else, whose size depends on an implementation 
> > detail in a third place.
> >
> > This also assumes that all users of tid store are limited by 
> > maintenance_work_mem. Andres thought of an example of some day unifying 
> > with tidbitmap.c, and maybe other applications will be limited by work_mem.
> >
> > But now that I'm looking at the guc tables, I am reminded that work_mem's 
> > minimum is 64kB, so this highlights a design problem: There is obviously no 
> > requirement that the minimum work_mem has to be >= a single DSA segment, 
> > even though operations like parallel hash and parallel bitmap heap scan are 
> > limited by work_mem.
>
> Right.
>
> >  It would be nice to find out what happens with these parallel features 
> > when work_mem is tiny (maybe parallelism is not even considered?).
>
> IIUC both don't care about the allocated DSA segment size. Parallel
> hash accounts actual tuple (+ header) size as used memory but doesn't
> consider how much DSA segment is allocated behind. Both parallel hash
> and parallel bitmap scan can work even with work_mem = 64kB, but when
> checking the total DSA segment size allocated during these operations,
> it was 1MB.
>
> I realized that there is a similar memory limit design issue also on
> the non-shared tidstore cases. We deduct 70kB from max_bytes but it
> won't work fine with work_mem = 64kB.  Probably we need to reconsider
> it. FYI 70kB comes from the maximum slab block size for node256.

Currently, we calculate the slab block size enough to allocate 32
chunks from there. For node256, the leaf node is 2,088 bytes and the
slab block size is 66,816 bytes. One idea to fix this issue to
decrease it. For example, with 16 chunks the slab block size is 33,408
bytes and with 8 chunks it's 16,704 bytes. I ran a brief benchmark
test with 70kB block size and 16kB block size:

* 70kB slab blocks:
select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFF');
height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793
 mem_allocated | load_ms | search_ms
---+-+---
 143085184 |1216 |   750
(1 row)

* 16kB slab blocks:
select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFF');
height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793
 mem_allocated | load_ms | search_ms

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-02-28 Thread Damir Belyalov
Hello

Tested patch on all cases: CIM_SINGLE, CIM_MULTI, CIM_MULTI_CONDITION. As
expected it works.
Also added a description to copy.sgml and made a review on patch.

I added 'ignored_errors' integer parameter that should be output after the
option is finished.
All errors were added to the system logfile with full detailed context.
Maybe it's better to log only error message.
file:///home/abc13/Documents/todo_copy/postgres/v2-0001-Add-COPY-option-IGNORE_DATATYPE_ERRORS.patch



Regards, Damir Belyalov
Postgres Professional
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index c25b52d0cb..706b929947 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -34,6 +34,7 @@ COPY { table_name [ ( format_name
 FREEZE [ boolean ]
+IGNORE_DATATYPE_ERRORS [ boolean ]
 DELIMITER 'delimiter_character'
 NULL 'null_string'
 HEADER [ boolean | MATCH ]
@@ -233,6 +234,17 @@ COPY { table_name [ ( 

 
+   
+IGNORE_DATATYPE_ERRORS
+
+ 
+  Drops rows that contain malformed data while copying. These are rows
+  with columns where the data type's input-function raises an error.
+  Outputs warnings about rows with incorrect data to system logfile.
+ 
+
+   
+

 DELIMITER
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index e34f583ea7..0334894014 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -410,6 +410,7 @@ ProcessCopyOptions(ParseState *pstate,
 	bool		format_specified = false;
 	bool		freeze_specified = false;
 	bool		header_specified = false;
+	bool		ignore_datatype_errors_specified= false;
 	ListCell   *option;
 
 	/* Support external use for option sanity checking */
@@ -449,6 +450,13 @@ ProcessCopyOptions(ParseState *pstate,
 			freeze_specified = true;
 			opts_out->freeze = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "ignore_datatype_errors") == 0)
+		{
+			if (ignore_datatype_errors_specified)
+errorConflictingDefElem(defel, pstate);
+			ignore_datatype_errors_specified = true;
+			opts_out->ignore_datatype_errors = defGetBoolean(defel);
+		}
 		else if (strcmp(defel->defname, "delimiter") == 0)
 		{
 			if (opts_out->delim)
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index af52faca6d..ecaa750568 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -955,10 +955,14 @@ CopyFrom(CopyFromState cstate)
 	errcallback.previous = error_context_stack;
 	error_context_stack = 
 
+	if (cstate->opts.ignore_datatype_errors)
+		cstate->ignored_errors = 0;
+
 	for (;;)
 	{
 		TupleTableSlot *myslot;
 		bool		skip_tuple;
+		ErrorSaveContext escontext = {T_ErrorSaveContext};
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -991,9 +995,26 @@ CopyFrom(CopyFromState cstate)
 
 		ExecClearTuple(myslot);
 
+		if (cstate->opts.ignore_datatype_errors)
+		{
+			escontext.details_wanted = true;
+			cstate->escontext = escontext;
+		}
+
 		/* Directly store the values/nulls array in the slot */
 		if (!NextCopyFrom(cstate, econtext, myslot->tts_values, myslot->tts_isnull))
+		{
+			if (cstate->opts.ignore_datatype_errors && cstate->ignored_errors > 0)
+ereport(WARNING, errmsg("Errors: %ld", cstate->ignored_errors));
 			break;
+		}
+
+		/* Soft error occured, skip this tuple */
+		if (cstate->escontext.error_occurred)
+		{
+			ExecClearTuple(myslot);
+			continue;
+		}
 
 		ExecStoreVirtualTuple(myslot);
 
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 91b564c2bc..9c36b0dc8b 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -70,6 +70,7 @@
 #include "libpq/pqformat.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "nodes/miscnodes.h"
 #include "pgstat.h"
 #include "port/pg_bswap.h"
 #include "utils/builtins.h"
@@ -938,10 +939,23 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 
 			cstate->cur_attname = NameStr(att->attname);
 			cstate->cur_attval = string;
-			values[m] = InputFunctionCall(_functions[m],
-		  string,
-		  typioparams[m],
-		  att->atttypmod);
+
+			/* If IGNORE_DATATYPE_ERRORS is enabled skip rows with datatype errors */
+			if (!InputFunctionCallSafe(_functions[m],
+	   string,
+	   typioparams[m],
+	   att->atttypmod,
+	   (Node *) >escontext,
+	   [m]))
+			{
+cstate->ignored_errors++;
+
+ereport(LOG,
+		errmsg("%s", cstate->escontext.error_data->message));
+
+return true;
+			}
+
 			if (string != NULL)
 nulls[m] = false;
 			cstate->cur_attname = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a0138382a1..d79d293c0d 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -701,7 +701,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	HANDLER HAVING HEADER_P HOLD HOUR_P
 
-	IDENTITY_P IF_P ILIKE 

Re: Allow logical replication to copy tables in binary format

2023-02-28 Thread Melih Mutlu
Hi,

Attached v11.

vignesh C , 28 Şub 2023 Sal, 13:59 tarihinde şunu
yazdı:

> Thanks for the patch, Few comments:
> 1) Are primary key required for the tables, if not required we could
> remove the primary key which will speed up the test by not creating
> the indexes and inserting in the indexes. Even if required just create
> for one of the tables:
>

I think that having a primary key in tables for logical replication tests
is good for checking if log. rep. duplicates any row. Other tests also have
primary keys in almost all tables.

Bharath Rupireddy , 28 Şub 2023
Sal, 15:27 tarihinde şunu yazdı:

> 1.
> +  
> +   If true, initial data synchronization will be performed in binary
> format
> +  
> It's not just the initial table sync right? The table sync can happen
> at any other point of time when ALTER SUBSCRIPTION ... REFRESH
> PUBLICATION WITH (copy = true) is run.
> How about - "If true, the subscriber requests publication for
> pre-existing data in binary format"?
>

I changed it as you suggested.
I sometimes feel like the phrase "initial sync" is used for initial sync of
a table, not a subscription. Or table syncs triggered by ALTER SUBSCRIPTION
are ignored in some places where "initial sync" is used.

2.
> Perhaps, this should cover the recommended cases for enabling this new
> option - something like below (may not need to have exact wording, but
> the recommended cases?):
> "It is recommended to enable this option only when 1) the column data
> types have appropriate binary send/receive functions, 2) not
> replicating between different major versions or different platforms,
> 3) both publisher and subscriber tables have the exact same column
> types (not when replicating from smallint to int or numeric to int8
> and so on), 4) both publisher and subscriber supports COPY with binary
> option, otherwise the table copy can fail."
>

I added a line stating that binary format is less portable across machine
architectures and versions as stated in COPY [1].
I don't think we should add line saying "recommended", but state the
restrictions clearly instead. It's also similar in COPY docs as well.
I think the explanation now covers all your points, right?

Jelte Fennema , 28 Şub 2023 Sal, 16:25 tarihinde şunu
yazdı:

> > 3. I think the newly added tests must verify if the binary COPY is
> > picked up when enabled. Perhaps, looking at the publisher's server log
> > for 'COPY ... WITH BINARY format'? Maybe it's an overkill, otherwise,
> > we have no way of testing that the option took effect.
>
> Another way to test that BINARY is enabled could be to trigger one
> of the failure cases.


Yes, there is already a failure case for binary copy which resolves with
swithcing binary_copy to false.
But I also added checks for publisher logs now too.


[1] https://www.postgresql.org/docs/devel/sql-copy.html

Thanks,
-- 
Melih Mutlu
Microsoft


v11-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data


Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-02-28 Thread Maxim Orlov
On Mon, 27 Feb 2023 at 12:10, Heikki Linnakangas  wrote:

> (CC list trimmed, gmail wouldn't let me send otherwise)
>
> That sounds right for pg_serial, pg_notify, and pg_subtrans. But not for
> pg_commit_ts and the pg_multixacts.
>
> This needs tests for pg_upgrading those SLRUs, after 0, 1 and N
> wraparounds.
>
> Yep, that's my fault. I've forgotten about pg_multixacts. But for the
pg_commit_ts it's a bit complicated story.
The thing is, if we do upgrade, old files from pg_commit_ts not copied into
a new server.

For example, I've checked one more time on the current master branch:
1). initdb
2). add "track_commit_timestamp = on" into postgresql.conf
3). pgbench
4). $ ls  pg_commit_ts/
  0005  000A  000F  0014  0019  001E  0023...
...009A  009F  00A4  00A9  00AE  00B3  00B8  00BD  00C2
5). do pg_upgrade
6). $ ls  pg_commit_ts/
  00C2

Either I do not understand something, or the files from pg_commit_ts
directory are not copied.

-- 
Best regards,
Maxim Orlov.


Re: Adding CommandID to heap xlog records

2023-02-28 Thread Heikki Linnakangas
I took another stab at this from a different angle, and tried to use 
this to simplify logical decoding. The theory was that if we included 
the command ID in the WAL records, we wouldn't need the separate 
HEAP2_NEW_CID record anymore, and could remove much of the code in 
reorderbuffer.c that's concerned with tracking ctid->(cmin,cmax) 
mapping. Unfortunately, it didn't work out.


Here's one problem:

Insert with cmin 1
Commit
Delete the same tuple with cmax 2.
Abort

Even if we store the cmin in the INSERT record, and set it on the tuple 
on replay, the DELETE overwrites it. That's OK for the original 
transactions, because they only look at the cmin/cmax of their own 
transaction, but it's a problem for logical decoding. If we see the 
inserted tuple during logical decoding, we need the cmin of the tuple.


We could still just replace the HEAP2_NEW_CID records with the CIDs in 
the heap INSERT/UPDATE/DELETE records, and use that information to 
maintain the ctid->(cmin,cmax) mapping in reorderbuffer.c like we do 
today. But that doesn't really simplify reorderbuffer.c much. Attached 
is a patch for that, for the archives sake.


Another problem with that is that logical decoding needs slightly 
different information than what we store on the tuples on disk. My 
original motivation for this was for Neon, which needs the WAL replay to 
restore the same CID as what's stored on disk, whether it's cmin, cmax 
or combocid. But for logical decoding, we need the cmin or cmax, *not* 
the combocid. To cater for both uses, we'd need to include both the 
original cmin/cmax and the possible combocid, which again makes it more 
complicated.


So unfortunately I don't see much opportunity to simplify logical 
decoding with this. However, please take a look at the first two patches 
attached. They're tiny cleanups that make sense on their own.


- Heikki
From d666309a4cffc48e5091c0a97b8e2e7ec668f058 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 28 Feb 2023 13:22:40 +0200
Subject: [PATCH 1/4] Improve comment on why we need ctid->(cmin,cmax) mapping

---
 src/backend/replication/logical/snapbuild.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 62542827e4b..93755646564 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -41,10 +41,15 @@
  * transactions we need Snapshots that see intermediate versions of the
  * catalog in a transaction. During normal operation this is achieved by using
  * CommandIds/cmin/cmax. The problem with that however is that for space
- * efficiency reasons only one value of that is stored
- * (cf. combocid.c). Since combo CIDs are only available in memory we log
- * additional information which allows us to get the original (cmin, cmax)
- * pair during visibility checks. Check the reorderbuffer.c's comment above
+ * efficiency reasons, the cmin and cmax are not included in WAL records. We
+ * cannot read the cmin/cmax from the tuple itself, either, because it is
+ * reset on crash recovery. Even if we could, we could not decode combocids
+ * which are only tracked in the original backend's memory. To work around
+ * that, heapam writes an extra WAL record (XLOG_HEAP2_NEW_CID) every time a
+ * catalog row is modified, which includes the cmin and cmax of the
+ * tuple. During decoding, we insert the ctid->(cmin,cmax) mappings into the
+ * reorder buffer, and use them at visibility checks instead of the cmin/cmax
+ * on the tuple itself. Check the reorderbuffer.c's comment above
  * ResolveCminCmaxDuringDecoding() for details.
  *
  * To facilitate all this we need our own visibility routine, as the normal
-- 
2.30.2

From 44435eb4121425f4babd9ed8847e6ddc137ba436 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 28 Feb 2023 12:55:49 +0200
Subject: [PATCH 2/4] Remove redundant check for fast_forward

---
 src/backend/replication/logical/decode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1f..d8962345da4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -394,8 +394,7 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_HEAP2_MULTI_INSERT:
-			if (!ctx->fast_forward &&
-SnapBuildProcessChange(builder, xid, buf->origptr))
+			if (SnapBuildProcessChange(builder, xid, buf->origptr))
 DecodeMultiInsert(ctx, buf);
 			break;
 		case XLOG_HEAP2_NEW_CID:
-- 
2.30.2

From bc5103128fb3b6583a0b2b248fd205b59a347fe5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 28 Feb 2023 13:02:56 +0200
Subject: [PATCH 3/4] Remove combocid field in logical decode that was just for
 debugging.

(I don't think this is worth doing without the next patch)
---
 

Re: Non-superuser subscription owners

2023-02-28 Thread Robert Haas
On Mon, Feb 27, 2023 at 7:37 PM Jeff Davis  wrote:
> > Yeah. That's the idea I was floating, at least.
>
> Isn't that a hard problem; maybe impossible?

It doesn't seem that hard to me; maybe I'm missing something.

The existing SECURITY_RESTRICTED_OPERATION flag basically prevents you
from tinkering with the session state. If we also had a similar flags
like DATABASE_READS_PROHIBITED and DATABASE_WRITES_PROHIBITED (or just
a combined DATABASE_ACCESS_PROHIBITED flag) I think that would be
pretty close to what we need. The idea would be that, when a user
executes a function or procedure owned by a user that they don't trust
completely, we'd set
SECURITY_RESTRICTED_OPERATION|DATABASE_READS_PROHIBITED|DATABASE_WRITES_PROHIBITED.
And we could provide a user with a way to express the degree of trust
they have in some other user or perhaps even some specific function,
e.g.

SET trusted_roles='alice:read';

...could mean that I trust alice to read from the database with my
permissions, should I happen to run code provided by her in SECURITY
INVOKER modacke.

I'm sure there's some details to sort out here, e.g. around security
related to the trusted_roles GUC itself. But I don't really see a
fundamental problem. We can invent arbitrary flags that prohibit
classes of operations that are of concern, set them by default in
cases where concern is justified, and then give users who want the
current behavior some kind of escape hatch that causes those flags to
not get set after all. Not only does such a solution not seem
impossible, I can possibly even imagine back-patching it, depending on
exactly what the shape of the final solution is, how important we
think it is to get a fix out there, and how brave I'm feeling that
day.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Allow logical replication to copy tables in binary format

2023-02-28 Thread Jelte Fennema
> 3. I think the newly added tests must verify if the binary COPY is
> picked up when enabled. Perhaps, looking at the publisher's server log
> for 'COPY ... WITH BINARY format'? Maybe it's an overkill, otherwise,
> we have no way of testing that the option took effect.

Another way to test that BINARY is enabled could be to trigger one
of the failure cases.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-28 Thread Masahiko Sawada
On Tue, Feb 28, 2023 at 3:42 PM John Naylor
 wrote:
>
>
> On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> >  wrote:
> > >
> > > That doesn't seem useful to me. If we've done enough testing to reassure 
> > > us the new way always gives the same answer, the old way is not needed at 
> > > commit time. If there is any doubt it will always give the same answer, 
> > > then the whole patchset won't be committed.
>
> > My idea is to make the bug investigation easier but on
> > reflection, it seems not the best idea given this purpose.
>
> My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old 
> tid array. As I've said, that doesn't seem like a good thing to carry forward 
> forevermore, in any form. Plus, comparing new code with new code is not the 
> same thing as comparing existing code with new code. That was my idea 
> upthread.
>
> Maybe the effort my idea requires is too much vs. the likelihood of finding a 
> problem. In any case, it's clear that if I want that level of paranoia, I'm 
> going to have to do it myself.
>
> > What do you think
> > about the attached patch? Please note that it also includes the
> > changes for minimum memory requirement.
>
> Most of the asserts look logical, or at least harmless.
>
> - int max_off; /* the maximum offset number */
> + OffsetNumber max_off; /* the maximum offset number */
>
> I agree with using the specific type for offsets here, but I'm not sure why 
> this change belongs in this patch. If we decided against the new asserts, 
> this would be easy to lose.

Right. I'll separate this change as a separate patch.

>
> This change, however, defies common sense:
>
> +/*
> + * The minimum amount of memory required by TidStore is 2MB, the current 
> minimum
> + * valid value for the maintenance_work_mem GUC. This is required to 
> allocate the
> + * DSA initial segment, 1MB, and some meta data. This number is applied also 
> to
> + * the local TidStore cases for simplicity.
> + */
> +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
>
> + /* Sanity check for the max_bytes */
> + if (max_bytes < TIDSTORE_MIN_MEMORY)
> + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
> + TIDSTORE_MIN_MEMORY, max_bytes);
>
> Aside from the fact that this elog's something that would never get past 
> development, the #define just adds a hard-coded copy of something that is 
> already hard-coded somewhere else, whose size depends on an implementation 
> detail in a third place.
>
> This also assumes that all users of tid store are limited by 
> maintenance_work_mem. Andres thought of an example of some day unifying with 
> tidbitmap.c, and maybe other applications will be limited by work_mem.
>
> But now that I'm looking at the guc tables, I am reminded that work_mem's 
> minimum is 64kB, so this highlights a design problem: There is obviously no 
> requirement that the minimum work_mem has to be >= a single DSA segment, even 
> though operations like parallel hash and parallel bitmap heap scan are 
> limited by work_mem.

Right.

>  It would be nice to find out what happens with these parallel features when 
> work_mem is tiny (maybe parallelism is not even considered?).

IIUC both don't care about the allocated DSA segment size. Parallel
hash accounts actual tuple (+ header) size as used memory but doesn't
consider how much DSA segment is allocated behind. Both parallel hash
and parallel bitmap scan can work even with work_mem = 64kB, but when
checking the total DSA segment size allocated during these operations,
it was 1MB.

I realized that there is a similar memory limit design issue also on
the non-shared tidstore cases. We deduct 70kB from max_bytes but it
won't work fine with work_mem = 64kB.  Probably we need to reconsider
it. FYI 70kB comes from the maximum slab block size for node256.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: pg_rewind: warn when checkpoint hasn't happened after promotion

2023-02-28 Thread James Coleman
On Mon, Feb 27, 2023 at 2:33 AM Heikki Linnakangas  wrote:
>
> On 16/11/2022 07:17, kuroda.keis...@nttcom.co.jp wrote:
> >> The issue here is pg_rewind looks into control file to determine the
> >> soruce timeline, because the control file is not updated until the
> >> first checkpoint ends after promotion finishes, even though file
> >> blocks are already diverged.
> >>
> >> Even in that case history file for the new timeline is already
> >> created, so searching for the latest history file works.
> >
> > I think this change is a good one because if I want
> > pg_rewind to run automatically after a promotion,
> > I don't have to wait for the checkpoint to complete.
> >
> > The attached patch is Horiguchi-san's patch with
> > additional tests. The tests are based on James's tests,
> > "010_no_checkpoint_after_promotion.pl" tests that
> > pg_rewind is successfully executed without running
> > checkpoint after promote.
>
> I fixed this last week in commit 009746, see thread [1]. I'm sorry I
> didn't notice this thread earlier.
>
> I didn't realize that we had a notice about this in the docs. I'll go
> and remove that. Thanks!
>
> - Heikki
>

Thanks; I think the missing [1] (for reference) is:
https://www.postgresql.org/message-id/9f568c97-87fe-a716-bd39-65299b8a60f4%40iki.fi

James




Re: Add shared buffer hits to pg_stat_io

2023-02-28 Thread Drouvot, Bertrand

Hi,

On 2/25/23 9:16 PM, Melanie Plageman wrote:

Hi,

As suggested in [1], the attached patch adds shared buffer hits to
pg_stat_io.



Thanks for the patch!

 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-bool *foundPtr, IOContext *io_context)
+bool *foundPtr, IOContext io_context)
 {
BufferTag   newTag; /* identity of requested block 
*/
LocalBufferLookupEnt *hresult;
@@ -128,14 +128,6 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, 
BlockNumber blockNum,
hresult = (LocalBufferLookupEnt *)
hash_search(LocalBufHash, , HASH_FIND, NULL);

-   /*
-* IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
-* io_context here (instead of after a buffer hit would have returned) 
for
-* convenience since we don't have to worry about the overhead of 
calling
-* IOContextForStrategy().
-*/
-   *io_context = IOCONTEXT_NORMAL;


It looks like that io_context is not used in LocalBufferAlloc() anymore and 
then can be removed as an argument.



I am looking for input as to the order of this column in the view. I
think it should go after op_bytes since it is not relevant for
non-block-oriented IO. 


Agree.


However, I'm not sure what the order of hits,
evictions, and reuses should be (all have to do with buffers).



I'm not sure there is a strong "correct" ordering but the proposed one looks 
natural to me.


While adding this, I noticed that I had made all of the IOOP columns
int8 in the view, and I was wondering if this is sufficient for hits (I
imagine you could end up with quite a lot of those).



I think that's ok and bigint is what is already used for 
pg_statio_user_tables.heap_blks_hit for example.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Allow logical replication to copy tables in binary format

2023-02-28 Thread Bharath Rupireddy
On Tue, Feb 28, 2023 at 1:22 AM Melih Mutlu  wrote:
>
> Hi,
>
> Thanks for all of your reviews!
>
> So, I made some changes in the v10 according to your comments.

Thanks. Some quick comments on v10:

1.
+  
+   If true, initial data synchronization will be performed in binary format
+  
It's not just the initial table sync right? The table sync can happen
at any other point of time when ALTER SUBSCRIPTION ... REFRESH
PUBLICATION WITH (copy = true) is run.
How about - "If true, the subscriber requests publication for
pre-existing data in binary format"?

2.
+  Specifies whether pre-existing data on the publisher will be copied
+  to the subscriber in binary format. The default is
false.
+  Binary format is very data type specific, it will not allow copying
+  between different column types as opposed to text format. Note that
+  if this option is enabled, all data types which will be copied during
+  the initial synchronization should have binary send and
receive functions.
+  If this option is disabled, data format for the initial
synchronization
+  will be text.
Perhaps, this should cover the recommended cases for enabling this new
option - something like below (may not need to have exact wording, but
the recommended cases?):

"It is recommended to enable this option only when 1) the column data
types have appropriate binary send/receive functions, 2) not
replicating between different major versions or different platforms,
3) both publisher and subscriber tables have the exact same column
types (not when replicating from smallint to int or numeric to int8
and so on), 4) both publisher and subscriber supports COPY with binary
option, otherwise the table copy can fail."

3. I think the newly added tests must verify if the binary COPY is
picked up when enabled. Perhaps, looking at the publisher's server log
for 'COPY ... WITH BINARY format'? Maybe it's an overkill, otherwise,
we have no way of testing that the option took effect.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: WIN32 pg_import_system_collations

2023-02-28 Thread Andrew Dunstan


On 2023-02-27 Mo 17:20, Juan José Santamaría Flecha wrote:



El lun, 27 feb 2023, 23:05, Juan José Santamaría Flecha 
 escribió:



The command that's failing is "SET lc_time TO 'de_DE';", and that
area of code is untouched by this patch. As mentioned in [1],
the problem seems to come from a Windows bug that the CI images
and my development machines have patched out.


What I wanted to post as [1]:

https://www.postgresql.org/message-id/CAC%2BAXB1agvrgpyHEfqbDr2MOpcON3d%2BWYte_SLzn1E4TamLs9g%40mail.gmail.com



Hmm, yeah. I'm not sure I understand the point of this test anyway:


SELECT to_char(date '2010-03-01', 'DD TMMON ' COLLATE "de_DE");


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Allow logical replication to copy tables in binary format

2023-02-28 Thread vignesh C
On Tue, 28 Feb 2023 at 01:22, Melih Mutlu  wrote:
>
> Hi,
>
> Thanks for all of your reviews!
>
> So, I made some changes in the v10 according to your comments.
>
> 1- copy_format is changed to a boolean parameter copy_binary.
> It sounds easier to use a boolean to enable/disable binary copy. Default 
> value is false, so nothing changes in the current behaviour if copy_binary is 
> not specified.
> It's still persisted into the catalog. This is needed since its value will be 
> needed by tablesync workers later. And tablesync workers fetch subscription 
> configurations from the catalog.
> In the copy_data case, it is not directly stored anywhere but it affects the 
> state of the table which is stored in the catalog. So, I guess this is the 
> convenient way if we decide to go with a new parameter.
>
> 2- copy_binary is not tied to another parameter
> The patch does not disallow any combination of copy_binary with binary and 
> copy_data options. I tried to explain what binary copy can and cannot do in 
> the docs. Rest is up to the user now.
>
> 3- Removed version check for copy_binary
> HEAD already does not have any check for binary option. Making binary copy 
> work only if publisher and subscriber are the same major version can be too 
> restricted.
>
> 4- Added separate test file
> Although I believe 002_types.pl and 014_binary.pl can be improved more for 
> binary enabled subscription cases, this patch might not be the best place to 
> make those changes.
> 032_binary_copy.pl now has the tests for binary copy option. There are also 
> some regression tests in subscription.sql.
>
> Finally, some other small fixes are done according to the reviews.
>
>  Also, thanks Bharath for performance testing [1]. It can be seen that the 
> patch has some benefits.
>
> I appreciate further thoughts/reviews.

Thanks for the patch, Few comments:
1) Are primary key required for the tables, if not required we could
remove the primary key which will speed up the test by not creating
the indexes and inserting in the indexes. Even if required just create
for one of the tables:
+# Create tables on both sides of the replication
+my $ddl = qq(
+   CREATE TABLE public.test_numerical (
+   a INTEGER PRIMARY KEY,
+   b NUMERIC,
+   c FLOAT,
+   d BIGINT
+   );
+   CREATE TABLE public.test_arrays (
+   a INTEGER[] PRIMARY KEY,
+   b NUMERIC[],
+   c TEXT[]
+   );
+CREATE TABLE public.test_range_array (
+   a INTEGER PRIMARY KEY,
+   b TSTZRANGE,
+   c int8range[]
+   );
+CREATE TYPE public.test_comp_basic_t AS (a FLOAT, b TEXT, c INTEGER);
+CREATE TABLE public.test_one_comp (
+   a INTEGER PRIMARY KEY,
+   b public.test_comp_basic_t
+   ););
+

2) We can have the Insert/Select of tables in the same order so that
it is easy to verify. test_range_array/test_one_comp
insertion/selection order was different.
+# Insert some content and before creating a subscription
+$node_publisher->safe_psql(
+   'postgres', qq(
+INSERT INTO public.test_numerical (a, b, c, d) VALUES
+   (1, 1.2, 1.3, 10),
+(2, 2.2, 2.3, 20);
+   INSERT INTO public.test_arrays (a, b, c) VALUES
+   ('{1,2,3}', '{1.1, 1.2, 1.3}', '{"one", "two", "three"}'),
+('{3,1,2}', '{1.3, 1.1, 1.2}', '{"three", "one", "two"}');
+INSERT INTO test_range_array (a, b, c) VALUES
+   (1, tstzrange('Mon Aug 04 00:00:00 2014
CEST'::timestamptz, 'infinity'), '{"[1,2]", "[10,20]"}'),
+   (2, tstzrange('Sat Aug 02 00:00:00 2014
CEST'::timestamptz, 'Mon Aug 04 00:00:00 2014 CEST'::timestamptz),
'{"[2,3]", "[20,30]"}');
+   INSERT INTO test_one_comp (a, b) VALUES
+   (1, ROW(1.0, 'a', 1)),
+   (2, ROW(2.0, 'b', 2));
+   ));
+
+# Create the subscription with copy_binary = true
+my $publisher_connstring = $node_publisher->connstr . ' dbname=postgres';
+$node_subscriber->safe_psql('postgres',
+   "CREATE SUBSCRIPTION tsub CONNECTION '$publisher_connstring' "
+ . "PUBLICATION tpub WITH (slot_name = tpub_slot, copy_binary
= true)");
+
+# Ensure nodes are in sync with each other
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'tsub');
+
+my $sync_check =  qq(
+   SET timezone = '+2';
+   SELECT a, b, c, d FROM test_numerical ORDER BY a;
+   SELECT a, b, c FROM test_arrays ORDER BY a;
+   SELECT a, b FROM test_one_comp ORDER BY a;
+   SELECT a, b, c FROM test_range_array ORDER BY a;
+);

3) Should we check the results for test_myvarchar table only, since
there is no change in other tables, we need not check other tables
again:
+# Now tablesync should succeed
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'tsub');
+
+$sync_check =  qq(
+   SET timezone = '+2';
+   SELECT a, b, c, d FROM test_numerical ORDER BY a;
+   SELECT a, b, c 

Re: Make some xlogreader messages more accurate

2023-02-28 Thread Jeevan Ladhe
+1 for the changes.

>1. Why is "wanted >=%u" any better than "wanted at least %u"? IMO, the
>wording as opposed to >= symbol in the user-facing messages works
>better.

I think I agree with Bharath on this: "wanted at least %u" sounds better
for user error than "wanted >=%u".

Regards,
Jeevan Ladhe

On Tue, 28 Feb 2023 at 11:46, Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Thu, Feb 23, 2023 at 1:06 PM Peter Eisentraut
>  wrote:
> >
> > Here is a small patch to make some invalid-record error messages in
> > xlogreader a bit more accurate (IMO).
>
> +1 for these changes.
>
> > My starting point was that when you have some invalid WAL, you often get
> > a message like "wanted 24, got 0".  This is a bit incorrect, since it
> > really wanted *at least* 24, not exactly 24.  So I have updated the
> > messages to that effect, and
>
> Yes, it's not exactly "wanted", but "wanted at least" because
> xl_tot_len is the total length of the entire record including header
> and payload.
>
> > also added that detail to one message where
> > it was available but not printed.
>
> Looks okay.
>
> > Going through the remaining report_invalid_record() calls I then
> > adjusted the use of "invalid" vs. "incorrect" in one case.  The message
> > "record with invalid length" makes it sound like the length was
> > something like -5, but really we know what the length should be and what
> > we got wasn't it, so "incorrect" sounded better and is also used in
> > other error messages in that file.
>
> I have no strong opinion about this change. We seem to be using
> "invalid length" and "incorrect length" interchangeably [1] without
> distinguishing between "invalid" if length is < 0 and "incorrect" if
> length >= 0 and not something we're expecting.
>
> Another comment on the patch:
> 1. Why is "wanted >=%u" any better than "wanted at least %u"? IMO, the
> wording as opposed to >= symbol in the user-facing messages works
> better.
> +report_invalid_record(state, "invalid record offset at %X/%X:
> wanted >=%u, got %u",
> +  "invalid record length at %X/%X:
> wanted >=%u, got %u",
> +  "invalid record length at %X/%X: wanted
> >=%u, got %u",
>
> [1]
> elog(ERROR, "incorrect length %d in streaming transaction's changes
> file \"%s\"",
> "record with invalid length at %X/%X",
> (errmsg("invalid length of checkpoint record")));
> errmsg("invalid length of startup packet")));
> errmsg("invalid length of startup packet")));
> elog(ERROR, "invalid zero-length dimension array in MCVList");
> elog(ERROR, "invalid length (%d) dimension array in MCVList",
> errmsg("invalid length in external \"%s\" value",
> errmsg("invalid length in external bit string")));
> libpq_append_conn_error(conn, "certificate contains IP address with
> invalid length %zu
>
> --
> Bharath Rupireddy
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>
>
>


Re: [HACKERS] make async slave to wait for lsn to be replayed

2023-02-28 Thread Kartyshov Ivan

Intro==
The main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored 
inside

application. We cannot store this lsn inside database, since reads are
distributed across all replicas and primary.

https://www.postgresql.org/message-id/195e2d07ead315b1620f1a053313f490%40postgrespro.ru

Suggestions
==
Lots of proposals were made how this feature may look like.
I aggregate them into the following four types.

1) Classic (wait_classic_v1.patch)
https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
==
advantages: multiple events, standalone WAIT
disadvantages: new words in grammar

WAIT FOR  [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
    [ WAIT FOR [ANY | ALL] event [, ...]]
where event is one of:
LSN value
TIMEOUT number_of_milliseconds
timestamp

2) After style: Kyotaro and Freund (wait_after_within_v1.patch)
https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
==
advantages: no new words in grammar, standalone AFTER
disadvantages: a little harder to understand

AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
    [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
    [ AFTER lsn_event [ WITHIN delay_milliseconds ]]

3) Procedure style: Tom Lane and Kyotaro (wait_proc_v1.patch)
https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com
==
advantages: no new words in grammar,like it made in 
pg_last_wal_replay_lsn, no snapshots need

disadvantages: a little harder to remember names
SELECT pg_waitlsn(‘LSN’, timeout);
SELECT pg_waitlsn_infinite(‘LSN’);
SELECT pg_waitlsn_no_wait(‘LSN’);

4) Brackets style: Kondratov
https://www.postgresql.org/message-id/a8bff0350a27e0a87a6eaf0905d6737f%40postgrespro.ru
 
==
advantages: only one new word in grammar,like it made in VACUUM and 
REINDEX, ability to extend parameters without grammar fixes

disadvantages: 
WAIT (LSN '16/B374D848', TIMEOUT 100);
BEGIN WAIT (LSN '16/B374D848' [, etc_options]);
...
COMMIT;

Consequence
==
Below I provide the implementation of patches for the first three types.
I propose to discuss this feature again/

Regards

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..18695e013e 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..a2794763b1 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  
 
-BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] wait_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+where wait_event is:
+AFTER lsn_value [ WITHIN number_of_milliseconds ]
 
  
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  
 
-START TRANSACTION [ transaction_mode [, ...] ]
+START TRANSACTION [ transaction_mode [, ...] ] wait_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+where wait_event is:
+AFTER lsn_value [ WITHIN number_of_milliseconds ]
 
  
 
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a83ff4551e 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@



+   
 
  
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..e4d2d8d0a1 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1752,6 +1753,15 @@ PerformWalRecovery(void)
 break;
 			}
 
+			/*
+			* If we replayed an LSN that someone was waiting for,
+			* 

Maybe we can remove the type cast in typecache.c

2023-02-28 Thread qinghao huang
Hi hackers,
When I was reading postgres code, I found there is a wierd type cast. I'm 
pondering if it is necessary.

```
/* Allocate a new typmod number.  This will be wasted if we error out. */
typmod = (int)

pg_atomic_fetch_add_u32(>shared_typmod_registry->next_typmod,
1);

```
typmod has u32 type, but we cast it to int first.

And I also have some confusion about why `NextRecordTypmod` and 
`TupleDescData.tdtypmod` has type of int32, but `SharedTypmodTableEntry.typmod` 
has type of uint32.

Best regard,
Qinghao Huang



Re: Track IO times in pg_stat_io

2023-02-28 Thread Drouvot, Bertrand

Hi,

On 2/26/23 5:03 PM, Melanie Plageman wrote:

Hi,

As suggested in [1], the attached patch adds IO times to pg_stat_io;


Thanks for the patch!

I started to have a look at it and figured out that a tiny rebase was needed 
(due to
728560db7d and b9f0e54bc9), so please find the rebase (aka V2) attached.


The timings will only be non-zero when track_io_timing is on


That could lead to incorrect interpretation if one wants to divide the timing 
per operations, say:

- track_io_timing is set to on while there is already operations
- or set to off while it was on (and the number of operations keeps growing)

Might be worth to warn/highlight in the "track_io_timing" doc?


+   if (track_io_timing)
+   {
+   INSTR_TIME_SET_CURRENT(io_time);
+   INSTR_TIME_SUBTRACT(io_time, io_start);
+   pgstat_count_io_time(io_object, io_context, 
IOOP_EXTEND, io_time);
+   }
+
+
pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);

vs

@@ -1042,6 +1059,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, 
ForkNumber forkNum,
INSTR_TIME_SUBTRACT(io_time, io_start);

pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
INSTR_TIME_ADD(pgBufferUsage.blk_read_time, 
io_time);
+   pgstat_count_io_time(io_object, io_context, 
IOOP_READ, io_time);
}

That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() 
(for the IOOP_EXTEND case) and
after pgstat_count_io_op() (for the IOOP_READ case).

What about calling them in the same order and so that pgstat_count_io_time() is 
called before pgstat_count_io_op()?

If so, the ordering would also need to be changed in:

- FlushRelationBuffers()
- register_dirty_segment()



There is one minor question (in the code as a TODO) which is whether or
not it is worth cross-checking that IO counts and times are either both
zero or neither zero in the validation function
pgstat_bktype_io_stats_valid().



As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that 
would be a good idea
to also check that if counts are not Zero then times are not Zero.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..2c62b0a437 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+read_time double precision
+   
+   
+Time spent in read operations in milliseconds (if  is enabled, otherwise zero)
+   
+  
+ 
+
  
   

@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+write_time double precision
+   
+   
+Time spent in write operations in milliseconds (if  is enabled, otherwise zero)
+   
+  
+ 
+
  
   

@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+extend_time double precision
+   
+   
+Time spent in extend operations in milliseconds (if  is enabled, otherwise zero)
+   
+  
+ 
+
  
   

@@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+fsync_time double precision
+   
+   
+Time spent in fsync operations in milliseconds (if  is enabled, otherwise zero)
+   
+  
+ 
+
  
   

diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 34ca0e739f..39391bc2fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1123,12 +1123,16 @@ SELECT
b.io_object,
b.io_context,
b.reads,
+   b.read_time,
b.writes,
+   b.write_time,
b.extends,
+   b.extend_time,
b.op_bytes,
b.evictions,
b.reuses,
b.fsyncs,
+   b.fsync_time,
b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..bbd2af9fae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1000,11 +1000,26 @@ ReadBuffer_common(SMgrRelation smgr, char 
relpersistence, ForkNumber forkNum,
 
if (isExtend)
{
+   instr_time  io_start,
+   io_time;
/* new 

Re: Provide PID data for "cannot wait on a latch owned by another process" in latch.c

2023-02-28 Thread Michael Paquier
On Tue, Feb 28, 2023 at 08:18:16AM +0100, Peter Eisentraut wrote:
> I would also have asked for some kind of prefix that introduces the numbers.

Okay.

> I wonder what these numbers are useful for though?  Is this a development
> aid?

Yes.

> Can you do anything with these numbers?

Yes.  They immediately pointed out that I missed to mark a latch as
owned in a process, hence the owner_pid was showing up as 0 when
trying to use it.  The second showed me the process that was involved,
which was still useful once cross-checked with the contents of the
logs prefixed with %p.
--
Michael


signature.asc
Description: PGP signature


Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-02-28 Thread Bharath Rupireddy
On Thu, Feb 23, 2023 at 3:03 AM Melanie Plageman
 wrote:
>
> Hi,
>
> So, I attached a rough implementation of both the autovacuum failsafe
> reverts to shared buffers and the vacuum option (no tests or docs or
> anything).

Thanks for the patches. I have some comments.

0001:
1. I don't quite understand the need for this 0001 patch. Firstly,
bstrategy allocated once per autovacuum worker in AutovacMemCxt which
goes away with the process. Secondly, the worker exits after
do_autovacuum() with which memory context is gone. I think this patch
is unnecessary unless I'm missing something.

0002:
1. Don't we need to remove vac_strategy for analyze.c as well? It's
pretty-meaningless there than vacuum.c as we're passing bstrategy to
all required functions.

0004:
1. I think no multiple sentences in a single error message. How about
"of %d, changing it to %d"?
+elog(WARNING, "buffer_usage_limit %d is below the
minimum buffer_usage_limit of %d. setting it to %d",

2. Typically, postgres error messages start with lowercase letters,
hints and detail messages start with uppercase letters.
+if (buffers == 0)
+elog(ERROR, "Use of shared buffers unsupported for buffer
access strategy: %s. nbuffers must be -1.",
+strategy_name);
+
+if (buffers > 0)
+elog(ERROR, "Specification of ring size in buffers
unsupported for buffer access strategy: %s. nbuffers must be -1.",
+strategy_name);

3. A function for this seems unnecessary, especially when a static
array would do the needful, something like forkNames[].
+static const char *
+btype_get_name(BufferAccessStrategyType btype)
+{
+switch (btype)
+{

4. Why are these assumptions needed? Can't we simplify by doing
validations on the new buffers parameter only when the btype is
BAS_VACUUM?
+if (buffers == 0)
+elog(ERROR, "Use of shared buffers unsupported for buffer
access strategy: %s. nbuffers must be -1.",
+strategy_name);

+// TODO: DEBUG logging message for dev?
+if (buffers == 0)
+btype = BAS_NORMAL;

5. Is this change needed for this patch?
 default:
 elog(ERROR, "unrecognized buffer access strategy: %d",
- (int) btype);
-return NULL;/* keep compiler quiet */
+(int) btype);
+
+pg_unreachable();

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Add documentation for coverage reports with meson

2023-02-28 Thread Michael Paquier
Hi all,

I have mentioned on a different thread of -docs that we have no
documentation to achieve $subject, so attached is a patch to add
something.  This can be done with the following steps:
meson setup -Db_coverage=true .. blah
ninja
meson test
ninja coverage-html

As far as I can see, there is no option to generate anything else than
a HTML report?  This portion is telling the contrary, still it does
not seem to work here and ninja does the job with coverage-html or
coverage as only available targets:
https://mesonbuild.com/howtox.html#producing-a-coverage-report

Side issue: the current code generates no reports for the files that
are automatically generated in src/backend/nodes/, which are actually
part of src/include/ for a meson build.  I have not looked into that
yet.

Thoughts?
--
Michael
diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml
index a08c7a78af..dd9a004a6d 100644
--- a/doc/src/sgml/regress.sgml
+++ b/doc/src/sgml/regress.sgml
@@ -825,53 +825,76 @@ PG_TEST_NOCLEAN=1 make -C src/bin/pg_dump check
 instrumentation, so that it becomes possible to examine which
 parts of the code are covered by the regression tests or any other
 test suite that is run with the code.  This is currently supported
-when compiling with GCC, and it requires the gcov
-and lcov programs.
+when compiling with GCC, and it requires the gcov,
+lcov and genhtml programs.

 
-   
-A typical workflow looks like this:
+   
+Coverage with configure
+
+ A typical workflow looks like this:
 
 ./configure --enable-coverage ... OTHER OPTIONS ...
 make
 make check # or other test suite
 make coverage-html
 
-Then point your HTML browser
-to coverage/index.html.
-   
+ Then point your HTML browser
+ to coverage/index.html.
+
 
-   
-If you don't have lcov or prefer text output over an
-HTML report, you can run
+
+ If you don't have lcov or prefer text output over an
+ HTML report, you can run
 
 make coverage
 
-instead of make coverage-html, which will
-produce .gcov output files for each source file
-relevant to the test.  (make coverage and make
-coverage-html will overwrite each other's files, so mixing them
-might be confusing.)
-   
+ instead of make coverage-html, which will
+ produce .gcov output files for each source file
+ relevant to the test.  (make coverage and make
+ coverage-html will overwrite each other's files, so mixing them
+ might be confusing.)
+
 
-   
-You can run several different tests before making the coverage report;
-the execution counts will accumulate.  If you want
-to reset the execution counts between test runs, run:
+
+ You can run several different tests before making the coverage report;
+ the execution counts will accumulate.  If you want
+ to reset the execution counts between test runs, run:
 
 make coverage-clean
 
-   
+
 
-   
-You can run the make coverage-html or make
-coverage command in a subdirectory if you want a coverage
-report for only a portion of the code tree.
-   
+
+ You can run the make coverage-html or make
+ coverage command in a subdirectory if you want a coverage
+ report for only a portion of the code tree.
+
 
-   
-Use make distclean to clean up when done.
-   
+
+ Use make distclean to clean up when done.
+
+   
+
+   
+Coverage with meson
+
+ A typical workflow looks like this:
+
+meson setup -Db_coverage=true ... OTHER OPTIONS ...
+ninja
+meson test
+ninja coverage-html
+
+ Then point your HTML browser
+ to ./meson-logs/coveragereport/index.html.
+
+
+
+ You can run several different tests before making the coverage report;
+ the execution counts will accumulate.
+
+   
   
 
 


signature.asc
Description: PGP signature


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Amit Kapila
On Mon, Feb 27, 2023 at 2:21 PM Hayato Kuroda (Fujitsu)
 wrote:
>

Few comments:
1.
+ /*
+ * If we've requested to shut down, exit the process.
+ *
+ * Note that WalSndDone() cannot be used here because the delaying
+ * changes will be sent in the function.
+ */
+ if (got_STOPPING)
+ {
+ QueryCompletion qc;
+
+ /* Inform the standby that XLOG streaming is done */
+ SetQueryCompletion(, CMDTAG_COPY, 0);
+ EndCommand(, DestRemote, false);
+ pq_flush();

Do we really need to do anything except for breaking the loop and let
the exit handling happen in the main loop when 'got_STOPPING' is set?
AFAICS, this is what we are doing in some other palces (See
WalSndWaitForWal). Won't that work? It seems that will help us sending
all the pending WAL.

2.
+ /* Try to flush pending output to the client */
+ if (pq_flush_if_writable() != 0)
+ WalSndShutdown();

Is there a reason to try flushing here?

Apart from the above, I have made a few changes in the comments in the
attached diff patch. If you agree with those then please include them
in the next version.

-- 
With Regards,
Amit Kapila.


changes_amit_1.patch
Description: Binary data


Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-02-28 Thread Bharath Rupireddy
On Thu, Jan 12, 2023 at 6:06 AM Andres Freund  wrote:
>
> On 2023-01-11 17:26:19 -0700, David G. Johnston wrote:
> > Should we just add "ring_buffers" to the existing "shared_buffers" and
> > "temp_buffers" settings?
>
> The different types of ring buffers have different sizes, for good reasons. So
> I don't see that working well. I also think it'd be more often useful to
> control this on a statement basis - if you have a parallel import tool that
> starts NCPU COPYs you'd want a smaller buffer than a single threaded COPY. Of
> course each session can change the ring buffer settings, but still.

How about having GUCs for each ring buffer (bulk_read_ring_buffers,
bulk_write_ring_buffers, vacuum_ring_buffers - ah, 3 more new GUCs)?
These options can help especially when statement level controls aren't
easy to add (COPY, CREATE TABLE AS/CTAS, REFRESH MAT VIEW/RMV)? If
needed users can also set them at the system level. For instance, one
can set bulk_write_ring_buffers to other than 16MB or -1 to disable
the ring buffer to use shared_buffers and run a bunch of bulk write
queries.

Although I'm not quite opposing the idea of statement level controls
(like the VACUUM one proposed here), it is better to make these ring
buffer sizes configurable across the system to help with the other
similar cases e.g., a CTAS or RMV can help subsequent reads from
shared buffers if ring buffer is skipped.

Thoughts?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com