RE: Rework LogicalOutputPluginWriterUpdateProgress

2023-03-06 Thread Hayato Kuroda (Fujitsu)
Dear Wang,

Thank you for updating the patch! Followings are my comments.

---
01. missing comments
You might miss the comment from Peter[1]. Or could you pin the related one?

---
02. LogicalDecodingProcessRecord()

Don't we have to call UpdateDecodingProgressAndKeepalive() when there is no
decoding function? Assuming that the timeout parameter does not have enough time
period and there are so many sequential operations in the transaction. At that 
time
there may be a possibility that timeout is occurred while calling 
ReorderBufferProcessXid()
several times.  It may be a bad example, but I meant to say that we may have to
consider the case that decoding function has not implemented yet.

---
03. stream_*_cb_wrapper

Only stream_*_cb_wrapper have comments "don't call update progress, we didn't 
really make any", but
there are more functions that does not send updates. Do you have any reasons 
why only they have?

[1]: 
https://www.postgresql.org/message-id/CAHut%2BPsksiQHuv4A54R4w79TAvCu__PcuffKYY0V96e2z_sEvA%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: pg_rewind: Skip log directory for file type check like pg_wal

2023-03-06 Thread Soumyadeep Chakraborty
On Mon, Mar 6, 2023 at 11:33 PM Alexander Kukushkin  wrote:
>
>
> Lets assume that on the source we have "pg_log" and on the target we have 
> "my_log" (they are configured using "log_directory" GUC).
> When doing rewind in this case we want neither to remove the content of 
> "my_log" on the target nor to copy content of "pg_log" from the source.
> It couldn't be achieved just by introducing a static string "log". The 
> "log_directory" GUC must be examined on both, source and target.

Trouble with doing that is if pg_rewind is run in non-libpq (offline)
mode. Then we would have to parse it out of the conf file(s)?
Is there a standard way of doing that?

Regards,
Soumyadeep (VMware)




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-06 Thread Peter Smith
On Mon, Mar 6, 2023 at 10:18 PM Önder Kalacı  wrote:
>
>> 4. IdxIsRelationIdentityOrPK
>>
>> +/*
>> + * Given a relation and OID of an index, returns true if the
>> + * index is relation's replica identity index or relation's
>> + * primary key's index.
>> + *
>> + * Returns false otherwise.
>> + */
>> +bool
>> +IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid)
>> +{
>> + Assert(OidIsValid(idxoid));
>> +
>> + return GetRelationIdentityOrPK(rel) == idxoid;
>> +}
>>
>> I think you've "simplified" this function in v28 but AFAICT now it has
>> a different logic to v27.
>>
>> PREVIOUSLY it was coded like
>> + return RelationGetReplicaIndex(rel) == idxoid ||
>> + RelationGetPrimaryKeyIndex(rel) == idxoid;
>>
>> But now in that scenario, it won't
>> even check the PK if there was a valid RI. So it might return false
>> when previously it returned true. Is it deliberate?
>>
>
> Thanks for detailed review/investigation on this. But, I also agree that
> there is no difference in terms of correctness. Also, it is probably better
> to be consistent with the existing code. So, making 
> IdxIsRelationIdentityOrPK()
> relying on GetRelationIdentityOrPK() still sounds better to me.
>
>> You can see if 'idxoid' did NOT match RI but if it DID match PK
>> previously it would return true.
>
>
> Still, I cannot see how this check would yield a different result with how
> RI/PK works -- as Amit also noted in the next e-mail.
>
> Do you see any cases where this check would produce a different result?
> I cannot, but wanted to double check with you.
>
>

Let me give an example to demonstrate why I thought something is fishy here:

Imagine rel has a (non-default) REPLICA IDENTITY with Oid=.
Imagine the same rel has a PRIMARY KEY with Oid=.

---

+/*
+ * Get replica identity index or if it is not defined a primary key.
+ *
+ * If neither is defined, returns InvalidOid
+ */
+Oid
+GetRelationIdentityOrPK(Relation rel)
+{
+ Oid idxoid;
+
+ idxoid = RelationGetReplicaIndex(rel);
+
+ if (!OidIsValid(idxoid))
+ idxoid = RelationGetPrimaryKeyIndex(rel);
+
+ return idxoid;
+}
+
+/*
+ * Given a relation and OID of an index, returns true if the
+ * index is relation's replica identity index or relation's
+ * primary key's index.
+ *
+ * Returns false otherwise.
+ */
+bool
+IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid)
+{
+ Assert(OidIsValid(idxoid));
+
+ return GetRelationIdentityOrPK(rel) == idxoid;
+}
+


So, according to the above function comment/name, I will expect
calling IdxIsRelationIdentityOrPK passing Oid= or Oid- will
both return true, right?

But AFAICT

IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel)
returns  (the valid oid of the RI) -->  ==  --> true;

IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel)
returns  (the valid oid of the RI) -->  ==  --> false;

~

Now two people are telling me this is OK, but I've been staring at it
for too long and I just don't see how it can be. (??)

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Thomas Munro
On Tue, Mar 7, 2023 at 7:47 PM Julien Rouhaud  wrote:
> I registered lapwing as a 32b Debian 7 so I thought it would be expected to
> keep it as-is rather than upgrading to all newer major Debian versions,
> especially since there were newer debian animal registered (no 32b though
> AFAICS).

Animals do get upgraded: see the "w. e. f." ("with effect from") line
in https://buildfarm.postgresql.org/cgi-bin/show_members.pl which
comes from people running something like ./update_personality.pl
--os-version "11" so that it shows up on the website.

>  I'm not opposed to upgrading it but I think there's still value in
> having somewhat old packages versions being tested, especially since there
> isn't much 32b coverage of those.  I would be happy to register a newer 32b
> version, or even sid, if needed but the -m32 part on the CI makes me think
> there isn't much value doing that now.

Totally up to you as an animal zoo keeper but in my humble opinion the
interesting range of Debian releases currently is 11-13, or maybe 10
if you really want to test the LTS/old-stable release (and CI is
testing 11).

> I think this is the first time that a problem raised by -Werror on that old
> animal is actually a false positive, while there were many times it reported
> useful stuff.  Now this has been up for years before we got better CI tooling,
> especially with -m32 support, so there might not be any value to have it
> anymore.  As I mentioned at [1] I don't mind removing it or just work on
> upgrading any dependency (or removing known buggy compiler flags) to keep it
> without being annoying.  In any case I'm usually quite fast at reacting to any
> problem/complaint on that animal, so you don't have to worry about the
> buildfarm being red too long if it came to that.

Yeah, it's given us lots of useful data, thanks.  Personally I would
upgrade it so it keeps telling us useful things but I feel like I've
said enough about that so I'll shut up now :-)  Re: being red too
long... yeah that reminds me, I really need to fix seawasp ASAP...




Re: pg_rewind: Skip log directory for file type check like pg_wal

2023-03-06 Thread Alexander Kukushkin
On Mon, 6 Mar 2023 at 19:37, Soumyadeep Chakraborty <
soumyadeep2...@gmail.com> wrote:

>
> > 2. Unlike "pg_wal", the "log" directory is not necessarily located
> inside PGDATA. The actual value is configured using "log_directory" GUC,
> which just happened to be "log" by default. And in fact actual values on
> source and target could be different.
>
> I think we only care about files/dirs inside the datadir. Anything
> outside is out of scope for
> pg_rewind AFAIU. We can only address the common case here. As mentioned in
> this
> comment:
>
>  * XXX: There is no backend function to get a symbolic link's target in
>  * general, so if the admin has put any custom symbolic links in the data
>  * directory, they won't be copied correctly.
>

That's exactly my point. Users are very creative.
On one node they could set log_directory to for example "pg_log" and on
another one "my_log".
And they would be writing logs to $PGDATA/pg_log and $PGDATA/my_log
respectively and they are both located inside datadir.

Lets assume that on the source we have "pg_log" and on the target we have
"my_log" (they are configured using "log_directory" GUC).
When doing rewind in this case we want neither to remove the content of
"my_log" on the target nor to copy content of "pg_log" from the source.
It couldn't be achieved just by introducing a static string "log". The
"log_directory" GUC must be examined on both, source and target.

Regards,
--
Alexander Kukushkin


Re: Add pg_walinspect function with block info columns

2023-03-06 Thread Michael Paquier
On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote:
> Ah. Yes, that expansion sounds sensible.

Okay, so, based on this idea, I have hacked on this stuff and finish
with the attached that shows block data if it exists, as well as FPI
stuff if any.  bimg_info is showed as a text[] for its flags.

I guess that I'd better add a test that shows correctly a record with
some block data attached to it, on top of the existing one for FPIs..
Any suggestions?  Perhaps just a heap/heap2 record?

Thoughts?
--
Michael
From 237be9d768c20f9c8fec0a11a8fcbee6c4e9bf8a Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 7 Mar 2023 16:06:28 +0900
Subject: [PATCH] Rework pg_walinspect to retrieve more block information

---
 doc/src/sgml/pgwalinspect.sgml|  36 +++--
 .../pg_walinspect/expected/pg_walinspect.out  |  16 +--
 .../pg_walinspect/pg_walinspect--1.0--1.1.sql |  16 ++-
 contrib/pg_walinspect/pg_walinspect.c | 134 +-
 contrib/pg_walinspect/sql/pg_walinspect.sql   |  16 +--
 5 files changed, 145 insertions(+), 73 deletions(-)

diff --git a/doc/src/sgml/pgwalinspect.sgml b/doc/src/sgml/pgwalinspect.sgml
index 3d7cdb95cc..3a65beb08d 100644
--- a/doc/src/sgml/pgwalinspect.sgml
+++ b/doc/src/sgml/pgwalinspect.sgml
@@ -190,31 +190,39 @@ combined_size_percentage | 2.8634072910530795
 

 
- pg_get_wal_fpi_info(start_lsn pg_lsn, end_lsn pg_lsn) returns setof record
+ pg_get_wal_block_info(start_lsn pg_lsn, end_lsn pg_lsn) returns setof record
 
 
 
  
-  Gets a copy of full page images as bytea values (after
-  applying decompression when necessary) and their information associated
-  with all the valid WAL records between
+  Gets a copy of the block information stored in WAL records. This includes
+  copies of the block data (NULL if none) and full page
+  images as bytea values (after
+  applying decompression when necessary, or NULL if none)
+  and their information associated with all the valid WAL records between
   start_lsn and
-  end_lsn. Returns one row per full page image.
-  If start_lsn or
+  end_lsn. Returns one row per block registered
+  in a WAL record. If start_lsn or
   end_lsn are not yet available, the function
   will raise an error. For example:
 
-postgres=# SELECT lsn, reltablespace, reldatabase, relfilenode, relblocknumber,
-  forkname, substring(fpi for 24) as fpi_trimmed
- FROM pg_get_wal_fpi_info('0/1801690', '0/1825C60');
+postgres=# SELECT lsn, blockid, reltablespace, reldatabase, relfilenode,
+  relblocknumber, forkname,
+  substring(blockdata for 24) as block_trimmed,
+  substring(fpi for 24) as fpi_trimmed, fpilen, fpiinfo
+ FROM pg_get_wal_fpi_info('0/20BC498', '0/20BCD80');
 -[ RECORD 1 ]--+---
-lsn| 0/1807E20
+lsn| 0/20BC498
+blockid| 0
 reltablespace  | 1663
-reldatabase| 5
-relfilenode| 16396
-relblocknumber | 43
+reldatabase| 16384
+relfilenode| 24576
+relblocknumber | 44
 forkname   | main
-fpi_trimmed| \xb89e6601a003c00300200420
+block_trimmed  | null
+fpi_trimmed| \x18ed0a020401a01800200420
+fpilen | 2148
+fpiinfo| {HAS_HOLE,APPLY}
 
  
 
diff --git a/contrib/pg_walinspect/expected/pg_walinspect.out b/contrib/pg_walinspect/expected/pg_walinspect.out
index 9bcb05354e..b1682582e8 100644
--- a/contrib/pg_walinspect/expected/pg_walinspect.out
+++ b/contrib/pg_walinspect/expected/pg_walinspect.out
@@ -74,7 +74,7 @@ SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_records_info(:'wal_lsn1', :'wal_lsn2'
 (1 row)
 
 -- ===
--- Tests to get full page image (FPI) from WAL record
+-- Tests to get block information from WAL record
 -- ===
 SELECT pg_current_wal_lsn() AS wal_lsn3 \gset
 -- Force FPI on the next update.
@@ -83,8 +83,8 @@ CHECKPOINT;
 UPDATE sample_tbl SET col1 = col1 * 100 WHERE col1 = 1;
 SELECT pg_current_wal_lsn() AS wal_lsn4 \gset
 -- Check if we get FPI from WAL record.
-SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_fpi_info(:'wal_lsn3', :'wal_lsn4')
-  WHERE relfilenode = :'sample_tbl_oid';
+SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_block_info(:'wal_lsn3', :'wal_lsn4')
+  WHERE relfilenode = :'sample_tbl_oid' AND fpi IS NOT NULL;
  ok 
 
  t
@@ -116,7 +116,7 @@ SELECT has_function_privilege('regress_pg_walinspect',
 (1 row)
 
 SELECT has_function_privilege('regress_pg_walinspect',
-  'pg_get_wal_fpi_info(pg_lsn, pg_lsn) ', 'EXECUTE'); -- no
+  'pg_get_wal_block_info(pg_lsn, pg_lsn) ', 'EXECUTE'); -- no
  has_function_privilege 
 
  f
@@ -146,7 +146,7 @@ SELECT has_function_privilege('regress_pg_walinspect',
 (1 row)
 
 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-06 Thread Bharath Rupireddy
On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart  wrote:
>
> +void
> +XLogReadFromBuffers(XLogRecPtr startptr,
>
> Since this function presently doesn't return anything, can we have it
> return the number of bytes read instead of storing it in a pointer
> variable?

Done.

> +ptr = startptr;
> +nbytes = count;
> +dst = buf;
>
> These variables seem superfluous.

Needed startptr and count for DEBUG1 message and assertion at the end.
Removed dst and used buf in the new patch now.

> +/*
> + * Requested WAL isn't available in WAL buffers, so return with
> + * what we have read so far.
> + */
> +break;
>
> nitpick: I'd move this to the top so that you can save a level of
> indentation.

Done.

> +/*
> + * All the bytes are not in one page. Read available bytes on
> + * the current page, copy them over to output buffer and
> + * continue to read remaining bytes.
> + */
>
> Is it possible to memcpy more than a page at a time?

It would complicate things a lot there; the logic to figure out the
last page bytes that may or may not fit in the whole page gets
complicated. Also, the logic to verify each page's header gets
complicated. We might lose out if we memcpy all the pages at once and
start verifying each page's header in another loop.

I would like to keep it simple - read a single page from WAL buffers,
verify it and continue.

> +/*
> + * The fact that we acquire WALBufMappingLock while reading the 
> WAL
> + * buffer page itself guarantees that no one else initializes it 
> or
> + * makes it ready for next use in AdvanceXLInsertBuffer().
> + *
> + * However, we perform basic page header checks for ensuring that
> + * we are not reading a page that just got initialized. Callers
> + * will anyway perform extensive page-level and record-level
> + * checks.
> + */
>
> Hm.  I wonder if we should make these assertions instead.

Okay. I added XLogReaderValidatePageHeader for assert-only builds
which will help catch any issues there. But we can't perform record
level checks here because this function doesn't know where the record
starts from, it knows only pages. This change required us to pass in
XLogReaderState to XLogReadFromBuffers. I marked it as
PG_USED_FOR_ASSERTS_ONLY and did page header checks only when it is
passed as non-null so that someone who doesn't have XLogReaderState
can still read from buffers.

> +elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given 
> LSN %X/%X, Timeline ID %u",
> + *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
>
> I definitely don't think we should put an elog() in this code path.
> Perhaps this should be guarded behind WAL_DEBUG.

Placing it behind WAL_DEBUG doesn't help users/developers. My
intention was to let users know that the WAL read hit the buffers,
it'll help them report if any issue occurs and also help developers to
debug that issue.

On a different note - I was recently looking at the code around
WAL_DEBUG macro and the wal_debug GUC. It looks so complex that one
needs to build source code with the WAL_DEBUG macro and enable the GUC
to see the extended logs for WAL. IMO, the best way there is either:
1) unify all the code under WAL_DEBUG macro and get rid of wal_debug GUC, or
2) unify all the code under wal_debug GUC (it is developer-only and
superuser-only so there shouldn't be a problem even if we ship it out
of the box).

If someone is concerned about the GUC being enabled on production
servers knowingly or unknowingly with option (2), we can go ahead with
option (1). I will discuss this separately to see what others think.

> I think we can simplify this.  We effectively take the same action any time
> "count" doesn't equal "read_bytes", so there's no need for the "else if".
>
> if (count == read_bytes)
> return true;
>
> buf += read_bytes;
> startptr += read_bytes;
> count -= read_bytes;

I wanted to avoid setting these unnecessarily for buffer misses.

Thanks a lot for reviewing. I'm attaching the v8 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v8-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patch
Description: Binary data


v8-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patch
Description: Binary data


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-06 Thread Amit Kapila
On Tue, Mar 7, 2023 at 9:16 AM shiy.f...@fujitsu.com
 wrote:
>
> On Monay, Mar 6, 2023 7:19 PM Önder Kalacı  wrote:
> >
> > Yeah, seems easier to follow to me as well. Reflected it in the comment as 
> > well.
> >

Few comments:
=
1.
+get_usable_indexoid(ApplyExecutionData *edata, ResultRelInfo *relinfo)
{
...
+ if (targetrelkind == RELKIND_PARTITIONED_TABLE)
+ {
+ /* Target is a partitioned table, so find relmapentry of the partition */
+ TupleConversionMap *map = ExecGetRootToChildMap(relinfo, edata->estate);
+ AttrMap*attrmap = map ? map->attrMap : NULL;
+
+ relmapentry =
+ logicalrep_partition_open(relmapentry, relinfo->ri_RelationDesc,
+   attrmap);


When will we hit this part of the code? As per my understanding, for
partitioned tables, we always follow apply_handle_tuple_routing()
which should call logicalrep_partition_open(), and do the necessary
work for us.

2. In logicalrep_partition_open(), it would be better to set
localrelvalid after finding the required index. The entry should be
marked valid after initializing/updating all the required members. I
have changed this in the attached.

3.
@@ -31,6 +32,7 @@ typedef struct LogicalRepRelMapEntry
  Relation localrel; /* relcache entry (NULL when closed) */
  AttrMap*attrmap; /* map of local attributes to remote ones */
  bool updatable; /* Can apply updates/deletes? */
+ Oid usableIndexOid; /* which index to use, or InvalidOid if none */

Would it be better to name this new variable as localindexoid to match
it with the existing variable localreloid? Also, the camel case for
this variable appears odd.

4. If you agree with the above, then we should probably change the
name of functions get_usable_indexoid() and
FindLogicalRepUsableIndex() accordingly.

5.
+ {
+ /*
+ * If we had a primary key or relation identity with a unique index,
+ * we would have already found and returned that oid. At this point,
+ * the remote relation has replica identity full.

These comments are not required as this just states what the code just
above is doing.

Apart from the above, I have made some modifications in the other
comments. If you are convinced with those, then kindly include them in
the next version.

-- 
With Regards,
Amit Kapila.


use_index_sub_amit_1.patch
Description: Binary data


Re: [HACKERS] make async slave to wait for lsn to be replayed

2023-03-06 Thread Kartyshov Ivan

Fix build.meson troubles

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..422bb1ed82 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1752,6 +1753,15 @@ PerformWalRecovery(void)
 break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+			{
+WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 42cced9ebe..ec6ab7722a 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00..3465673f0a
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,275 @@
+/*-
+ *
+ * wait.c
+ *	  Implements wait lsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 
+
+#include "access/xact.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(>mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(>mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(>mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+state->lsn[i] < state->min_lsn.value)
+state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+state->backend_maxid = i;
+break;
+			}
+
+	SpinLockRelease(>mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = 

Re: Add pg_walinspect function with block info columns

2023-03-06 Thread Kyotaro Horiguchi
At Tue, 7 Mar 2023 14:44:49 +0900, Michael Paquier  wrote 
in 
> On Tue, Mar 07, 2023 at 11:17:45AM +0900, Kyotaro Horiguchi wrote:
> > Thus I'm inclined to agree with Michael's suggestion of creating a new
> > normalized set-returning function that returns information of
> > "blocks".
> 
> Just to be clear here, I am not suggesting to add a new function for
> only the block information, just a rename of the existing
> pg_get_wal_fpi_info() to something like pg_get_wal_block_info() that
> includes both the FPI (if any or NULL if none) and the block data (if
> any or NULL is none) so as all of them are governed by the same lookup
> at pg_wal/.  The fpi information (aka compression type) is displayed
> if there is a FPI in the block.

Ah. Yes, that expansion sounds sensible.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Julien Rouhaud
On Tue, Mar 07, 2023 at 07:14:51PM +1300, Thomas Munro wrote:
> On Tue, Mar 7, 2023 at 5:32 PM Michael Paquier  wrote:
> > On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote:
> > > Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> > > have old buggy compilers set to run with -Werror.
> >
> > Yes, as far as I can see when investigating the issue, this is an old
> > bug of gcc when detecting where the initialization needs to be
> > applied.  And at the same time the fix is deadly simple, so the
> > current statu-quo does not sound that bad to me.  Note that lapwing is
> > one of the only animals testing 32b builds, and it has saved from
> > quite few bugs over the years.
>
> Yeah, but I'm just wondering, why not run a current release on it[1]?
> Debian is one of the few distributions still supporting 32 bit
> kernels, and it's good to test rare things, but AFAIK the primary
> reason we finish up with EOL'd OSes in the 'farm is because they have
> been forgotten (the secondary reason is because they couldn't be
> upgraded because the OS dropped the [micro]architecture).

I registered lapwing as a 32b Debian 7 so I thought it would be expected to
keep it as-is rather than upgrading to all newer major Debian versions,
especially since there were newer debian animal registered (no 32b though
AFAICS).  I'm not opposed to upgrading it but I think there's still value in
having somewhat old packages versions being tested, especially since there
isn't much 32b coverage of those.  I would be happy to register a newer 32b
version, or even sid, if needed but the -m32 part on the CI makes me think
there isn't much value doing that now.

Now about the -Werror:

> BTW CI also tests 32 bit with -m32 on Debian, but with a 64 bit
> kernel, which probably doesn't change much at the level we care about,
> so maybe this doesn't matter much... just sharing an observation that
> we're wasting time thinking about an OS release that gave up the ghost
> in 2016, because it is running with -Werror.  *shrug*

I think this is the first time that a problem raised by -Werror on that old
animal is actually a false positive, while there were many times it reported
useful stuff.  Now this has been up for years before we got better CI tooling,
especially with -m32 support, so there might not be any value to have it
anymore.  As I mentioned at [1] I don't mind removing it or just work on
upgrading any dependency (or removing known buggy compiler flags) to keep it
without being annoying.  In any case I'm usually quite fast at reacting to any
problem/complaint on that animal, so you don't have to worry about the
buildfarm being red too long if it came to that.

[1] 
https://www.postgresql.org/message-id/20220921155025.wdixzbrt2uzbi6vz%40jrouhaud




Re: Make mesage at end-of-recovery less scary.

2023-03-06 Thread Kyotaro Horiguchi
At Mon, 6 Mar 2023 14:58:15 -0500, "Gregory Stark (as CFM)" 
 wrote in 
> It looks like this needs a rebase and at a quick glance it looks like
> more than a trivial conflict. I'll mark it Waiting on Author. Please
> update it back when it's rebased

Thanks for checking it!

I think 4ac30ba4f2 is that, which changes a few error
messages. Addition to rebasing, I rewrote some code comments of
xlogreader.c and revised the additional test script.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From d8b7ff96f48ede26390fa2208460ee2a8ea9cd87 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 7 Mar 2023 14:55:58 +0900
Subject: [PATCH v25] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.
---
 src/backend/access/transam/xlogreader.c   | 135 ++
 src/backend/access/transam/xlogrecovery.c |  94 +++
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 ++-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/035_recovery.pl   | 130 +
 6 files changed, 329 insertions(+), 51 deletions(-)
 create mode 100644 src/test/recovery/t/035_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cadea21b37..5a27c10bbb 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 			 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -641,16 +645,12 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Verify the record header.
 	 *
 	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
+	 * header might not fit on this page.
 	 */
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
 
 	/*
 	 * If the whole record header is on this page, validate it immediately.
@@ -669,18 +669,21 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-  "invalid record length at %X/%X: expected at least %u, got %u",
-  LSN_FORMAT_ARGS(RecPtr),
-  (uint32) SizeOfXLogRecord, total_len);
+		/*
+		 * xl_tot_len is the first field of the struct, so it must be on this
+		 * page (the records are MAXALIGNed), but we cannot access any other
+		 * fields until we've verified that we got the whole header.
+		 *
+		 * XXX: more validation should be done here
+		 */
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1106,25 +1109,81 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller 

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Thomas Munro
On Tue, Mar 7, 2023 at 5:32 PM Michael Paquier  wrote:
> On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote:
> > Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> > have old buggy compilers set to run with -Werror.
>
> Yes, as far as I can see when investigating the issue, this is an old
> bug of gcc when detecting where the initialization needs to be
> applied.  And at the same time the fix is deadly simple, so the
> current statu-quo does not sound that bad to me.  Note that lapwing is
> one of the only animals testing 32b builds, and it has saved from
> quite few bugs over the years.

Yeah, but I'm just wondering, why not run a current release on it[1]?
Debian is one of the few distributions still supporting 32 bit
kernels, and it's good to test rare things, but AFAIK the primary
reason we finish up with EOL'd OSes in the 'farm is because they have
been forgotten (the secondary reason is because they couldn't be
upgraded because the OS dropped the [micro]architecture).  Unlike
vintage SPARC, actual users might plausibly be running a current
release on a 32 bit Intel system, I guess (maybe on a Quark
microcontroller?)?

BTW CI also tests 32 bit with -m32 on Debian, but with a 64 bit
kernel, which probably doesn't change much at the level we care about,
so maybe this doesn't matter much... just sharing an observation that
we're wasting time thinking about an OS release that gave up the ghost
in 2016, because it is running with -Werror.  *shrug*

[1] https://wiki.debian.org/DebianReleases




Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-06 Thread Julien Rouhaud
On Tue, Mar 07, 2023 at 01:56:24PM +0900, Michael Paquier wrote:
> On Tue, Mar 07, 2023 at 12:42:20PM +0800, Julien Rouhaud wrote:
> > ah right I should have checked. but the same ABI compatibility concern
> > still exists for version 1.0 of the extension.
>
> Yes, we'd better make sure that the past code is able to run, at
> least.  Now I am not really convinced that we have the need to enforce
> an error with the new code even if 1.0 is still installed,

So keep this "deprecated" C function working, as it would only be a few lines
of code?

> so as it is
> possible to remove all the traces of the code that triggers errors if
> an end LSN is higher than the current insert LSN for primaries or
> replayed LSN for standbys.

+1 for that




Re: Add pg_walinspect function with block info columns

2023-03-06 Thread Michael Paquier
On Tue, Mar 07, 2023 at 11:17:45AM +0900, Kyotaro Horiguchi wrote:
> Thus I'm inclined to agree with Michael's suggestion of creating a new
> normalized set-returning function that returns information of
> "blocks".

Just to be clear here, I am not suggesting to add a new function for
only the block information, just a rename of the existing
pg_get_wal_fpi_info() to something like pg_get_wal_block_info() that
includes both the FPI (if any or NULL if none) and the block data (if
any or NULL is none) so as all of them are governed by the same lookup
at pg_wal/.  The fpi information (aka compression type) is displayed
if there is a FPI in the block.
--
Michael


signature.asc
Description: PGP signature


Re: NumericShort vs NumericLong format

2023-03-06 Thread Tom Lane
"David G. Johnston"  writes:
> As an aside, for anyone more fluent than I who reads this, is the use of
> the word "dynamic scale" in this code comment supposed to be "display
> scale"?
> https://github.com/postgres/postgres/blob/cf96907aadca454c4094819c2ecddee07eafe203/src/backend/utils/adt/numeric.c#L121

Yeah, I think you're right.

regards, tom lane




Re: Should vacuum process config file reload more often

2023-03-06 Thread Masahiko Sawada
On Mon, Mar 6, 2023 at 5:26 AM Melanie Plageman
 wrote:
>
> On Thu, Mar 2, 2023 at 6:37 PM Melanie Plageman
>  wrote:
> >
> > On Thu, Mar 2, 2023 at 2:36 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 2, 2023 at 10:41 AM Melanie Plageman
> > >  wrote:
> > > > On another topic, I've just realized that when autovacuuming we only
> > > > update tab->at_vacuum_cost_delay/limit from
> > > > autovacuum_vacuum_cost_delay/limit for each table (in
> > > > table_recheck_autovac()) and then use that to update
> > > > MyWorkerInfo->wi_cost_delay/limit. MyWorkerInfo->wi_cost_delay/limit is
> > > > what is used to update VacuumCostDelay/Limit in AutoVacuumUpdateDelay().
> > > > So, even if we reload the config file in vacuum_delay_point(), if we
> > > > don't use the new value of autovacuum_vacuum_cost_delay/limit it will
> > > > have no effect for autovacuum.
> > >
> > > Right, but IIUC wi_cost_limit (and VacuumCostDelayLimit) might be
> > > updated. After the autovacuum launcher reloads the config file, it
> > > calls autovac_balance_cost() that updates that value of active
> > > workers. I'm not sure why we don't update workers' wi_cost_delay,
> > > though.
> >
> > Ah yes, I didn't realize this. Thanks. I went back and did more code
> > reading/analysis, and I see no reason why we shouldn't update
> > worker->wi_cost_delay to the new value of autovacuum_vac_cost_delay in
> > autovac_balance_cost(). Then, as you said, the autovac launcher will
> > call autovac_balance_cost() when it reloads the configuration file.
> > Then, the next time the autovac worker calls AutoVacuumUpdateDelay(), it
> > will update VacuumCostDelay.
> >
> > > > I started writing a little helper that could be used to update these
> > > > workerinfo->wi_cost_delay/limit in vacuum_delay_point(),
> > >
> > > Since we set vacuum delay parameters for autovacuum workers so that we
> > > ration out I/O equally, I think we should keep the current mechanism
> > > that the autovacuum launcher sets workers' delay parameters and they
> > > update accordingly.
> >
> > Yes, agreed, it should go in the same place as where we update
> > wi_cost_limit (autovac_balance_cost()). I think we should potentially
> > rename autovac_balance_cost() because its name and all its comments
> > point to its only purpose being to balance the total of the workers
> > wi_cost_limits to no more than autovacuum_vacuum_cost_limit. And the
> > autovacuum_vacuum_cost_delay doesn't need to be balanced in this way.
> >
> > Though, since this change on its own would make autovacuum pick up new
> > values of autovacuum_vacuum_cost_limit (without having the worker reload
> > the config file), I wonder if it makes sense to try and have
> > vacuum_delay_point() only reload the config file if it is an explicit
> > vacuum or an analyze not being run in an outer transaction (to avoid
> > overhead of reloading config file)?
> >
> > The lifecycle of this different vacuum delay-related gucs and how it
> > differs between autovacuum workers and explicit vacuum is quite tangled
> > already, though.
>
> So, I've attached a new version of the patch which is quite different
> from the previous versions.

Thank you for updating the patch!

>
> In this version I've removed wi_cost_delay from WorkerInfoData. There is
> no synchronization of cost_delay amongst workers, so there is no reason
> to keep it in shared memory.
>
> One consequence of not updating VacuumCostDelay from wi_cost_delay is
> that we have to have a way to keep track of whether or not autovacuum
> table options are in use.
>
> This patch does this in a cringeworthy way. I added two global
> variables, one to track whether or not cost delay table options are in
> use and the other to store the value of the table option cost delay. I
> didn't want to use a single variable with a special value to indicate
> that table option cost delay is in use because
> autovacuum_vacuum_cost_delay already has special values that mean
> certain things. My code needs a better solution.

While it's true that wi_cost_delay doesn't need to be shared, it seems
to make the logic somewhat complex. We need to handle cost_delay in a
different way from other vacuum-related parameters and we need to make
sure av[_use]_table_option_cost_delay are set properly. Removing
wi_cost_delay from WorkerInfoData saves 8 bytes shared memory per
autovacuum worker but it might be worth considering to keep
wi_cost_delay for simplicity.

---
 void
 AutoVacuumUpdateDelay(void)
 {
-if (MyWorkerInfo)
+/*
+ * We are using autovacuum-related GUCs to update
VacuumCostDelay, so we
+ * only want autovacuum workers and autovacuum launcher to do this.
+ */
+if (!(am_autovacuum_worker || am_autovacuum_launcher))
+return;

Is there any case where the autovacuum launcher calls
AutoVacuumUpdateDelay() function?

---
In at autovac_balance_cost(), we have,

int vac_cost_limit = (autovacuum_vac_cost_limit > 0 ?
   

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Michael Paquier
On Mon, Mar 06, 2023 at 02:29:50PM -0800, Andres Freund wrote:
> Thanks.

Sure, no problem.  If there is anything else needed for this thread,
feel free to ping me here.
--
Michael


signature.asc
Description: PGP signature


Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-06 Thread Michael Paquier
On Tue, Mar 07, 2023 at 12:42:20PM +0800, Julien Rouhaud wrote:
> ah right I should have checked. but the same ABI compatibility concern
> still exists for version 1.0 of the extension.

Yes, we'd better make sure that the past code is able to run, at
least.  Now I am not really convinced that we have the need to enforce
an error with the new code even if 1.0 is still installed, so as it is
possible to remove all the traces of the code that triggers errors if
an end LSN is higher than the current insert LSN for primaries or
replayed LSN for standbys.
--
Michael


signature.asc
Description: PGP signature


Re: Raising the SCRAM iteration count

2023-03-06 Thread Michael Paquier
On Fri, Mar 03, 2023 at 11:13:36PM +0100, Daniel Gustafsson wrote:
> That would indeed be nice, but is there a way to do this without a complicated
> pump TAP expression?  I was unable to think of a way but I might be missing
> something?

A SET command refreshes immediately the cache information of the
connection in pqSaveParameterStatus()@libpq, so a test in password.sql
with \password would be enough to check the computation happens in
pg_fe_scram_build_secret() with the correct iteration number.  Say
like:
=# SET scram_iterations = 234;
SET
=# \password
Enter new password for user "postgres": TYPEME 
Enter it again: TYPEME
=# select substr(rolpassword, 1, 18) from pg_authid
 where oid::regrole::name = current_role;
   substr   

 SCRAM-SHA-256$234:
(1 row)

Or perhaps I am missing something?

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: NumericShort vs NumericLong format

2023-03-06 Thread David G. Johnston
I'll give this a go as a learning exercise for myself...

On Mon, Mar 6, 2023 at 8:47 PM Amin  wrote:

>
> - How can I determine which format will be used for a numeric type?
>

https://github.com/postgres/postgres/blob/cf96907aadca454c4094819c2ecddee07eafe203/src/backend/utils/adt/numeric.c#L491

(the three constants are decimal 63, 63, and -64 respectively)

> - What the precision and scale values should be for pgsql to use the long
> format? Is there a threshold?
>
>
Ones that cause the linked-to test to return false I suppose.

David J.

As an aside, for anyone more fluent than I who reads this, is the use of
the word "dynamic scale" in this code comment supposed to be "display
scale"?

https://github.com/postgres/postgres/blob/cf96907aadca454c4094819c2ecddee07eafe203/src/backend/utils/adt/numeric.c#L121


Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-06 Thread Julien Rouhaud
On Tue, 7 Mar 2023, 12:36 Michael Paquier,  wrote:

> On Tue, Mar 07, 2023 at 09:13:46AM +0800, Julien Rouhaud wrote:
> > It's problematic to install the extension if we rely on upgrade scripts
> only.
> > We could also provide a pg_walinspect--1.2.sql file and it would just
> work, and
> > that may have been a good idea if there wasn't also the problem of
> people still
> > having the version 1.1 locally installed, as we don't want them to see
> random
> > failures like "could not find function ... in file ...", or keeping the
> ability
> > to install the former 1.1 version (with those functions bypassed).
>
> Why would we need a 1.2?  HEAD is the only branch with pg_walinspect
> 1.1, and it has not been released yet.
>

ah right I should have checked. but the same ABI compatibility concern
still exists for version 1.0 of the extension.

>


Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-06 Thread Michael Paquier
On Tue, Mar 07, 2023 at 09:13:46AM +0800, Julien Rouhaud wrote:
> It's problematic to install the extension if we rely on upgrade scripts only.
> We could also provide a pg_walinspect--1.2.sql file and it would just work, 
> and
> that may have been a good idea if there wasn't also the problem of people 
> still
> having the version 1.1 locally installed, as we don't want them to see random
> failures like "could not find function ... in file ...", or keeping the 
> ability
> to install the former 1.1 version (with those functions bypassed).

Why would we need a 1.2?  HEAD is the only branch with pg_walinspect
1.1, and it has not been released yet.
--
Michael


signature.asc
Description: PGP signature


Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Michael Paquier
On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote:
> On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro  wrote:
>> Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
>> initialised that struct.  I guess it wants {{0}} instead of {0}.
>> Apparently old GCC was wrong about that warning[1], but that system
>> doesn't have the back-patched fixes?  Not sure.

6392f2a was one such case.

> Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> have old buggy compilers set to run with -Werror.

Yes, as far as I can see when investigating the issue, this is an old
bug of gcc when detecting where the initialization needs to be
applied.  And at the same time the fix is deadly simple, so the
current statu-quo does not sound that bad to me.  Note that lapwing is
one of the only animals testing 32b builds, and it has saved from
quite few bugs over the years.
--
Michael


signature.asc
Description: PGP signature


RE: [Proposal] Add foreign-server health checks infrastructure

2023-03-06 Thread Hayato Kuroda (Fujitsu)
Dear Katsuragi-san,

Thank you for reviewing! PSA new version patches.

> I think we can update the status to ready for committer after
> this fix, if there is no objection.

That's a very good news for me! How about other people?

> >> 7. the document of postgres_fdw_verify_connection_states_all
> >> NULL
> >> +  is returned if the local session does not have connection
> >> caches,
> >> or this
> >> +  function is not available on this platform.
> >>
> >> I think there is a case where a connection cache exists but valid
> >> connections do not exist and NULL is returned (disconnection case).
> >> Almost the same document as the postgres_fdw_verify_connection_states
> >> case (comment no.5) seems better.
> >
> > Yes, but completely same statement cannot be used because these is not
> > specified foreign server. How about:
> > NULL is returned if there are no established connections or this
> > function ...
> 
> Yes, to align with the postgres_fdw_verify_connection_states()
> case, how about writing the connection is not valid. Like the
> following?
> 'NULL is returned if no valid connections are established or
> this function...'

Prefer yours, fixed.

> This is my comment for v35. Please check.
> 0002:
> 1. the comment of verify_cached_connections (I missed one minor point.)
> + * This function emits warnings if a disconnection is found. This
> returns true
> + * if disconnections cannot be found, otherwise returns false.
> 
> I think false is returned only if disconnections are found and
> true is returned in all other cases. So, modifying the description
> like the following seems better.
> 'This returns false if disconnections are found, otherwise
> returns true.'

Fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v36-0001-Add-PQconnCheck-and-PQconnCheckable-to-libpq.patch
Description:  v36-0001-Add-PQconnCheck-and-PQconnCheckable-to-libpq.patch


v36-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch
Description:  v36-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch


v36-0003-add-test.patch
Description: v36-0003-add-test.patch


Re: Making empty Bitmapsets always be NULL

2023-03-06 Thread David Rowley
On Sat, 4 Mar 2023 at 11:08, Tom Lane  wrote:
>
> David Rowley  writes:
> > On Fri, 3 Mar 2023 at 15:17, Tom Lane  wrote:
> > It's true that the majority of Bitmapsets are going to be just 1 word,
> > but it's important to acknowledge that we do suffer in some more
> > extreme cases when Bitmapsets become large. Partition with large
> > numbers of partitions is one such case.
>
> Maybe, but optimizing for that while pessimizing every other case
> doesn't sound very attractive from here.  I think we need some
> benchmarks on normal-size bitmapsets before considering doing much
> in this area.

After thinking about this again and looking at the code, I'm not
really sure where the pessimism has been added.  For the changes made
to say bms_equal(), there was already a branch that checked the nwords
column so that we could find the shorter and longer sets out of the
two input sets. That's been replaced with a comparison of both input
set's nwords, which does not really seem any more expensive. For
bms_compare() we needed to do Min(a->nwords, b->nwords) to find the
shortest set, likewise for bms_nonempty_difference() and
bms_is_subset().  That does not seem less expensive than the
replacement code.

I think the times where we have sets that we do manage to trim down
the nword count with that we actually end up having to expand again
are likely fairly rare.

I also wondered if we could shave off a few instructions by utilising
the knowledge that nwords is never 0.  That would mean that some of
the loops could be written as:

i = 0; do { ; } while (++i < set->nwords);

instead of:

for (i = 0; i < set->nwords; i++) { ; }

if we do assume that the vast majority of sets are nwords==1 sets,
then this reduces the loop condition checks by half for all those.

I see that gcc manages to optimize: for (i = 0; i < set->nwords || i
== 0; i++) { ; } into the same code as the do while loop. clang
does not seem to manage that.

> Also, if we're going to make any sort of changes here it'd probably
> behoove us to make struct Bitmapset private in bitmapset.c, so that
> we can have confidence that nobody is playing games with them.
> I had a go at that and was pleasantly surprised to find that
> actually nobody has; the attached passes check-world.  It'd probably
> be smart to commit this as a follow-on to 00b41463c, whether or not
> we go any further.

That seems like a good idea. This will give us extra reassurance that
nobody is making up their own Bitmapsets through some custom function
that don't follow the rules.

> Also, given that we do this, I don't think that check_bitmapset_invariants
> as you propose it is worth the trouble.

I wondered if maybe just Assert(a == NULL || a->words[a->nwords - 1]
!= 0); would be worthwhile anywhere. However, I don't see any
particular function that is more likely to catch those errors, so it's
maybe not worth doing anywhere if we're not doing it everywhere.

I adjusted the patch to remove the invariant checks and fixed up a
couple of things I'd missed.  The 0002 patch changes the for loops
into do while loops. I wanted to see if we could see any performance
gains from doing this.

The performance numbers are nowhere near as stable as I'd like them to
have been, but testing the patch shows:

Test 1:

setup:
create table t1 (a int) partition by list(a);
select 'create table t1_'||x||' partition of t1 for values
in('||x||');' from generate_series(0,9)x;
\gexec

Test 1's sql:
select * from t1 where a > 1 and a < 3;

for i in {1..3}; do pgbench -n -f test1.sql -T 15 postgres | grep tps; done

master (cf96907aad):
tps = 29534.189309
tps = 30465.722545
tps = 30328.290553

master + 0001:
tps = 28915.174536
tps = 29817.950994
tps = 29387.084581

master + 0001 + 0002:
tps = 29438.216512
tps = 29951.905408
tps = 31445.191414

Test 2:

setup:
create table t2 (a int) partition by list(a);
select 'create table t2_'||x||' partition of t2 for values
in('||x||');' from generate_series(0,)x;
\gexec

Test 2's sql:
select * from t2 where a > 1 and a < 3;

for i in {1..3}; do pgbench -n -f test2.sql -T 15 postgres | grep tps; done

master (cf96907aad):
tps = 28470.504990
tps = 29175.450905
tps = 28123.699176

master + 0001:
tps = 28056.256805
tps = 28380.401746
tps = 28384.395217

master + 0001 + 0002:
tps = 29365.992219
tps = 28418.374923
tps = 28303.924129

Test 3:

setup:
create table t3a (a int primary key);
create table t3b (a int primary key);

Test 3's sql:
select * from t3a inner join t3b on t3a.a = t3b.a;

for i in {1..3}; do pgbench -n -f test3.sql -T 15 postgres | grep tps; done

master (cf96907aad):
tps = 20458.710550
tps = 20527.898929
tps = 20284.165277

master + 0001:
tps = 20700.340713
tps = 20571.913956
tps = 20541.771589

master + 0001 + 0002:
tps = 20046.674601
tps = 20016.649536
tps = 19487.999853

I've attached a graph of this too. It shows that there might be a
small increase in performance with tests 1 and 2. It seems like test 3
regresses a bit. I suspect this might just be a 

NumericShort vs NumericLong format

2023-03-06 Thread Amin
Hi,

- How can I determine which format will be used for a numeric type?
- What the precision and scale values should be for pgsql to use the long
format? Is there a threshold?


RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-06 Thread shiy.f...@fujitsu.com
On Monay, Mar 6, 2023 7:19 PM Önder Kalacı  wrote:
> 
> Yeah, seems easier to follow to me as well. Reflected it in the comment as 
> well.  
> 

Thanks for updating the patch. Here are some comments on v33-0001 patch.

1.
+   if (RelationReplicaIdentityFullIndexScanEnabled(localrel) &&
+   remoterel->replident == REPLICA_IDENTITY_FULL)

RelationReplicaIdentityFullIndexScanEnabled() is introduced in 0002 patch, so
the call to it should be moved to 0002 patch I think.

2.
+#include "optimizer/cost.h"

Do we need this in the latest patch? I tried and it looks it could be removed
from src/backend/replication/logical/relation.c.

3.
+# now, create a unique index and set the replica
+$node_publisher->safe_psql('postgres',
+   "CREATE UNIQUE INDEX test_replica_id_full_unique ON 
test_replica_id_full(x);");
+$node_subscriber->safe_psql('postgres',
+   "CREATE UNIQUE INDEX test_replica_id_full_unique ON 
test_replica_id_full(x);");
+

Should the comment be "now, create a unique index and set the replica identity"?

4.
+$node_publisher->safe_psql('postgres',
+   "ALTER TABLE test_replica_id_full REPLICA IDENTITY USING INDEX 
test_replica_id_full_unique;");
+$node_subscriber->safe_psql('postgres',
+   "ALTER TABLE test_replica_id_full REPLICA IDENTITY USING INDEX 
test_replica_id_full_unique;");
+
+# wait for the synchronization to finish
+$node_subscriber->wait_for_subscription_sync;

There's no new tables to need to be synchronized here, should we remove the call
to wait_for_subscription_sync?

Regards,
Shi Yu


Re: add PROCESS_MAIN to VACUUM

2023-03-06 Thread Michael Paquier
On Mon, Mar 06, 2023 at 04:59:49PM -0800, Nathan Bossart wrote:
> That did cross my mind, but I was worried that trying to explain all that
> here could cause confusion.
> 
>   If PROCESS_MAIN is set (the default), it's time to vacuum the main
>   relation.  Otherwise, we can skip this part.  If processing the TOAST
>   table is required (e.g., PROCESS_TOAST is set), we'll force
>   PROCESS_MAIN to be set when we recurse to the TOAST table so that it
>   gets processed here.
> 
> How does that sound?

Sounds clear to me, thanks!  Melanie, what do you think?
--
Michael


signature.asc
Description: PGP signature


Re: Record queryid when auto_explain.log_verbose is on

2023-03-06 Thread Imseih (AWS), Sami
> > It's a bit annoying that the info is missing since pg 14, but we
> > probably can't
> > backpatch this as it might break log parser tools.


> What do you think?

That's a good point about log parsing tools, i.e. pgbadger.

Backpatching does not sounds to appealing to me after
giving this a second thought.

However, I do feel it needs to be called out in docs,
that "Query Identifier" is not available in auto_explain
until version 16.

Regards,

--

Sami Imseih
Amazon Web Services (AWS)





Re: Allow logical replication to copy tables in binary format

2023-03-06 Thread Amit Kapila
On Wed, Mar 1, 2023 at 7:58 PM Melih Mutlu  wrote:
>
> Bharath Rupireddy , 1 Mar 2023 Çar, 
> 15:02 tarihinde şunu yazdı:
>>
>
> That was my intention in the beginning with this patch. Then the new option 
> also made some sense at some point, and I added copy_binary option according 
> to reviews.
> The earlier versions of the patch didn't have that. Without the new option, 
> this patch would also be smaller.
>
> But before changing back to the point where these are all tied to binary 
> option without a new option, I think we should decide if that's really the 
> ideal way to do it.
>

As per what I could read in this thread, most people prefer to use the
existing binary option rather than inventing a new way (option) to
binary copy in the initial sync phase. Do you agree? If so, it is
better to update the patch accordingly as this is the last CF for this
release and we have a limited time left.

-- 
With Regards,
Amit Kapila.




Re: Moving forward with TDE

2023-03-06 Thread Chris Travers
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

I have decided to write a review here in terms of whether we want this feature, 
and perhaps the way we should look at encryption as a project down the road, 
since I think this is only the beginning.  I am hoping to run some full tests 
of the feature sometime in coming weeks.  Right now this review is limited to 
the documentation and documented feature.

From the documentation, the primary threat model of TDE is to prevent 
decryption of data from archived wal segments (and data files), for example on 
a backup system.  While there are other methods around this problem to date, I 
think that this feature is worth pursuing for that reason.  I want to address a 
couple of reasons for this and then go into some reservations I have about how 
some of this is documented.

There are current workarounds to ensuring encryption at rest, but these have a 
number of problems.  Encryption passphrases end up lying around the system in 
various places.  Key rotation is often difficult.  And one mistake can easily 
render all efforts ineffective.  TDE solves these problems.  The overall design 
from the internal docs looks solid.  This definitely is something I would 
recommend for many users.

I have a couple small caveats though.  Encryption of data is a large topic and 
there isn't a one-size-fits-all solution to industrial or state requirements.  
Having all this key management available in PostgreSQL is a very good thing.  
Long run it is likely to end up being extensible, and therefore both more 
powerful and offering a wider range of choices for solution architects.  
Implementing encryption is also something that is easy to mess up.  For this 
reason I think it would be great if we had a standardized format for discussing 
encryption options that we could use going forward.  I don't think that should 
be held against this patch but I think we need to start discussing it now 
because it will be a bigger problem later.

A second caveat I have is that key management is a topic where you really need 
a good overview of internals in order to implement effectively.  If you don't 
know how an SSL handshake works or what is in a certificate, you can easily 
make mistakes in setting up SSL.  I can see the same thing happening here.  For 
example, I don't think it would be safe to leave the KEK on an encrypted 
filesystem that is decrypted at runtime (or at least I wouldn't consider that 
safe -- your appetite for risk may vary).

My proposal would be to have build a template for encryption options in the 
documentation.  This could include topics like SSL as well.  In such a template 
we'd have sections like "Threat model,"  "How it works," "Implementation 
Requirements" and so forth.  Again I don't think this needs to be part of the 
current patch but I think it is something we need to start thinking about now.  
Maybe after this goes in, I can present a proposed documentation patch.

I will also note that I don't consider myself to be very qualified on topics 
like encryption.  I can reason about key management to some extent but some 
implementation details may be beyond me.  I would hope we could get some extra 
review on this patch set soon.

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-03-06 Thread Kyotaro Horiguchi
At Mon, 6 Mar 2023 15:21:14 -0500, Melanie Plageman  
wrote in 
> On Mon, Mar 6, 2023 at 2:34 PM Andres Freund  wrote:
> > That, but also because it's simply more reliable. autovacuum=off doesn't
> > protect against a anti-wraparound vacuum or such. Or a concurrent test 
> > somehow
> > triggering a read. Or ...
>
> Good point. Attached is what you suggested. I committed the transaction
> before the drop table so that the statistics would be visible when we
> queried pg_stat_io.

While I don't believe anti-wraparound vacuum can occur during testing,
Melanie's solution (moving the commit by a few lines) seems working
(by a manual testing).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-06 Thread Amit Kapila
On Tue, Mar 7, 2023 at 1:34 AM Andres Freund  wrote:
>
> On 2023-03-01 14:10:07 +0530, Amit Kapila wrote:
> > On Wed, Mar 1, 2023 at 12:09 AM Andres Freund  wrote:
> > >
> > > > I see this as a way to provide this feature for users but I would
> > > > prefer to proceed with this if we can get some more buy-in from senior
> > > > community members (at least one more committer) and some user(s) if
> > > > possible. So, I once again request others to chime in and share their
> > > > opinion.
> > >
> > > I'd prefer not having an option, because we figure out the cause of the
> > > performance regression (reducing it to be small enough to not care). After
> > > that an option defaulting to using indexes.
> > >
> >
> > Sure, if we can reduce regression to be small enough then we don't
> > need to keep the default as false, otherwise, also, we can consider it
> > to keep an option defaulting to using indexes depending on the
> > investigation for regression. Anyway, the main concern was whether it
> > is okay to have an option for this which I think we have an agreement
> > on, now I will continue my review.
>
> I think even as-is it's reasonable to just use it. The sequential scan
> approach is O(N^2), which, uh, is not good. And having an index over thousands
> of non-differing values will generally perform badly, not just in this
> context.
>

Yes, it is true that generally also index scan with a lot of
duplicates may not perform well but during the scan, we do costing to
ensure such cases and may prefer other index or sequence scan. Then we
have "enable_indexscan" GUC that the user can use if required. So, I
feel it is better to have a knob to disallow usage of such indexes and
the default would be to use an index, if available.

-- 
With Regards,
Amit Kapila.




Re: proposal: possibility to read dumped table's name from file

2023-03-06 Thread Julien Rouhaud
Hi,

On Mon, Mar 06, 2023 at 10:20:32PM +0100, Daniel Gustafsson wrote:
> > On 6 Mar 2023, at 21:45, Gregory Stark (as CFM)  wrote:
> >
> > So This patch has been through a lot of commitfests. And it really
> > doesn't seem that hard to resolve -- Pavel has seemingly been willing
> > to go along whichever way the wind has been blowing but honestly it
> > kind of seems like he's just gotten drive-by suggestions and he's put
> > a lot of work into trying to satisfy them.
>
> Agreed.

Indeed, I'm not sure I would have had that much patience.

> > He implemented --include-tables-from-file=... etc. Then he implemented
> > a hand-written parser for a DSL to select objects, then he implemented
> > a bison parser, then he went back to the hand-written parser.
>
> Well, kind of.  I was trying to take the patch to the finishing line but was
> uncomfortable with the hand written parser so I implemented a parser in Bison
> to replace it with.  Not that hand-written parsers are bad per se (or that my
> bison parser was perfect), but reading quoted identifiers across line
> boundaries tend to require a fair amount of handwritten code.  Pavel did not
> object to this version, but it was objected to by two other committers.
>
> At this point [0] I stepped down from trying to finish it as the approach I 
> was
> comfortable didn't gain traction (which is totally fine).
>
> Downthread from this the patch got a lot of reviews from Julien with the old
> parser back in place.

Yeah, and the current state seems quite good to me.

> > Can we get some consensus on whether the DSL looks right
>
> I would consider this pretty settled.

Agreed.




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Thomas Munro
On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro  wrote:
> Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
> initialised that struct.  I guess it wants {{0}} instead of {0}.
> Apparently old GCC was wrong about that warning[1], but that system
> doesn't have the back-patched fixes?  Not sure.

Oh, you already pushed a fix.  But now I'm wondering if it's useful to
have old buggy compilers set to run with -Werror.




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Thomas Munro
On Mon, Mar 6, 2023 at 5:30 PM Michael Paquier  wrote:
> On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote:
> > I ran some tests on my dev system [1] and I don't see much difference
> > between v3 and v4. So, +1 for v3 patch (+ argument order swap) from
> > Andres to keep the code simple and elegant.
>
> This thread has stalled for a couple of weeks, so I have gone back to
> it.  Testing on a tmpfs I am not seeing a difference if performance
> for any of the approaches discussed.  At the end, as I am the one
> behind the introduction of pg_pwrite_zeros(), I have applied v3 after
> switches the size and offset parameters to be the same way as in v4.

Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
initialised that struct.  I guess it wants {{0}} instead of {0}.
Apparently old GCC was wrong about that warning[1], but that system
doesn't have the back-patched fixes?  Not sure.

[1] 
https://stackoverflow.com/questions/63355760/how-standard-is-the-0-initializer-in-c89




Re: Record queryid when auto_explain.log_verbose is on

2023-03-06 Thread torikoshia

On 2023-03-07 08:50, Imseih (AWS), Sami wrote:

I am wondering if this patch should be backpatched?

The reason being is in auto_explain documentation [1],
there is a claim of equivalence of the auto_explain.log_verbose
option and EXPLAIN(verbose)

". it's equivalent to the VERBOSE option of EXPLAIN."

This can be quite confusing for users of the extension.
The documentation should either be updated or a backpatch
all the way down to 14, which the version the query identifier
was moved to core. I am in favor of the latter.

Any thoughts?


We discussed a bit whether to backpatch this, but agreed that it would 
be better not to do so for the following reasons:


It's a bit annoying that the info is missing since pg 14, but we 
probably can't

backpatch this as it might break log parser tools.


What do you think?

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION




Re: Add pg_walinspect function with block info columns

2023-03-06 Thread Kyotaro Horiguchi
At Tue, 7 Mar 2023 09:34:24 +0900, Michael Paquier  wrote 
in 
> On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote:
> > On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy
> >> IMO, pg_get_wal_records_extended_info as proposed doesn't look good to
> >> me as it outputs most of the columns that are already given by
> >> pg_get_wal_records_info.What I think the best way at this point is to
> >> make it return the following:
> >> lsn pg_lsn
> >> block_id int8
> >> spcOid oid
> >> dbOid oid
> >> relNumber oid
> >> forkNames text
> >> fpi bytea
> >> fpi_info text
> 
> I would add the length of the block data (without the hole and
> compressed, as the FPI data should always be presented as
> uncompressed), and the block data if any (without the block data
> length as one can guess it based on the bytea data length).  Note that 
> a block can have both a FPI and some data assigned to it, as far as I
> recall.

+1

> > The basic idea is to create a single entrypoint to all relevant data
> > from DecodedXLogRecord in SQL, not multiple.
> 
> While I would agree with this principle on simplicity's ground in
> terms of minimizing the SQL interface and the pg_wal/ lookups, I
> disagree about it on unsability ground, because we can avoid extra SQL
> tweaks with more functions.  One recent example I have in mind is
> partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE
> on the catalogs. There are of course various degrees of complexity,
> and perhaps unnest() cannot qualify as one, but having two functions
> returning normalized records (one for the record information, and a
> second for the block information), is a rather good balance between
> usability and interface complexity, in my experience.  If you have two
> functions, a JOIN is enough to cross-check the block data and the
> record data, while an unnest() heavily bloats the main function output
> (aka byteas of FPIs in a single array).

FWIW, my initial thought about the proposal was similar to Matthias,
and tried a function that would convert (for simplicity) the block_ref
string to a json object. Although this approach did work, I was not
satisfied with its limited usability and poor performance (mainly the
poor performance is due to text->json conversion, though)..

Finally, I realized that the initial discomfort I experienced stemmed
from the name of the function, which suggests that it returns
information of "records". This discomfort would disappear if the
function were instead named pg_get_wal_blockref_info() or something
similar.

Thus I'm inclined to agree with Michael's suggestion of creating a new
normalized set-returning function that returns information of
"blocks".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-06 Thread Alexander Korotkov
On Tue, Mar 7, 2023 at 4:50 AM Andres Freund  wrote:
> On 2023-03-02 14:28:56 +0400, Pavel Borisov wrote:
> > 2. Heap updates with low tuple concurrency:
> > Prepare with pkeys (pgbench -d postgres -i -I dtGvp -s 300 
> > --unlogged-tables)
> > Update 3*10^7 rows, 50 conns (pgbench postgres -f
> > ./update-only-account.sql -s 300 -P10 -M prepared -T 600 -j 5 -c 50)
> >
> > Result: Both patches and master are the same within a tolerance of
> > less than 0.7%.
>
> What exactly does that mean? I would definitely not want to accept a 0.7%
> regression of the uncontended case to benefit the contended case here...

I don't know what exactly Pavel meant, but average overall numbers for
low concurrency are.
master: 420401 (stddev of average 233)
patchset v11: 420111 (stddev of average 199)
The difference is less than 0.1% and that is very safely within the error.

--
Regards,
Alexander Korotkov




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-03-06 Thread torikoshia

On 2023-03-06 23:03, Daniel Gustafsson wrote:

On 28 Feb 2023, at 15:28, Damir Belyalov  wrote:


Tested patch on all cases: CIM_SINGLE, CIM_MULTI, CIM_MULTI_CONDITION. 
As expected it works.

Also added a description to copy.sgml and made a review on patch.

Thanks for your tests and improvements!

I added 'ignored_errors' integer parameter that should be output after 
the option is finished.
All errors were added to the system logfile with full detailed 
context. Maybe it's better to log only error message.

Certainly.

FWIW, Greenplum has a similar construct (but which also logs the errors 
in the
db) where data type errors are skipped as long as the number of errors 
don't
exceed a reject limit.  If the reject limit is reached then the COPY 
fails:


LOG ERRORS [ SEGMENT REJECT LIMIT  [ ROWS | PERCENT ]]

IIRC the gist of this was to catch then the user copies the wrong input 
data or
plain has a broken file.  Rather than finding out after copying n rows 
which

are likely to be garbage the process can be restarted.

This version of the patch has a compiler error in the error message:

copyfrom.c: In function ‘CopyFrom’:
copyfrom.c:1008:29: error: format ‘%ld’ expects argument of type ‘long
int’, but argument 2 has type ‘uint64’ {aka ‘long long unsigned int’}
[-Werror=format=]
1008 | ereport(WARNING, errmsg("Errors: %ld", cstate->ignored_errors));
 |  ^ ~~
 |  |
 |  uint64 {aka long
long unsigned int}


On that note though, it seems to me that this error message leaves a 
bit to be

desired with regards to the level of detail.

+1.
I felt just logging "Error: %ld" would make people wonder the meaning of 
the %ld. Logging something like ""Error: %ld data type errors were 
found" might be clearer.


--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION




Re: [Proposal] Add foreign-server health checks infrastructure

2023-03-06 Thread Katsuragi Yuta

Hi Kuroda-san,

Thank you for updating the patch!

I think we can update the status to ready for committer after
this fix, if there is no objection.


7. the document of postgres_fdw_verify_connection_states_all
NULL
+  is returned if the local session does not have connection 
caches,

or this
+  function is not available on this platform.

I think there is a case where a connection cache exists but valid
connections do not exist and NULL is returned (disconnection case).
Almost the same document as the postgres_fdw_verify_connection_states
case (comment no.5) seems better.


Yes, but completely same statement cannot be used because these is not
specified foreign server. How about:
NULL is returned if there are no established connections or this 
function ...


Yes, to align with the postgres_fdw_verify_connection_states()
case, how about writing the connection is not valid. Like the
following?
'NULL is returned if no valid connections are established or
this function...'


This is my comment for v35. Please check.
0002:
1. the comment of verify_cached_connections (I missed one minor point.)
+ * This function emits warnings if a disconnection is found. This 
returns true

+ * if disconnections cannot be found, otherwise returns false.

I think false is returned only if disconnections are found and
true is returned in all other cases. So, modifying the description
like the following seems better.
'This returns false if disconnections are found, otherwise
returns true.'


regards,

--
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-06 Thread Andres Freund
Hi,

On 2023-03-02 14:28:56 +0400, Pavel Borisov wrote:
> 2. Heap updates with low tuple concurrency:
> Prepare with pkeys (pgbench -d postgres -i -I dtGvp -s 300 --unlogged-tables)
> Update 3*10^7 rows, 50 conns (pgbench postgres -f
> ./update-only-account.sql -s 300 -P10 -M prepared -T 600 -j 5 -c 50)
> 
> Result: Both patches and master are the same within a tolerance of
> less than 0.7%.

What exactly does that mean? I would definitely not want to accept a 0.7%
regression of the uncontended case to benefit the contended case here...

Greetings,

Andres Freund




Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-06 Thread Alexander Korotkov
On Thu, Mar 2, 2023 at 11:16 PM Alexander Korotkov  wrote:
> On Thu, Mar 2, 2023 at 9:17 PM Pavel Borisov  wrote:
> > On Thu, 2 Mar 2023 at 18:53, Alexander Korotkov  
> > wrote:
> > > On Thu, Mar 2, 2023 at 1:29 PM Pavel Borisov  
> > > wrote:
> > > > > Let's see the performance results for the patchset.  I'll properly
> > > > > revise the comments if results will be good.
> > > > >
> > > > > Pavel, could you please re-run your tests over revised patchset?
> > > >
> > > > Since last time I've improved the test to avoid significant series
> > > > differences due to AWS storage access variation that is seen in [1].
> > > > I.e. each series of tests is run on a tmpfs with newly inited pgbench
> > > > tables and vacuum. Also, I've added a test for low-concurrency updates
> > > > where the locking optimization isn't expected to improve performance,
> > > > just to make sure the patches don't make things worse.
> > > >
> > > > The tests are as follows:
> > > > 1. Heap updates with high tuple concurrency:
> > > > Prepare without pkeys (pgbench -d postgres -i -I dtGv -s 10 
> > > > --unlogged-tables)
> > > > Update tellers 100 rows, 50 conns ( pgbench postgres -f
> > > > ./update-only-tellers.sql -s 10 -P10 -M prepared -T 600 -j 5 -c 50 )
> > > >
> > > > Result: Average of 5 series with patches (0001+0002) is around 5%
> > > > faster than both master and patch 0001. Still, there are some
> > > > fluctuations between different series of the measurements of the same
> > > > patch, but much less than in [1]
> > >
> > > Thank you for running this that fast!
> > >
> > > So, it appears that 0001 patch has no effect.  So, we probably should
> > > consider to drop 0001 patch and consider just 0002 patch.
> > >
> > > The attached patch v12 contains v11 0002 patch extracted separately.
> > > Please, add it to the performance comparison.  Thanks.
> >
> > I've done a benchmarking on a full series of four variants: master vs
> > v11-0001 vs v11-0001+0002 vs v12 in the same configuration as in the
> > previous measurement. The results are as follows:
> >
> > 1. Heap updates with high tuple concurrency:
> > Average of 5 series v11-0001+0002 is around 7% faster than the master.
> > I need to note that while v11-0001+0002 shows consistent performance
> > improvement over the master, its value can not be determined more
> > precisely than a couple of percents even with averaging. So I'd
> > suppose we may not conclude from the results if a more subtle
> > difference between v11-0001+0002 vs v12 (and master vs v11-0001)
> > really exists.
> >
> > 2. Heap updates with high tuple concurrency:
> > All patches and master are still the same within a tolerance of
> > less than 0.7%.
> >
> > Overall patch v11-0001+0002 doesn't show performance degradation so I
> > don't see why to apply only patch 0002 skipping 0001.
>
> Thank you, Pavel.  So, it seems that we have substantial benefit only
> with two patches.  So, I'll continue working on both of them.

The revised patchset is attached.  The patch removing extra
table_tuple_fetch_row_version() is back.  The second patch now
implements a concept of LazyTupleTableSlot, a slot which gets
allocated only when needed.  Also, there is more minor refactoring and
more comments.

--
Regards,
Alexander Korotkov


0001-Evade-extra-table_tuple_fetch_row_version-in-Exe-v13.patch
Description: Binary data


0002-Allow-locking-updated-tuples-in-tuple_update-and-v13.patch
Description: Binary data


Re: Inaccurate comment for pg_get_partkeydef

2023-03-06 Thread David Rowley
On Mon, 6 Mar 2023 at 18:54, Japin Li  wrote:
> PSA patch to fix a comment inaccurate.

Thanks. Pushed.

David




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-03-06 Thread Masahiko Sawada
On Tue, Mar 7, 2023 at 1:01 AM John Naylor  wrote:
>
> On Mon, Mar 6, 2023 at 1:28 PM Masahiko Sawada  wrote:
>
> > > Since the block-level measurement is likely overestimating quite a bit, I 
> > > propose to simply reverse the order of the actions here, effectively 
> > > reporting progress for the *last page* and not the current one: First 
> > > update progress with the current memory usage, then add tids for this 
> > > page. If this allocated a new block, only a small bit of that will be 
> > > written to. If this block pushes it over the limit, we will detect that 
> > > up at the top of the loop. It's kind of like our earlier attempts at a 
> > > "fudge factor", but simpler and less brittle. And, as far as OS pages we 
> > > have actually written to, I think it'll effectively respect the memory 
> > > limit, at least in the local mem case. And the numbers will make sense.
> > >
> > > Thoughts?
> >
> > It looks to work but it still doesn't work in a case where a shared
> > tidstore is created with a 64kB memory limit, right?
> > TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> > from the beginning.
>
> I have two ideas:
>
> 1. Make it optional to track chunk memory space by a template parameter. It 
> might be tiny compared to everything else that vacuum does. That would allow 
> other users to avoid that overhead.
> 2. When context block usage exceeds the limit (rare), make the additional 
> effort to get the precise usage -- I'm not sure such a top-down facility 
> exists, and I'm not feeling well enough today to study this further.

I prefer option (1) as it's straight forward. I mentioned a similar
idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is
defined. It might be worth checking if there is visible overhead of
tracking chunk memory space. IIRC we've not evaluated it yet.

[1] 
https://www.postgresql.org/message-id/CAD21AoDK3gbX-jVxT6Pfso1Na0Krzr8Q15498Aj6tmXgzMFksA%40mail.gmail.com

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-06 Thread Julien Rouhaud
On Mon, Mar 06, 2023 at 08:36:17PM +0530, Bharath Rupireddy wrote:
> On Mon, Mar 6, 2023 at 2:22 PM Julien Rouhaud  wrote:
>
> > Also:
> >
> > /*
> > - * Get info and data of all WAL records from start LSN till end of WAL.
> > + * NB: This function does nothing and stays here for backward 
> > compatibility.
> > + * Without it, the extension fails to install.
> >   *
> > - * This function emits an error if a future start i.e. WAL LSN the database
> > - * system doesn't know about is specified.
> > + * Try using pg_get_wal_records_info() for the same till_end_of_wal
> > + * functionaility.
> >
> > I don't like much this chunk (same for the other kept function).  Apart from
> > the obvious typo in "functionaility", I don't think that the comment is 
> > really
> > accurate.
>
> Can you be more specific what's inaccurate about the comment?

It's problematic to install the extension if we rely on upgrade scripts only.
We could also provide a pg_walinspect--1.2.sql file and it would just work, and
that may have been a good idea if there wasn't also the problem of people still
having the version 1.1 locally installed, as we don't want them to see random
failures like "could not find function ... in file ...", or keeping the ability
to install the former 1.1 version (with those functions bypassed).




Re: add PROCESS_MAIN to VACUUM

2023-03-06 Thread Nathan Bossart
On Tue, Mar 07, 2023 at 09:20:12AM +0900, Michael Paquier wrote:
> -* Do the actual work --- either FULL or "lazy" vacuum
> +* If PROCESS_MAIN is set (the default), it's time to vacuum the main
> +* relation.  Otherwise, we can skip this part.  If required, we'll 
> process
> +* the TOAST table later.
> 
> Should we mention that this part could be used for a toast table once
> we've already looped once through vacuum_rel() when toast_relid was
> set?  VACOPT_PROCESS_MAIN is enforced a few lines down the road,
> still..

That did cross my mind, but I was worried that trying to explain all that
here could cause confusion.

If PROCESS_MAIN is set (the default), it's time to vacuum the main
relation.  Otherwise, we can skip this part.  If processing the TOAST
table is required (e.g., PROCESS_TOAST is set), we'll force
PROCESS_MAIN to be set when we recurse to the TOAST table so that it
gets processed here.

How does that sound?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Evaluate arguments of correlated SubPlans in the referencing ExprState

2023-03-06 Thread Tom Lane
Andres Freund  writes:
> On 2023-03-03 15:09:18 -0500, Tom Lane wrote:
>> It'd be good to have a header comment for ExecInitExprRec documenting
>> the arguments, particularly that resv/resnull are where to put the
>> subplan's eventual result.

> Did you mean ExecInitSubPlanExpr()?

Right, copy-and-pasteo, sorry.

>> You could avoid having to assume ExprState's resvalue/resnull being
>> safe to use by instead using the target resv/resnull.  This would
>> require putting those into the EEOP_PARAM_SET step so that
>> ExecEvalParamSet knows where to fetch from, so maybe it's not an
>> improvement, but perhaps worth considering.

> I think that'd be a bit worse - we'd have more pointers that can't be handled
> in a generic way in JIT.

OK.

regards, tom lane




Re: Add pg_walinspect function with block info columns

2023-03-06 Thread Michael Paquier
On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote:
> On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy
>> IMO, pg_get_wal_records_extended_info as proposed doesn't look good to
>> me as it outputs most of the columns that are already given by
>> pg_get_wal_records_info.What I think the best way at this point is to
>> make it return the following:
>> lsn pg_lsn
>> block_id int8
>> spcOid oid
>> dbOid oid
>> relNumber oid
>> forkNames text
>> fpi bytea
>> fpi_info text

I would add the length of the block data (without the hole and
compressed, as the FPI data should always be presented as
uncompressed), and the block data if any (without the block data
length as one can guess it based on the bytea data length).  Note that 
a block can have both a FPI and some data assigned to it, as far as I
recall.

> The basic idea is to create a single entrypoint to all relevant data
> from DecodedXLogRecord in SQL, not multiple.

While I would agree with this principle on simplicity's ground in
terms of minimizing the SQL interface and the pg_wal/ lookups, I
disagree about it on unsability ground, because we can avoid extra SQL
tweaks with more functions.  One recent example I have in mind is
partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE
on the catalogs. There are of course various degrees of complexity,
and perhaps unnest() cannot qualify as one, but having two functions
returning normalized records (one for the record information, and a
second for the block information), is a rather good balance between
usability and interface complexity, in my experience.  If you have two
functions, a JOIN is enough to cross-check the block data and the
record data, while an unnest() heavily bloats the main function output
(aka byteas of FPIs in a single array).
--
Michael


signature.asc
Description: PGP signature


Re: Evaluate arguments of correlated SubPlans in the referencing ExprState

2023-03-06 Thread Andres Freund
Hi,

On 2023-03-03 15:09:18 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-03-02 13:00:31 -0800, Andres Freund wrote:
> >> I'm not opposed to EXPR_PARAM_SET, to be clear. I'll send an updated
> >> version later. I was just thinking about the correctness in the current
> >> world.
>
> > Attached.
>
> I've looked through this, and it looks basically OK so I marked it RfC.

Thanks!


> I do have a few nitpicks that you might or might not choose to adopt:
>
> It'd be good to have a header comment for ExecInitExprRec documenting
> the arguments, particularly that resv/resnull are where to put the
> subplan's eventual result.

Did you mean ExecInitSubPlanExpr()?


> You could avoid having to assume ExprState's resvalue/resnull being
> safe to use by instead using the target resv/resnull.  This would
> require putting those into the EEOP_PARAM_SET step so that
> ExecEvalParamSet knows where to fetch from, so maybe it's not an
> improvement, but perhaps worth considering.

I think that'd be a bit worse - we'd have more pointers that can't be handled
in a generic way in JIT.


> I think that ExecEvalParamSet should either set prm->execPlan to NULL,
> or maybe better Assert that it is already NULL.

Agreed.

Greetings,

Andres Freund




Re: add PROCESS_MAIN to VACUUM

2023-03-06 Thread Michael Paquier
On Mon, Mar 06, 2023 at 06:12:36PM -0500, Melanie Plageman wrote:
> LGTM.

-* Do the actual work --- either FULL or "lazy" vacuum
+* If PROCESS_MAIN is set (the default), it's time to vacuum the main
+* relation.  Otherwise, we can skip this part.  If required, we'll process
+* the TOAST table later.

Should we mention that this part could be used for a toast table once
we've already looped once through vacuum_rel() when toast_relid was
set?  VACOPT_PROCESS_MAIN is enforced a few lines down the road,
still..
--
Michael


signature.asc
Description: PGP signature


Re: [PoC] Let libpq reject unexpected authentication requests

2023-03-06 Thread Jacob Champion
On Fri, Mar 3, 2023 at 6:35 PM Michael Paquier  wrote:
> I was refreshing my mind with 0001 yesterday, and except for the two
> parts where we need to worry about AUTH_REQ_OK being sent too early
> and the business with gssenc, this is a rather straight-forward.  It
> also looks like the the participants of the thread are OK with the
> design you are proposing (list of keywords, potentially negative
> patterns).  I think that I can get this part merged for this CF, at
> least, not sure about the rest :p

Thanks! Is there anything that would make the sslcertmode patch more
palatable? Or any particular areas of concern?

--Jacob




Re: Normalization of utility queries in pg_stat_statements

2023-03-06 Thread Michael Paquier
On Mon, Mar 06, 2023 at 03:50:55PM +0300, Andrei Zubkov wrote:
> Those statements is not related to any WAL tests. It seems a little bit
> incorrect to me.

The intention is to have each file run in isolation, so this is
incorrect as it stands.  Thanks for the report!
--
Michael


signature.asc
Description: PGP signature


Re: a very minor bug and a couple of comment changes for basebackup.c

2023-03-06 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> Thanks for the review. I have committed the patches.

No objections to what was committed.

> On Thu, Mar 2, 2023 at 2:59 AM Michael Paquier  wrote:
> > There is more to it: the page LSN is checked before its checksum.
> > Hence, if the page's LSN is corrupted in a way that it is higher than
> > sink->bbs_state->startptr, the checksum verification is just skipped
> > while the page is broken but not reported as such.  (Spoiler: this has
> > been mentioned in the past, and maybe we'd better remove this stuff in
> > its entirety.)
> 
> Yep. It's pretty embarrassing that we have a checksum verification
> feature that has both known ways of producing false positives and
> known ways of producing false negatives and we have no plan to ever
> fix that, we're just going to keep shipping what we've got. I think
> it's pretty clear that the feature shouldn't have been committed like
> this; valid criticisms of the design were offered and simply not
> addressed, not even by updating the comments or documentation with
> disclaimers. I find the code in sendFile() pretty ugly, too. For all
> of that, I'm a bit uncertain whether ripping it out is the right thing
> to do. It might be (I'm not sure) that it tends to work decently well
> in practice. Like, yes, it could produce false checksum warnings, but
> does that actually happen to people? It's probably not too likely that
> a read() or write() of 8kB gets updated after doing only part of the
> I/O, so the concurrent read or write is fairly likely to be on-CPU, I
> would guess, and therefore this algorithm might kind of work OK in
> practice despite begin garbage on a theoretical level. Similarly, the
> problems with how the LSN is vetted make it likely that a page
> replaced with random garbage will go undetected, but a page where a
> few bytes get flipped in a random position within the page is likely
> to get caught, and maybe the latter is actually a bigger risk than the
> former. I don't really know. I'd be interested to hear from anyone
> with a lot of practical experience using the feature. A few anecdotes
> of the form "this routine fails to tell us about problems" or "this
> often complains about problems that are not real" or "this has really
> helped us out on a number of occasions and we have had no problems
> with it" would be really helpful.

The concern about the LSN is certainly a valid one and we ended up
ripping that check out of pgbackrest in favor of taking a different
approach- simply re-read the page and see if it changed.  If it changed,
then we punt and figure that it was a hot page that PG was actively
writing to and so it'll be in the WAL and we don't have to worry about
it.  We're generally concerned more about latent on-disk corruption that
is missed than about some kind of in-memory corruption and the page
changing in the filesystem cache without some other process writing to
it just seems to be awfully unlikely.

https://github.com/pgbackrest/pgbackrest/commit/9eec98c61302121134d2067326dbd2cd0f2f0b9c

From a practical perspective, while this has the afore-mentioned risk
regarding our loop happening twice fast enough that it re-reads the same
page without the i/o on the page progressing at all, I'm not aware of
even one report of that actually happening.  We absolutely have seen
cases where the first read picks up a torn page and that's not even
uncommon in busy environments.

One of the ideas I've had around how to address all of this is to not
depend on inferring things from what's been written out but instead to
just ask PG and that's one of the (relatively few...) benefits that I
see to an archive_library- the ability to check if a given page is in
shared buffers and, if so, what its LSN is to see if it's past the start
of the backup.  Given that pg_basebackup is already working with a lot
of server-side code.. perhaps this could be a direction to go in for it.

Another idea we played around with was keeping track of the LSNs of the
pages with invalid checksums and checking that they fall within the
range of start LSN-end LSN of the backup; while that wouldn't
completely prevent corrupted LSNs from skipping detection using the LSN
approach, it'd sure make it a whole lot less likely.

Lastly, I've also wondered about trying to pull (clean) pages from
shared_buffers directly instead of reading them off of disk at all,
perhaps using a ring buffer in shared_buffers to read them in if
they're not already there, and letting all the usual checksum validation
and such happening with PG handling it.  Dirty pages would have to be
handled though, presumably by locking them and then reading the page
from disk as I don't think folks would be pleased if we decided to
forcibly write them out.  For an incremental backup, if you have
reliable knowledge that the page is in the WAL from PG, perhaps you
could even skip the page entirely.

Independently of verifying PG checksums, we've also 

Re: Record queryid when auto_explain.log_verbose is on

2023-03-06 Thread Imseih (AWS), Sami
I am wondering if this patch should be backpatched?

The reason being is in auto_explain documentation [1],
there is a claim of equivalence of the auto_explain.log_verbose
option and EXPLAIN(verbose)

". it's equivalent to the VERBOSE option of EXPLAIN."

This can be quite confusing for users of the extension.
The documentation should either be updated or a backpatch
all the way down to 14, which the version the query identifier
was moved to core. I am in favor of the latter.

Any thoughts?


[1] https://www.postgresql.org/docs/14/auto-explain.html

Regards,

--
Sami Imseih
Amazon Web Services (AWS)




RE: Ability to reference other extensions by schema in extension scripts

2023-03-06 Thread Regina Obe
> It looks like this patch needs a quick rebase, there's a conflict in the
> meson.build.
> 
> I'll leave the state since presumably this would be easy to resolve but it 
> would
> be more likely to get attention if it's actually building cleanly.
> 
> http://cfbot.cputube.org/patch_42_4023.log
> --
> Gregory Stark
> As Commitfest Manager

Attach is the patch rebased against master.




0005-Allow-use-of-extschema-reqextname-to-reference.patch
Description: Binary data


Re: add PROCESS_MAIN to VACUUM

2023-03-06 Thread Melanie Plageman
On Mon, Mar 6, 2023 at 5:43 PM Nathan Bossart  wrote:
>
> On Mon, Mar 06, 2023 at 05:09:58PM -0500, Melanie Plageman wrote:
> > I would move this comment inside of the outer if statement since it is
> > distinguishing between the two branches of the inner if statement.
>
> Oops, done.
>
> > Also, I would still consider putting a comment above that reminds us that
> > VACOPT_PROCESS_MAIN is the default and will vacuum the main relation.
>
> I tried adding something along these lines.

LGTM.




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Andres Freund
Hi,

On 2023-03-06 13:29:50 +0900, Michael Paquier wrote:
> On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote:
> > I ran some tests on my dev system [1] and I don't see much difference
> > between v3 and v4. So, +1 for v3 patch (+ argument order swap) from
> > Andres to keep the code simple and elegant.
>
> This thread has stalled for a couple of weeks, so I have gone back to
> it.  Testing on a tmpfs I am not seeing a difference if performance
> for any of the approaches discussed.  At the end, as I am the one
> behind the introduction of pg_pwrite_zeros(), I have applied v3 after
> switches the size and offset parameters to be the same way as in v4.

Thanks.

Greetings,

Andres Freund




Re: On login trigger: take three

2023-03-06 Thread Daniel Gustafsson
> On 6 Mar 2023, at 23:10, Andres Freund  wrote:
> 
> Hi,
> 
> On 2023-03-06 15:55:01 -0500, Gregory Stark (as CFM) wrote:
>> It looks like Daniel Gustafsson, Andres, and Tom have all weighed in
>> on this patch with at least a neutral comment (+-0 from Andres :)
>> 
>> It looks like the main concern was breaking physical replicas and that
>> there was consensus that as long as single-user mode worked that it
>> was ok?
> 
> I don't think it's OK with just single user mode as an scape hatch. We need
> the GUC that was discussed as part of the thread (and I think there's a patch
> for that too).

This is the patch which originated from this thread:

https://commitfest.postgresql.org/42/4013/

--
Daniel Gustafsson





Re: On login trigger: take three

2023-03-06 Thread Andres Freund
Hi,

On 2023-03-06 15:55:01 -0500, Gregory Stark (as CFM) wrote:
> It looks like Daniel Gustafsson, Andres, and Tom have all weighed in
> on this patch with at least a neutral comment (+-0 from Andres :)
> 
> It looks like the main concern was breaking physical replicas and that
> there was consensus that as long as single-user mode worked that it
> was ok?

I don't think it's OK with just single user mode as an scape hatch. We need
the GUC that was discussed as part of the thread (and I think there's a patch
for that too).

Greetings,

Andres Freund




Re: add PROCESS_MAIN to VACUUM

2023-03-06 Thread Melanie Plageman
On Mon, Mar 06, 2023 at 01:13:37PM -0800, Nathan Bossart wrote:
> On Mon, Mar 06, 2023 at 03:48:28PM -0500, Melanie Plageman wrote:
> > On Mon, Mar 6, 2023 at 3:27 PM Nathan Bossart  
> > wrote:
> >> On Mon, Mar 06, 2023 at 02:40:09PM -0500, Melanie Plageman wrote:
> >> > I noticed in vacuum_rel() in vacuum.c where table_relation_vacuum() is
> >> > called, 4211fbd84 changes the else into an else if [1]. I understand
> >> > after reading the commit and re-reading the code why that is now, but I
> >> > was initially confused. I was thinking it might be nice to have a
> >> > comment mentioning why there is no else case here (i.e. that the main
> >> > table relation will be vacuumed on the else if branch).
> >>
> >> This was a hack to avoid another level of indentation for that whole block
> >> of code, but based on your comment, it might be better to just surround
> >> this entire section with an "if (params->options & VACOPT_PROCESS_MAIN)"
> >> check.  WDYT?
> > 
> > I think that would be clearer.
> 
> Here's a patch.  Thanks for reviewing.

> diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
> index 580f966499..fb1ef28fa9 100644
> --- a/src/backend/commands/vacuum.c
> +++ b/src/backend/commands/vacuum.c
> @@ -2060,23 +2060,25 @@ vacuum_rel(Oid relid, RangeVar *relation, 
> VacuumParams *params, bool skip_privs)

I would move this comment inside of the outer if statement since it is
distinguishing between the two branches of the inner if statement.

Also, I would still consider putting a comment above that reminds us that
VACOPT_PROCESS_MAIN is the default and will vacuum the main relation.

>   /*
>* Do the actual work --- either FULL or "lazy" vacuum
>*/
> - if ((params->options & VACOPT_FULL) &&
> - (params->options & VACOPT_PROCESS_MAIN))
> + if (params->options & VACOPT_PROCESS_MAIN)
>   {
> - ClusterParams cluster_params = {0};
> + if (params->options & VACOPT_FULL)
> + {
> + ClusterParams cluster_params = {0};
>  
> - /* close relation before vacuuming, but hold lock until commit 
> */
> - relation_close(rel, NoLock);
> - rel = NULL;
> + /* close relation before vacuuming, but hold lock until 
> commit */
> + relation_close(rel, NoLock);
> + rel = NULL;
>  
> - if ((params->options & VACOPT_VERBOSE) != 0)
> - cluster_params.options |= CLUOPT_VERBOSE;
> + if ((params->options & VACOPT_VERBOSE) != 0)
> + cluster_params.options |= CLUOPT_VERBOSE;
>  
> - /* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
> - cluster_rel(relid, InvalidOid, _params);
> + /* VACUUM FULL is now a variant of CLUSTER; see 
> cluster.c */
> + cluster_rel(relid, InvalidOid, _params);
> + }
> + else
> + table_relation_vacuum(rel, params, vac_strategy);
>   }
> - else if (params->options & VACOPT_PROCESS_MAIN)
> - table_relation_vacuum(rel, params, vac_strategy);
>  
>   /* Roll back any GUC changes executed by index functions */
>   AtEOXact_GUC(false, save_nestlevel);


- Melanie




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-06 Thread Nathan Bossart
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+TimeLineID tli,
+Size count,
+char *buf,
+Size *read_bytes)

Since this function presently doesn't return anything, can we have it
return the number of bytes read instead of storing it in a pointer
variable?

+ptr = startptr;
+nbytes = count;
+dst = buf;

These variables seem superfluous.

+/*
+ * Requested WAL isn't available in WAL buffers, so return with
+ * what we have read so far.
+ */
+break;

nitpick: I'd move this to the top so that you can save a level of
indentation.

if (expectedEndPtr != endptr)
break;

... logic for when the data is found in the WAL buffers ...

+/*
+ * All the bytes are not in one page. Read available bytes on
+ * the current page, copy them over to output buffer and
+ * continue to read remaining bytes.
+ */

Is it possible to memcpy more than a page at a time?

+/*
+ * The fact that we acquire WALBufMappingLock while reading the WAL
+ * buffer page itself guarantees that no one else initializes it or
+ * makes it ready for next use in AdvanceXLInsertBuffer().
+ *
+ * However, we perform basic page header checks for ensuring that
+ * we are not reading a page that just got initialized. Callers
+ * will anyway perform extensive page-level and record-level
+ * checks.
+ */

Hm.  I wonder if we should make these assertions instead.

+elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given 
LSN %X/%X, Timeline ID %u",
+ *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);

I definitely don't think we should put an elog() in this code path.
Perhaps this should be guarded behind WAL_DEBUG.

+/*
+ * Check if we have read fully (hit), partially (partial hit) or
+ * nothing (miss) from WAL buffers. If we have read either partially or
+ * nothing, then continue to read the remaining bytes the usual way,
+ * that is, read from WAL file.
+ */
+if (count == read_bytes)
+{
+/* Buffer hit, so return. */
+return true;
+}
+else if (read_bytes > 0 && count > read_bytes)
+{
+/*
+ * Buffer partial hit, so reset the state to count the read bytes
+ * and continue.
+ */
+buf += read_bytes;
+startptr += read_bytes;
+count -= read_bytes;
+}
+
+/* Buffer miss i.e., read_bytes = 0, so continue */

I think we can simplify this.  We effectively take the same action any time
"count" doesn't equal "read_bytes", so there's no need for the "else if".

if (count == read_bytes)
return true;

buf += read_bytes;
startptr += read_bytes;
count -= read_bytes;

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: using memoize in in paralel query decreases performance

2023-03-06 Thread David Rowley
On Mon, 6 Mar 2023 at 21:55, Pavel Stehule  wrote:
> default https://explain.depesz.com/s/fnBe

It looks like the slowness is coming from the Bitmap Index scan and
Bitmap heap scan rather than Memoize.

   ->  Bitmap Heap Scan on public.item i  (cost=285.69..41952.12
rows=29021 width=16) (actual time=20.395..591.606 rows=20471
loops=784)
 Output: i.id, i.item_category_id
 Recheck Cond: (i.item_category_id = ictc.sub_category_id)
 Heap Blocks: exact=1590348
 Worker 0:  actual time=20.128..591.426 rows=20471 loops=112
 Worker 1:  actual time=20.243..591.627 rows=20471 loops=112
 Worker 2:  actual time=20.318..591.660 rows=20471 loops=112
 Worker 3:  actual time=21.180..591.644 rows=20471 loops=112
 Worker 4:  actual time=20.226..591.357 rows=20471 loops=112
 Worker 5:  actual time=20.597..591.418 rows=20471 loops=112
 ->  Bitmap Index Scan on ixfk_ite_itemcategoryid
(cost=0.00..278.43 rows=29021 width=0) (actual time=14.851..14.851
rows=25362 loops=784)
   Index Cond: (i.item_category_id = ictc.sub_category_id)
   Worker 0:  actual time=14.863..14.863 rows=25362 loops=112
   Worker 1:  actual time=14.854..14.854 rows=25362 loops=112
   Worker 2:  actual time=14.611..14.611 rows=25362 loops=112
   Worker 3:  actual time=15.245..15.245 rows=25362 loops=112
   Worker 4:  actual time=14.909..14.909 rows=25362 loops=112
   Worker 5:  actual time=14.841..14.841 rows=25362 loops=112

> disabled memoize https://explain.depesz.com/s/P2rP

->  Bitmap Heap Scan on public.item i  (cost=285.69..41952.12
rows=29021 width=16) (actual time=9.256..57.503 rows=20471 loops=784)
   Output: i.id, i.item_category_id
   Recheck Cond: (i.item_category_id = ictc.sub_category_id)
   Heap Blocks: exact=1590349
   Worker 0:  actual time=9.422..58.420 rows=20471 loops=112
   Worker 1:  actual time=9.449..57.539 rows=20471 loops=112
   Worker 2:  actual time=9.751..58.129 rows=20471 loops=112
   Worker 3:  actual time=9.620..57.484 rows=20471 loops=112
   Worker 4:  actual time=8.940..57.911 rows=20471 loops=112
   Worker 5:  actual time=9.454..57.488 rows=20471 loops=112
   ->  Bitmap Index Scan on ixfk_ite_itemcategoryid
(cost=0.00..278.43 rows=29021 width=0) (actual time=4.581..4.581
rows=25363 loops=784)
 Index Cond: (i.item_category_id = ictc.sub_category_id)
 Worker 0:  actual time=4.846..4.846 rows=25363 loops=112
 Worker 1:  actual time=4.734..4.734 rows=25363 loops=112
 Worker 2:  actual time=4.803..4.803 rows=25363 loops=112
 Worker 3:  actual time=4.959..4.959 rows=25363 loops=112
 Worker 4:  actual time=4.402..4.402 rows=25363 loops=112
 Worker 5:  actual time=4.778..4.778 rows=25363 loops=112

I wonder if the additional work_mem required for Memoize is just doing
something like causing kernel page cache evictions and leading to
fewer buffers for ixfk_ite_itemcategoryid and the item table being
cached in the kernel page cache.

Maybe you could get an idea of that if you SET track_io_timing = on;
and EXPLAIN (ANALYZE, BUFFERS) both queries.

David




RE: Ability to reference other extensions by schema in extension scripts

2023-03-06 Thread Regina Obe
> It looks like this patch needs a quick rebase, there's a conflict in the
> meson.build.
> 
> I'll leave the state since presumably this would be easy to resolve but it 
> would
> be more likely to get attention if it's actually building cleanly.
> 
> http://cfbot.cputube.org/patch_42_4023.log
> 
> On Tue, 28 Feb 2023 at 18:44, Sandro Santilli  wrote:
> >
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  tested, failed
> > Implements feature:   tested, passed
> > Spec compliant:   tested, passed
> > Documentation:tested, passed
> >
> > I've applied the patch attached to message
> > https://www.postgresql.org/message-
> id/000401d94bc8%2448dff700%24da9fe5
> > 00%24%40pcorp.us (md5sum a7d45a32c054919d94cd4a26d7d07c20) onto
> > current tip of the master branch being
> > 128dd9f9eca0b633b51ffcd5b0f798fbc48ec4c0
> >
> > The review written in https://www.postgresql.org/message-
> id/20230228224608.ak7br5shev4wic5a%40c19 all still applies.
> >
> > The `make installcheck-world`  test fails for me but the failures seem
> unrelated to the patch (many occurrences of "+ERROR:  function
> pg_input_error_info(unknown, unknown) does not exist" in the
> regression.diff).
> >
> > Documentation exists for the new feature
> >
> > The new status of this patch is: Ready for Committer
> 
> 
> 
> --
> Gregory Stark
> As Commitfest Manager

Just sent a note about the wildcard one.  Was this conflicting with the 
wildcard one or some other.
I can rebase if it was conflicting with another one, if it was the wildcard 
one, then maybe we should commit this one and we'll rebase the wildcard one.

We would like to submit the wildcard one too, but I think Tom had some 
reservations on that one.

Thanks,
Regina






RE: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-06 Thread Regina Obe
> This patch too is conflicting on meson.build. Are these two patches
> interdependent?
> 
> This one looks a bit more controversial. I'm not sure if Tom has been
> convinced and I haven't seen anyone else speak up.
> 
> Perhaps this needs a bit more discussion of other options to solve this issue.
> Maybe it can just be solved with multiple one-line scripts that call to a 
> master
> script?
> 
> --
> Gregory Stark
> As Commitfest Manager

They edit the same file yes so yes conflicts are expected.  
The wildcard one, Tom was not convinced, so I assume we'll need to 
go a couple more rounds on it and hopefully we can get something that will not 
be so controversial.

I don't think the schema qualifying required extension feature is controversial 
though so I think that should be able to go and we'll rebase our other patch 
after that.

Thanks,
Regina





Re: On login trigger: take three

2023-03-06 Thread Daniel Gustafsson
> On 6 Mar 2023, at 21:55, Gregory Stark (as CFM)  wrote:
> 
> It looks like Daniel Gustafsson, Andres, and Tom have all weighed in
> on this patch with at least a neutral comment (+-0 from Andres :)

I think the concept of a login event trigger has merits, even though it's kind
of a niche use case.

> It looks like the main concern was breaking physical replicas and that
> there was consensus that as long as single-user mode worked that it
> was ok?

Having a way to not rely on single-user mode and not causing an unacceptable
performance hit (which judging by recent benchmarks might not be an issue?).  I
still intend to revisit this and I hope to get to it during this CF.

--
Daniel Gustafsson





Re: proposal: possibility to read dumped table's name from file

2023-03-06 Thread Daniel Gustafsson
> On 6 Mar 2023, at 21:45, Gregory Stark (as CFM)  wrote:
> 
> So This patch has been through a lot of commitfests. And it really
> doesn't seem that hard to resolve -- Pavel has seemingly been willing
> to go along whichever way the wind has been blowing but honestly it
> kind of seems like he's just gotten drive-by suggestions and he's put
> a lot of work into trying to satisfy them.

Agreed.

> He implemented --include-tables-from-file=... etc. Then he implemented
> a hand-written parser for a DSL to select objects, then he implemented
> a bison parser, then he went back to the hand-written parser.

Well, kind of.  I was trying to take the patch to the finishing line but was
uncomfortable with the hand written parser so I implemented a parser in Bison
to replace it with.  Not that hand-written parsers are bad per se (or that my
bison parser was perfect), but reading quoted identifiers across line
boundaries tend to require a fair amount of handwritten code.  Pavel did not
object to this version, but it was objected to by two other committers.

At this point [0] I stepped down from trying to finish it as the approach I was
comfortable didn't gain traction (which is totally fine).

Downthread from this the patch got a lot of reviews from Julien with the old
parser back in place.

> Can we get some consensus on whether the DSL looks right

I would consider this pretty settled.

> and whether the hand-written parser is sensible.

This is the part where a committer who wants to pursue the hand-written parser
need to step up. With the amount of review received it's hopefully pretty close.

--
Daniel Gustafsson

[0] 098531e1-fba9-4b7d-884e-0a4363eee...@yesql.se





Re: add PROCESS_MAIN to VACUUM

2023-03-06 Thread Nathan Bossart
On Mon, Mar 06, 2023 at 03:48:28PM -0500, Melanie Plageman wrote:
> On Mon, Mar 6, 2023 at 3:27 PM Nathan Bossart  
> wrote:
>> On Mon, Mar 06, 2023 at 02:40:09PM -0500, Melanie Plageman wrote:
>> > I noticed in vacuum_rel() in vacuum.c where table_relation_vacuum() is
>> > called, 4211fbd84 changes the else into an else if [1]. I understand
>> > after reading the commit and re-reading the code why that is now, but I
>> > was initially confused. I was thinking it might be nice to have a
>> > comment mentioning why there is no else case here (i.e. that the main
>> > table relation will be vacuumed on the else if branch).
>>
>> This was a hack to avoid another level of indentation for that whole block
>> of code, but based on your comment, it might be better to just surround
>> this entire section with an "if (params->options & VACOPT_PROCESS_MAIN)"
>> check.  WDYT?
> 
> I think that would be clearer.

Here's a patch.  Thanks for reviewing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 580f966499..fb1ef28fa9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2060,23 +2060,25 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, bool skip_privs)
 	/*
 	 * Do the actual work --- either FULL or "lazy" vacuum
 	 */
-	if ((params->options & VACOPT_FULL) &&
-		(params->options & VACOPT_PROCESS_MAIN))
+	if (params->options & VACOPT_PROCESS_MAIN)
 	{
-		ClusterParams cluster_params = {0};
+		if (params->options & VACOPT_FULL)
+		{
+			ClusterParams cluster_params = {0};
 
-		/* close relation before vacuuming, but hold lock until commit */
-		relation_close(rel, NoLock);
-		rel = NULL;
+			/* close relation before vacuuming, but hold lock until commit */
+			relation_close(rel, NoLock);
+			rel = NULL;
 
-		if ((params->options & VACOPT_VERBOSE) != 0)
-			cluster_params.options |= CLUOPT_VERBOSE;
+			if ((params->options & VACOPT_VERBOSE) != 0)
+cluster_params.options |= CLUOPT_VERBOSE;
 
-		/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
-		cluster_rel(relid, InvalidOid, _params);
+			/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
+			cluster_rel(relid, InvalidOid, _params);
+		}
+		else
+			table_relation_vacuum(rel, params, vac_strategy);
 	}
-	else if (params->options & VACOPT_PROCESS_MAIN)
-		table_relation_vacuum(rel, params, vac_strategy);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);


Re: On login trigger: take three

2023-03-06 Thread Gregory Stark (as CFM)
It looks like Daniel Gustafsson, Andres, and Tom have all weighed in
on this patch with at least a neutral comment (+-0 from Andres :)

It looks like the main concern was breaking physical replicas and that
there was consensus that as long as single-user mode worked that it
was ok?

So maybe it's time after 2 1/2 years to get this one committed?

-- 
Gregory Stark
As Commitfest Manager




Re: add PROCESS_MAIN to VACUUM

2023-03-06 Thread Melanie Plageman
On Mon, Mar 6, 2023 at 3:27 PM Nathan Bossart  wrote:
>
> On Mon, Mar 06, 2023 at 02:40:09PM -0500, Melanie Plageman wrote:
> > I noticed in vacuum_rel() in vacuum.c where table_relation_vacuum() is
> > called, 4211fbd84 changes the else into an else if [1]. I understand
> > after reading the commit and re-reading the code why that is now, but I
> > was initially confused. I was thinking it might be nice to have a
> > comment mentioning why there is no else case here (i.e. that the main
> > table relation will be vacuumed on the else if branch).
>
> This was a hack to avoid another level of indentation for that whole block
> of code, but based on your comment, it might be better to just surround
> this entire section with an "if (params->options & VACOPT_PROCESS_MAIN)"
> check.  WDYT?

I think that would be clearer.

- Melanie




Re: proposal: possibility to read dumped table's name from file

2023-03-06 Thread Gregory Stark (as CFM)
So This patch has been through a lot of commitfests. And it really
doesn't seem that hard to resolve -- Pavel has seemingly been willing
to go along whichever way the wind has been blowing but honestly it
kind of seems like he's just gotten drive-by suggestions and he's put
a lot of work into trying to satisfy them.

He implemented --include-tables-from-file=... etc. Then he implemented
a hand-written parser for a DSL to select objects, then he implemented
a bison parser, then he went back to the hand-written parser.

Can we get some consensus on whether the DSL looks right and whether
the hand-written parser is sensible. And if so then can a committer
step up to actual review and commit the patch? The last review said it
might need a native English speaker to tweak some wording but
otherwise looked good.

-- 
Gregory Stark
As Commitfest Manager




Re: Support logical replication of DDLs

2023-03-06 Thread Zheng Li
On Mon, Mar 6, 2023 at 5:17 AM wangw.f...@fujitsu.com
 wrote:
>
> For v-75-0003* patch.
> 2. In the function deparse_CreateSeqStmt.
> It seems that we are not deparsing the "AS data_type" clause (CREATE SEQUENCE
> ... AS data_type). I think this causes all data_type to be default (bigint)
> after executing the parsed CreateSeq command.
>
> ~~~

Hi, thanks for the comments. We identified this issue as well during
testing, I've made a fix and will update the patch in a few days with
other fixes.

Regards,
Zane




Re: add PROCESS_MAIN to VACUUM

2023-03-06 Thread Nathan Bossart
On Mon, Mar 06, 2023 at 02:40:09PM -0500, Melanie Plageman wrote:
> I noticed in vacuum_rel() in vacuum.c where table_relation_vacuum() is
> called, 4211fbd84 changes the else into an else if [1]. I understand
> after reading the commit and re-reading the code why that is now, but I
> was initially confused. I was thinking it might be nice to have a
> comment mentioning why there is no else case here (i.e. that the main
> table relation will be vacuumed on the else if branch).

This was a hack to avoid another level of indentation for that whole block
of code, but based on your comment, it might be better to just surround
this entire section with an "if (params->options & VACOPT_PROCESS_MAIN)"
check.  WDYT?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-03-06 Thread Melanie Plageman
On Mon, Mar 6, 2023 at 2:34 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-03-06 14:24:09 -0500, Melanie Plageman wrote:
> > On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote:
> > > On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:
> > > > On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
> > > >  wrote:
> > > > >
> > > > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi 
> > > > >  wrote in
> > > > > > In any case, I think we need to avoid such concurrent 
> > > > > > autovacuum/analyze.
> > > > >
> > > > > If it is correct, I believe the attached fix works.
> > > >
> > > > Thanks for investigating this!
> > > >
> > > > Yes, this fix looks correct and makes sense to me.
> > >
> > > Wouldn't it be better to just perform the section from the ALTER TABLE 
> > > till
> > > the DROP TABLE in a transaction? Then there couldn't be any other 
> > > accesses in
> > > just that section. I'm not convinced it's good to disallow all concurrent
> > > activity in other parts of the test.
> >
> > You mean for test coverage reasons? Because the table in question only
> > exists for a few operations in this test file.
>
> That, but also because it's simply more reliable. autovacuum=off doesn't
> protect against a anti-wraparound vacuum or such. Or a concurrent test somehow
> triggering a read. Or ...

Good point. Attached is what you suggested. I committed the transaction
before the drop table so that the statistics would be visible when we
queried pg_stat_io.

- Melanie
From 78ca019624fbd7d6e2a4d94970b804fc834731b4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 6 Mar 2023 15:16:03 -0500
Subject: [PATCH v1] Fix flakey pg_stat_io test

Wrap test of pg_stat_io's tracking of shared buffer reads in a
transaction to prevent concurrent accesses (e.g. by autovacuum) leading
to incorrect test failures.

Discussion: https://www.postgresql.org/message-id/20230306190919.ai6mxdq3sygyyths%40awork3.anarazel.de
---
 src/test/regress/expected/stats.out | 4 
 src/test/regress/sql/stats.sql  | 4 
 2 files changed, 8 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 937b2101b3..fb5adb0fd7 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1181,6 +1181,9 @@ SELECT current_setting('fsync') = 'off'
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
   FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+-- Do this in a transaction to prevent any other concurrent access to our newly
+-- rewritten table, guaranteeing our test will pass.
+BEGIN;
 ALTER TABLE test_io_shared SET TABLESPACE regress_tblspace;
 -- SELECT from the table so that the data is read into shared buffers and
 -- io_context 'normal', io_object 'relation' reads are counted.
@@ -1190,6 +1193,7 @@ SELECT COUNT(*) FROM test_io_shared;
100
 (1 row)
 
+COMMIT;
 SELECT pg_stat_force_next_flush();
  pg_stat_force_next_flush 
 --
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 74e592aa8a..84604d8fa0 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -576,10 +576,14 @@ SELECT current_setting('fsync') = 'off'
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
   FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+-- Do this in a transaction to prevent any other concurrent access to our newly
+-- rewritten table, guaranteeing our test will pass.
+BEGIN;
 ALTER TABLE test_io_shared SET TABLESPACE regress_tblspace;
 -- SELECT from the table so that the data is read into shared buffers and
 -- io_context 'normal', io_object 'relation' reads are counted.
 SELECT COUNT(*) FROM test_io_shared;
+COMMIT;
 SELECT pg_stat_force_next_flush();
 SELECT sum(reads) AS io_sum_shared_after_reads
   FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
-- 
2.37.2



Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-03-06 Thread Andrei Zubkov
Hi Gregory,

> patching file contrib/pg_stat_statements/sql/oldextversions.sql
> can't find file to patch at input line 1833
> 
> 
> and
> 
> > --- a/contrib/pg_stat_statements/sql/pg_stat_statements.sql
> > +++ b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
> --
> No file to patch.  Skipping patch.
> 
Thank you for your attention.

Yes, it is due to parallel work on "Normalization of utility queries in
pg_stat_statements" patch
(https://postgr.es/m/Y/7Y9U/y/keaw...@paquier.xyz)

It seems I've found something strange in new test files - I've
mentioned this in a thread of a patch. Once there will be any solution
I'll do a rebase again as soon as possible.

-- 
Andrei Zubkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company






Re: Add a hook to allow modification of the ldapbindpasswd

2023-03-06 Thread Gregory Stark (as CFM)
The CFBot says this patch is failing but I find it hard to believe
this is related to this patch...

2023-03-05 20:56:58.705 UTC [33902][client backend]
[pg_regress/btree_index][18/750:0] STATEMENT:  ALTER INDEX
btree_part_idx ALTER COLUMN id SET (n_distinct=100);
2023-03-05 20:56:58.709 UTC [33902][client backend]
[pg_regress/btree_index][:0] LOG:  disconnection: session time:
0:00:02.287 user=postgres database=regression host=[local]
2023-03-05 20:56:58.710 UTC [33889][client backend]
[pg_regress/join][:0] LOG:  disconnection: session time: 0:00:02.289
user=postgres database=regression host=[local]
2023-03-05 20:56:58.749 UTC [33045][postmaster] LOG:  server process
(PID 33898) was terminated by signal 6: Abort trap
2023-03-05 20:56:58.749 UTC [33045][postmaster] DETAIL:  Failed
process was running: SELECT * FROM writetest;
2023-03-05 20:56:58.749 UTC [33045][postmaster] LOG:  terminating any
other active server processes





-- 
Gregory Stark
As Commitfest Manager




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-06 Thread Andres Freund
Hi,

On 2023-03-06 17:02:38 +0530, Amit Kapila wrote:
> Andres, do you have any thoughts on this? We seem to have figured out
> the cause of regression in the case Shi-San has reported and others
> also agree with it. We can't think of doing anything better than what
> the patch currently is doing, so thought of going with an option to
> allow users to disable index scans. The current understanding is that
> the patch will be a win in much more cases than the cases where one
> can see regression but still having a knob could be useful in those
> few cases.

I think the case in which the patch regresses performance in is irrelevant in
practice. It's good to not regress unnecessarily for easily avoidable reason,
even in such cases, but it's not worth holding anything up due to it.

Greetings,

Andres Freund




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-03-06 Thread Gregory Stark (as CFM)
I'm sorry, It seems this is failing again? It's Makefile and
meson.build again :(

Though there are also

patching file contrib/pg_stat_statements/sql/oldextversions.sql
can't find file to patch at input line 1833


and

|--- a/contrib/pg_stat_statements/sql/pg_stat_statements.sql
|+++ b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
--
No file to patch.  Skipping patch.




--
Gregory Stark
As Commitfest Manager




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-06 Thread Andres Freund
Hi,

On 2023-03-01 14:10:07 +0530, Amit Kapila wrote:
> On Wed, Mar 1, 2023 at 12:09 AM Andres Freund  wrote:
> >
> > > I see this as a way to provide this feature for users but I would
> > > prefer to proceed with this if we can get some more buy-in from senior
> > > community members (at least one more committer) and some user(s) if
> > > possible. So, I once again request others to chime in and share their
> > > opinion.
> >
> > I'd prefer not having an option, because we figure out the cause of the
> > performance regression (reducing it to be small enough to not care). After
> > that an option defaulting to using indexes.
> >
> 
> Sure, if we can reduce regression to be small enough then we don't
> need to keep the default as false, otherwise, also, we can consider it
> to keep an option defaulting to using indexes depending on the
> investigation for regression. Anyway, the main concern was whether it
> is okay to have an option for this which I think we have an agreement
> on, now I will continue my review.

I think even as-is it's reasonable to just use it. The sequential scan
approach is O(N^2), which, uh, is not good. And having an index over thousands
of non-differing values will generally perform badly, not just in this
context.

Greetings,

Andres Freund




Re: Make mesage at end-of-recovery less scary.

2023-03-06 Thread Gregory Stark (as CFM)
It looks like this needs a rebase and at a quick glance it looks like
more than a trivial conflict. I'll mark it Waiting on Author. Please
update it back when it's rebased




--
Gregory Stark
As Commitfest Manager




Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-06 Thread Gregory Stark (as CFM)
This patch too is conflicting on meson.build. Are these two patches
interdependent?

This one looks a bit more controversial. I'm not sure if Tom has been
convinced and I haven't seen anyone else speak up.

Perhaps this needs a bit more discussion of other options to solve
this issue. Maybe it can just be solved with multiple one-line scripts
that call to a master script?

-- 
Gregory Stark
As Commitfest Manager




Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-03-06 Thread Heikki Linnakangas

On 01/03/2023 12:21, Aleksander Alekseev wrote:

Hi,


I'm surprised that these patches extend the page numbering to 64 bits,
but never actually uses the high bits. The XID "epoch" is not used, and
pg_xact still wraps around and the segment names are still reused. I
thought we could stop doing that.


To clarify, the idea is to let CLOG grow indefinitely and simply store
FullTransactionId -> TransactionStatus (two bits). Correct?


Correct.


I didn't investigate this in much detail but it may affect quite some
amount of code since TransactionIdDidCommit() and
TransactionIdDidCommit() currently both deal with TransactionId, not
FullTransactionId. IMO, this would be a nice change however, assuming
we are ready for it.


Yep, it's a lot of code churn..


In the previous version of the patch there was an attempt to derive
FullTransactionId from TransactionId but it was wrong for the reasons
named above in the thread. Thus is was removed and the patch
simplified.


Yeah, it's tricky to get it right. Clearly we need to do it at some 
point though.


All in all, this is a big effort. I spent some more time reviewing this 
in the last few days, and thought a lot about what the path forward here 
could be. And I haven't looked at the actual 64-bit XIDs patch set yet, 
just this patch to use 64-bit addressing in SLRUs.


This patch is the first step, but we have a bit of a chicken and egg 
problem, because this patch on its own isn't very interesting, but on 
the other hand, we need it to work on the follow up items. Here's how I 
see the development path for this (and again, this is just for the 
64-bit SLRUs work, not the bigger 64-bit-XIDs-in-heapam effort):


1. Use 64 bit page numbers in SLRUs (this patch)

I would like to make one change here: I'd prefer to keep the old 4-digit 
segment names, until we actually start to use the wider address space. 
Let's allow each SLRU to specify how many digits to use in the 
filenames, so that we convert one SLRU at a time.


If we do that, and don't change any of the existing SLRUs to actually 
use the wider space of page and segment numbers yet, this patch becomes 
just refactoring with no on-disk format changes. No pg_upgrade needed.


The next patches will start to make use of the wider address space, one 
SLRU at a time.


2. Use the larger segment file names in async.c, to lift the current 8 
GB limit on the max number of pending notifications.


No one actually minds the limit, it's quite generous as it is. But there 
is some code and complexity in async.c to avoid the wraparound that 
could be made simpler if we used longer SLRU segment names and avoided 
the wraparound altogether.


I wonder if we should actually add an artificial limit, as a GUC. If 
there are gigabytes of notifications queued up, something's probably 
wrong with the system, and you're not going to be happy if we just 
remove the limit so it can grow to terabytes until you run out of disk 
space.


3. Extend pg_multixact so that pg_multixact/members is addressed by 
64-bit offsets.


Currently, multi-XIDs can wrap around, requiring anti-wraparound 
freezing, but independently of that, the pg_multixact/members SLRU can 
also wrap around. We track both, and trigger anti-wraparound if either 
SLRU is about to wrap around. If we redefine MultiXactOffset as a 64-bit 
integer, we can avoid the pg_multixact/members wraparound altogether. A 
downside is that pg_multixact/offsets will take twice as much space, but 
I think that's a good tradeoff. Or perhaps we can play tricks like store 
a single 64-bit offset on each pg_multixact/offsets page, and a 32-bit 
offset from that for each XID, to avoid making it so much larger.


This would reduce the need to do anti-wraparound VACUUMs on systems that 
use multixacts heavily. Needs pg_upgrade support.


4. Extend pg_subtrans to 64-bits.

This isn't all that interesting because the active region of pg_subtrans 
cannot be wider than 32 bits anyway, because you'll still reach the 
general 32-bit XID wraparound. But it might be less confusing in some 
places.


I actually started to write a patch to do this, to see how complicated 
it is. It quickly proliferates into expanding other XIDs to 64-bits, 
like TransactionXmin, frozenXid calculation in vacuum.c, known-assigned 
XID tracking in procarray.c. etc. It's going to be necessary to convert 
32-bit XIDs to FullTransactionIds at some boundaries, and I'm not sure 
where exactly that should happen. It's easier to do the conversions 
close to subtrans.c, but then I'm not sure how much it gets us in terms 
of reducing confusion. It's easy to get confused with the epochs during 
conversions, as you noted. On the other hand, if we change much more of 
the backend to use FullTransactionIds, the patch becomes much more invasive.


Nice thing with pg_subtrans, though, is that it doesn't require 
pg_upgrade support.


5. Extend pg_xact to 64-bits.

Similar to pg_subtrans, really, but needs pg_upgrade support.

6. (a 

Re: add PROCESS_MAIN to VACUUM

2023-03-06 Thread Melanie Plageman
On Mon, Mar 06, 2023 at 09:37:23AM -0800, Nathan Bossart wrote:
> On Mon, Mar 06, 2023 at 04:51:46PM +0900, Michael Paquier wrote:
> > That was mostly OK for me, so applied after tweaking a couple of
> > places in the tests (extra explanations, for one), the comments and
> > the code.

I noticed in vacuum_rel() in vacuum.c where table_relation_vacuum() is
called, 4211fbd84 changes the else into an else if [1]. I understand
after reading the commit and re-reading the code why that is now, but I
was initially confused. I was thinking it might be nice to have a
comment mentioning why there is no else case here (i.e. that the main
table relation will be vacuumed on the else if branch).

- Melanie

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/commands/vacuum.c#L2078




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-03-06 Thread Andres Freund
Hi,

On 2023-03-06 14:24:09 -0500, Melanie Plageman wrote:
> On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote:
> > On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:
> > > On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
> > >  wrote:
> > > >
> > > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi 
> > > >  wrote in
> > > > > In any case, I think we need to avoid such concurrent 
> > > > > autovacuum/analyze.
> > > >
> > > > If it is correct, I believe the attached fix works.
> > > 
> > > Thanks for investigating this!
> > > 
> > > Yes, this fix looks correct and makes sense to me.
> > 
> > Wouldn't it be better to just perform the section from the ALTER TABLE till
> > the DROP TABLE in a transaction? Then there couldn't be any other accesses 
> > in
> > just that section. I'm not convinced it's good to disallow all concurrent
> > activity in other parts of the test.
> 
> You mean for test coverage reasons? Because the table in question only
> exists for a few operations in this test file.

That, but also because it's simply more reliable. autovacuum=off doesn't
protect against a anti-wraparound vacuum or such. Or a concurrent test somehow
triggering a read. Or ...

Greetings,

Andres Freund




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-03-06 Thread Melanie Plageman
On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote:
> Hi,
> 
> On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:
> > On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi 
> > >  wrote in
> > > > In any case, I think we need to avoid such concurrent 
> > > > autovacuum/analyze.
> > >
> > > If it is correct, I believe the attached fix works.
> > 
> > Thanks for investigating this!
> > 
> > Yes, this fix looks correct and makes sense to me.
> 
> Wouldn't it be better to just perform the section from the ALTER TABLE till
> the DROP TABLE in a transaction? Then there couldn't be any other accesses in
> just that section. I'm not convinced it's good to disallow all concurrent
> activity in other parts of the test.

You mean for test coverage reasons? Because the table in question only
exists for a few operations in this test file.

- Melanie




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-03-06 Thread Andres Freund
Hi,

On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:
> On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in
> > > In any case, I think we need to avoid such concurrent autovacuum/analyze.
> >
> > If it is correct, I believe the attached fix works.
> 
> Thanks for investigating this!
> 
> Yes, this fix looks correct and makes sense to me.

Wouldn't it be better to just perform the section from the ALTER TABLE till
the DROP TABLE in a transaction? Then there couldn't be any other accesses in
just that section. I'm not convinced it's good to disallow all concurrent
activity in other parts of the test.

Greetings,

Andres Freund




Re: Ability to reference other extensions by schema in extension scripts

2023-03-06 Thread Gregory Stark (as CFM)
It looks like this patch needs a quick rebase, there's a conflict in
the meson.build.

I'll leave the state since presumably this would be easy to resolve
but it would be more likely to get attention if it's actually building
cleanly.

http://cfbot.cputube.org/patch_42_4023.log

On Tue, 28 Feb 2023 at 18:44, Sandro Santilli  wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
>
> I've applied the patch attached to message 
> https://www.postgresql.org/message-id/000401d94bc8%2448dff700%24da9fe500%24%40pcorp.us
>  (md5sum a7d45a32c054919d94cd4a26d7d07c20) onto current tip of the master 
> branch being 128dd9f9eca0b633b51ffcd5b0f798fbc48ec4c0
>
> The review written in 
> https://www.postgresql.org/message-id/20230228224608.ak7br5shev4wic5a%40c19 
> all still applies.
>
> The `make installcheck-world`  test fails for me but the failures seem 
> unrelated to the patch (many occurrences of "+ERROR:  function 
> pg_input_error_info(unknown, unknown) does not exist" in the regression.diff).
>
> Documentation exists for the new feature
>
> The new status of this patch is: Ready for Committer



-- 
Gregory Stark
As Commitfest Manager




Re: About default inBufSize (connection read buffer size) in libpq

2023-03-06 Thread Andres Freund
Hi,

On 2023-03-05 05:42:06 +0300, Трофимов Иван wrote:
> I was running some benchmarks for PG driver built on top of libpq async
> functionality,
> and noticed that recv syscalls issued by the application are limited by 16Kb,
> which seems to
> be inBufSize coming from makeEmptyPGconn in interfaces/libpq/fe-connect.c.
>  
> Hacking that to higher values allowed my benchmarks to issue drastically less
> syscalls
> when running some heavy selects, both in local and cloud environments, which
> made them
> significantly faster.
>  
> I believe there is a reason for that value to be 16Kb, but i was wondering if
> it's safe to change
> this default to user-provided value, and if it is - could this functionality 
> be
> added into API?

I've observed the small buffer size hurting as well - not just client side,
also on the serve.

But I don't think we necessarily need to make it configurable. From what I can
tell the pain mainly comes using the read/send buffers when they don't even
help, because the message data we're processing is bigger than the buffer
size. When we need to receive / send data that we know is bigger than the
the buffer, we should copy the portion that is still in the buffer, and then
send/receive directly from the data to be sent/received.

Greetings,

Andres Freund




Re: Commitfest 2023-03 starting tomorrow!

2023-03-06 Thread Gregory Stark (as CFM)
Sorry, I wasn't feeling very well since Friday. I'm still not 100% but
I'm going to try to do some triage this afternoon.

There are a few patches that need a rebase. And a few patches failing
Meson builds or autoconf stages -- I wonder if there's something
unrelated broken there?

But what I think is really needed is for committers to pick up patches
that are ready to commit and grab them. There are currently two
patches with macdice marked as committer and one with michael-kun
(i.e. you:)

But what can we do to get more some patches picked up now instead of
at the end of the commitfest? Would it help if I started asking on
existing threads if there's a committer willing to take it up?

On Wed, 1 Mar 2023 at 20:27, Michael Paquier  wrote:
>
> On Wed, Mar 01, 2023 at 03:47:17PM +0900, Michael Paquier wrote:
> > The CF would begin in more or less 5 hours as of the moment of this
> > message:
> > https://www.timeanddate.com/time/zones/aoe
>
> Note: I have switched this CF as "In Process" a few hours ago.
> --
> Michael



-- 
Gregory Stark
As Commitfest Manager




Re: pg_rewind: Skip log directory for file type check like pg_wal

2023-03-06 Thread Soumyadeep Chakraborty
On Mon, Mar 6, 2023 at 12:28 AM Alexander Kukushkin  wrote:
>
> Hello Soumyadeep,
>
> The problem indeed exists, but IMO the "log" directory case must be handled 
> differently:
> 1. We don't need or I would even say we don't want to sync log files from the 
> new primary, because it destroys the actual logs, which could be very 
> important to figure out what has happened with the old primary

Yes, this can be solved by adding "log" to excludeDirContents. We did
this for GPDB.

> 2. Unlike "pg_wal", the "log" directory is not necessarily located inside 
> PGDATA. The actual value is configured using "log_directory" GUC, which just 
> happened to be "log" by default. And in fact actual values on source and 
> target could be different.

I think we only care about files/dirs inside the datadir. Anything
outside is out of scope for
pg_rewind AFAIU. We can only address the common case here. As mentioned in this
comment:

 * XXX: There is no backend function to get a symbolic link's target in
 * general, so if the admin has put any custom symbolic links in the data
 * directory, they won't be copied correctly.

There is not much we can do about custom configurations.

Regards,
Soumyadeep (VMware)




Re: add PROCESS_MAIN to VACUUM

2023-03-06 Thread Nathan Bossart
On Mon, Mar 06, 2023 at 04:51:46PM +0900, Michael Paquier wrote:
> That was mostly OK for me, so applied after tweaking a couple of
> places in the tests (extra explanations, for one), the comments and
> the code.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: HOT chain validation in verify_heapam()

2023-03-06 Thread Robert Haas
On Thu, Feb 9, 2023 at 12:09 PM Himanshu Upadhyaya
 wrote:
> Initially while implementing logic to identify the root of the HOT chain
> I was getting crash and regression failure's that time I thought of having
> this check along with a few other changes that were required,
> but you are right, it's unnecessary to add data to the predecessor
> array(in this case) and is not required. I am removing this from the patch.

I finally found time to look at this today -- apologies for the long
delay -- and I don't think that it addresses my objections. When I
proposed lp_valid, I had a very simple idea in mind: it tells you
whether or not the line pointer is, at some basic level, valid. Like,
it contains numbers that could point to a tuple on the page, at least
hypothetically. But that is something that can be determined strictly
by inspecting the line pointer, and yet you have
check_tuple_visibility() changing the value based on the visibility
status of xmin. So it seems that we still don't have a patch where the
value of a variable called lp_valid corresponds to whether or not the
L.P. is valid.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: running logical replication as the subscription owner

2023-03-06 Thread Jelte Fennema
> Yeah. As Andres pointed out somewhere or other, that also means you're
> decoding the WAL once per user instead of just once. I'm surprised
> that hasn't been cost-prohibitive.

We'd definitely prefer to have one subscription and do the decoding
only once. But we haven't run into big perf issues with the current
setup so far. We use it for non-blocking copying of shards (regular PG
tables under the hood). Most of the time is usually spent in the
initial copy phase, not the catchup. And also in practice our users
often only have one table owning user (and more than 5 table owning
users is extremely rare).




Re: running logical replication as the subscription owner

2023-03-06 Thread Robert Haas
On Fri, Mar 3, 2023 at 6:57 PM Jelte Fennema  wrote:
> I'm definitely in favor of making it easier to use logical replication
> in a safe manner.

Cool.

> In Citus we need to logically replicate and we're
> currently using quite some nasty and undocumented hacks to do so:
> We're creating a subscription per table owner, where each subscription
> is owned by a temporary user that has the same permissions as the
> table owner. These temporary users were originally superusers, because
> otherwise we cannot make them subscription owners, but once assigning
> a subscription to them we take away the superuser permissions from
> them[1]. And we also need to hook into ALTER/DELETE subscription
> commands to make sure that these temporary owners cannot edit their
> own subscription[2].
>
> Getting this right was not easy. And even it has the serious downside
> that we need multiple subscriptions/replication slots which causes
> extra complexity in various ways and it eats much more aggressively
> into the replication slot limits than we'd like. Having one
> subscription that could apply into tables that were owned by multiple
> users in a safe way would make this sooo much easier.

Yeah. As Andres pointed out somewhere or other, that also means you're
decoding the WAL once per user instead of just once. I'm surprised
that hasn't been cost-prohibitive.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Timeline ID hexadecimal format

2023-03-06 Thread Peter Eisentraut

On 03.03.23 16:52, Sébastien Lardière wrote:

On 02/03/2023 09:12, Peter Eisentraut wrote:

On 24.02.23 17:27, Sébastien Lardière wrote:

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index be05a33205..7e26b51031 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1332,7 +1332,8 @@ restore_command = 'cp/mnt/server/archivedir/%f %p'
  you like, add comments to a history file to record your own 
notes about
  how and why this particular timeline was created.  Such 
comments will be
  especially valuable when you have a thicket of different 
timelines as

-    a result of experimentation.
+    a result of experimentation. In both WAL segment file names and 
history files,

+    the timeline ID number is expressed in hexadecimal.
 
   


I think here it would be more helpful to show actual examples. Like, 
here is a possible file name, this is what the different parts mean.


So you mean explain the WAL filename and the history filename ? Is it 
the good place for it ?


Well, your patch says, by the way, the timeline ID in the file is 
hexadecimal.  Then one might ask, what file, what is a timeline, what 
are the other numbers in the file, etc.  It seems very specific in this 
context.  I don't know if the format of these file names is actually 
documented somewhere.



diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6..3b5d041d92 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4110,7 +4110,9 @@ restore_command = 'copy 
"C:\\server\\archivedir\\%f" "%p"'  # Windows

  current when the base backup was taken.  The
  value latest recovers
  to the latest timeline found in the archive, which is 
useful in

-    a standby server. latest is the default.
+    a standby server. A numerical value expressed in hexadecimal 
must be
+    prefixed with 0x, for example 
0x11.

+    latest is the default.
 
   


This applies to all configuration parameters, so it doesn't need to be 
mentioned explicitly for individual ones.


Probably, but is there another parameter with the same consequence ?

worth it to document this point globally ?


It's ok to mention it again.  We do something similar for example at 
unix_socket_permissions.  But maybe with more context, like "If you want 
to specify a timeline ID hexadecimal (for example, if extracted from a 
WAL file name), then prefix it with a 0x".


diff --git a/doc/src/sgml/ref/pg_waldump.sgml 
b/doc/src/sgml/ref/pg_waldump.sgml

index 343f0482a9..4ae8f2ebdd 100644
--- a/doc/src/sgml/ref/pg_waldump.sgml
+++ b/doc/src/sgml/ref/pg_waldump.sgml
@@ -215,7 +215,8 @@ PostgreSQL documentation
 
  Timeline from which to read WAL records. The default is to 
use the
  value in startseg, if that is 
specified; otherwise, the

-    default is 1.
+    default is 1. The value must be expressed in decimal, 
contrary to the hexadecimal

+    value given in WAL segment file names and history files.
 
    
   


Maybe this could be fixed instead?


Indeed, and strtoul is probably a better option than sscanf, don't you 
think ?


Yeah, the use of sscanf() is kind of weird here.  We have been moving 
the option parsing to use option_parse_int().  Maybe hex support could 
be added there.  Or just use strtoul().





Re: Track IO times in pg_stat_io

2023-03-06 Thread Melanie Plageman
Thanks for the review!

On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand
 wrote:
> On 2/26/23 5:03 PM, Melanie Plageman wrote:
> > As suggested in [1], the attached patch adds IO times to pg_stat_io;
>
> Thanks for the patch!
>
> I started to have a look at it and figured out that a tiny rebase was needed 
> (due to
> 728560db7d and b9f0e54bc9), so please find the rebase (aka V2) attached.

Thanks for doing that!

> > The timings will only be non-zero when track_io_timing is on
>
> That could lead to incorrect interpretation if one wants to divide the timing 
> per operations, say:
>
> - track_io_timing is set to on while there is already operations
> - or set to off while it was on (and the number of operations keeps growing)
>
> Might be worth to warn/highlight in the "track_io_timing" doc?

This is a good point. I've added a note to the docs for pg_stat_io.

> +   if (track_io_timing)
> +   {
> +   INSTR_TIME_SET_CURRENT(io_time);
> +   INSTR_TIME_SUBTRACT(io_time, io_start);
> +   pgstat_count_io_time(io_object, io_context, 
> IOOP_EXTEND, io_time);
> +   }
> +
> +
>  pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
>
> vs
>
> @@ -1042,6 +1059,7 @@ ReadBuffer_common(SMgrRelation smgr, char 
> relpersistence, ForkNumber forkNum,
>  INSTR_TIME_SUBTRACT(io_time, io_start);
>  
> pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
>  INSTR_TIME_ADD(pgBufferUsage.blk_read_time, 
> io_time);
> +   pgstat_count_io_time(io_object, io_context, 
> IOOP_READ, io_time);
>  }
>
> That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() 
> (for the IOOP_EXTEND case) and
> after pgstat_count_io_op() (for the IOOP_READ case).
>
> What about calling them in the same order and so that pgstat_count_io_time() 
> is called before pgstat_count_io_op()?
>
> If so, the ordering would also need to be changed in:
>
> - FlushRelationBuffers()
> - register_dirty_segment()

Yes, good point. I've updated the code to use this suggested ordering in
attached v3.

> > There is one minor question (in the code as a TODO) which is whether or
> > not it is worth cross-checking that IO counts and times are either both
> > zero or neither zero in the validation function
> > pgstat_bktype_io_stats_valid().
> >
>
> As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that 
> would be a good idea
> to also check that if counts are not Zero then times are not Zero.

Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).

However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled. We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.

- Melanie
From 52d997001108a52c833b339f9b8dcb3d34ed3270 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 6 Mar 2023 10:41:51 -0500
Subject: [PATCH v3] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot 
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml   | 59 
 src/backend/catalog/system_views.sql   |  4 ++
 src/backend/storage/buffer/bufmgr.c| 40 +--
 src/backend/storage/buffer/localbuf.c  | 14 
 src/backend/storage/smgr/md.c  | 47 ++---
 src/backend/utils/activity/pgstat_io.c | 96 +-
 src/backend/utils/adt/pgstatfuncs.c| 40 +--
 src/include/catalog/pg_proc.dat|  6 +-
 src/include/pgstat.h   |  5 +-
 src/test/regress/expected/rules.out|  6 +-
 10 files changed, 275 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..ad3667f258 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+read_time double precision
+   
+   
+Time spent in read operations in milliseconds (if  is enabled, otherwise zero)
+   
+  
+ 
+
  
   

@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+write_time double precision
+   
+   
+Time spent in write operations in milliseconds (if  is enabled, otherwise zero)
+   
+ 

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-06 Thread Önder Kalacı
Hi all,

Thanks for working on this.


> I imagine that a typical use case would be to set min_send_delay to
> several hours to days. I'm concerned that it could not be an
> acceptable trade-off for many users that the system cannot collect any
> garbage during that.
>

I'm not too worried about the WAL recycling, that mostly looks like
a documentation issue to me. It is not a problem that many PG users
are unfamiliar. Also, even though one day creating - altering subscription
is relaxed to be done by a regular user, one option could be to require
this setting to be changed by a superuser? That would alleviate my concern
regarding WAL recycling. A superuser should be able to monitor the system
and adjust the settings/hardware accordingly.

However, VACUUM being blocked by replication with a configuration
change on the subscription sounds more concerning to me. Blocking
VACUUM for hours could quickly escalate to performance problems.

On the other hand, we already have a similar problem with
recovery_min_apply_delay combined with hot_standby_feedback [1].
So, that probably is an acceptable trade-off for the pgsql-hackers.
If you use this feature, you should be even more careful.


> I think we can have the apply process write the decoded changes
> somewhere on the disk (as not temporary files) and return the flush
> LSN so that the apply worker can apply them later and the publisher
> can advance slot's LSN. The feature would be more complex but from the
> user perspective it would be better.
>

Yes, this might probably be one of the ideal solutions to the problem at
hand. But,
my current guess is that it'd be a non-trivial change with different
concurrency/failure
scenarios. So, I'm not sure if that is going to be a realistic patch to
pursue.


Thanks,
Onder KALACI



[1] PostgreSQL: Documentation: 15: 20.6. Replication



Re: a very minor bug and a couple of comment changes for basebackup.c

2023-03-06 Thread Robert Haas
Thanks for the review. I have committed the patches.

On Thu, Mar 2, 2023 at 2:59 AM Michael Paquier  wrote:
> Seems right, I think that you should backpatch that as
> VERIFY_CHECKSUMS is the default.

Done.

> There is more to it: the page LSN is checked before its checksum.
> Hence, if the page's LSN is corrupted in a way that it is higher than
> sink->bbs_state->startptr, the checksum verification is just skipped
> while the page is broken but not reported as such.  (Spoiler: this has
> been mentioned in the past, and maybe we'd better remove this stuff in
> its entirety.)

Yep. It's pretty embarrassing that we have a checksum verification
feature that has both known ways of producing false positives and
known ways of producing false negatives and we have no plan to ever
fix that, we're just going to keep shipping what we've got. I think
it's pretty clear that the feature shouldn't have been committed like
this; valid criticisms of the design were offered and simply not
addressed, not even by updating the comments or documentation with
disclaimers. I find the code in sendFile() pretty ugly, too. For all
of that, I'm a bit uncertain whether ripping it out is the right thing
to do. It might be (I'm not sure) that it tends to work decently well
in practice. Like, yes, it could produce false checksum warnings, but
does that actually happen to people? It's probably not too likely that
a read() or write() of 8kB gets updated after doing only part of the
I/O, so the concurrent read or write is fairly likely to be on-CPU, I
would guess, and therefore this algorithm might kind of work OK in
practice despite begin garbage on a theoretical level. Similarly, the
problems with how the LSN is vetted make it likely that a page
replaced with random garbage will go undetected, but a page where a
few bytes get flipped in a random position within the page is likely
to get caught, and maybe the latter is actually a bigger risk than the
former. I don't really know. I'd be interested to hear from anyone
with a lot of practical experience using the feature. A few anecdotes
of the form "this routine fails to tell us about problems" or "this
often complains about problems that are not real" or "this has really
helped us out on a number of occasions and we have had no problems
with it" would be really helpful.

On the other hand, we could just say that the code is nonsense and
therefore, regardless of practical experience, it ought to be removed.
I'm somewhat open to that idea, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Allow tests to pass in OpenSSL FIPS mode

2023-03-06 Thread Daniel Gustafsson
> On 6 Mar 2023, at 15:55, Tom Lane  wrote:
> Daniel Gustafsson  writes:

>> For readers without all context, wouldn't it be better to encode in the
>> function name why we're not just calling a hash like md5?  Something like
>> fips_allowed_hash() or similar?
> 
> I'd prefer shorter than that --- all these queries are laid out on the
> expectation of a very short function name.  Maybe "fipshash()"?
> 
> We could make the comment introducing the function declarations more
> elaborate, too.

fipshash() with an explanatory comments sounds like a good idea.

--
Daniel Gustafsson





  1   2   >