Re: "ERROR: latch already owned" on gharial

2024-02-09 Thread Soumyadeep Chakraborty
Hey,

Deeply appreciate both your input!

On Thu, Feb 8, 2024 at 4:57 AM Heikki Linnakangas  wrote:
> Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in
> ProcKill(), before step 3 can happen. Comment in spin.h about
> SpinLockAcquire/Release:
>
> >  *Load and store operations in calling code are guaranteed not to be
> >  *reordered with respect to these operations, because they include a
> >  *compiler barrier.  (Before PostgreSQL 9.5, callers needed to use a
> >  *volatile qualifier to access data protected by spinlocks.)
>
> That talks about a compiler barrier, though, not a memory barrier. But
> looking at the implementations in s_lock.h, I believe they do act as
> memory barrier, too.
>
> So you might indeed have that problem on 9.4, but AFAICS not on later
> versions.

Yes 9.4 does not have 0709b7ee72e, which I'm assuming you are referring to?

Reading src/backend/storage/lmgr/README.barrier: For x86, to avoid reordering
between a load and a store, we need something that prevents both CPU and
compiler reordering. pg_memory_barrier() fits the bill.

Here we can have pid X's read of latch->owner_pid=Y reordered to precede
pid Y's store of latch->owner_pid = 0. The compiler barrier in S_UNLOCK() will
prevent compiler reordering but not CPU reordering of the above.

#define S_UNLOCK(lock) \
do { __asm__ __volatile__("" : : : "memory");  *(lock) = 0; } while (0)
which is equivalent to a:
#define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory")

But maybe both CPU and memory reordering will be prevented by the tas() in
S_LOCK() which does a lock and xchgb?

Is the above acting as BOTH a compiler and CPU barrier, like the lock; addl
stuff in pg_memory_barrier_impl()?

If yes, then the picture would look like this:

Pid Y in DisownLatch(), Pid X in OwnLatch()

Y: LOAD latch->ownerPid
...
Y: STORE latch->ownerPid = 0
...
// returning PGPROC to freeList
Y:S_LOCK(ProcStructLock) <--- tas() prevents X: LOAD latch->ownerPid
from preceding this
...
... < X: LOAD latch->ownerPid can't get here anyway as spinlock is held
...
Y:S_UNLOCK(ProcStructLock)
...
X: S_LOCK(ProcStructLock) // to retrieve PGPROC from freeList
...
X: S_UNLOCK(ProcStructLock)
...
X: LOAD latch->ownerPid

And this issue is not caused due to 9.4 missing 0709b7ee72e, which
changed S_UNLOCK
exclusively.

If no, then we would need the patch that does an explicit pg_memory_barrier()
at the end of DisownLatch() for PG HEAD.

On Thu, Feb 8, 2024 at 1:41 PM Andres Freund  wrote:

> Right.  I wonder if the issue istead could be something similar to what was
> fixed in 8fb13dd6ab5b and more generally in 97550c0711972a. If two procs go
> through proc_exit() for the same process, you can get all kinds of weird
> mixed up resource ownership.  The bug fixed in 8fb13dd6ab5b wouldn't apply,
> but it's pretty easy to introduce similar bugs in other places, so it seems
> quite plausible that greenplum might have done so.  We also did have more
> proc_exit()s in signal handlers in older branches, so it might just be an
> issue that also was present before.

Hmm, the pids X and Y in the example provided upthread don't spawn off any
children (like by calling system()) - they are just regular backends. So its
not possible for them to receive TERM and try to proc_exit() w/ the same
PGPROC. So that is not the issue, I guess?

Regards,
Soumyadeep (VMware)




Re: "ERROR: latch already owned" on gharial

2024-02-07 Thread Soumyadeep Chakraborty
Hey hackers,

I wanted to report that we have seen this issue (with the procLatch) a few
times very sporadically on Greenplum 6X (based on 9.4), with relatively newer
versions of GCC.

I realize that 9.4 is out of support, so this email is purely to add on to the
existing thread, in case the info can help fix/reveal something in supported
versions.

Unfortunately, we don't have a core to share as we don't have the benefit of
commit [1] in Greenplum 6X, but we do possess commit [2] which gives us an elog
ERROR as opposed to PANIC.

Instance 1:

Event 1: 2023-11-13 10:01:31.927168 CET..., pY,
..."LOG","0","disconnection: session time: ..."
Event 2: 2023-11-13 10:01:32.049135
CET...,pX,"FATAL","XX000","latch already owned by pid Y (is_set:
0) (pg_latch.c:159)",,,0,,
"pg_latch.c",159,"Stack trace:
10xbde8b8 postgres errstart (elog.c:567)
20xbe0768 postgres elog_finish (discriminator 7)
30xa08924 postgres  (pg_latch.c:158) <-- OwnLatch
40xa7f179 postgres InitProcess (proc.c:523)
50xa94ac3 postgres PostgresMain (postgres.c:4874)
60xa1e2ed postgres  (postmaster.c:2860)
70xa1f295 postgres PostmasterMain (discriminator 5)
...
"LOG","0","server process (PID Y) exited with exit code
1",,,0,,"postmaster.c",3987,

Instance 2 (was reported with (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)):

Exactly the same as Instance 1 with identical log, ordering of events and stack
trace, except this time (is_set: 1) when the ERROR is logged.

A possible ordering of events:

(1) DisownLatch() is called by pid Y during ProcKill() and the write for
latch->owner_pid = 0 is NOT yet flushed to shmem.

(2) The PGPROC object for pid Y is returned to the free list.

(3) Pid X sees the same PGPROC object on the free list and grabs it.

(4) Pid X does sanity check inside OwnLatch during InitProcess and
still sees the
old value of latch->owner_pid = Y (and not = 0), and trips the ERROR.

The above sequence of operations should apply to PG HEAD as well.

Suggestion:

Should we do a pg_memory_barrier() at the end of DisownLatch(), like in
ResetLatch(), like the one introduced in [3]? This would ensure that the write
latch->owner_pid = 0; is flushed to shmem. The attached patch does this.

I'm not sure why we didn't introduce a memory barrier in DisownLatch() in [3].
I didn't find anything in the associated hackers thread [4] either. Was it the
performance impact, or was it just because SetLatch and ResetLatch
were more racy
and this is way less likely to happen?

This is out of my wheelhouse, but would one additional barrier in a process'
lifecycle be that bad for performance?

Appendix:

Build details: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)

CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -fno-aggressive-loop-optimizations
-Wno-unused-but-set-variable -Wno-address -Werror=implicit-fallthrough=3
-Wno-format-truncation -Wno-stringop-truncation -m64 -O3
-fargument-noalias-global -fno-omit-frame-pointer -g -std=gnu99
-Werror=uninitialized -Werror=implicit-function-declaration

Regards,
Soumyadeep (VMware)

[1] 
https://github.com/postgres/postgres/commit/12e28aac8e8eb76cab13a4e9b696e3dab17f1c99
[2] 
https://github.com/greenplum-db/gpdb/commit/81fdd6c5219af865e9dc41f4087e0405d6616050
[3] 
https://github.com/postgres/postgres/commit/14e8803f101a54d99600683543b0f893a2e3f529
[4] 
https://www.postgresql.org/message-id/flat/20150112154026.GB2092%40awork2.anarazel.de
From 9702054c37aa80cb60a16b2d80a475e9b27b087a Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Wed, 7 Feb 2024 18:04:43 -0800
Subject: [PATCH v1 1/1] Use memory barrier in DisownLatch

Failing to do so may cause another process undergoing startup to see our
pid associated with the latch during OwnLatch's sanity check.
---
 src/backend/storage/ipc/latch.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index d8a69990b3a..332c5e0b2cf 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -493,6 +493,12 @@ DisownLatch(Latch *latch)
 	Assert(latch->owner_pid == MyProcPid);
 
 	latch->owner_pid = 0;
+
+	/*
+	 * Ensure that another backend reusing this PGPROC during startup sees that
+	 * owner_pid = 0 and doesn't blow up in OwnLatch().
+	 */
+	pg_memory_barrier();
 }
 
 /*
-- 
2.34.1



Re: brininsert optimization opportunity

2023-12-13 Thread Soumyadeep Chakraborty
On Tue, Dec 12, 2023 at 3:25 AM Tomas Vondra
 wrote:


> The attached 0001 patch fixes this by adding the call to validate_index,
which seems like the proper place as it's where the indexInfo is
allocated and destroyed.

Yes, and by reading validate_index's header comment, there is a clear
expectation that we will be adding tuples to the index in the table AM call
table_index_validate_scan (albeit "validating" doesn't necessarily convey this
side effect). So, it makes perfect sense to call it here.


> But this makes me think - are there any other places that might call
> index_insert without then also doing the cleanup? I did grep for the
> index_insert() calls, and it seems OK except for this one.
>
> There's a call in toast_internals.c, but that seems OK because that only
> deals with btree indexes (and those don't need any cleanup). The same
> logic applies to unique_key_recheck(). The rest goes through
> execIndexing.c, which should do the cleanup in ExecCloseIndices().
>
> Note: We should probably call the cleanup even in the btree cases, even
> if only to make it clear it needs to be called after index_insert().

Agreed. Doesn't feel great, but yeah all of these btree specific code does call
through index_* functions, instead of calling btree functions directly. So,
ideally we should follow through with that pattern and call the cleanup
explicitly. But we are introducing per-tuple overhead that is totally wasted.
Maybe we can add a comment instead like:

void
toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock)
{
int i;

/*
* Save a few cycles by choosing not to call index_insert_cleanup(). Toast
* indexes are btree, which don't have a aminsertcleanup() anyway.
*/

/* Close relations and clean up things */
...
}

And add something similar for unique_key_recheck()? That said, I am also not
opposed to just accepting these wasted cycles, if the commenting seems wonky.

> I propose we do a much simpler thing instead - allow the cache may be
> initialized / cleaned up repeatedly, and make sure it gets reset at
> convenient place (typically after index_insert calls that don't go
> through execIndexing). That'd mean the cleanup does not need to happen
> very far from the index_insert(), which makes the reasoning much easier.
> 0002 does this.

That kind of goes against the ethos of the ii_AmCache, which is meant to capture
state to be used across multiple index inserts. Also, quoting the current docs:

"If the index AM wishes to cache data across successive index insertions
within an SQL statement, it can allocate space
in indexInfo-ii_Context and store a pointer to the
data in indexInfo-ii_AmCache (which will be NULL
initially). After the index insertions complete, the memory will be freed
automatically. If additional cleanup is required (e.g. if the cache contains
buffers and tuple descriptors), the AM may define an optional function
indexinsertcleanup, called before the memory is released."

The memory will be freed automatically (as soon as ii_Context goes away). So,
why would we explicitly free it, like in the attached 0002 patch? And the whole
point of the cleanup function is to do something other than freeing memory, as
the docs note highlights so well.

Also, the 0002 patch does introduce per-tuple function call overhead in
heapam_index_validate_scan().

Also, now we have split brain...in certain situations we want to call
index_insert_cleanup() per index insert and in certain cases we would like it
called once for all inserts. Not very easy to understand IMO.

Finally, I don't think that any index AM would have anything to clean up after
every insert.

I had tried (abused) an approach with MemoryContextCallbacks upthread and that
seems a no-go. And yes I agree, having a dual to makeIndexInfo() (like
destroyIndexInfo()) means that we lose the benefits of ii_Context. That could
be disruptive to existing AMs in-core and outside.

All said and done, v1 has my vote.

Regards,
Soumyadeep (VMware)




Re: brininsert optimization opportunity

2023-11-26 Thread Soumyadeep Chakraborty
On Sun, Nov 26, 2023 at 9:28 PM Richard Guo  wrote:
>
>
> On Sun, Nov 26, 2023 at 4:06 AM Tomas Vondra  
> wrote:
>>
>> I've done a bit more cleanup on the last version of the patch (renamed
>> the fields to start with bis_ as agreed, rephrased the comments / docs /
>> commit message a bit) and pushed.
>
>
> It seems that we have an oversight in this commit.  If there is no tuple
> that has been inserted, we wouldn't have an available insert state in
> the clean up phase.  So the Assert in brininsertcleanup() is not always
> right.  For example:
>
> regression=# update brin_summarize set value = brin_summarize.value;
> server closed the connection unexpectedly
>

I wasn't able to repro the issue on 86b64bafc19c4c60136a4038d2a8d1e6eecc59f2.
with UPDATE/INSERT:

postgres=# drop table a;
DROP TABLE
postgres=# create table a(i int);
CREATE TABLE
postgres=# create index on a using brin(i);
CREATE INDEX
postgres=# insert into a select 1 where 1!=1;
INSERT 0 0
postgres=# update a set i = 2 where i = 0;
UPDATE 0
postgres=# update a set i = a.i;
UPDATE 0

This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7,
we have moved ExecOpenIndices()
from ExecInitModifyTable() to ExecInsert(). Since we never open the
indices if nothing is
inserted, we would never attempt to close them with ExecCloseIndices()
while the ii_AmCache
is NULL (which is what causes this assertion failure).

However, it is possible to repro the issue with:
# create empty file
touch /tmp/a.csv
postgres=# create table a(i int);
CREATE TABLE
postgres=# create index on a using brin(i);
CREATE INDEX
postgres=# copy a from '/tmp/a.csv';
TRAP: failed Assert("bistate"), File: "brin.c", Line: 362, PID: 995511

> So I wonder if we should check 'bistate' and do the clean up only if
there is an available one, something like below.

Yes, this is the right way to go. Thanks for reporting!

Regards,
Soumyadeep (VMware)




Re: brininsert optimization opportunity

2023-11-25 Thread Soumyadeep Chakraborty
Thanks a lot for reviewing and pushing! :)

Regards,
Soumyadeep (VMware)




Re: brininsert optimization opportunity

2023-11-04 Thread Soumyadeep Chakraborty
On Fri, 3 Nov 2023 at 19:37, Tomas Vondra  wrote:

> The one thing I'm not entirely sure about is adding new stuff to the
> IndexAmRoutine. I don't think we want to end up with too many callbacks
> that all AMs have to initialize etc. I can't think of a different/better
> way to do this, though.

Yes there is not really an alternative. Also, aminsertcleanup() is very similar
to amvacuumcleanup(), so it is not awkward. Why should vacuum be an
exclusive VIP? :)
And there are other indexam callbacks that not every AM implements. So this
addition is not unprecedented in that sense.

> Barring objections, I'll try to push this early next week, after another
> round of cleanup.

Many thanks for resurrecting this patch!

On Fri, Nov 3, 2023 at 12:16PM Matthias van de Meent
 wrote:

>
> I do think we should choose a better namespace than bs_* for the
> fields of BrinInsertState, as BrinBuildState already uses the bs_*
> namespace for its fields in the same file, but that's only cosmetic.
>

bis_* then.

Regards,
Soumyadeep (VMware)




Re: brininsert optimization opportunity

2023-09-04 Thread Soumyadeep Chakraborty
Rebased against latest HEAD.

Regards,
Soumyadeep (VMware)
From 94a8acd3125aa4a613c238e435ed78dba9f40625 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Sat, 29 Jul 2023 09:17:49 -0700
Subject: [PATCH v5 1/1] Reuse revmap and brin desc in brininsert

brininsert() used to have code that performed per-tuple initialization
of the revmap. That had some overhead.

To mitigate, we introduce a new AM callback to clean up state at the end
of all inserts, and perform the revmap cleanup there.
---
 contrib/bloom/blutils.c  |  1 +
 doc/src/sgml/indexam.sgml| 14 +-
 src/backend/access/brin/brin.c   | 74 +++-
 src/backend/access/gin/ginutil.c |  1 +
 src/backend/access/gist/gist.c   |  1 +
 src/backend/access/hash/hash.c   |  1 +
 src/backend/access/index/indexam.c   | 15 ++
 src/backend/access/nbtree/nbtree.c   |  1 +
 src/backend/access/spgist/spgutils.c |  1 +
 src/backend/executor/execIndexing.c  |  5 ++
 src/include/access/amapi.h   |  3 ++
 src/include/access/brin_internal.h   |  1 +
 src/include/access/genam.h   |  2 +
 13 files changed, 108 insertions(+), 12 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index f23fbb1d9e0..4830cb3fee6 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,6 +131,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->aminsertcleanup = NULL;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index e813e2b620a..17f7d6597f1 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -139,6 +139,7 @@ typedef struct IndexAmRoutine
 ambuild_function ambuild;
 ambuildempty_function ambuildempty;
 aminsert_function aminsert;
+aminsertcleanup_function aminsertcleanup;
 ambulkdelete_function ambulkdelete;
 amvacuumcleanup_function amvacuumcleanup;
 amcanreturn_function amcanreturn;   /* can be NULL */
@@ -355,11 +356,22 @@ aminsert (Relation indexRelation,
within an SQL statement, it can allocate space
in indexInfo-ii_Context and store a pointer to the
data in indexInfo-ii_AmCache (which will be NULL
-   initially).
+   initially). If the data cached contains structures that can be simply pfree'd,
+   they will implicitly be pfree'd. But, if it contains more complex data, such
+   as Buffers or TupleDescs, additional cleanup is necessary. This additional
+   cleanup can be performed in indexinsertcleanup.
   
 
   
 
+bool
+aminsertcleanup (IndexInfo *indexInfo);
+
+   Clean up state that was maintained across successive inserts in
+   indexInfo-ii_AmCache. This is useful if the data
+   contained is complex - like Buffers or TupleDescs which need additional
+   cleanup, unlike simple structures that will be implicitly pfree'd.
+
 IndexBulkDeleteResult *
 ambulkdelete (IndexVacuumInfo *info,
   IndexBulkDeleteResult *stats,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d4fec654bb6..76927beb0ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -58,6 +58,17 @@ typedef struct BrinBuildState
 	BrinMemTuple *bs_dtuple;
 } BrinBuildState;
 
+/*
+ * We use a BrinInsertState to capture running state spanning multiple
+ * brininsert invocations, within the same command.
+ */
+typedef struct BrinInsertState
+{
+	BrinRevmap *bs_rmAccess;
+	BrinDesc   *bs_desc;
+	BlockNumber bs_pages_per_range;
+}			BrinInsertState;
+
 /*
  * Struct used as "opaque" during index scans
  */
@@ -72,6 +83,7 @@ typedef struct BrinOpaque
 
 static BrinBuildState *initialize_brin_buildstate(Relation idxRel,
   BrinRevmap *revmap, BlockNumber pagesPerRange);
+static BrinInsertState *initialize_brin_insertstate(Relation idxRel, IndexInfo *indexInfo);
 static void terminate_brin_buildstate(BrinBuildState *state);
 static void brinsummarize(Relation index, Relation heapRel, BlockNumber pageRange,
 		  bool include_partial, double *numSummarized, double *numExisting);
@@ -117,6 +129,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->aminsertcleanup = brininsertcleanup;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -140,6 +153,28 @@ brinhandler(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(amroutine);
 }
 
+/*
+ * Initialize a BrinInsertState to maintain state to be used across multiple
+ * tuple inserts, within the same command.
+ */
+static BrinInsertState *
+initialize_brin_insertstate(Relation idxRel, IndexInfo *indexInfo)
+{
+	BrinInse

Re: brininsert optimization opportunity

2023-08-05 Thread Soumyadeep Chakraborty
Created an entry for the Sep CF: https://commitfest.postgresql.org/44/4484/

Regards,
Soumyadeep (VMware)

On Sat, Jul 29, 2023 at 9:28 AM Soumyadeep Chakraborty
 wrote:
>
> Attached v4 of the patch, rebased against latest HEAD.
>
> Regards,
> Soumyadeep (VMware)




Re: brininsert optimization opportunity

2023-07-29 Thread Soumyadeep Chakraborty
Attached v4 of the patch, rebased against latest HEAD.

Regards,
Soumyadeep (VMware)
From b5fb12fbb8b0c1b2963a05a2877b5063bbc75f58 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Sat, 29 Jul 2023 09:17:49 -0700
Subject: [PATCH v4 1/1] Reuse revmap and brin desc in brininsert

brininsert() used to have code that performed per-tuple initialization
of the revmap. That had some overhead.

To mitigate, we introduce a new AM callback to clean up state at the end
of all inserts, and perform the revmap cleanup there.
---
 contrib/bloom/blutils.c  |  1 +
 doc/src/sgml/indexam.sgml| 14 +-
 src/backend/access/brin/brin.c   | 74 +++-
 src/backend/access/gin/ginutil.c |  1 +
 src/backend/access/gist/gist.c   |  1 +
 src/backend/access/hash/hash.c   |  1 +
 src/backend/access/index/indexam.c   | 15 ++
 src/backend/access/nbtree/nbtree.c   |  1 +
 src/backend/access/spgist/spgutils.c |  1 +
 src/backend/executor/execIndexing.c  |  5 ++
 src/include/access/amapi.h   |  3 ++
 src/include/access/brin_internal.h   |  1 +
 src/include/access/genam.h   |  2 +
 13 files changed, 108 insertions(+), 12 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d935ed8fbdf..9720f69de09 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,6 +131,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->aminsertcleanup = NULL;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index e813e2b620a..17f7d6597f1 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -139,6 +139,7 @@ typedef struct IndexAmRoutine
 ambuild_function ambuild;
 ambuildempty_function ambuildempty;
 aminsert_function aminsert;
+aminsertcleanup_function aminsertcleanup;
 ambulkdelete_function ambulkdelete;
 amvacuumcleanup_function amvacuumcleanup;
 amcanreturn_function amcanreturn;   /* can be NULL */
@@ -355,11 +356,22 @@ aminsert (Relation indexRelation,
within an SQL statement, it can allocate space
in indexInfo-ii_Context and store a pointer to the
data in indexInfo-ii_AmCache (which will be NULL
-   initially).
+   initially). If the data cached contains structures that can be simply pfree'd,
+   they will implicitly be pfree'd. But, if it contains more complex data, such
+   as Buffers or TupleDescs, additional cleanup is necessary. This additional
+   cleanup can be performed in indexinsertcleanup.
   
 
   
 
+bool
+aminsertcleanup (IndexInfo *indexInfo);
+
+   Clean up state that was maintained across successive inserts in
+   indexInfo-ii_AmCache. This is useful if the data
+   contained is complex - like Buffers or TupleDescs which need additional
+   cleanup, unlike simple structures that will be implicitly pfree'd.
+
 IndexBulkDeleteResult *
 ambulkdelete (IndexVacuumInfo *info,
   IndexBulkDeleteResult *stats,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa3..2da59f2ab03 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -58,6 +58,17 @@ typedef struct BrinBuildState
 	BrinMemTuple *bs_dtuple;
 } BrinBuildState;
 
+/*
+ * We use a BrinInsertState to capture running state spanning multiple
+ * brininsert invocations, within the same command.
+ */
+typedef struct BrinInsertState
+{
+	BrinRevmap *bs_rmAccess;
+	BrinDesc   *bs_desc;
+	BlockNumber bs_pages_per_range;
+}			BrinInsertState;
+
 /*
  * Struct used as "opaque" during index scans
  */
@@ -72,6 +83,7 @@ typedef struct BrinOpaque
 
 static BrinBuildState *initialize_brin_buildstate(Relation idxRel,
   BrinRevmap *revmap, BlockNumber pagesPerRange);
+static BrinInsertState *initialize_brin_insertstate(Relation idxRel, IndexInfo *indexInfo);
 static void terminate_brin_buildstate(BrinBuildState *state);
 static void brinsummarize(Relation index, Relation heapRel, BlockNumber pageRange,
 		  bool include_partial, double *numSummarized, double *numExisting);
@@ -117,6 +129,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->aminsertcleanup = brininsertcleanup;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -140,6 +153,28 @@ brinhandler(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(amroutine);
 }
 
+/*
+ * Initialize a BrinInsertState to maintain state to be used across multiple
+ * tuple inserts, within the same command.
+ */
+static BrinInsertState *
+initialize_brin_insertstate(Relation idxRel, IndexInfo *indexInfo)
+

Re: brininsert optimization opportunity

2023-07-05 Thread Soumyadeep Chakraborty
On Wed, Jul 5, 2023 at 3:16 AM Tomas Vondra
 wrote:

> > I'll try this out and introduce a couple of new index AM callbacks. I
> > think it's best to do it before releasing the locks - otherwise it
> > might be weird
> > to manipulate buffers of an index relation, without having some sort of 
> > lock on
> > it. I'll think about it some more.
> >
>
> I don't understand why would this need more than just a callback to
> release the cache.

We wouldn't. I thought that it would be slightly cleaner and slightly more
performant if we moved the (if !state) branches out of the XXXinsert()
functions.
But I guess, let's minimize the changes here. One cleanup callback is enough.

> > PS: It should be possible to make GIN and GiST use the new index AM APIs
> > as well.
> >
>
> Why should GIN/GiST use the new API? I think it's perfectly sensible to
> only require the "cleanup callback" when just pfree() is not enough.

Yeah no need.

Attached v3 of the patch w/ a single index AM callback.

Regards,
Soumyadeep (VMware)
From 563b905da5c2602113044689e37f8385cbfbde64 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Wed, 5 Jul 2023 10:59:11 -0700
Subject: [PATCH v3 1/1] Reuse revmap and brin desc in brininsert

brininsert() used to have code that performed per-tuple initialization
of the revmap. That had some overhead.

To mitigate, we introduce a new AM callback to clean up state at the end
of all inserts, and perform the revmap cleanup there.
---
 contrib/bloom/blutils.c  |  1 +
 doc/src/sgml/indexam.sgml| 14 +-
 src/backend/access/brin/brin.c   | 74 +++-
 src/backend/access/gin/ginutil.c |  1 +
 src/backend/access/gist/gist.c   |  1 +
 src/backend/access/hash/hash.c   |  1 +
 src/backend/access/index/indexam.c   | 15 ++
 src/backend/access/nbtree/nbtree.c   |  1 +
 src/backend/access/spgist/spgutils.c |  1 +
 src/backend/executor/execIndexing.c  |  5 ++
 src/include/access/amapi.h   |  3 ++
 src/include/access/brin_internal.h   |  1 +
 src/include/access/genam.h   |  2 +
 13 files changed, 108 insertions(+), 12 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d935ed8fbdf..9720f69de09 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,6 +131,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->aminsertcleanup = NULL;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index e813e2b620a..17f7d6597f1 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -139,6 +139,7 @@ typedef struct IndexAmRoutine
 ambuild_function ambuild;
 ambuildempty_function ambuildempty;
 aminsert_function aminsert;
+aminsertcleanup_function aminsertcleanup;
 ambulkdelete_function ambulkdelete;
 amvacuumcleanup_function amvacuumcleanup;
 amcanreturn_function amcanreturn;   /* can be NULL */
@@ -355,11 +356,22 @@ aminsert (Relation indexRelation,
within an SQL statement, it can allocate space
in indexInfo-ii_Context and store a pointer to the
data in indexInfo-ii_AmCache (which will be NULL
-   initially).
+   initially). If the data cached contains structures that can be simply pfree'd,
+   they will implicitly be pfree'd. But, if it contains more complex data, such
+   as Buffers or TupleDescs, additional cleanup is necessary. This additional
+   cleanup can be performed in indexinsertcleanup.
   
 
   
 
+bool
+aminsertcleanup (IndexInfo *indexInfo);
+
+   Clean up state that was maintained across successive inserts in
+   indexInfo-ii_AmCache. This is useful if the data
+   contained is complex - like Buffers or TupleDescs which need additional
+   cleanup, unlike simple structures that will be implicitly pfree'd.
+
 IndexBulkDeleteResult *
 ambulkdelete (IndexVacuumInfo *info,
   IndexBulkDeleteResult *stats,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa3..2da59f2ab03 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -58,6 +58,17 @@ typedef struct BrinBuildState
 	BrinMemTuple *bs_dtuple;
 } BrinBuildState;
 
+/*
+ * We use a BrinInsertState to capture running state spanning multiple
+ * brininsert invocations, within the same command.
+ */
+typedef struct BrinInsertState
+{
+	BrinRevmap *bs_rmAccess;
+	BrinDesc   *bs_desc;
+	BlockNumber bs_pages_per_range;
+}			BrinInsertState;
+
 /*
  * Struct used as "opaque" during index scans
  */
@@ -72,6 +83,7 @@ typedef struct BrinOpaque
 
 static BrinBuildState *initialize_brin_buildstate(Relation idxRel,
   BrinRevmap *revmap,

Re: brininsert optimization opportunity

2023-07-04 Thread Soumyadeep Chakraborty
On Tue, Jul 4, 2023 at 2:54 PM Tomas Vondra
 wrote:

> I'm not sure if memory context callbacks are the right way to rely on
> for this purpose. The primary purpose of memory contexts is to track
> memory, so using them for this seems a bit weird.

Yeah, this just kept getting dirtier and dirtier.

> There are cases that do something similar, like expandendrecord.c where
> we track refcounted tuple slot, but IMHO there's a big difference
> between tracking one slot allocated right there, and unknown number of
> buffers allocated much later.

Yeah, the following code in ER_mc_callbackis is there I think to prevent double
freeing the tupdesc (since it might be freed in ResourceOwnerReleaseInternal())
(The part about: /* Ditto for tupdesc references */).

if (tupdesc->tdrefcount > 0)
{
if (--tupdesc->tdrefcount == 0)
FreeTupleDesc(tupdesc);
}
Plus the above code doesn't try anything with Resource owner stuff, whereas
releasing a buffer means:
ReleaseBuffer() -> UnpinBuffer() ->
ResourceOwnerForgetBuffer(CurrentResourceOwner, b);

> The fact that even with the extra context is still doesn't handle query
> cancellations is another argument against that approach (I wonder how
> expandedrecord.c handles that, but I haven't checked).
>
> >
> > Maybe there is a better way of doing our cleanup? I'm not sure. Would
> > love your input!
> >
> > The other alternative for all this is to introduce new AM callbacks for
> > insert_begin and insert_end. That might be a tougher sell?
> >
>
> That's the approach I wanted to suggest, more or less - to do the
> cleanup from ExecCloseIndices() before index_close(). I wonder if it's
> even correct to do that later, once we release the locks etc.

I'll try this out and introduce a couple of new index AM callbacks. I
think it's best to do it before releasing the locks - otherwise it
might be weird
to manipulate buffers of an index relation, without having some sort of lock on
it. I'll think about it some more.

> I don't think ii_AmCache was intended for stuff like this - GIN and GiST
> only use it to cache stuff that can be just pfree-d, but for buffers
> that's no enough. It's not surprising we need to improve this.

Hmmm, yes, although the docs state:
"If the index AM wishes to cache data across successive index insertions within
an SQL statement, it can allocate space in indexInfo->ii_Context and
store a pointer
to the data in indexInfo->ii_AmCache (which will be NULL initially)."
they don't mention anything about buffer usage. Well we will fix it!

PS: It should be possible to make GIN and GiST use the new index AM APIs
as well.

> FWIW while debugging this (breakpoint on MemoryContextDelete) I was
> rather annoyed the COPY keeps dropping and recreating the two BRIN
> contexts - brininsert cxt / brin dtuple. I wonder if we could keep and
> reuse those too, but I don't know how much it'd help.
>

Interesting, I will investigate that.

> > Now, to finally answer your question about the speedup without
> > generate_series(). We do see an even higher speedup!
> >
> > seq 1 2 > /tmp/data.csv
> > \timing
> > DROP TABLE heap;
> > CREATE TABLE heap(i int);
> > CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
> > COPY heap FROM '/tmp/data.csv';
> >
> > -- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
> > COPY 2
> > Time: 205072.444 ms (03:25.072)
> > Time: 215380.369 ms (03:35.380)
> > Time: 203492.347 ms (03:23.492)
> >
> > -- 3 runs (branch v2)
> >
> > COPY 2
> > Time: 135052.752 ms (02:15.053)
> > Time: 135093.131 ms (02:15.093)
> > Time: 138737.048 ms (02:18.737)
> >
>
> That's nice, but it still doesn't say how much of that is reading the
> data. If you do just copy into a table without any indexes, how long
> does it take?

So, I loaded the same heap table without any indexes and at the same
scale. I got:

postgres=# COPY heap FROM '/tmp/data.csv';
COPY 2
Time: 116161.545 ms (01:56.162)
Time: 114182.745 ms (01:54.183)
Time: 114975.368 ms (01:54.975)

perf diff also attached between the three: w/ no indexes (baseline),
master and v2.

Regards,
Soumyadeep (VMware)


perf_diff_3way.out
Description: Binary data


Re: brininsert optimization opportunity

2023-07-04 Thread Soumyadeep Chakraborty
Thank you both for reviewing!

On Tue, Jul 4, 2023 at 4:24AM Alvaro Herrera  wrote:

> Hmm, yeah, I remember being bit bothered by this repeated
> initialization. Your patch looks reasonable to me. I would set
> bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure.
> Also, please add comments atop these two new functions, to explain what
> they are.

Done. Set bistate->bs_desc = NULL; as well. Added comments.


On Tue, Jul 4, 2023 at 4:59AM Tomas Vondra
 wrote:

> Yeah. I wonder how much of that runtime is the generate_series(),
> though. What's the speedup if that part is subtracted. It's guaranteed
> to be even more significant, but by how much?

When trying COPY, I got tripped by the following:

We get a buffer leak WARNING for the meta page and a revmap page.

WARNING:  buffer refcount leak: [094] (rel=base/156912/206068,
blockNum=1, flags=0x8300, refcount=1 1)
WARNING:  buffer refcount leak: [093] (rel=base/156912/206068,
blockNum=0, flags=0x8300, refcount=1 1)

PrintBufferLeakWarning bufmgr.c:3240
ResourceOwnerReleaseInternal resowner.c:554
ResourceOwnerRelease resowner.c:494
PortalDrop portalmem.c:563
exec_simple_query postgres.c:1284

We release the buffer during this resowner release and then we crash
with:

TRAP: failed Assert("bufnum <= NBuffers"), File:
"../../../../src/include/storage/bufmgr.h", Line: 305, PID: 86833
postgres: pivotal test4 [local] COPY(ExceptionalCondition+0xbb)[0x5572b55bcc79]
postgres: pivotal test4 [local] COPY(+0x61ccfc)[0x5572b537dcfc]
postgres: pivotal test4 [local] COPY(ReleaseBuffer+0x19)[0x5572b5384db2]
postgres: pivotal test4 [local] COPY(brinRevmapTerminate+0x1e)[0x5572b4e3fd39]
postgres: pivotal test4 [local] COPY(+0xcfc44)[0x5572b4e30c44]
postgres: pivotal test4 [local] COPY(+0x89e7f2)[0x5572b55ff7f2]
postgres: pivotal test4 [local] COPY(MemoryContextDelete+0xd7)[0x5572b55ff683]
postgres: pivotal test4 [local] COPY(PortalDrop+0x374)[0x5572b5602dc7]

Unfortunately, when we do COPY, the MemoryContext where makeIndexInfo
gets called is PortalContext and that is what is set in ii_Context.
Furthermore, we clean up the resource owner stuff before we can clean
up the MemoryContexts in PortalDrop().

The CurrentMemoryContext when initialize_brin_insertstate() is called
depends. For CopyMultiInsertBufferFlush() -> ExecInsertIndexTuples()
it is PortalContext, and for CopyFrom() -> ExecInsertIndexTuples() it is
ExecutorState/ExprContext. We can't rely on it to register the callback
neither.

What we can do is create a new MemoryContext for holding the
BrinInsertState, and we tie the callback to that so that cleanup is not
affected by all of these variables. See v2 patch attached. Passes make
installcheck-world and make installcheck -C src/test/modules/brin.

However, we do still have 1 issue with the v2 patch:
When we try to cancel (Ctrl-c) a running COPY command:
ERROR:  buffer 151 is not owned by resource owner TopTransaction

#4  0x559cbc54a934 in ResourceOwnerForgetBuffer
(owner=0x559cbd6fcf28, buffer=143) at resowner.c:997
#5  0x559cbc2c45e7 in UnpinBuffer (buf=0x7f8d4a8f3f80) at bufmgr.c:2390
#6  0x559cbc2c7e49 in ReleaseBuffer (buffer=143) at bufmgr.c:4488
#7  0x559cbbd82d53 in brinRevmapTerminate (revmap=0x559cbd7a03b8)
at brin_revmap.c:105
#8  0x559cbbd73c44 in brininsertCleanupCallback
(arg=0x559cbd7a5b68) at brin.c:168
#9  0x559cbc54280c in MemoryContextCallResetCallbacks
(context=0x559cbd7a5a50) at mcxt.c:506
#10 0x559cbc54269d in MemoryContextDelete (context=0x559cbd7a5a50)
at mcxt.c:421
#11 0x559cbc54273e in MemoryContextDeleteChildren
(context=0x559cbd69ae90) at mcxt.c:457
#12 0x559cbc54625c in AtAbort_Portals () at portalmem.c:850

Haven't found a way to fix this ^ yet.

Maybe there is a better way of doing our cleanup? I'm not sure. Would
love your input!

The other alternative for all this is to introduce new AM callbacks for
insert_begin and insert_end. That might be a tougher sell?

Now, to finally answer your question about the speedup without
generate_series(). We do see an even higher speedup!

seq 1 2 > /tmp/data.csv
\timing
DROP TABLE heap;
CREATE TABLE heap(i int);
CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
COPY heap FROM '/tmp/data.csv';

-- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
COPY 2
Time: 205072.444 ms (03:25.072)
Time: 215380.369 ms (03:35.380)
Time: 203492.347 ms (03:23.492)

-- 3 runs (branch v2)

COPY 2
Time: 135052.752 ms (02:15.053)
Time: 135093.131 ms (02:15.093)
Time: 138737.048 ms (02:18.737)

Regards,
Soumyadeep (VMware)
From 54ba134f4afe9c4bac19e7d8fde31b9768dc23cd Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Tue, 4 Jul 2023 11:50:35 -0700
Subject: [PATCH v2 1/1] Reuse revmap and brin desc in brininsert

brininsert() used to have code that performed per-tuple initialization
of the revmap. That had some overhead.
---
 src/backend/acce

brininsert optimization opportunity

2023-07-03 Thread Soumyadeep Chakraborty
Hello hackers,

My colleague, Ashwin, pointed out to me that brininsert's per-tuple init
of the revmap access struct can have non-trivial overhead.

Turns out he is right. We are saving 24 bytes of memory per-call for
the access struct, and a bit on buffer/locking overhead, with the
attached patch.

The implementation ties the revmap cleanup as a MemoryContext callback
to the IndexInfo struct's MemoryContext, as there is no teardown
function provided by the index AM for end-of-insert-command.

Test setup (local Ubuntu workstation):

# Drop caches and restart between each run:
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches;"
pg_ctl -D /usr/local/pgsql/data/ -l /tmp/logfile restart

\timing
DROP TABLE heap;
CREATE TABLE heap(i int);
CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
INSERT INTO heap SELECT 1 FROM generate_series(1, 2);

Results:
We see an improvement for 100M tuples and an even bigger improvement for
200M tuples.

Master (29cf61ade3f245aa40f427a1d6345287ef77e622):

test=# INSERT INTO heap SELECT 1 FROM generate_series(1, 1);
INSERT 0 1
Time: 222762.159 ms (03:42.762)

-- 3 runs
test=# INSERT INTO heap SELECT 1 FROM generate_series(1, 2);
INSERT 0 2
Time: 471168.181 ms (07:51.168)
Time: 457071.883 ms (07:37.072)
TimeL 486969.205 ms (08:06.969)

Branch:

test2=# INSERT INTO heap SELECT 1 FROM generate_series(1, 1);
INSERT 0 1
Time: 200046.519 ms (03:20.047)

-- 3 runs
test2=# INSERT INTO heap SELECT 1 FROM generate_series(1, 2);
INSERT 0 2
Time: 369041.832 ms (06:09.042)
Time: 365483.382 ms (06:05.483)
Time: 375506.144 ms (06:15.506)

# Profiled backend running INSERT of 1 rows
sudo perf record -p 11951 --call-graph fp sleep 180

Please see attached perf diff between master and branch. We see that we
save on a bit of overhead from brinRevmapInitialize(),
brinRevmapTerminate() and lock routines.

Regards,
Soumyadeep (VMware)


perf_diff.out
Description: Binary data
From 5d11219e413fe6806e00dd738b051c3948dffcab Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Mon, 3 Jul 2023 10:47:48 -0700
Subject: [PATCH v1 1/1] Reuse revmap and brin desc in brininsert

brininsert() used to have code that performed per-tuple initialization
of the revmap. That had some overhead.
---
 src/backend/access/brin/brin.c | 70 --
 1 file changed, 59 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa3..e27bee980f1 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -58,6 +58,17 @@ typedef struct BrinBuildState
 	BrinMemTuple *bs_dtuple;
 } BrinBuildState;
 
+/*
+ * We use a BrinInsertState to capture running state spanning multiple
+ * brininsert invocations, within the same command.
+ */
+typedef struct BrinInsertState
+{
+	BrinRevmap 	*bs_rmAccess;
+	BrinDesc   	*bs_desc;
+	BlockNumber bs_pages_per_range;
+} BrinInsertState;
+
 /*
  * Struct used as "opaque" during index scans
  */
@@ -72,6 +83,7 @@ typedef struct BrinOpaque
 
 static BrinBuildState *initialize_brin_buildstate(Relation idxRel,
   BrinRevmap *revmap, BlockNumber pagesPerRange);
+static BrinInsertState *initialize_brin_insertstate(Relation idxRel, IndexInfo *indexInfo);
 static void terminate_brin_buildstate(BrinBuildState *state);
 static void brinsummarize(Relation index, Relation heapRel, BlockNumber pageRange,
 		  bool include_partial, double *numSummarized, double *numExisting);
@@ -140,6 +152,42 @@ brinhandler(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(amroutine);
 }
 
+static void
+brininsertCleanupCallback(void *arg)
+{
+	BrinInsertState *bistate = (BrinInsertState *) arg;
+	/*
+	 * Clean up the revmap. Note that the brinDesc has already been cleaned up
+	 * as part of its own memory context.
+	 */
+	brinRevmapTerminate(bistate->bs_rmAccess);
+}
+
+static BrinInsertState *
+initialize_brin_insertstate(Relation idxRel, IndexInfo *indexInfo)
+{
+	BrinInsertState 		*bistate;
+	MemoryContextCallback 	*cb;
+	MemoryContext 			oldcxt;
+
+	oldcxt = MemoryContextSwitchTo(indexInfo->ii_Context);
+	bistate = MemoryContextAllocZero(indexInfo->ii_Context,
+	 sizeof(BrinInsertState));
+
+	bistate->bs_desc = brin_build_desc(idxRel);
+	cb = palloc(sizeof(MemoryContextCallback));
+	cb->arg = bistate;
+	cb->func = brininsertCleanupCallback;
+	MemoryContextRegisterResetCallback(indexInfo->ii_Context, cb);
+	MemoryContextSwitchTo(oldcxt);
+
+	bistate->bs_rmAccess = brinRevmapInitialize(idxRel,
+>bs_pages_per_range,
+NULL);
+
+	return bistate;
+}
+
 /*
  * A tuple in the heap is being inserted.  To keep a brin index up to date,
  * we need to obtain the relevant index tuple and compare its stored values
@@ -162,14 +210,23 @@ brininsert(Relation idxRel, Datum *values, bool *nulls,
 	BlockNumber pagesPerRange;
 	BlockNumber origH

Re: pg_rewind: Skip log directory for file type check like pg_wal

2023-03-06 Thread Soumyadeep Chakraborty
On Mon, Mar 6, 2023 at 11:33 PM Alexander Kukushkin  wrote:
>
>
> Lets assume that on the source we have "pg_log" and on the target we have 
> "my_log" (they are configured using "log_directory" GUC).
> When doing rewind in this case we want neither to remove the content of 
> "my_log" on the target nor to copy content of "pg_log" from the source.
> It couldn't be achieved just by introducing a static string "log". The 
> "log_directory" GUC must be examined on both, source and target.

Trouble with doing that is if pg_rewind is run in non-libpq (offline)
mode. Then we would have to parse it out of the conf file(s)?
Is there a standard way of doing that?

Regards,
Soumyadeep (VMware)




Re: pg_rewind: Skip log directory for file type check like pg_wal

2023-03-06 Thread Soumyadeep Chakraborty
On Mon, Mar 6, 2023 at 12:28 AM Alexander Kukushkin  wrote:
>
> Hello Soumyadeep,
>
> The problem indeed exists, but IMO the "log" directory case must be handled 
> differently:
> 1. We don't need or I would even say we don't want to sync log files from the 
> new primary, because it destroys the actual logs, which could be very 
> important to figure out what has happened with the old primary

Yes, this can be solved by adding "log" to excludeDirContents. We did
this for GPDB.

> 2. Unlike "pg_wal", the "log" directory is not necessarily located inside 
> PGDATA. The actual value is configured using "log_directory" GUC, which just 
> happened to be "log" by default. And in fact actual values on source and 
> target could be different.

I think we only care about files/dirs inside the datadir. Anything
outside is out of scope for
pg_rewind AFAIU. We can only address the common case here. As mentioned in this
comment:

 * XXX: There is no backend function to get a symbolic link's target in
 * general, so if the admin has put any custom symbolic links in the data
 * directory, they won't be copied correctly.

There is not much we can do about custom configurations.

Regards,
Soumyadeep (VMware)




pg_rewind: Skip log directory for file type check like pg_wal

2023-03-05 Thread Soumyadeep Chakraborty
Hello hackers,

I think we should extend the "log" directory the same courtesy as was
done for pg_wal (pg_xlog) in 0e42397f42b.

Today, even if BOTH source and target servers have symlinked "log"
directories, pg_rewind fails with:

file "log" is of different type in source and target.

Attached is a repro patch using the 004_pg_xlog_symlink.pl test to
demonstrate the failure.
Running make check PROVE_TESTS='t/004_pg_xlog_symlink.pl'
in src/bin/pg_rewind should suffice after applying.

This is because when we use the libpq query to fetch the filemap from
the source server, we consider the log directory as a directory, even if
it is a symlink. This is because pg_stat_file() is used in that query in
libpq_traverse_files() and pg_stat_file() returns isdir=t for symlinks
to directories.

This shortcoming is somewhat called out:

 * XXX: There is no backend function to get a symbolic link's target in
 * general, so if the admin has put any custom symbolic links in the data
 * directory, they won't be copied correctly.

We could fix the query and/or pg_stat_file(). However, we would also
like to support deployments where only one of the primaries and/or
standbys have the symlink. That is not hard to conceive, given primaries
and standbys can have drastically disparate log volume and/or log
collection requirements.

Attached is a patch that treats "log" like we treat "pg_wal".

Regards,
Soumyadeep (VMware)
From 697414d2b630efdad0a9137ea9cc93f8576a9792 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Sun, 5 Mar 2023 17:57:55 -0800
Subject: [PATCH v1 1/1] Fix pg_rewind when log is a symlink

The log directory can often be symlinked in the same way the pg_wal
directory is. Today, even if BOTH the source and target servers have
their log directories symlinked, pg_rewind will run into the error:

"log" is of different type in source and target

This is because when we use the libpq query to fetch the filemap from
the source server, we consider the log directory as a directory, even if
it is a symlink. This is because pg_stat_file() is used in that query in
libpq_traverse_files() and pg_stat_file() returns isdir=t for symlinks
to directories.

This shortcoming is somewhat called out:

 * XXX: There is no backend function to get a symbolic link's target in
 * general, so if the admin has put any custom symbolic links in the data
 * directory, they won't be copied correctly.

We could fix the query and/or pg_stat_file(). However, we would also
like to support deployments where only one of the primaries and/or
standbys have the symlink. That is not hard to conceive, given primaries
and standbys can have drastically disparate log volume and/or log
collection requirements.

So, we decide to extend the log directory the same courtesy as was done
for pg_wal (pg_xlog) in 0e42397f42b.
---
 src/bin/pg_rewind/filemap.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index bd5c598e200..a076bb33996 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -221,11 +221,11 @@ process_source_file(const char *path, file_type_t type, size_t size,
 	file_entry_t *entry;
 
 	/*
-	 * Pretend that pg_wal is a directory, even if it's really a symlink. We
+	 * Pretend that pg_wal/log is a directory, even if it's really a symlink. We
 	 * don't want to mess with the symlink itself, nor complain if it's a
 	 * symlink in source but not in target or vice versa.
 	 */
-	if (strcmp(path, "pg_wal") == 0 && type == FILE_TYPE_SYMLINK)
+	if (((strcmp(path, "pg_wal") == 0 || strcmp(path, "log") == 0)) && type == FILE_TYPE_SYMLINK)
 		type = FILE_TYPE_DIRECTORY;
 
 	/*
@@ -263,9 +263,9 @@ process_target_file(const char *path, file_type_t type, size_t size,
 	 */
 
 	/*
-	 * Like in process_source_file, pretend that pg_wal is always a directory.
+	 * Like in process_source_file, pretend that pg_wal/log is always a directory.
 	 */
-	if (strcmp(path, "pg_wal") == 0 && type == FILE_TYPE_SYMLINK)
+	if (((strcmp(path, "pg_wal") == 0 || strcmp(path, "log") == 0)) && type == FILE_TYPE_SYMLINK)
 		type = FILE_TYPE_DIRECTORY;
 
 	/* Remember this target file */
-- 
2.34.1

diff --git a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl
index 5fb7fa9077c..f95ba1a1486 100644
--- a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl
+++ b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl
@@ -2,7 +2,7 @@
 # Copyright (c) 2021-2023, PostgreSQL Global Development Group
 
 #
-# Test pg_rewind when the target's pg_wal directory is a symlink.
+# Test pg_rewind when the target's log directory is a symlink.
 #
 use strict;
 use warnings;
@@ -27,11 +27,12 @@ sub run_test
 	RewindTest::setup_cluster($test_mode);
 
 	my $test_primary_datadir = $node_primary->data_dir;

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2022-06-09 Thread Soumyadeep Chakraborty
On Wed, May 18, 2022 at 6:26 PM Zhihong Yu  wrote:

> +   accessMethodId = ((Form_pg_class) GETSTRUCT(tup))->relam;
>
> -   /* look up the access method, verify it is for a table */
> -   if (accessMethod != NULL)
> -   accessMethodId = get_table_am_oid(accessMethod, false);
> +   if (!HeapTupleIsValid(tup))
> +   elog(ERROR, "cache lookup failed for relation %u", relid);
>
> The validity check of tup should be done before fetching the value of
relam field.

Thanks. Fixed and rebased.

Regards,
Soumyadeep (VMware)
From a9fc146192430aa45224fb1e64bd95e0bb64fc00 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Thu, 9 Jun 2022 10:57:35 -0700
Subject: [PATCH v3 1/2] Allow ATSETAM on partition roots

Setting the access method on partition roots was disallowed. This commit
relaxes that restriction.

Summary of changes over original patch:

* Rebased
* Minor comment updates
* Add more test coverage

Authors: Justin Pryzby , Soumyadeep Chakraborty

---
 src/backend/catalog/heap.c  |   3 +-
 src/backend/commands/tablecmds.c| 103 +++-
 src/backend/utils/cache/relcache.c  |   4 +
 src/test/regress/expected/create_am.out |  50 +++-
 src/test/regress/sql/create_am.sql  |  23 +++---
 5 files changed, 129 insertions(+), 54 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 1803194db94..4561f3b8c98 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1449,7 +1449,8 @@ heap_create_with_catalog(const char *relname,
 		 * No need to add an explicit dependency for the toast table, as the
 		 * main table depends on it.
 		 */
-		if (RELKIND_HAS_TABLE_AM(relkind) && relkind != RELKIND_TOASTVALUE)
+		if ((RELKIND_HAS_TABLE_AM(relkind) && relkind != RELKIND_TOASTVALUE) ||
+			relkind == RELKIND_PARTITIONED_TABLE)
 		{
 			ObjectAddressSet(referenced, AccessMethodRelationId, accessmtd);
 			add_exact_object_address(, addrs);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2de0ebacec3..9a5eabcac49 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -570,6 +570,7 @@ static ObjectAddress ATExecClusterOn(Relation rel, const char *indexName,
 	 LOCKMODE lockmode);
 static void ATExecDropCluster(Relation rel, LOCKMODE lockmode);
 static void ATPrepSetAccessMethod(AlteredTableInfo *tab, Relation rel, const char *amname);
+static void ATExecSetAccessMethodNoStorage(Relation rel, Oid newAccessMethod);
 static bool ATPrepChangePersistence(Relation rel, bool toLogged);
 static void ATPrepSetTableSpace(AlteredTableInfo *tab, Relation rel,
 const char *tablespacename, LOCKMODE lockmode);
@@ -676,7 +677,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	Oid			ofTypeId;
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
-	const char *accessMethod = NULL;
 	Oid			accessMethodId = InvalidOid;
 
 	/*
@@ -938,20 +938,32 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	 * a type of relation that needs one, use the default.
 	 */
 	if (stmt->accessMethod != NULL)
+		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
+	else if (stmt->partbound && relkind == RELKIND_RELATION)
 	{
-		accessMethod = stmt->accessMethod;
+		HeapTuple   tup;
+		Oidrelid;
 
-		if (partitioned)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("specifying a table access method is not supported on a partitioned table")));
-	}
-	else if (RELKIND_HAS_TABLE_AM(relkind))
-		accessMethod = default_table_access_method;
+		/*
+		 * For partitioned tables, when no access method is specified, we
+		 * default to the parent table's AM.
+		 */
+		Assert(list_length(inheritOids) == 1);
+		/* XXX: should implement get_rel_relam? */
+		relid = linitial_oid(inheritOids);
+		tup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
+		if (!HeapTupleIsValid(tup))
+			elog(ERROR, "cache lookup failed for relation %u", relid);
 
-	/* look up the access method, verify it is for a table */
-	if (accessMethod != NULL)
-		accessMethodId = get_table_am_oid(accessMethod, false);
+		accessMethodId = ((Form_pg_class) GETSTRUCT(tup))->relam;
+
+		ReleaseSysCache(tup);
+
+		if (!OidIsValid(accessMethodId))
+			accessMethodId = get_table_am_oid(default_table_access_method, false);
+	}
+	else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
+		accessMethodId = get_table_am_oid(default_table_access_method, false);
 
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
@@ -4693,13 +4705,6 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 			break;
 		case AT_SetAccessMethod:	/* SET ACCESS METHOD */
 			ATSimplePermissions(cmd->subtype, rel, ATT_TABLE | ATT_MATVIEW);
-
-			/* partitioned tables don't have an access me

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2022-05-18 Thread Soumyadeep Chakraborty
On Wed, May 18, 2022 at 4:14 PM Justin Pryzby  wrote:

> I didn't look closely yet, but this comment is wrong:
>
> + * Since these have no storage the tablespace can be updated with a
simple


> + * metadata only operation to update the tablespace.



Good catch. Fixed.

> It'd be convenient if AMs worked the same way (and a bit odd that they
don't).
> Note that in v15, pg_dump/restore now allow --no-table-am, an exact
parallel to
> --no-tablespace.

I agree that ATSET AM should behave in a similar fashion to ATSET
tablespaces.
However, the way that ATSET tablespace currently behaves is not consistent
with
the ONLY clause.

On a given partition root:
ALTER TABLE ONLY am_partitioned SET TABLESPACE ts;
has the same effect as:
ALTER TABLE am_partitioned SET TABLESPACE ts;

We are missing out on the feature to set the AM/tablespace throughout the
partition hierarchy, with one command.

Regards,
Soumyadeep (VMware)
From 3a4a78d46fc74bb3e9b7ac9aefc689d250e1ecf4 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Tue, 17 May 2022 15:22:48 -0700
Subject: [PATCH v2 2/2] Make ATSETAM recurse by default

ATSETAM now recurses to partition children by default. To prevent
recursion, and have the new am apply to future children only, users must
specify the ONLY clause.
---
 src/backend/commands/tablecmds.c|  2 ++
 src/test/regress/expected/create_am.out | 13 -
 src/test/regress/sql/create_am.sql  |  5 -
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5febd834d5b..c5bbb71c66d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4711,6 +4711,8 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 		(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 		 errmsg("cannot have multiple SET ACCESS METHOD subcommands")));
 
+			if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ATSimpleRecursion(wqueue, rel, cmd, recurse, lockmode, context);
 			ATPrepSetAccessMethod(tab, rel, cmd->name);
 			pass = AT_PASS_MISC;	/* does not matter; no work in Phase 2 */
 			break;
diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out
index a48199df9a2..e99de1ac912 100644
--- a/src/test/regress/expected/create_am.out
+++ b/src/test/regress/expected/create_am.out
@@ -284,7 +284,7 @@ CREATE TABLE am_partitioned(x INT, y INT)
 CREATE TABLE am_partitioned_1 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 0);
 CREATE TABLE am_partitioned_2 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 1);
 ALTER TABLE am_partitioned_1 SET ACCESS METHOD heap2;
-ALTER TABLE am_partitioned SET ACCESS METHOD heap2;
+ALTER TABLE ONLY am_partitioned SET ACCESS METHOD heap2;
 CREATE TABLE am_partitioned_3 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 2);
 SELECT relname, amname FROM pg_class c, pg_am am
   WHERE c.relam = am.oid AND c.relname LIKE 'am_partitioned%' ORDER BY 1;
@@ -296,6 +296,17 @@ SELECT relname, amname FROM pg_class c, pg_am am
  am_partitioned_3 | heap2
 (4 rows)
 
+ALTER TABLE am_partitioned SET ACCESS METHOD heap;
+SELECT relname, amname FROM pg_class c, pg_am am
+WHERE c.relam = am.oid AND c.relname LIKE 'am_partitioned%' ORDER BY 1;
+ relname  | amname 
+--+
+ am_partitioned   | heap
+ am_partitioned_1 | heap
+ am_partitioned_2 | heap
+ am_partitioned_3 | heap
+(4 rows)
+
 DROP TABLE am_partitioned;
 -- Second, create objects in the new AM by changing the default AM
 BEGIN;
diff --git a/src/test/regress/sql/create_am.sql b/src/test/regress/sql/create_am.sql
index 8f3db03bba2..f4e22684782 100644
--- a/src/test/regress/sql/create_am.sql
+++ b/src/test/regress/sql/create_am.sql
@@ -184,10 +184,13 @@ CREATE TABLE am_partitioned(x INT, y INT)
 CREATE TABLE am_partitioned_1 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 0);
 CREATE TABLE am_partitioned_2 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 1);
 ALTER TABLE am_partitioned_1 SET ACCESS METHOD heap2;
-ALTER TABLE am_partitioned SET ACCESS METHOD heap2;
+ALTER TABLE ONLY am_partitioned SET ACCESS METHOD heap2;
 CREATE TABLE am_partitioned_3 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 2);
 SELECT relname, amname FROM pg_class c, pg_am am
   WHERE c.relam = am.oid AND c.relname LIKE 'am_partitioned%' ORDER BY 1;
+ALTER TABLE am_partitioned SET ACCESS METHOD heap;
+SELECT relname, amname FROM pg_class c, pg_am am
+WHERE c.relam = am.oid AND c.relname LIKE 'am_partitioned%' ORDER BY 1;
 DROP TABLE am_partitioned;
 
 -- Second, create objects in the new AM by changing the default AM
-- 
2.25.1

From 6edce14b5becb30ca7c12224acaabea62a4d999f Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Mon, 16 May 2022 14:46:20 -0700
Subject: [PATCH v2 1/2] Allow ATSETAM on partition roots

Setting the access 

ALTER TABLE SET ACCESS METHOD on partitioned tables

2022-05-17 Thread Soumyadeep Chakraborty
Hello,

This is a fresh thread to continue the discussion on ALTER TABLE SET
ACCESS METHOD when applied to partition roots, as requested.

Current behavior (HEAD):

CREATE TABLE am_partitioned(x INT, y INT)
   PARTITION BY hash (x);
ALTER TABLE am_partitioned SET ACCESS METHOD heap2;
ERROR:  cannot change access method of a partitioned table

Potential behavior options:

(A) Don't recurse to existing children and ensure that the new am gets
inherited by any new children. (ALTER TABLE SET TABLESPACE behavior)

(B) Recurse to existing children and modify their am. Also, ensure that
any new children inherit the new am.

A patch [1] was introduced earlier by Justin to implement
(A). v1-0001-Allow-ATSETAM-on-partition-roots.patch contains a rebase
of that patch against latest HEAD, with minor updates on comments and
some additional test coverage.

I think that (B) is necessary for partition hierarchies with a high
number of partitions. One typical use case in Greenplum, for
instance, is to convert heap tables containing cold data to append-only
storage at the root or subroot level of partition hierarchies consisting
of thousands of partitions. Asking users to ALTER individual partitions
is cumbersome and error-prone.

Furthermore, I believe that (B) should be the default and (A) can be
chosen by using the ONLY clause. This would give us the best of both
worlds and would make the use of ONLY consistent. The patch
v1-0002-Make-ATSETAM-recurse-by-default.patch achieves that.

Thoughts?

Regards,
Soumyadeep (VMware)


[1]
https://www.postgresql.org/message-id/20210308010707.GA29832%40telsasoft.com
From 440517ff8a5912d4c657a464b2c1de9c8b3a4f70 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Tue, 17 May 2022 15:22:48 -0700
Subject: [PATCH v1 2/2] Make ATSETAM recurse by default

ATSETAM now recurses to partition children by default. To prevent
recursion, and have the new am apply to future children only, users must
specify the ONLY clause.
---
 src/backend/commands/tablecmds.c|  2 ++
 src/test/regress/expected/create_am.out | 13 -
 src/test/regress/sql/create_am.sql  |  5 -
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b3f411dd71f..58251c22ffe 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4711,6 +4711,8 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 		(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 		 errmsg("cannot have multiple SET ACCESS METHOD subcommands")));
 
+			if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ATSimpleRecursion(wqueue, rel, cmd, recurse, lockmode, context);
 			ATPrepSetAccessMethod(tab, rel, cmd->name);
 			pass = AT_PASS_MISC;	/* does not matter; no work in Phase 2 */
 			break;
diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out
index a48199df9a2..e99de1ac912 100644
--- a/src/test/regress/expected/create_am.out
+++ b/src/test/regress/expected/create_am.out
@@ -284,7 +284,7 @@ CREATE TABLE am_partitioned(x INT, y INT)
 CREATE TABLE am_partitioned_1 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 0);
 CREATE TABLE am_partitioned_2 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 1);
 ALTER TABLE am_partitioned_1 SET ACCESS METHOD heap2;
-ALTER TABLE am_partitioned SET ACCESS METHOD heap2;
+ALTER TABLE ONLY am_partitioned SET ACCESS METHOD heap2;
 CREATE TABLE am_partitioned_3 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 2);
 SELECT relname, amname FROM pg_class c, pg_am am
   WHERE c.relam = am.oid AND c.relname LIKE 'am_partitioned%' ORDER BY 1;
@@ -296,6 +296,17 @@ SELECT relname, amname FROM pg_class c, pg_am am
  am_partitioned_3 | heap2
 (4 rows)
 
+ALTER TABLE am_partitioned SET ACCESS METHOD heap;
+SELECT relname, amname FROM pg_class c, pg_am am
+WHERE c.relam = am.oid AND c.relname LIKE 'am_partitioned%' ORDER BY 1;
+ relname  | amname 
+--+
+ am_partitioned   | heap
+ am_partitioned_1 | heap
+ am_partitioned_2 | heap
+ am_partitioned_3 | heap
+(4 rows)
+
 DROP TABLE am_partitioned;
 -- Second, create objects in the new AM by changing the default AM
 BEGIN;
diff --git a/src/test/regress/sql/create_am.sql b/src/test/regress/sql/create_am.sql
index 8f3db03bba2..f4e22684782 100644
--- a/src/test/regress/sql/create_am.sql
+++ b/src/test/regress/sql/create_am.sql
@@ -184,10 +184,13 @@ CREATE TABLE am_partitioned(x INT, y INT)
 CREATE TABLE am_partitioned_1 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 0);
 CREATE TABLE am_partitioned_2 PARTITION OF am_partitioned FOR VALUES WITH (MODULUS 3,REMAINDER 1);
 ALTER TABLE am_partitioned_1 SET ACCESS METHOD heap2;
-ALTER TABLE am_partitioned SET ACCESS METHOD heap2;
+ALTER TABLE ONLY am_partitioned SET ACCESS METHOD heap2;
 CREATE TABLE am_partitioned_3 PARTITION OF am_par

Re: Unnecessary delay in streaming replication due to replay lag

2021-12-15 Thread Soumyadeep Chakraborty
Hi Bharath,

Thanks for the review!

On Sat, Nov 27, 2021 at 6:36 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> 1) A memory leak: add FreeDir(dir); before returning.
> + ereport(LOG,
> + (errmsg("Could not start streaming WAL eagerly"),
> + errdetail("There are timeline changes in the locally available WAL
files."),
> + errhint("WAL streaming will begin once all local WAL and archives
> are exhausted.")));
> + return;
> + }
>

Thanks for catching that. Fixed.

>
>
> 2) Is there a guarantee that while we traverse the pg_wal directory to
> find startsegno and endsegno, the new wal files arrive from the
> primary or archive location or old wal files get removed/recycled by
> the standby? Especially when wal_receiver_start_condition=consistency?
> + startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
> + endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
> + }
>

Even if newer wal files arrive after the snapshot of the dir listing
taken by AllocateDir()/ReadDir(), we will in effect start from a
slightly older location, which should be fine. It shouldn't matter if
an older file is recycled. If the last valid WAL segment is recycled,
we will ERROR out in StartWALReceiverEagerlyIfPossible() and the eager
start can be retried by the startup process when
CheckRecoveryConsistency() is called again.

>
>
> 3) I think the errmsg text format isn't correct. Note that the errmsg
> text starts with lowercase and doesn't end with "." whereas errdetail
> or errhint starts with uppercase and ends with ".". Please check other
> messages for reference.
> The following should be changed.
> + errmsg("Requesting stream from beginning of: %s",
> + errmsg("Invalid WAL segment found while calculating stream start:
> %s. Skipping.",
> + (errmsg("Could not start streaming WAL eagerly"),

Fixed.

> 4) I think you also need to have wal files names in double quotes,
> something like below:
> errmsg("could not close file \"%s\": %m", xlogfname)));

Fixed.

>
> 5) It is ".stream start: \"%s\", skipping..",
> + errmsg("Invalid WAL segment found while calculating stream start:
> %s. Skipping.",

Fixed.

> 4) I think the patch can make the startup process significantly slow,
> especially when there are lots of wal files that exist in the standby
> pg_wal directory. This is because of the overhead
> StartWALReceiverEagerlyIfPossible adds i.e. doing two while loops to
> figure out the start position of the
> streaming in advance. This might end up the startup process doing the
> loop over in the directory rather than the important thing of doing
> crash recovery or standby recovery.

Well, 99% of the time we can expect that the second loop finishes after
1 or 2 iterations, as the last valid WAL segment would most likely be
the highest numbered WAL file or thereabouts. I don't think that the
overhead will be significant as we are just looking up a directory
listing and not reading any files.

> 5) What happens if this new GUC is enabled in case of a synchronous
standby?
> What happens if this new GUC is enabled in case of a crash recovery?
> What happens if this new GUC is enabled in case a restore command is
> set i.e. standby performing archive recovery?

The GUC would behave the same way for all of these cases. If we have
chosen 'startup'/'consistency', we would be starting the WAL receiver
eagerly. There might be certain race conditions when one combines this
GUC with archive recovery, which was discussed upthread. [1]

> 6) How about bgwriter/checkpointer which gets started even before the
> startup process (or a new bg worker? of course it's going to be an
> overkill) finding out the new start pos for the startup process and
> then we could get rid of startup behaviour of the
> patch? This avoids an extra burden on the startup process. Many times,
> users will be complaining about why recovery is taking more time now,
> after the GUC wal_receiver_start_condition=startup.

Hmm, then we would be needing additional synchronization. There will
also be an added dependency on checkpoint_timeout. I don't think that
the performance hit is significant enough to warrant this change.

> 8) Can we have a better GUC name than wal_receiver_start_condition?
> Something like wal_receiver_start_at or wal_receiver_start or some
> other?

Sure, that makes more sense. Fixed.

Regards,
Soumyadeep (VMware)

[1]
https://www.postgresql.org/message-id/CAE-ML%2B-8KnuJqXKHz0mrC7-qFMQJ3ArDC78X3-AjGKos7Ceocw%40mail.gmail.com
From e6ffb5400fd841e4285939133739692c9cb4ba17 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Fri, 19 Nov 2021 00:33:17 -0800
Subject: [PATCH v5 1/1] Intr

Re: Unnecessary delay in streaming replication due to replay lag

2021-11-22 Thread Soumyadeep Chakraborty
Hi Bharath,

Yes, that thread has been discussed here. Asim had x-posted the patch to
[1]. This thread
was more recent when Ashwin and I picked up the patch in Aug 2021, so we
continued here.
The patch has been significantly updated by us, addressing Michael's long
outstanding feedback.

Regards,
Soumyadeep (VMware)

[1]
https://www.postgresql.org/message-id/CANXE4TeinQdw%2BM2Or0kTR24eRgWCOg479N8%3DgRvj9Ouki-tZFg%40mail.gmail.com


Re: Unnecessary delay in streaming replication due to replay lag

2021-11-19 Thread Soumyadeep Chakraborty
Hi Daniel,

Thanks for checking in on this patch.
Attached rebased version.

Regards,
Soumyadeep (VMware)
From 51149e4f877dc2f8bf47a1356fc8b0ec2512ac6d Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Fri, 19 Nov 2021 00:33:17 -0800
Subject: [PATCH v4 1/1] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Co-authors: Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  33 
 src/backend/access/transam/xlog.c | 170 +-
 src/backend/replication/walreceiver.c |   1 +
 src/backend/utils/misc/guc.c  |  18 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 src/include/replication/walreceiver.h |  10 ++
 src/test/recovery/t/027_walreceiver_start.pl  |  96 ++
 7 files changed, 321 insertions(+), 10 deletions(-)
 create mode 100644 src/test/recovery/t/027_walreceiver_start.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f806740d5d..b911615bf04 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4604,6 +4604,39 @@ ANY num_sync ( 
+  wal_receiver_start_condition (enum)
+  
+   wal_receiver_start_condition configuration parameter
+  
+  
+  
+   
+Specifies when the WAL receiver process will be started for a standby
+server.
+The allowed values of wal_receiver_start_condition
+are startup (start immediately when the standby starts),
+consistency (start only after reaching consistency), and
+exhaust (start only after all WAL from the archive and
+pg_wal has been replayed)
+ The default setting isexhaust.
+   
+
+   
+Traditionally, the WAL receiver process is started only after the
+standby server has exhausted all WAL from the WAL archive and the local
+pg_wal directory. In some environments there can be a significant volume
+of local WAL left to replay, along with a large volume of yet to be
+streamed WAL. Such environments can benefit from setting
+wal_receiver_start_condition to
+startup or consistency. These
+values will lead to the WAL receiver starting much earlier, and from
+the end of locally available WAL. The network will be utilized to stream
+WAL concurrently with replay, improving performance significantly.
+   
+  
+ 
+
  
   wal_receiver_status_interval (integer)
   
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 16164483688..9c1243a2f85 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -983,6 +983,8 @@ static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
 static void checkXLogConsistency(XLogReaderState *record);
+static void StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+			  TimeLineID currentTLI);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1538,6 +1540,152 @@ checkXLogConsistency(XLogReaderState *record)
 	}
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning. startPoint signifies whether we
+ * are trying the eager start right at startup or once we have reached
+ * consistency.
+ */
+static void
+StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+  TimeLineID currentTLI)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	XLogSegNo	startsegno = -1;
+	XLogSegNo	endsegno = -1;
+
+	/*
+	 * We should not be starting the walreceiver during bootstrap/init
+	 * processing.
+	 */
+	if (!IsNormalProcessingMode())
+		return;
+
+	/* Only the startup process

Re: Unnecessary delay in streaming replication due to replay lag

2021-11-09 Thread Soumyadeep Chakraborty
;   * Move to XLOG_FROM_STREAM state, and set to start a
> - * walreceiver if necessary.
> + * walreceiver if necessary. The WAL receiver may have
> + * already started (if it was configured to start
> + * eagerly).
>   */
>  currentSource = XLOG_FROM_STREAM;
> -startWalReceiver = true;
> +startWalReceiver = !WalRcvStreaming();
>  break;
>  case XLOG_FROM_ARCHIVE:
>  case XLOG_FROM_PG_WAL:
>
> -/*
> - * WAL receiver must not be running when reading WAL from
> - * archive or pg_wal.
> - */
> -Assert(!WalRcvStreaming());
>
> These parts should IMO not be changed.  They are strong assumptions we
> rely on in the startup process, and this comes down to the fact that
> it is not a good idea to mix a WAL receiver started while
> currentSource could be pointing at a WAL source completely different.
> That's going to bring a lot of racy conditions, I am afraid, as we
> rely on currentSource a lot during recovery, in combination that we
> expect the code to be able to retrieve WAL in a linear fashion from
> the LSN position that recovery is looking for.
>
> So, I think that deciding if a WAL receiver should be started blindly
> outside of the code path deciding if the startup process is waiting
> for some WAL is not a good idea, and the position we may begin to
> stream from may be something that we may have zero need for at the
> end (this is going to be tricky if we detect a TLI jump while
> replaying the local WAL, also?).  The issue is that I am not sure what
> a good design for that should be.  We have no idea when the startup
> process will need WAL from a different source until replay comes
> around, but what we want here is to anticipate othis LSN :)

Can you elaborate on the race conditions that you are thinking about?
Do the race conditions manifest only when we mix archiving and streaming?
If yes, how do you feel about making the GUC a no-op with a WARNING
while we are in WAL archiving mode?

> I am wondering if there should be a way to work out something with the
> control file, though, but things can get very fancy with HA
> and base backup deployments and the various cases we support thanks to
> the current way recovery works, as well.  We could also go simpler and
> rework the priority order if both archiving and streaming are options
> wanted by the user.

Agreed, it would be much better to depend on the state in pg_wal,
namely the files that are available there.

Reworking the priority order seems like an appealing fix - if we can say
streaming > archiving in terms of priority, then the race that you are
referring to will not happen?

Also, what are some use cases where one would give priority to streaming
replication over archive recovery, if both sources have the same WAL
segments?

Regards,
Ashwin & Soumyadeep
From 7e301866e0468ed8def5b34df2d4570d6678baf3 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Fri, 30 Jul 2021 18:21:55 -0700
Subject: [PATCH v3 1/1] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Co-authors: Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  33 
 src/backend/access/transam/xlog.c | 170 +-
 src/backend/replication/walreceiver.c |   1 +
 src/backend/utils/misc/guc.c  |  18 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 src/include/replication/walreceiver.h |  10 ++
 src/test/recovery/t/027_walreceiver_start.pl  |  96 ++
 7 files changed, 321 insertions(+), 10 deletions(-)
 create mode 100644 src/test/recovery/t/027_walreceiver_start.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f806740d5d..b911615bf04 100644
--- a/doc/src/sgml

Re: Unnecessary delay in streaming replication due to replay lag

2021-10-25 Thread Soumyadeep Chakraborty
Rebased and added a CF entry for Nov CF:
https://commitfest.postgresql.org/35/3376/.


Re: Unnecessary delay in streaming replication due to replay lag

2021-08-24 Thread Soumyadeep Chakraborty
Hello,

Ashwin and I recently got a chance to work on this and we addressed all
outstanding feedback and suggestions. PFA a significantly reworked patch.

On 20.11.2020 11:21, Michael Paquier wrote:

> This patch thinks that it is fine to request streaming even if
> PrimaryConnInfo is not set, but that's not fine.

We introduced a check to ensure that PrimaryConnInfo is set up before we
request the WAL stream eagerly.

> Anyway, I don't quite understand what you are trying to achieve here.
> "startpoint" is used to request the beginning of streaming.  It is
> roughly the consistency LSN + some alpha with some checks on WAL
> pages (those WAL page checks are not acceptable as they make
> maintenance harder).  What about the case where consistency is
> reached but there are many segments still ahead that need to be
> replayed?  Your patch would cause streaming to begin too early, and
> a manual copy of segments is not a rare thing as in some environments
> a bulk copy of segments can make the catchup of a standby faster than
> streaming.
>
> It seems to me that what you are looking for here is some kind of
> pre-processing before entering the redo loop to determine the LSN
> that could be reused for the fast streaming start, which should match
> the end of the WAL present locally.  In short, you would need a
> XLogReaderState that begins a scan of WAL from the redo point until it
> cannot find anything more, and use the last LSN found as a base to
> begin requesting streaming.  The question of timeline jumps can also
> be very tricky, but it could also be possible to not allow this option
> if a timeline jump happens while attempting to guess the end of WAL
> ahead of time.  Another thing: could it be useful to have an extra
> mode to begin streaming without waiting for consistency to finish?

1. When wal_receiver_start_condition='consistency', we feel that the
stream start point calculation should be done only when we reach
consistency. Imagine the situation where consistency is reached 2 hours
after start, and within that 2 hours a lot of WAL has been manually
copied over into the standby's pg_wal. If we pre-calculated the stream
start location before we entered the main redo apply loop, we would be
starting the stream from a much earlier location (minus the 2 hours
worth of WAL), leading to wasted work.

2. We have significantly changed the code to calculate the WAL stream
start location. We now traverse pg_wal, find the latest valid WAL
segment and start the stream from the segment's start. This is much
more performant than reading from the beginning of the locally available
WAL.

3. To perform the validation check, we no longer have duplicate code -
as we can now rely on the XLogReaderState(), XLogReaderValidatePageHeader()
and friends.

4. We have an extra mode: wal_receiver_start_condition='startup', which
will start the WAL receiver before the startup process reaches
consistency. We don't fully understand the utility of having 'startup' over
'consistency' though.

5. During the traversal of pg_wal, if we find WAL segments on differing
timelines, we bail out and abandon attempting to start the WAL stream
eagerly.

6. To handle the cases where a lot of WAL is copied over after the
WAL receiver has started at consistency:
i) Don't recommend wal_receiver_start_condition='startup|consistency'.

ii) Copy over the WAL files and then start the standby, so that the WAL
stream starts from a fresher point.

iii) Have an LSN/segment# target to start the WAL receiver from?

7. We have significantly changed the test. It is much more simplified
and focused.

8. We did not test wal_receiver_start_condition='startup' in the test.
It's actually hard to assert that the walreceiver has started at
startup. recovery_min_apply_delay only kicks in once we reach
consistency, and thus there is no way I could think of to reliably halt
the startup process and check: "Has the wal receiver started even
though the standby hasn't reached consistency?" Only way we could think
of is to generate a large workload during the course of the backup so
that the standby has significant WAL to replay before it reaches
consistency. But that will make the test flaky as we will have no
absolutely precise wait condition. That said, we felt that checking
for 'consistency' is enough as it covers the majority of the added
code.

9. We added a documentation section describing the GUC.


Regards,
Ashwin and Soumyadeep (VMware)
From b3703fb16a352bd9166ed75de7b68599c735ac63 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Fri, 30 Jul 2021 18:21:55 -0700
Subject: [PATCH v1 1/1] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_

Re: Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-15 Thread Soumyadeep Chakraborty
On Sun, Aug 15, 2021 at 8:16 PM Michael Paquier  wrote:
>
> On Fri, Aug 13, 2021 at 05:59:21PM -0700, Soumyadeep Chakraborty wrote:
> > and passes with the code change, as expected. I can't explain why the
> > test doesn't freeze up in v3 in wait_for_catchup() at the end.
>
> It took me some some to understand why.  If I am right, that's because
> of the intermediate test block working on $standby_2 and the two
> INSERT queries of the primary.  In v1 and v4, we have no activity on
> the primary between the first set of tests and yours, meaning that
> $standby has nothing to do.  In v2 and v3, the two INSERT queries run
> on the primary for the purpose of the recovery pause make $standby_1
> wait for the default value of recovery_min_apply_delay, aka 3s, in
> parallel.  If the set of tests for $standby_2 is faster than that,
> we'd bump on the phase where the code still waited for 3s, not the 2
> hours set, visibly.

I see, thanks a lot for the explanation. Thanks to your investigation, I
can now kind of reuse some of the test mechanisms for the other patch that
I am working on [1]. There, we don't have multiple standbys getting in the
way, thankfully.

> After considering this stuff, the order dependency we'd introduce in
> this test makes the whole thing more brittle than it should.  And such
> an edge case does not seem worth spending extra cycles testing anyway,
> as if things break we'd finish with a test stuck for an unnecessary
> long time by relying on wait_for_catchup("replay").  We could use
> something else, say based on a lookup of pg_stat_activity but this
> still requires extra run time for the wait phases needed.  So at the
> end I have dropped the test, but backpatched the fix.
> --

Fair.

Regards,
Soumyadeep (VMware)

[1] 
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com




Re: Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-13 Thread Soumyadeep Chakraborty
Hey Michael,

Really appreciate the review!

On Wed, Aug 11, 2021 at 12:40 AM Michael Paquier  wrote:

> Agreed that the current behavior is confusing.  As you are using the
> commit timestamp for the comparison, this is right.  One small-ish
> comment I have about the code is that we should mention
> recovery_min_apply_delay for HandleStartupProcInterrupts(), and not
> only the trigger file location.

Cool, updated.

> +# First, set the delay for the next commit to some obscenely high value.
> It has no need to be obscenely high, just high enough to give the time
> to slow machines to catch the wait event lookup done.  So this could
> use better words to explain this choice.

Sounds good. Done.

> Anyway, it seems to me that something is incorrect in this new test
> (manual tests pass here, the TAP test is off).  One thing we need to
> make sure for any new test added is that it correctly breaks if the
> fix proposed is *not* in place.  And as far as I can see, the test
> passes even if the recalculation of delayUntil is done within the loop
> that reloads the configuration.  The thing may be a bit tricky to make
> reliable as the parameter reloading may cause wait_event to not be
> RecoveryApplyDelay, so we should have at least a check based on a scan
> of pg_stat_replication.replay_lsn on the primary.  Perhaps you have
> an other idea?

Hmm, please see attached v4 which just shifts the test to the middle,
like v1. When I run the test without the code change, the test hangs
as expected in:

# Now the standby should catch up.
$node_primary->wait_for_catchup('standby', 'replay');

and passes with the code change, as expected. I can't explain why the
test doesn't freeze up in v3 in wait_for_catchup() at the end.

As for checking on the wait event, since we only signal the standby
after performing the check, that should be fine. Nothing else would be
sending a SIGHUP before the check. Is that assumption correct?

> When using wait_for_catchup(), I would recommend to list "replay" for
> this test, even if that's the default mode, to make clear what the
> intention is.

Done.

Regards,
Soumyadeep (VMware)
From 824076a977af0e40da1c7eb9e4aeac9a8e81a7ee Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Mon, 2 Aug 2021 20:50:37 -0700
Subject: [PATCH v4 1/1] Refresh delayUntil in recovery delay loop

This ensures that the wait interval in the loop is correctly
recalculated with the updated value of recovery_min_apply_delay.

Now, if one changes the GUC while we are waiting, those changes will
take effect. Practical applications include instantly cancelling a long
wait time by setting recovery_min_apply_delay to 0. This can be useful
in tests.
---
 src/backend/access/transam/xlog.c   | 13 +---
 src/test/recovery/t/005_replay_delay.pl | 41 ++---
 2 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d0ec6a834be..74ad7ff905b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6234,12 +6234,11 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (!getRecordTimestamp(record, ))
 		return false;
 
-	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
-
 	/*
 	 * Exit without arming the latch if it's already past time to apply this
 	 * record
 	 */
+	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
 	msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
 	if (msecs <= 0)
 		return false;
@@ -6248,14 +6247,20 @@ recoveryApplyDelay(XLogReaderState *record)
 	{
 		ResetLatch(>recoveryWakeupLatch);
 
-		/* might change the trigger file's location */
+		/* might change the trigger file's location and recovery_min_apply_delay */
 		HandleStartupProcInterrupts();
 
 		if (CheckForStandbyTrigger())
 			break;
 
 		/*
-		 * Wait for difference between GetCurrentTimestamp() and delayUntil
+		 * Recalculate delayUntil as recovery_min_apply_delay could have changed
+		 * since last time.
+		 */
+		delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
+
+		/*
+		 * Wait for difference between GetCurrentTimestamp() and delayUntil.
 		 */
 		msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
 delayUntil);
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 0b56380e0a7..ad8093df41a 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 
 # Initialize primary node
 my $node_primary = PostgresNode->new('primary');
@@ -56,6 +56,39 @@ $node_standby->poll_query_until('postgres',
 ok(time() - $primary_insert_time >= $delay,
 	"standby applies WAL only af

Re: Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-06 Thread Soumyadeep Chakraborty
Rebased. Also added a stronger check to see if the standby is stuck in
recovery_min_apply_delay:

$node_standby->poll_query_until('postgres', qq{
SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
WHERE backend_type='startup';
}) or die "Timed out checking if startup is in recovery_min_apply_delay";

Attached v3.

Regards,
Soumyadeep (VMware)
From 441716f45d0fbffde1135ba007c3a6c5b6b464de Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Mon, 2 Aug 2021 20:50:37 -0700
Subject: [PATCH v3 1/1] Refresh delayUntil in recovery delay loop

This ensures that the wait interval in the loop is correctly
recalculated with the updated value of recovery_min_apply_delay.

Now, if one changes the GUC while we are waiting, those changes will
take effect. Practical applications include instantly cancelling a long
wait time by setting recovery_min_apply_delay to 0. This can be useful
in tests.
---
 src/backend/access/transam/xlog.c   | 11 +---
 src/test/recovery/t/005_replay_delay.pl | 36 +++--
 2 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d0ec6a834be..3e05119212b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6234,12 +6234,11 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (!getRecordTimestamp(record, ))
 		return false;
 
-	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
-
 	/*
 	 * Exit without arming the latch if it's already past time to apply this
 	 * record
 	 */
+	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
 	msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
 	if (msecs <= 0)
 		return false;
@@ -6255,7 +6254,13 @@ recoveryApplyDelay(XLogReaderState *record)
 			break;
 
 		/*
-		 * Wait for difference between GetCurrentTimestamp() and delayUntil
+		 * Recalculate delayUntil as recovery_min_apply_delay could have changed
+		 * while we were waiting in the loop.
+		 */
+		delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
+
+		/*
+		 * Wait for difference between GetCurrentTimestamp() and delayUntil.
 		 */
 		msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
 delayUntil);
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 0b56380e0a7..4f409ad10d1 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 
 # Initialize primary node
 my $node_primary = PostgresNode->new('primary');
@@ -56,7 +56,6 @@ $node_standby->poll_query_until('postgres',
 ok(time() - $primary_insert_time >= $delay,
 	"standby applies WAL only after replication delay");
 
-
 # Check that recovery can be paused or resumed expectedly.
 my $node_standby2 = PostgresNode->new('standby2');
 $node_standby2->init_from_backup($node_primary, $backup_name,
@@ -110,3 +109,36 @@ $node_standby2->poll_query_until('postgres',
 $node_standby2->promote;
 $node_standby2->poll_query_until('postgres', "SELECT NOT pg_is_in_recovery()")
   or die "Timed out while waiting for promotion to finish";
+
+# Now test to see if updates to recovery_min_apply_delay apply when a standby is
+# waiting for a recovery delay to elapse.
+
+# First, set the delay for the next commit to some obscenely high value.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO '2h';");
+$node_standby->reload;
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(51, 60))");
+
+# The standby's flush LSN should be caught up.
+my $last_insert_lsn =
+	$node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+$node_primary->wait_for_catchup('standby', 'flush', $last_insert_lsn);
+
+# The standby's replay LSN will not be caught up as it should be waiting in
+# recovery_min_apply_delay.
+$node_standby->poll_query_until('postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+is( $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn() < '$last_insert_lsn'::pg_lsn;"),
+	't', 'standby stuck in recovery_min_apply_delay');
+
+# Clear the recovery_min_apply_delay timeout so that the wait is immediately
+# cancelled and replay can proceed.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO DEFAULT;");
+$node_standby->reload;
+
+# Now the standby should catch up.
+$node_primary->wait_for_catchup('standby');
-- 
2.25.1



Re: Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-03 Thread Soumyadeep Chakraborty
Hi Kyotaro,

Thanks for the review!

On Mon, Aug 2, 2021 at 11:42 PM Kyotaro Horiguchi
 wrote:

> One comment from me.
>
> +$node_standby->safe_psql('postgres', "ALTER SYSTEM SET 
> recovery_min_apply_delay TO 0;");
>
> It might be better do "SET reco.. TO DEFAULT" instead.
>

Sure.

> And how about adding the new test at the end of existing tests. We can
> get rid of the extra lines in the diff.

No problem. See attached v2. I didn't do that initially as the test file
looks as though it is split into two halves: one for testing
recovery_min_apply_delay and the other for testing recovery pause. So while
making this change, I added a header comment for the newly added test case.
Please take a look.

Regards,
Soumyadeep (VMware)
From 8b4ea51e1a22548fdd3f6921fe374d3a9d987a77 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Mon, 2 Aug 2021 20:50:37 -0700
Subject: [PATCH v2 1/1] Refresh delayUntil in recovery delay loop

This ensures that the wait interval in the loop is correctly
recalculated with the updated value of recovery_min_apply_delay.

Now, if one changes the GUC while we are waiting, those changes will
take effect. Practical applications include instantly cancelling a long
wait time by setting recovery_min_apply_delay to 0. This can be useful
in tests.
---
 src/backend/access/transam/xlog.c   | 11 ++---
 src/test/recovery/t/005_replay_delay.pl | 31 +++--
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb01eb..89dc759586c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6234,12 +6234,11 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (!getRecordTimestamp(record, ))
 		return false;
 
-	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
-
 	/*
 	 * Exit without arming the latch if it's already past time to apply this
 	 * record
 	 */
+	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
 	msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
 	if (msecs <= 0)
 		return false;
@@ -6255,7 +6254,13 @@ recoveryApplyDelay(XLogReaderState *record)
 			break;
 
 		/*
-		 * Wait for difference between GetCurrentTimestamp() and delayUntil
+		 * Recalculate delayUntil as recovery_min_apply_delay could have changed
+		 * while we were waiting in the loop.
+		 */
+		delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
+
+		/*
+		 * Wait for difference between GetCurrentTimestamp() and delayUntil.
 		 */
 		msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
 delayUntil);
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 0b56380e0a7..d55f1fc257f 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 
 # Initialize primary node
 my $node_primary = PostgresNode->new('primary');
@@ -56,7 +56,6 @@ $node_standby->poll_query_until('postgres',
 ok(time() - $primary_insert_time >= $delay,
 	"standby applies WAL only after replication delay");
 
-
 # Check that recovery can be paused or resumed expectedly.
 my $node_standby2 = PostgresNode->new('standby2');
 $node_standby2->init_from_backup($node_primary, $backup_name,
@@ -110,3 +109,31 @@ $node_standby2->poll_query_until('postgres',
 $node_standby2->promote;
 $node_standby2->poll_query_until('postgres', "SELECT NOT pg_is_in_recovery()")
   or die "Timed out while waiting for promotion to finish";
+
+# Now test to see if updates to recovery_min_apply_delay apply when a standby is
+# waiting for a recovery delay to elapse.
+
+# First, set the delay for the next commit to some obscenely high value.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO '2h';");
+$node_standby->reload;
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(51, 60))");
+
+# We should not have replayed the LSN from the last insert on the standby yet,
+# even though it will be received and flushed eventually. In other words, we
+# should be stuck in recovery_min_apply_delay.
+my $last_insert_lsn =
+	$node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+$node_primary->wait_for_catchup('standby', 'flush', $last_insert_lsn);
+is( $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn() < '$last_insert_lsn'::pg_lsn;"),
+	't', 'standby stuck in recovery_min_apply_delay');
+
+# Clear the recovery_min_apply_delay timeout so that the wait is immediately
+# cancelled and replay can proceed.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO DEFAULT;");
+$node_standby->reload;
+
+# Now the standby should catch up.
+$node_primary->wait_for_catchup('standby');
-- 
2.25.1



Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-02 Thread Soumyadeep Chakraborty
Hello,

I came across this issue while I was tweaking a TAP test with Ashwin for
this thread [1].

We noticed that while the startup process waits for a recovery delay, it does
not respect changes to the recovery_min_apply_delay GUC. So:

// Standby is waiting on recoveryWakeupLatch with a large
recovery_min_apply_delay value
node_standby->safe_psql('postgres', 'ALTER SYSTEM SET
recovery_min_apply_delay TO 0;');
node_standby->reload;
// Standby is still waiting, it does not proceed until the old timeout
is elapsed.

I attached a small patch to fix this. It makes it so that
recovery_min_apply_delay is reconsulted in the loop to redetermine the wait
interval. This makes it similar to the loop in CheckpointerMain, where
CheckPointTimeout is consulted after interrupts are handled:

if (elapsed_secs >= CheckPointTimeout)
continue; /* no sleep for us ... */

I have also added a test to 005_replay_delay.pl.

Regards,
Soumyadeep(VMware)

[1] 
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
From e1b93d4b3874eb0923e18dca9a9eccd453d6cd56 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Mon, 2 Aug 2021 20:50:37 -0700
Subject: [PATCH v1 1/1] Refresh delayUntil in recovery delay loop

This ensures that the wait interval in the loop is correctly
recalculated with the updated value of recovery_min_apply_delay.

Now, if one changes the GUC while we are waiting, those changes will
take effect. Practical applications include instantly cancelling a long
wait time by setting recovery_min_apply_delay to 0. This can be useful
in tests.
---
 src/backend/access/transam/xlog.c   | 11 ++---
 src/test/recovery/t/005_replay_delay.pl | 32 +
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb01eb..89dc759586c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6234,12 +6234,11 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (!getRecordTimestamp(record, ))
 		return false;
 
-	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
-
 	/*
 	 * Exit without arming the latch if it's already past time to apply this
 	 * record
 	 */
+	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
 	msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
 	if (msecs <= 0)
 		return false;
@@ -6255,7 +6254,13 @@ recoveryApplyDelay(XLogReaderState *record)
 			break;
 
 		/*
-		 * Wait for difference between GetCurrentTimestamp() and delayUntil
+		 * Recalculate delayUntil as recovery_min_apply_delay could have changed
+		 * while we were waiting in the loop.
+		 */
+		delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
+
+		/*
+		 * Wait for difference between GetCurrentTimestamp() and delayUntil.
 		 */
 		msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
 delayUntil);
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 0b56380e0a7..74f821b3248 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 
 # Initialize primary node
 my $node_primary = PostgresNode->new('primary');
@@ -56,6 +56,30 @@ $node_standby->poll_query_until('postgres',
 ok(time() - $primary_insert_time >= $delay,
 	"standby applies WAL only after replication delay");
 
+# Now set the delay for the next commit to some obscenely high value.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO '2h';");
+$node_standby->reload;
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(21, 30))");
+
+# We should not have replayed the LSN from the last insert on the standby yet,
+# even in spite of it being received and flushed eventually. In other words, we
+# should be stuck in recovery_min_apply_delay.
+my $last_insert_lsn =
+	$node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+$node_primary->wait_for_catchup('standby', 'flush', $last_insert_lsn);
+is( $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn() < '$last_insert_lsn'::pg_lsn;"),
+	't', 'standby stuck in recovery_min_apply_delay');
+
+# Clear the recovery_min_apply_delay timeout so that the wait is immediately
+# cancelled and replay can proceed.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO 0;");
+$node_standby->reload;
+
+# Now the standby should catch up.
+$node_primary->wait_for_catchup('standby');
 
 # Check that recovery can be paused or resumed expectedly.
 my $node_standby2 = PostgresNode->new('stand

Re: A micro-optimisation for ProcSendSignal()

2021-08-02 Thread Soumyadeep Chakraborty
Hey Thomas,

On Mon, Aug 2, 2021 at 6:45 PM Thomas Munro  wrote:
>
> Hi Soumyadeep,
>
> On Sat, Jul 24, 2021 at 5:26 PM Soumyadeep Chakraborty
>  wrote:
> > On Tue, Jul 20, 2021 at 10:40 PM Thomas Munro  
> > wrote:
> > > I wonder why we need this member anyway, when you can compute it from
> > > the address... #define GetPGProcNumber(p) ((p) - ProcGlobal->allProcs)
> > > or something like that?  Kinda wonder why we don't use
> > > GetPGProcByNumber() in more places instead of open-coding access to
> > > ProcGlobal->allProcs, too...
> >
> > I tried this out. See attached v4 of your patch with these changes.
>
> I like it.  I've moved these changes to a separate patch, 0002, for
> separate commit.  I tweaked a couple of comments (it's not a position
> in the "procarray", well it's a position stored in the procarray, but
> that's confusing; I also found a stray comment about leader->pgprocno
> that is obsoleted by this change).  Does anyone have objections to
> this?

Awesome, thanks! Looks good.

> I was going to commit the earlier change this morning, but then I read [1].
>
> New idea.  Instead of adding pgprocno to SERIALIZABLEXACT (which we
> should really be trying to shrink, not grow), let's look it up by
> vxid->backendId.  I didn't consider that before, because I was trying
> not to get tangled up with BackendIds for various reasons, not least
> that that's yet another lock + O(n) search.
>
> It seems likely that getting from vxid to latch will be less clumsy in
> the near future.  That refactoring and harmonising of backend
> identifiers seems like a great idea to me.  Here's a version that
> anticipates that, using vxid->backendId to wake a sleeping
> SERIALIZABLE READ ONLY DEFERRABLE backend, without having to add a new
> member to the struct.
>

Neat! A Vxid -> PGPROC lookup eventually becomes an O(1) operation with the
changes proposed at the ending paragraph of [1].

[1] 
https://www.postgresql.org/message-id/20210802164124.ufo5buo4apl6yuvs%40alap3.anarazel.de

Regards,
Soumyadeep (VMware)




Re: A micro-optimisation for ProcSendSignal()

2021-07-23 Thread Soumyadeep Chakraborty
HI Thomas,

On Tue, Jul 20, 2021 at 10:40 PM Thomas Munro  wrote:

> > Slightly tangential: we should add a comment to PGPROC.pgprocno, for more
> > immediate understandability:
> >
> > + int pgprocno; /* index of this PGPROC in ProcGlobal->allProcs */
>
> I wonder why we need this member anyway, when you can compute it from
> the address... #define GetPGProcNumber(p) ((p) - ProcGlobal->allProcs)
> or something like that?  Kinda wonder why we don't use
> GetPGProcByNumber() in more places instead of open-coding access to
> ProcGlobal->allProcs, too...

I tried this out. See attached v4 of your patch with these changes.

> > Also, why don't we take the opportunity to get rid of 
> > SERIALIZABLEXACT->pid? We
> > took a stab. Attached is v2 of your patch with these changes.
>
> SERIALIZABLEXACT objects can live longer than the backends that
> created them.  They hang around to sabotage other transactions' plans,
> depending on what else they overlapped with before they committed.
> With that change, the pg_locks view might show the pid of some
> unrelated session that moves into the same PGPROC.

I see.

>
> It's only an "informational" pid, and pids are imperfect information
> anyway because (1) they are themselves recycled, and (2) they won't be
> interesting in a hypothetical multi-threaded future.  One solution
> would be to hide the pids from the view after the backend disconnects
> (somehow -- add a generation number?), but they're also still kinda
> useful, despite the weaknesses.  I wonder what the ideal way would be
> to refer to sessions, anyway, including those that are no longer
> active; perhaps we could invent a new "session ID" concept.

Updating the pg_locks view:

Yes, the pids may be valuable for future debugging/audit purposes. Also,
systems where pid_max is high enough to not see wraparound, will have
pids that are not recycled. I would lean towards showing the pid even
after the backend has exited.

Perhaps we could overload the stored pid to be negated (i.e. a backend
with pid 2 will become -2) to indicate that the pid belongs to
a backend that has exited. Additionally, we could introduce a boolean
field in pg_locks "backendisalive", so that the end user doesn't have
to reason about negative pids.

Session ID:

Interesting, Greenplum uses the concept of session ID pervasively. Being
a distributed database, the session ID helps to tie individual backends
on each PG instance to the same session. The session ID of course comes
with its share of bookkeeping:

* These IDs are incrementally dished out with a counter
  (with pg_atomic_add_fetch_u32), in increments of 1, on the Greenplum
  coordinator PG instance in InitProcess.

* The counter is a part of ProcGlobal and itself is initialized to 0 in
  InitProcGlobal, which means that session IDs are reset on cluster
  restart.

* The sessionID tied to each proceess is maintained in PGPROC.

* The sessionID finds its way into PgBackendStatus, which is further
  reported with pg_stat_get_activity.

A session ID seems a bit heavy just to indicate whether a backend has
exited.

Regards,
Soumyadeep
From c964a977ca41c057737ca34420e002dc85663eb0 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 11 Mar 2021 23:09:11 +1300
Subject: [PATCH v4 1/1] Optimize ProcSendSignal().

Instead of referring to target backends by pid, use pgprocno.  This
means that we don't have to scan the ProcArray, and we can drop some
special case code for dealing with the startup process.

Discussion: https://postgr.es/m/CA%2BhUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD%3Dq9HmwsfRRb-w%40mail.gmail.com
Reviewed-by: Soumyadeep Chakraborty 
Reviewed-by: Ashwin Agrawal 
---
 src/backend/access/transam/clog.c |  2 +-
 src/backend/access/transam/twophase.c |  3 +-
 src/backend/access/transam/xlog.c |  3 +-
 src/backend/postmaster/bgwriter.c |  2 +-
 src/backend/postmaster/pgarch.c   |  2 +-
 src/backend/storage/buffer/buf_init.c |  3 +-
 src/backend/storage/buffer/bufmgr.c   | 10 ++--
 src/backend/storage/ipc/procarray.c   |  6 +--
 src/backend/storage/lmgr/condition_variable.c | 12 ++---
 src/backend/storage/lmgr/lwlock.c |  6 +--
 src/backend/storage/lmgr/predicate.c  |  6 ++-
 src/backend/storage/lmgr/proc.c   | 50 ++-
 src/backend/utils/adt/lockfuncs.c |  1 +
 src/include/storage/buf_internals.h   |  2 +-
 src/include/storage/lock.h|  2 +-
 src/include/storage/predicate_internals.h |  1 +
 src/include/storage/proc.h| 12 ++---
 17 files changed, 42 insertions(+), 81 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3ea16a270a8..d0557f6c555 100644
--- a/src/backend/ac

Re: A micro-optimisation for ProcSendSignal()

2021-07-17 Thread Soumyadeep Chakraborty
Hi Thomas,

You might have missed a spot to initialize SERIALIZABLE_XACT->pgprocno in
InitPredicateLocks(), so:

+ PredXact->OldCommittedSxact->pgprocno = INVALID_PGPROCNO;

Slightly tangential: we should add a comment to PGPROC.pgprocno, for more
immediate understandability:

+ int pgprocno; /* index of this PGPROC in ProcGlobal->allProcs */

Also, why don't we take the opportunity to get rid of SERIALIZABLEXACT->pid? We
took a stab. Attached is v2 of your patch with these changes.

Regards,
Ashwin and Deep
From 08d5ab9f498140e1c3a8f6f00d1acd25b3fb8ba0 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 11 Mar 2021 23:09:11 +1300
Subject: [PATCH v2 1/1] Optimize ProcSendSignal().

Instead of referring to target processes by pid, use pgprocno so that we
don't have to scan the ProcArray and keep track of the startup process.

Discussion: https://postgr.es/m/CA%2BhUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD%3Dq9HmwsfRRb-w%40mail.gmail.com
---
 src/backend/access/transam/xlog.c |  1 -
 src/backend/storage/buffer/buf_init.c |  3 +-
 src/backend/storage/buffer/bufmgr.c   | 10 ++---
 src/backend/storage/lmgr/predicate.c  | 23 ++-
 src/backend/storage/lmgr/proc.c   | 49 ++-
 src/backend/utils/adt/lockfuncs.c |  8 +++-
 src/include/storage/buf_internals.h   |  2 +-
 src/include/storage/predicate_internals.h |  2 +-
 src/include/storage/proc.h|  8 +---
 9 files changed, 34 insertions(+), 72 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c7c928f50bd..2bbeaaf1b90 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7309,7 +7309,6 @@ StartupXLOG(void)
 		 */
 		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
-			PublishStartupProcessInformation();
 			EnableSyncRequestForwarding();
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
 			bgwriterLaunched = true;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index a299be10430..b9a83c0e9b9 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -16,6 +16,7 @@
 
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -118,7 +119,7 @@ InitBufferPool(void)
 			CLEAR_BUFFERTAG(buf->tag);
 
 			pg_atomic_init_u32(>state, 0);
-			buf->wait_backend_pid = 0;
+			buf->wait_backend_pgprocno = INVALID_PGPROCNO;
 
 			buf->buf_id = i;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 86ef607ff38..26122418faf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1890,11 +1890,11 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
 BUF_STATE_GET_REFCOUNT(buf_state) == 1)
 			{
 /* we just released the last pin other than the waiter's */
-int			wait_backend_pid = buf->wait_backend_pid;
+int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
 
 buf_state &= ~BM_PIN_COUNT_WAITER;
 UnlockBufHdr(buf, buf_state);
-ProcSendSignal(wait_backend_pid);
+ProcSendSignal(wait_backend_pgprocno);
 			}
 			else
 UnlockBufHdr(buf, buf_state);
@@ -3995,7 +3995,7 @@ UnlockBuffers(void)
 		 * got a cancel/die interrupt before getting the signal.
 		 */
 		if ((buf_state & BM_PIN_COUNT_WAITER) != 0 &&
-			buf->wait_backend_pid == MyProcPid)
+			buf->wait_backend_pgprocno == MyProc->pgprocno)
 			buf_state &= ~BM_PIN_COUNT_WAITER;
 
 		UnlockBufHdr(buf, buf_state);
@@ -4131,7 +4131,7 @@ LockBufferForCleanup(Buffer buffer)
 			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			elog(ERROR, "multiple backends attempting to wait for pincount 1");
 		}
-		bufHdr->wait_backend_pid = MyProcPid;
+		bufHdr->wait_backend_pgprocno = MyProc->pgprocno;
 		PinCountWaitBuf = bufHdr;
 		buf_state |= BM_PIN_COUNT_WAITER;
 		UnlockBufHdr(bufHdr, buf_state);
@@ -4202,7 +4202,7 @@ LockBufferForCleanup(Buffer buffer)
 		 */
 		buf_state = LockBufHdr(bufHdr);
 		if ((buf_state & BM_PIN_COUNT_WAITER) != 0 &&
-			bufHdr->wait_backend_pid == MyProcPid)
+			bufHdr->wait_backend_pgprocno == MyProc->pgprocno)
 			buf_state &= ~BM_PIN_COUNT_WAITER;
 		UnlockBufHdr(bufHdr, buf_state);
 
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d493aeef0fc..d76632bb56a 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1276,7 +1276,7 @@ InitPredicateLocks(void)
 		PredXact->OldCommittedSxact->finishedBefore = InvalidTransactionId;
 		PredXact->OldCommittedSxact->xmin = InvalidTransactionId;
 		PredXact->OldCommittedSxact->flags = SXACT_FLAG_COMMITTED;
-		PredXact->OldCommittedSxact->pid = 0;
+		PredXact->OldCommittedSxact->pgprocno = INVALID_PGPROCNO;
 	}
 	/* This never changes, so let's keep a local copy. */
 	OldCommittedSxact = PredXact->OldCommittedSxact;
@@ -1635,8 

Re: pg_stats and range statistics

2021-07-11 Thread Soumyadeep Chakraborty
Hello,

This should have been added with [1].

Excerpt from the documentation:
"pg_stats is also designed to present the information in a more readable
format than the underlying catalog — at the cost that its schema must
be extended whenever new slot types are defined for pg_statistic." [2]

So, I added a reminder in pg_statistic.h.

Attached is v2 of this patch with some cosmetic changes. Renamed the columns a
bit and updated the docs to be a bit more descriptive.
(range_length_empty_frac -> empty_range_frac, range_bounds_histogram ->
range_bounds_histograms)

One question:

We do have the option of representing the histogram of lower bounds separately
from the histogram of upper bounds, as two separate view columns. Don't know if
there is much utility though and there is a fair bit of added complexity: see
below. Thoughts?

My attempts via SQL (unnest -> lower|upper -> array_agg) were futile given
unnest does not play nice with anyarray. For instance:

select unnest(stavalues1) from pg_statistic;
ERROR:  cannot determine element type of "anyarray" argument

Maybe the only option is to write a UDF pg_get_{lower|upper}_bounds_histogram
which can do something similar to what calc_hist_selectivity does:

/*
 * Convert histogram of ranges into histograms of its lower and upper
 * bounds.
 */
nhist = hslot.nvalues;
hist_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
hist_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
for (i = 0; i < nhist; i++)
{
bool empty;

range_deserialize(rng_typcache, DatumGetRangeTypeP(hslot.values[i]),
  _lower[i], _upper[i], );
/* The histogram should not contain any empty ranges */
if (empty)
elog(ERROR, "bounds histogram contains an empty range");
}

This is looking good and ready.

[1] 
https://github.com/postgres/postgres/commit/918eee0c497c88260a2e107318843c9b1947bc6f
[2] https://www.postgresql.org/docs/devel/view-pg-stats.html

Regards,
Soumyadeep (VMware)
From 1f2ed0911eb3f7a53b16047cefb499692daf5ef6 Mon Sep 17 00:00:00 2001
From: Egor Rogov 
Date: Sun, 11 Jul 2021 11:09:45 -0700
Subject: [PATCH v2 1/1] Display length and bounds histograms in pg_stats

Values corresponding to STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and
STATISTIC_KIND_BOUNDS_HISTOGRAM were not exposed to pg_stats when these
slot kinds were introduced in 918eee0c49.

This commit adds the missing fields to pg_stats.

TODO: catalog version bump
---
 doc/src/sgml/catalogs.sgml   | 32 
 src/backend/catalog/system_views.sql | 23 +++-
 src/include/catalog/pg_statistic.h   |  3 +++
 src/test/regress/expected/rules.out  | 26 +-
 4 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d4af..7168ff9a13 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -12877,6 +12877,38 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
non-null elements.  (Null for scalar types.)
   
  
+
+ 
+  
+   range_length_histogram anyarray
+  
+  
+   A histogram of the lengths of non-empty and non-null range values of a
+   range type column. (Null for non-range types.)
+  
+ 
+
+ 
+  
+   empty_range_frac float4
+  
+  
+   Fraction of column entries whose values are empty ranges.
+   (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_bounds_histograms anyarray
+  
+  
+   Histograms of lower and upper bounds of non-empty, non-null ranges,
+   combined into a single array of range values. The lower and upper bounds
+   of each value correspond to the histograms of lower and upper bounds
+   respectively. (Null for non-range types.)
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984068..d8bc622ad5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -243,7 +243,28 @@ CREATE VIEW pg_stats WITH (security_barrier) AS
 WHEN stakind3 = 5 THEN stanumbers3
 WHEN stakind4 = 5 THEN stanumbers4
 WHEN stakind5 = 5 THEN stanumbers5
-END AS elem_count_histogram
+END AS elem_count_histogram,
+CASE
+WHEN stakind1 = 6 THEN stavalues1
+WHEN stakind2 = 6 THEN stavalues2
+WHEN stakind3 = 6 THEN stavalues3
+WHEN stakind4 = 6 THEN stavalues4
+WHEN stakind5 = 6 THEN stavalues5
+END AS range_length_histogram,
+CASE
+WHEN stakind1 = 6 THEN stanumbers1[1]
+WHEN stakind2 = 6 THEN stanumbers2[1]
+WHEN stakind3 = 6 THEN stanumbers3[1]
+WHEN stakind4 = 6 THEN stanumbers4[1]
+WHEN stakind5 = 6 THEN stanumbers5[1]
+END AS empty_range_frac,
+CASE
+WHEN stakind1 = 7 THEN stavalues1
+WHEN stakind2 

Re: Unused function parameter in get_qual_from_partbound()

2021-07-10 Thread Soumyadeep Chakraborty
> Marking this as ready for committer. It can be committed when the branch
> is cut for 15.

I see that REL_14_STABLE is already cut. So this can go in now.




Re: Unused function parameter in get_qual_from_partbound()

2021-07-10 Thread Soumyadeep Chakraborty
Hello,

Googling around, I didn't find any extensions that would break from this
change. Even if there are any, this change will simplify the relevant
callsites. It also aligns the interface nicely with get_qual_for_hash,
get_qual_for_list and get_qual_for_range.

Marking this as ready for committer. It can be committed when the branch
is cut for 15.

Regards,
Soumyadeep (VMware)




Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-13 Thread Soumyadeep Chakraborty
Hello,

PFA version 2 of the TAP test. I removed the non-deterministic sleep
and introduced retries until the WAL segment is archived and promotion
is complete. Some additional tidying up too.

Regards,
Soumyadeep (VMware)
diff --git a/src/test/recovery/t/022_pitr_prepared_xact.pl b/src/test/recovery/t/022_pitr_prepared_xact.pl
new file mode 100644
index 000..8738fa39d4d
--- /dev/null
+++ b/src/test/recovery/t/022_pitr_prepared_xact.pl
@@ -0,0 +1,83 @@
+# Test for point-in-time-recovery (PITR) with prepared transactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use File::Compare;
+
+# Initialize and start primary node with WAL archiving
+my $node_primary = get_new_node('primary');
+$node_primary->init(has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', "max_wal_senders = 10");
+$node_primary->append_conf('postgresql.conf', "wal_level = 'replica'");
+$node_primary->append_conf('postgresql.conf',
+	"max_prepared_transactions = 10");
+$node_primary->start;
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Initialize node for PITR targeting a specific restore point
+my $node_pitr = get_new_node('node_pitr');
+$node_pitr->init_from_backup(
+	$node_primary, $backup_name,
+	standby   => 0,
+	has_restoring => 1);
+$node_pitr->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_pitr->append_conf('postgresql.conf', "recovery_target_name = 'rp'");
+$node_pitr->append_conf('postgresql.conf',
+	"recovery_target_action = 'promote'");
+
+# Workload with a prepared transaction and the target restore point
+$node_primary->psql(
+	'postgres', qq{
+CREATE TABLE foo(i int);
+BEGIN;
+INSERT INTO foo VALUES(1);
+PREPARE TRANSACTION 'fooinsert';
+SELECT pg_create_restore_point('rp');
+INSERT INTO foo VALUES(2);
+});
+
+# Find next WAL segment to be archived
+my $walfile_to_be_archived = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+
+# Make WAL segment eligible for archival
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Wait until the WAL segment has been archived
+my $archive_wait_query =
+  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;";
+$node_primary->poll_query_until('postgres', $archive_wait_query)
+  or die "Timed out while waiting for WAL segment to be archived";
+my $last_archived_wal_file = $walfile_to_be_archived;
+
+# Now start the PITR node
+$node_pitr->start;
+
+# Wait until the PITR node exits recovery
+$node_pitr->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';")
+  or die "Timed out while waiting for PITR promotion";
+
+# Ensure that we didn't write to the older timeline during PITR promotion by
+# checking that the last archived WAL segment was not overwritten during recovery
+my $archive_dir   = $node_primary->archive_dir;
+my $archive_wal_file_path = "$archive_dir/$last_archived_wal_file";
+my $node_pitr_data= $node_pitr->data_dir;
+my $local_wal_file_path   = "$node_pitr_data/pg_wal/$last_archived_wal_file";
+is(compare($archive_wal_file_path, $local_wal_file_path),
+	qq(0), "check if the last archived WAL file was overwritten");
+
+# Commit the prepared transaction in the latest timeline and check the result.
+# There should only be one row (from the prepared transaction). The row from
+# the INSERT after the restore point should not show up, since our recovery
+# target preceded said INSERT
+$node_pitr->psql(
+	'postgres', qq{
+COMMIT PREPARED 'fooinsert';
+});
+my $result = $node_pitr->safe_psql('postgres', "SELECT * FROM foo;");
+is($result, qq(1), "check table contents after COMMIT PREPARED");


Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-04 Thread Soumyadeep Chakraborty
Hey all,

I took a stab at a quick and dirty TAP test (my first ever). So it
can probably be improved a lot. Please take a look.

On Thu, Mar 04, 2021 at 10:28:31AM +0900, Kyotaro Horiguchi wrote:

> 2. Restore ThisTimeLineID after calling XLogReadRecord() in the
>   *caller* side.  This is what came up to me first but I don't like
>   this, too, but I don't find better fix.  way.

+1 to this patch [1].
The above TAP test passes with this patch applied.

[1] 
https://www.postgresql.org/message-id/attachment/119972/dont_change_thistimelineid.patch

Regards,
Soumyadeep

Regards,
Soumyadeep
diff --git a/src/test/recovery/t/022_pitr_prepared_xact.pl b/src/test/recovery/t/022_pitr_prepared_xact.pl
new file mode 100644
index 000..b8d5146bb9d
--- /dev/null
+++ b/src/test/recovery/t/022_pitr_prepared_xact.pl
@@ -0,0 +1,69 @@
+# Test for timeline switch
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use File::Compare;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_primary->append_conf('postgresql.conf', "max_wal_senders = 10");
+$node_primary->append_conf('postgresql.conf', "wal_level = 'replica'");
+$node_primary->start;
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_pitr = get_new_node('node_pitr');
+$node_pitr->init_from_backup($node_primary, $backup_name,
+standby => 0, has_restoring => 1);
+$node_pitr->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_pitr->append_conf('postgresql.conf', "recovery_target_name = 'rp'");
+$node_pitr->append_conf('postgresql.conf', "recovery_target_action = 'promote'");
+
+# Workload with prepared transaction
+$node_primary->psql(
+'postgres', qq{
+CREATE TABLE foo(i int);
+BEGIN;
+INSERT INTO foo VALUES(1);
+PREPARE TRANSACTION 'fooinsert';
+SELECT pg_create_restore_point('rp');
+INSERT INTO foo VALUES(2);
+SELECT pg_switch_wal();
+});
+
+# Sleep 5s to ensure that the WAL has been archived.
+# probably can be replaced by a wait
+usleep(500);
+
+$node_pitr->start;
+
+my $last_archived_wal_file_name = $node_primary->safe_psql('postgres',
+"SELECT last_archived_wal FROM pg_stat_archiver;");
+
+# Ensure that we don't write to the older timeline during PITR promotion by
+# ensuring that the last archived WAL file was not overwritten during recovery.
+my $archive_dir = $node_primary->archive_dir;
+my $archive_wal_file_path = "${archive_dir}/${last_archived_wal_file_name}";
+my $node_pitr_data = $node_pitr->data_dir;
+my $local_wal_file_path = "${node_pitr_data}/pg_wal/${last_archived_wal_file_name}";
+is(compare($archive_wal_file_path, $local_wal_file_path), qq(0), "Check if the last archived WAL file was overwritten");
+
+$node_pitr->psql(
+'postgres', qq{
+COMMIT PREPARED 'fooinsert';
+});
+
+my $foo_select_result = $node_pitr->safe_psql('postgres',
+"SELECT * FROM foo;");
+
+is($foo_select_result, qq(1), "Check foo select result");


Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-03 Thread Soumyadeep Chakraborty
On 2021/03/03 17:46, Heikki Linnakangas wrote:

> I think it should be reset even earlier, inside XlogReadTwoPhaseData()
> probably. With your patch, doesn't the LogStandbySnapshot() call just
> above where you're ressetting ThisTimeLineID also write a WAL record
> with incorrect timeline?

Agreed.

On Wed, Mar 3, 2021 at 1:04 AM Fujii Masao  wrote:

> > Even better, can we avoid setting ThisTimeLineID in XlogReadTwoPhaseData() 
> > in the first place?
>
>
>
> Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e.,
> prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not
> sure if that's possible, though.. In the future other functions that calls
> read_local_xlog_page() during the promotion may appear. Fixing the issue
> outside read_local_xlog_page() may cause those functions to get
> the same issue.

I agree. We should fix the issue in read_local_xlog_page(). I have
attached two different patches which do so:
saved_ThisTimeLineID.patch and pass_ThisTimeLineID.patch.

The former saves the value of the ThisTimeLineID before it gets changed
in read_local_xlog_page() and resets it after ThisTimeLineID has been
used later on in the code (by XLogReadDetermineTimeline()).

The latter removes occurrences of ThisTimeLineID from
XLogReadDetermineTimeline() and introduces an argument currTLI to
XLogReadDetermineTimeline() to be used in its stead.

Regards,
Soumyadeep
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f460..a96fd236eeb 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -680,11 +680,12 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * The caller must also make sure it doesn't read past the current replay
  * position (using GetXLogReplayRecPtr) if executing in recovery, so it
  * doesn't fail to notice that the current timeline became historical. The
- * caller must also update ThisTimeLineID with the result of
+ * caller must also ensure that currTLI is updated with the result of
  * GetXLogReplayRecPtr and must check RecoveryInProgress().
  */
 void
-XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage,
+		  uint32 wantLength, TimeLineID currTLI)
 {
 	const XLogRecPtr lastReadPage = (state->seg.ws_segno *
 	 state->segcxt.ws_segsize + state->segoff);
@@ -712,12 +713,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * just carry on. (Seeking backwards requires a check to make sure the
 	 * older page isn't on a prior timeline).
 	 *
-	 * ThisTimeLineID might've become historical since we last looked, but the
+	 * currTLI might be historical, but the
 	 * caller is required not to read past the flush limit it saw at the time
 	 * it looked up the timeline. There's nothing we can do about it if
 	 * StartupXLOG() renames it to .partial concurrently.
 	 */
-	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	if (state->currTLI == currTLI && wantPage >= lastReadPage)
 	{
 		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
 		return;
@@ -729,7 +730,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * the current segment we can just keep reading.
 	 */
 	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
-		state->currTLI != ThisTimeLineID &&
+		state->currTLI != currTLI &&
 		state->currTLI != 0 &&
 		((wantPage + wantLength) / state->segcxt.ws_segsize) <
 		(state->currTLIValidUntil / state->segcxt.ws_segsize))
@@ -752,7 +753,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 		 * We need to re-read the timeline history in case it's been changed
 		 * by a promotion or replay from a cascaded replica.
 		 */
-		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+		List	   *timelineHistory = readTimeLineHistory(currTLI);
 		XLogRecPtr	endOfSegment;
 
 		endOfSegment = ((wantPage / state->segcxt.ws_segsize) + 1) *
@@ -830,7 +831,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 {
 	XLogRecPtr	read_upto,
 loc;
-	TimeLineID	tli;
+	TimeLineID	tli = ThisTimeLineID;
 	int			count;
 	WALReadError errinfo;
 
@@ -850,8 +851,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr();
-		tli = ThisTimeLineID;
+			read_upto = GetXLogReplayRecPtr();
 
 		/*
 		 * Check which timeline to get the record from.
@@ -877,9 +877,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		 * standby whose primary gets promoted while we're decoding, so a
 		 * one-off ERROR isn't too bad.
 		 */
-		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen, tli);
 
-		if 

PITR promote bug: Checkpointer writes to older timeline

2021-03-02 Thread Soumyadeep Chakraborty
Hello hackers,

We came across an issue where the checkpointer writes to the older
timeline while a promotion is ongoing after reaching the recovery point
in a PITR, when there are prepared transactions before the recovery
point. We came across this issue first in REL_12_STABLE and saw that it
also exists in devel.

Setup:
PFA a minimal reproduction script repro.txt.

After running the script, we notice that the checkpointer has written
the end-of-recovery shutdown checkpoint in the previous timeline (tli =
1), i.e. it wrote into the WAL segment 00010003 instead
of writing to the WAL segment 00020003, causing it to
overwrite WAL records past the recovery point (please see attached diff
output file waldiff.diff) in 00010003.

Also, performing a subsequent shutdown on the recovered server may cause
the next shutdown checkpoint record to be written, again, to the
previous timeline, i.e. to 00010003. A subsequent server
start will fail as the starup process will be unable to find the
checkpoint in the latest timeline (00020003) and we will
get:

...
LOG:  invalid record length at 0/3016FB8: wante
d 24, got 0
LOG:  invalid primary checkpoint record
PANIC:  could not locate a valid checkpoint record
...

RCA:

When there are prepared transactions in an older timeline, in the
checkpointer, a call to CheckPointTwoPhase() and subsequently to
XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads
to the following line:

read_upto = GetXLogReplayRecPtr();

GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read
the two phase WAL records in the older timeline. This variable will
remain unchanged and the checkpointer ends up writing the checkpoint
record into the older WAL segment (when XLogBeginInsert() is called
within CreateCheckPoint(), the value is still 1). The value is not
synchronized as even if RecoveryInProgress() is called,
xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE
(SharedRecoveryInProgress = true in older versions) as the startup
process waits for the checkpointer inside RequestCheckpoint() (since
recovery_target_action='promote' involves a non-fast promotion). Thus,
InitXLOGAccess() is not called and the value of ThisTimeLineID is not
updated before the checkpoint record write.

Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID
instead of a local variable, within read_local_xlog_page().

PFA a small patch that fixes the problem by explicitly calling
InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data
is read, in order to update ThisTimeLineID to the latest timeline. It is
okay to call InitXLOGAccess() as it is lightweight and would mostly be
a no-op.

Regards,
Soumyadeep, Kevin and Jimmy
VMWare
#! /bin/sh

# Primary setup

/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
echo "archive_mode = on
archive_command = 'cp %p /tmp/archive/%f'
max_prepared_transactions = 10" >> /usr/local/pgsql/data/postgresql.conf

/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l /tmp/logfile_p start

# Secondary setup

/usr/local/pgsql/bin/pg_basebackup -D /usr/local/pgsql/data_s -X stream && 
touch /usr/local/pgsql/data_s/recovery.signal
echo "port = 6432
restore_command = 'cp /tmp/archive/%f %p'
recovery_target_name = 'rp'
recovery_target_action = 'promote'
max_prepared_transactions = 20" >> /usr/local/pgsql/data_s/postgresql.conf

# Workload
echo "CREATE TABLE foo(i int);
BEGIN;
INSERT INTO foo VALUES(1);
PREPARE TRANSACTION 'fooinsert';
SELECT pg_create_restore_point('rp');
INSERT INTO foo VALUES(2);
SELECT pg_switch_wal();
" > /tmp/workload.sql
/usr/local/pgsql/bin/psql postgres -f /tmp/workload.sql

/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data_s -l /tmp/logfile_s start

sleep 5

/usr/local/pgsql/bin/pg_waldump /tmp/archive/00010003 > 
/tmp/archive_00010003
/usr/local/pgsql/bin/pg_waldump 
/usr/local/pgsql/data_s/pg_wal/00010003 > 
/tmp/standby_00010003

diff -u /tmp/archive_00010003 
/tmp/standby_00010003 > /tmp/waldiff.diff
cat /tmp/waldiff.diff
From dbf5a9f6899bedf28b482fc03a4a2b0571e92e9b Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Tue, 2 Mar 2021 17:41:20 -0800
Subject: [PATCH 1/1] Prevent checkpointer from writing to older timeline

Co-authored-by: Kevin Yeap 
---
 src/backend/access/transam/twophase.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20d6cc..081aee6217e 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1740,6 +1740,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
+	if (serialized_xacts > 0)
+	{
+		/*
+		 * In order to ensure that we write the ch

Re: Table AM modifications to accept column projection lists

2020-12-31 Thread Soumyadeep Chakraborty
Hey Masahiko,

I added it to the Jan CF (https://commitfest.postgresql.org/31/2922/).

PFA a rebased version against latest head.

Regards,
Soumyadeep
From d033a0e3bceaf6e3f861e08363d4f170bc2a9fea Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 9 Nov 2020 16:36:10 -0800
Subject: [PATCH v2 1/1] tableam: accept column projection list

Co-authored-by: Soumyadeep Chakraborty 
Co-authored-by: Melanie Plageman 
Co-authored-by: Ashwin Agrawal 
Co-authored-by: Jacob Champion 
---
 src/backend/access/heap/heapam_handler.c |   5 +-
 src/backend/access/nbtree/nbtsort.c  |   3 +-
 src/backend/access/table/tableam.c   |   5 +-
 src/backend/commands/analyze.c   |   5 +-
 src/backend/commands/trigger.c   |  19 +++-
 src/backend/executor/execMain.c  |   3 +-
 src/backend/executor/execPartition.c |   2 +
 src/backend/executor/execReplication.c   |   6 +-
 src/backend/executor/execScan.c  | 108 +
 src/backend/executor/nodeIndexscan.c |  10 ++
 src/backend/executor/nodeLockRows.c  |   7 +-
 src/backend/executor/nodeModifyTable.c   |  38 ++--
 src/backend/executor/nodeSeqscan.c   |  43 -
 src/backend/executor/nodeTidscan.c   |  11 ++-
 src/backend/nodes/copyfuncs.c|   2 +
 src/backend/nodes/equalfuncs.c   |   2 +
 src/backend/nodes/outfuncs.c |   2 +
 src/backend/nodes/readfuncs.c|   2 +
 src/backend/optimizer/path/allpaths.c|  85 +++-
 src/backend/optimizer/plan/planner.c |  19 
 src/backend/optimizer/prep/preptlist.c   |  13 ++-
 src/backend/optimizer/util/inherit.c |  37 ++-
 src/backend/parser/analyze.c |  40 ++--
 src/backend/parser/parse_relation.c  |  18 
 src/backend/partitioning/partbounds.c|  14 ++-
 src/backend/rewrite/rewriteHandler.c |   8 ++
 src/include/access/tableam.h | 118 +--
 src/include/executor/executor.h  |   6 ++
 src/include/nodes/execnodes.h|   1 +
 src/include/nodes/parsenodes.h   |  14 +++
 src/include/utils/rel.h  |  14 +++
 31 files changed, 612 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3eea215b85..c18641700b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -180,7 +180,8 @@ static bool
 heapam_fetch_row_version(Relation relation,
 		 ItemPointer tid,
 		 Snapshot snapshot,
-		 TupleTableSlot *slot)
+		 TupleTableSlot *slot,
+		 Bitmapset *project_cols)
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	Buffer		buffer;
@@ -348,7 +349,7 @@ static TM_Result
 heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
   TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
   LockWaitPolicy wait_policy, uint8 flags,
-  TM_FailureData *tmfd)
+  TM_FailureData *tmfd, Bitmapset *project_cols)
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8730de25ed..25ee10806b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1967,7 +1967,8 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
 	scan = table_beginscan_parallel(btspool->heap,
-	ParallelTableScanFromBTShared(btshared));
+	ParallelTableScanFromBTShared(btshared),
+	NULL);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 	   true, progress, _bt_build_callback,
 	   (void *) , scan);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6438c45716..187e861b5d 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -172,7 +172,7 @@ table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
 }
 
 TableScanDesc
-table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan)
+table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan, Bitmapset *proj)
 {
 	Snapshot	snapshot;
 	uint32		flags = SO_TYPE_SEQSCAN |
@@ -194,6 +194,9 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan)
 		snapshot = SnapshotAny;
 	}
 
+	if (proj)
+		return relation->rd_tableam->scan_begin_with_column_projection(relation, snapshot, 0, NULL,
+			parallel_scan, flags, proj);
 	return relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
 			parallel_scan, flags);
 }
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..791451c765 100644
--- a/src/backend/commands/analyz

Re: Split copy.

2020-11-17 Thread Soumyadeep Chakraborty
On Tue, Nov 17, 2020 at 2:38 AM Heikki Linnakangas  wrote:
>
> Thanks for feedback, attached is a new patch version.
>
> On 11/11/2020 21:49, Soumyadeep Chakraborty wrote:
> > On Tue, Nov 3, 2020 at 1:30 AM Heikki Linnakangas  wrote:
> >> I also split/duplicated the CopyStateData struct into CopyFromStateData
> >> and CopyToStateData. Many of the fields were common, but many were not,
> >> and I think some duplication is nicer than a struct where you use some
> >> fields and others are unused. I put the common formatting options into a
> >> new CopyFormatOptions struct.
> >
> > Would we be better off if we sub-struct CopyState <- Copy{From,To}State?
> > Like this:
> > typedef struct Copy{From|To}StateData
> > {
> > CopyState cs;
> > // Fields specific to COPY FROM/TO follow..
> > }
>
> Hmm. I don't think that would be better. There isn't actually that much
> in common between CopyFromStateData and CopyToStateData, and a little
> bit of duplication seems better.
>

Fair.

> > 6.
> >
> >> /* create workspace for CopyReadAttributes results */
> >> if (!cstate->opts.binary)
> >
> > Can we replace this if with an else?
>
> Seems better as it is IMO, the if- and else-branches are not really
> related to each other, even though they both happen to be conditioned on
> cstate->opts.binary.

Fair.

> Fixed all the other things you listed, fixed a bug in setting
> 'file_encoding', and trimmed down the #includes.
>

Thanks! LGTM! Marking as Ready for Committer.

Regards,
Soumyadeep (VMware)




Table AM modifications to accept column projection lists

2020-11-13 Thread Soumyadeep Chakraborty
ction set
extraction logic with a table_scans_leverage_column_projection() call.
We wouldn't want a non-columnar AM to incur the overhead.

- Standardize the table AM API for passing columns.

- The optimization for DELETE RETURNING does not currently work for
views. We have to populate the list of columns for the base relation
beneath the view properly.

- Currently the benefit of passing in an empty projection set for ON
CONFLICT DO UPDATE (UPSERT) and ON CONFLICT DO NOTHING (see
ExecCheckTIDVisible()) is masked by a preceding call to
check_exclusion_or_unique_constraint() which has not yet been modified
to pass a column projection list to the index scan.

- Compute scanCols earlier than set_base_rel_sizes() and use that
information to produce better relation size estimates (relation size
will depend on the number of columns projected) in the planner.
Essentially, we need to absorb the work done by Pengzhou [2].

- Right now, we do not extract a set of columns for the call to
table_tuple_lock() within GetTupleForTrigger() as it may be hard to
determine the list of columns used in a trigger body [3].

- validateForeignKeyConstraint() should only need to fetch the
foreign key column.

- List of index scan callsites that will benefit from calling
index_fetch_set_column_projection():

  -- table_index_fetch_tuple_check() does not need to fetch any
  columns (we have to pass an empty column bitmap), fetching the tid
  should be enough.

  -- unique_key_recheck() performs a liveness check for which we do
  not need to fetch any columns (we have to pass an empty column
  bitmap)

  -- check_exclusion_or_unique_constraint() needs to only fetch the
  columns that are part of the exclusion or unique constraint.

  -- IndexNextWithReorder() needs to only fetch columns being
  projected along with columns in the index qual and columns in the
  ORDER BY clause.

  -- get_actual_variable_endpoint() only performs visibility checks,
  so we don't need to fetch any columns (we have to pass an empty
  column projection bitmap)

- BitmapHeapScans can benefit from a column projection list the same
way as an IndexScan and SeqScan can. We can possibly pass down scanCols
in ExecInitBitmapHeapScan(). We would have to modify the BitmapHeapScan
table AM calls to take a column projection bitmap.

- There may be more callsites where we can pass a column projection list.

Regards,

Soumyadeep & Jacob

[0] 
https://www.postgresql.org/message-id/CAE-ML%2B-HwY4X4uTzBesLhOotHF7rUvP2Ur-rvEpqz2PUgK4K3g%40mail.gmail.com
[1] 
https://www.postgresql.org/message-id/flat/CAAKRu_Yj%3DQ_ZxiGX%2BpgstNWMbUJApEJX-imvAEwryCk5SLUebg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAG4reAQc9vYdmQXh%3D1D789x8XJ%3DgEkV%2BE%2BfT9%2Bs9tOWDXX3L9Q%40mail.gmail.com
[3] https://www.postgresql.org/message-id/23194.1560618101%40sss.pgh.pa.us
From 9a77cbaa839f8b56cd9d93fdf1ddf418fdc8a763 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 9 Nov 2020 16:36:10 -0800
Subject: [PATCH] tableam: accept column projection list

Co-authored-by: Soumyadeep Chakraborty 
Co-authored-by: Melanie Plageman 
Co-authored-by: Ashwin Agrawal 
Co-authored-by: Jacob Champion 
---
 src/backend/access/heap/heapam_handler.c |   5 +-
 src/backend/access/nbtree/nbtsort.c  |   3 +-
 src/backend/access/table/tableam.c   |   5 +-
 src/backend/commands/analyze.c   |   5 +-
 src/backend/commands/copy.c  |  21 +++-
 src/backend/commands/trigger.c   |  19 +++-
 src/backend/executor/execMain.c  |   3 +-
 src/backend/executor/execPartition.c |   2 +
 src/backend/executor/execReplication.c   |   6 +-
 src/backend/executor/execScan.c  | 108 +
 src/backend/executor/nodeIndexscan.c |  10 ++
 src/backend/executor/nodeLockRows.c  |   7 +-
 src/backend/executor/nodeModifyTable.c   |  38 ++--
 src/backend/executor/nodeSeqscan.c   |  43 -
 src/backend/executor/nodeTidscan.c   |  11 ++-
 src/backend/nodes/copyfuncs.c|   2 +
 src/backend/nodes/equalfuncs.c   |   2 +
 src/backend/nodes/outfuncs.c |   2 +
 src/backend/nodes/readfuncs.c|   2 +
 src/backend/optimizer/path/allpaths.c|  85 +++-
 src/backend/optimizer/plan/planner.c |  19 
 src/backend/optimizer/prep/preptlist.c   |  13 ++-
 src/backend/optimizer/util/inherit.c |  37 ++-
 src/backend/parser/analyze.c |  40 ++--
 src/backend/parser/parse_relation.c  |  18 
 src/backend/partitioning/partbounds.c|  14 ++-
 src/backend/rewrite/rewriteHandler.c |   8 ++
 src/include/access/tableam.h | 118 +--
 src/include/executor/executor.h  |   6 ++
 src/include/nodes/execnodes.h|   1 +
 src/include/nodes/parsenodes.h   |  14 +++
 src/include/utils/rel.h  |  14 +++
 32 files changed, 632 insertions(+), 49 deletions(-)

diff --git a/src/backend/access/heap/heapam_handle

Re: Delay of standby shutdown

2020-11-12 Thread Soumyadeep Chakraborty
Thanks! Marking this as ready for committer.

Regards,
Soumyadeep (VMware)




Re: Split copy.c

2020-11-11 Thread Soumyadeep Chakraborty
Hey Heikki,

On Tue, Nov 3, 2020 at 1:30 AM Heikki Linnakangas  wrote:

Thanks for working on this refactor. LGTM! I had a few minor comments:

1. We should rename the CopyFormatOptions *cstate param in
ProcessCopyOptions() to CopyFormatOptions *options and List *options to
List *raw_options IMO, to make it more readable.

2. We need to update the header comment for Copy{From,To}StateData. It is
currently the old comment from CopyStateData.

3. Can we add docs for the missing fields in the header comment for
BeginCopyFrom()?

4.

> /*
>  * Working state for COPY TO/FROM
>  */
> MemoryContext copycontext; /* per-copy execution context */

Comment needs to be updated for the COPY operation.

5.

> I also split/duplicated the CopyStateData struct into CopyFromStateData
> and CopyToStateData. Many of the fields were common, but many were not,
> and I think some duplication is nicer than a struct where you use some
> fields and others are unused. I put the common formatting options into a
> new CopyFormatOptions struct.

Would we be better off if we sub-struct CopyState <- Copy{From,To}State?
Like this:
typedef struct Copy{From|To}StateData
{
CopyState cs;
// Fields specific to COPY FROM/TO follow..
}

6.

> /* create workspace for CopyReadAttributes results */
> if (!cstate->opts.binary)

Can we replace this if with an else?

Regards,
Soumyadeep (VMware)




Re: PATCH: Attempt to make dbsize a bit more consistent

2020-11-11 Thread Soumyadeep Chakraborty
Hey Georgios,

On Tue, Nov 10, 2020 at 6:20 AM  wrote:
>
>
>
>
>
>
> ‐‐‐ Original Message ‐‐‐
> On Monday, November 9, 2020 7:50 PM, Soumyadeep Chakraborty 
>  wrote:
>
> > Hey Georgios,
> >
> > Thanks for looking for more avenues to invoke tableAM APIS! Please find
> > my review below:
>
> A great review Soumyadeep, it is much appreciated.
> Please remember to add yourself as a reviewer in the commitfest
> [https://commitfest.postgresql.org/30/2701/]

Ah yes. Sorry, I forgot that!

> >
> > On Tue, Oct 13, 2020 at 6:28 AM gkokola...@pm.me wrote:
> >
> > 1.
> >
> > > /*
> > >
> > > -   -   heap size, including FSM and VM
> > >
> > > -   -   table size, including FSM and VM
> > > */
> > >
> >
> > We should not mention FSM and VM in dbsize.c at all as these are
> > heap AM specific. We can say:
> > table size, excluding TOAST relation
>
> Yeah, I was thinking that the notion that FSM and VM are still taken
> into account should be stated. We are iterating over ForkNumber
> after all.
>
> How about phrasing it as:
>
> + table size, including all implemented forks from the AM (e.g. FSM, VM)
> + excluding TOAST relations
>
> Thoughts?

Yes, I was thinking along the same lines. The concept of a "fork" forced
should not be forced down into the tableAM. But that is a discussion for
another day. We can probably say:

+ table size, including all implemented forks from the AM (e.g. FSM, VM
+ for the heap AM) excluding TOAST relations

> >
> > 2.
> >
> > > /*
> > >
> > > -   Size of toast relation
> > > */
> > > if (OidIsValid(rel->rd_rel->reltoastrelid))
> > >
> > >
> > > -   size += calculate_toast_table_size(rel->rd_rel->reltoastrelid);
> > >
> > > -   {
> > > -   Relation toastRel;
> > > -
> > > -   toastRel = relation_open(rel->rd_rel->reltoastrelid, AccessShareLock);
> >
> > We can replace the OidIsValid check with relation_needs_toast_table()
> > and then have the OidIsValid() check in an Assert. Perhaps, that will
> > make things more readable.
>
> Please, allow me to kindly disagree.
>
> Relation is already open at this stage. Even create_toast_table(), the
> internal workhorse for creating toast relations, does check reltoastrelid
> to test if the relation is already toasted.
>
> Furthermore, we do call:
>
> + toastRel = relation_open(rel->rd_rel->reltoastrelid, AccessShareLock);
>
> and in order to avoid elog() errors underneath us, we ought to have
> verified the validity of reltoastrelid.
>
> In short, I think that the code in the proposal is not too unreadable
> nor that it breaks the coding patterns throughout the codebase.
>
> Am I too wrong?

No not at all. The code in the proposal is indeed adhering to the
codebase. What I was going for here was to increase the usage of
relation_needs_toast_table(). What I meant was:

if (table_relation_needs_toast_table(rel))
{
if (!OidIsValid(rel->rd_rel->reltoastrelid))
{
elog(ERROR, );
}
//size calculation here..
}

We want to error out if the toast table can't be found and the relation
is expected to have one, which the existing code doesn't handle.


>
> >
> > 3.
> >
> > > -   if (rel->rd_rel->relkind == RELKIND_RELATION ||
> > > -   rel->rd_rel->relkind == RELKIND_TOASTVALUE ||
> > > -   rel->rd_rel->relkind == RELKIND_MATVIEW)
> > > -   size = calculate_table_size(rel);
> > > -   else
> > > -   {
> > > -   relation_close(rel, AccessShareLock);
> > > -   PG_RETURN_NULL();
> > > -   }
> >
> > This leads to behavioral changes:
> >
> > I am talking about the case where one calls pg_table_size() on an index.
> > W/o your patch, it returns the size of the index. W/ your patch it
> > returns NULL. Judging by the documentation, this function should not
> > ideally apply to indexes but it does! I have a sinking feeling that lots
> > of users use it for this purpose, as there is no function to calculate
> > the size of a single specific index (except pg_relation_size()).
> > The same argument I have made above applies to sequences. Users may have
> > trial-and-errored their way into finding that pg_table_size() can tell
> > them the size of a specific index/sequence! I don't know how widespread
> > the use is in the user community, so IMO maybe we should be conservative
> > and not introduce this change? Alternatively, we could call out that
> > pg_table_size() is only for tab

Re: PATCH: Attempt to make dbsize a bit more consistent

2020-11-09 Thread Soumyadeep Chakraborty
Hey Georgios,

Thanks for looking for more avenues to invoke tableAM APIS! Please find
my review below:

On Tue, Oct 13, 2020 at 6:28 AM  wrote:

1.

>  /*
> - * heap size, including FSM and VM
> + * table size, including FSM and VM
>  */

We should not mention FSM and VM in dbsize.c at all as these are
heap AM specific. We can say:
table size, excluding TOAST relation

2.

>  /*
>  * Size of toast relation
>  */
>  if (OidIsValid(rel->rd_rel->reltoastrelid))
> - size += calculate_toast_table_size(rel->rd_rel->reltoastrelid);
> + {
> + Relation toastRel;
> +
> + toastRel = relation_open(rel->rd_rel->reltoastrelid, AccessShareLock);

We can replace the OidIsValid check with relation_needs_toast_table()
and then have the OidIsValid() check in an Assert. Perhaps, that will
make things more readable.

3.

> + if (rel->rd_rel->relkind == RELKIND_RELATION ||
> + rel->rd_rel->relkind == RELKIND_TOASTVALUE ||
> + rel->rd_rel->relkind == RELKIND_MATVIEW)
> + size = calculate_table_size(rel);
> + else
> + {
> + relation_close(rel, AccessShareLock);
> + PG_RETURN_NULL();
> + }

This leads to behavioral changes:

I am talking about the case where one calls pg_table_size() on an index.
W/o your patch, it returns the size of the index. W/ your patch it
returns NULL. Judging by the documentation, this function should not
ideally apply to indexes but it does! I have a sinking feeling that lots
of users use it for this purpose, as there is no function to calculate
the size of a single specific index (except pg_relation_size()).
The same argument I have made above applies to sequences. Users may have
trial-and-errored their way into finding that pg_table_size() can tell
them the size of a specific index/sequence! I don't know how widespread
the use is in the user community, so IMO maybe we should be conservative
and not introduce this change? Alternatively, we could call out that
pg_table_size() is only for tables by throwing an error if anything
other than a table is passed in.

If we decide to preserve the existing behavior of the pg_table_size():
It seems that for things not backed by the tableAM (indexes and
sequences), they should still go through calculate_relation_size().
We can call table_relation_size() based on if relkind is
RELKIND_RELATION, RELKIND_TOASTVALUE or RELKIND_MATVIEW. Perhaps it
might be worthwhile making a new macro RELKIND_HAS_TABLE_STORAGE to
capture these three cases (See RELKIND_HAS_STORAGE). This also ensures
that we return 0 for things that don't qualify as RELKIND_HAS_STORAGE,
such as a partitioned table (Currently w/ the patch applied, we return
NULL for those cases, which is another behavior change)

4.

> @@ -3776,10 +3776,24 @@ listTables(const char *tabtypes, const char *pattern, 
> bool verbose, bool showSys
>gettext_noop("Access Method"));
>
>  /*
> + * As of PostgreSQL 14, do not use pg_table_size() for indexes and
> + * sequences as it does not behave sanely for those.
> + *
>   * As of PostgreSQL 9.0, use pg_table_size() to show a more accurate
>   * size of a table, including FSM, VM and TOAST tables.
>   */
> -if (pset.sversion >= 9)
> +if (pset.sversion >= 14)
> +appendPQExpBuffer(,
> +  ",\n CASE"
> +  " WHEN c.relkind in 
> ("CppAsString2(RELKIND_INDEX)", "
> +
> CppAsString2(RELKIND_PARTITIONED_INDEX)", "
> +
> CppAsString2(RELKIND_SEQUENCE)") THEN"
> +  " 
> pg_catalog.pg_size_pretty(pg_catalog.pg_relation_size(c.oid))"
> +  " ELSE"
> +  " 
> pg_catalog.pg_size_pretty(pg_catalog.pg_table_size(c.oid))"
> +  " END as \"%s\"",
> +  gettext_noop("Size"));
> +else if (pset.sversion >= 9)
>  appendPQExpBuffer(,
>",\n  
> pg_catalog.pg_size_pretty(pg_catalog.pg_table_size(c.oid)) as \"%s\"",
>gettext_noop("Size"));

Following on from point 3, if we decide to preserve the existing behavior,
we wouldn't need this change, as it would be internalized within
pg_table_size().


4.
> >
> > -   return size;
> >
> > -   Assert(size < PG_INT64_MAX);
> >
> > -
> > -   return (int64)size;
> > I assume that this change, and the other ones like that, aim to handle 
> > int64
> > overflow? Using the extra legroom of uint64 can still lead to an 
> > overflow,
> > however theoretical it may be. Wouldn't it be better to check for 
> > overflow
> > before adding to size rather than after the fact? Something along the 
> > lines of
> > checking for headroom left:
> >
> > rel_size = table_relation_size(..);
> > if (rel_size > (PG_INT64_MAX - total_size))
> > < 

Re: Delay of standby shutdown

2020-11-03 Thread Soumyadeep Chakraborty
Hello Fujii,

On Thu, Sep 17, 2020 at 6:49 AM Fujii Masao  wrote:

As far as I understand after debugging, the problem is as follows:

1. After the primary is stopped, and the standby startup process is
waiting inside:

(void) WaitLatch(>recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
wait_time,
WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);

it receives SIGTERM on account of stopping the standby and it leads to
the WaitLatch call returning with WL_LATCH_SET.

2. WaitForWALToBecomeAvailable() then will return true after calling
XLogFileReadAnyTLI() and eventually, XLogReadRecord() will return NULL
since there is no new WAL to read, which means ReadRecord() will loop
back and perform another XLogReadRecord().

3. The additional XLogReadRecord() will lead to a 5s wait inside
WaitForWALToBecomeAvailable() - a different WaitLatch() call this time:

/*
 * Wait for more WAL to arrive. Time out after 5 seconds
 * to react to a trigger file promptly and to check if the
 * WAL receiver is still active.
 */
(void) WaitLatch(>recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
5000L, WAIT_EVENT_RECOVERY_WAL_STREAM);

4. And then eventually, the code will handle interrupts here inside
WaitForWALToBecomeAvailable() after the 5s wait:

/*
 * This possibly-long loop needs to handle interrupts of startup
 * process.
 */
HandleStartupProcInterrupts();

> To avoid this issue, I think that ReadRecord() should call
> HandleStartupProcInterrupts() whenever looping back to retry.

Alternatively, perhaps we can place it inside
WaitForWALToBecomeAvailable() (which already has a similar call),
since it is more semantically aligned to checking for interrupts, rather
than ReadRecord()? Like this:

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 52a67b117015..b05cf6c7c219 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12249,6 +12249,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr
RecPtr, bool randAccess,

WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);
 ResetLatch(>recoveryWakeupLatch);
 now = GetCurrentTimestamp();
+/*
+ * Check for interrupts
+ */
+HandleStartupProcInterrupts();
 }
 last_fail_time = now;
 currentSource = XLOG_FROM_ARCHIVE;

It also has the advantage of being in a slightly less "hot" code path
as compared to it being where you suggested (the location you suggested
is executed infinitely when a standby is not connected to a primary and
there is no more WAL to read)


Regards,

Soumyadeep and Georgios




Re: Add session statistics to pg_stat_database

2020-10-02 Thread Soumyadeep Chakraborty
On Tue, Sep 29, 2020 at 2:44 AM Laurenz Albe  wrote:


> > * Are we trying to capture ONLY client initiated disconnects in
> > m_aborted (we are not handling other disconnects by not accounting for
> > EOF..like if psql was killed)? If yes, why?
>
> I thought it was interesting to know how many database sessions are
> ended regularly as opposed to ones that get killed or end by unexpected
> client death.

It may very well be. It would also be interesting to find out how many
connections are still open on the database (something we could easily
glean if we had the number of all disconnects, client-initiated or
unnatural). Maybe we could have both?

m_sessions_disconnected;
m_sessions_killed;

>
>
> > *
> > > static TimestampTz pgStatActiveStart = DT_NOBEGIN;
> > > static PgStat_Counter pgStatActiveTime = 0;
> > > static TimestampTz pgStatTransactionIdleStart = DT_NOBEGIN;
> > > static PgStat_Counter pgStatTransactionIdleTime = 0;
> > > static bool pgStatSessionReported = false;
> > > bool pgStatSessionDisconnected = false;
> >
> > I think we can house all of these globals inside PgBackendStatus and can
> > follow the protocol for reading/writing fields in PgBackendStatus.
> > Refer: PGSTAT_{BEGIN|END}_WRITE_ACTIVITY
>
> Are you sure that is the right way to go?
>
> Correct me if I am wrong, but isn't PgBackendStatus for relevant status
> information that other processes can access?
> I'd assume that it is not the correct place to store backend-private data
> that are not relevant to others.
> Besides, if data is written to this structure more often, readers would
> have deal with more contention, which could affect performance.

You are absolutely right! PgBackendStatus is not the place for any of
these fields. We could place them in LocalPgBackendStatus perhaps. But
I don't feel too strongly about that now, having looked at similar fields
such as pgStatXactCommit, pgStatXactRollback etc. If we decide to stick
with the globals, let's isolate and decorate them with a comment such as
this example from the source:

/*
 * Updated by pgstat_count_buffer_*_time macros
 */
extern PgStat_Counter pgStatBlockReadTime;
extern PgStat_Counter pgStatBlockWriteTime;

> > pgStatSessionDisconnected is not required as it can be determined if a
> > session has been disconnected by looking at the force argument to
> > pgstat_report_stat() [unless we would want to distinguish between
> > client-initiated disconnects, which I am not sure why, as I have
> > brought up above].
>
> But wouldn't that mean that we count *every* end of a session as regular
> disconnection, even if the backend was killed?

See my comment above about client-initiated and unnatural disconnects.

>
> > * PGSTAT_FILE_FORMAT_ID needs to be updated when a stats collector data
> > structure changes and we had a change in PgStat_StatDBEntry.
>
> I think that should be left to the committer.

Fair.

> > * We would need to bump the catalog version since we have made
> > changes to system views. Refer: #define CATALOG_VERSION_NO
>
> Again, I think that's up to the committer.

Fair.


Regards,
Soumyadeep (VMware)




Re: avoid bitmapOR-ing indexes with scan condition inconsistent with partition constraint

2020-09-30 Thread Soumyadeep Chakraborty
Hi Justin,

Attached is a minimal patch that is rebased and removes the
dependency on Konstantin's earlier patch[1], making it enough to remove
the extraneous index scans as Justin brought up. Is this the direction we
want to head in?

Tagging Konstantin in the email to hear his input on his old patch.
Since Justin's attached patch [1] does not include the work that was done
on the operator_predicate_proof() and as discussed here in [2], that
work is important to see real benefits? Just wanted to check before
reviewing [1].

Regards,
Soumyadeep (VMware)

[1] 
https://www.postgresql.org/message-id/attachment/112074/0001-Secondary-index-access-optimizations.patch
[2] https://www.postgresql.org/message-id/5A006016.1010009%40postgrespro.ru
From c60d00a64c7b65d785cda1aad7e780388ab32e27 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 29 Sep 2020 17:09:45 -0700
Subject: [PATCH v3 1/1] Avoid index scan inconsistent with partition
 constraint

---
 src/backend/optimizer/path/allpaths.c  |  7 ++
 src/backend/optimizer/path/indxpath.c  |  5 +
 src/backend/optimizer/util/plancat.c   |  7 +-
 src/include/optimizer/plancat.h|  6 ++
 src/test/regress/expected/create_index.out | 25 ++
 src/test/regress/sql/create_index.sql  | 10 +
 6 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b399592ff815..a6a643b9091f 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1045,6 +1045,13 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 			continue;
 		}
 
+		/*
+		 * Ensure that rel->partition_qual is set, so we can use the information
+		 * to eliminate unnecessary index scans.
+		 */
+		if(childRTE->relid != InvalidOid)
+			get_relation_constraints(root, childRTE->relid, childrel, false, false, true);
+
 		/*
 		 * Constraint exclusion failed, so copy the parent's join quals and
 		 * targetlist to the child, with appropriate variable substitutions.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index bcb1bc6097d0..0532b3ddd0d6 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -1305,6 +1305,11 @@ generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
 Assert(!restriction_is_or_clause(rinfo));
 orargs = list_make1(rinfo);
 
+/* Avoid scanning indexes using a scan condition which is
+ * inconsistent with the partition constraint */
+if (predicate_refuted_by(rel->partition_qual, orargs, false))
+	continue;
+
 indlist = build_paths_for_OR(root, rel,
 			 orargs,
 			 all_clauses);
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index f9d0d67aa75a..1e0bac471ff3 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -64,11 +64,6 @@ static void get_relation_foreign_keys(PlannerInfo *root, RelOptInfo *rel,
 	  Relation relation, bool inhparent);
 static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel,
 		  List *idxExprs);
-static List *get_relation_constraints(PlannerInfo *root,
-	  Oid relationObjectId, RelOptInfo *rel,
-	  bool include_noinherit,
-	  bool include_notnull,
-	  bool include_partition);
 static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index,
 			   Relation heapRelation);
 static List *get_relation_statistics(RelOptInfo *rel, Relation relation);
@@ -1165,7 +1160,7 @@ get_relation_data_width(Oid relid, int32 *attr_widths)
  * run, and in many cases it won't be invoked at all, so there seems no
  * point in caching the data in RelOptInfo.
  */
-static List *
+List *
 get_relation_constraints(PlannerInfo *root,
 		 Oid relationObjectId, RelOptInfo *rel,
 		 bool include_noinherit,
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index c29a7091ec04..cadfee15a34a 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -39,6 +39,12 @@ extern int32 get_relation_data_width(Oid relid, int32 *attr_widths);
 extern bool relation_excluded_by_constraints(PlannerInfo *root,
 			 RelOptInfo *rel, RangeTblEntry *rte);
 
+extern List *get_relation_constraints(PlannerInfo *root,
+	  Oid relationObjectId, RelOptInfo *rel,
+	  bool include_noinherit,
+	  bool include_notnull,
+	  bool include_partition);
+
 extern List *build_physical_tlist(PlannerInfo *root, RelOptInfo *rel);
 
 extern bool has_unique_index(RelOptInfo *rel, AttrNumber attno);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6ace7662ee1f..3d67978232d3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1843,6 

Re: Refactor pg_rewind code and make it work against a standby

2020-09-24 Thread Soumyadeep Chakraborty
On Thu, Sep 24, 2020 at 10:27 AM Heikki Linnakangas  wrote:
> >> /*
> >>   * If this is a relation file, copy the modified blocks.
> >>   *
> >>   * This is in addition to any other changes.
> >>   */
> >> iter = datapagemap_iterate(>target_modified_pages);
> >> while (datapagemap_next(iter, ))
> >> {
> >> offset = blkno * BLCKSZ;
> >>
> >> source->queue_fetch_range(source, entry->path, offset, BLCKSZ);
> >> }
> >> pg_free(iter);
> >
> > Can we put this hunk into a static function overwrite_pages()?
>
> Meh, it's only about 10 lines of code, and one caller.

Fair.

>
> > 7. Please address the FIXME for the symlink case:
> > /* FIXME: Check if it points to the same target? */
>
> It's not a new issue. Would be nice to fix, of course. I'm not sure what
> the right thing to do would be. If you have e.g. replaced
> postgresql.conf with a symlink that points outside the data directory,
> would it be appropriate to overwrite it? Or perhaps we should throw an
> error? We also throw an error if a file is a symlink in the source but a
> regular file in the target, or vice versa.
>

Hmm, I can imagine a use case for 2 different symlink targets on the
source and target clusters. For example the primary's pg_wal directory
can have a different symlink target as compared to a standby's
(different mount points on the same network maybe?). An end user might
not desire pg_rewind meddling with that setup or may desire pg_rewind to
treat the source as a source-of-truth with respect to this as well and
would want pg_rewind to overwrite the target's symlink. Maybe doing a
check and emitting a warning with hint/detail is prudent here while
taking no action.


> > 8.
> >
> > * it anyway. But there's no harm in copying it now.)
> >
> > and
> >
> > * copy them here. But we don't know which scenario we're
> > * dealing with, and there's no harm in copying the missing
> > * blocks now, so do it now.
> >
> > Could you add a line or two explaining why there is "no harm" in these
> > two hunks above?
>
> The previous sentences explain that there's a WAL record covering them.
> So they will be overwritten by WAL replay anyway. Does it need more
> explanation?

Yeah you are right, that is reason enough.

> > 14. queue_overwrite_range(), finish_overwrite() instead of
> > queue_fetch_range(), finish_fetch()? Similarly update\
> > *_fetch_file_range() and *_finish_fetch()
> >
> >
> > 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c
>
> Good idea! And fetch.h -> rewind_source.h.

+1. You might have missed the changes to rename "fetch" -> "overwrite"
as was mentioned in 14.

>
> I also moved the code in execute_file_actions() function to pg_rewind.c,
> into a new function: perform_rewind(). It felt a bit silly to have just
> execute_file_actions() in a file of its own. perform_rewind() now does
> all the modifications to the data directory, writing the backup file.
> Except for writing the recovery config: that also needs to be when
> there's no rewind to do, so it makes sense to keep it separate. What do
> you think?

I don't mind inlining that functionality into perform_rewind(). +1.
Heads up: The function declaration for execute_file_actions() is still
there in rewind_source.h.

> > 16.
> >
> >> conn = PQconnectdb(connstr_source);
> >>
> >> if (PQstatus(conn) == CONNECTION_BAD)
> >> pg_fatal("could not connect to server: %s",
> >> PQerrorMessage(conn));
> >>
> >> if (showprogress)
> >> pg_log_info("connected to server");
> >
> >
> > The above hunk should be part of init_libpq_source(). Consequently,
> > init_libpq_source() should take a connection string instead of a conn.
>
> The libpq connection is also needed by WriteRecoveryConfig(), that's why
> it's not fully encapsulated in libpq_source.

Ah. I find it pretty weird why we need to specify --source-server to
have write-recovery-conf work. From the code, we only need the conn
for calling PQserverVersion(), something we can easily get by slurping
pg_controldata on the source side? Maybe we can remove this limitation?

Regards,
Soumyadeep (VMware)




Re: Add session statistics to pg_stat_database

2020-09-24 Thread Soumyadeep Chakraborty
Hello Laurenz,

Thanks for submitting this! Please find my feedback below.

* Are we trying to capture ONLY client initiated disconnects in
m_aborted (we are not handling other disconnects by not accounting for
EOF..like if psql was killed)? If yes, why?

* pgstat_send_connstats(): How about renaming the "force" argument to
"disconnected"?

*
> static TimestampTz pgStatActiveStart = DT_NOBEGIN;
> static PgStat_Counter pgStatActiveTime = 0;
> static TimestampTz pgStatTransactionIdleStart = DT_NOBEGIN;
> static PgStat_Counter pgStatTransactionIdleTime = 0;
> static bool pgStatSessionReported = false;
> bool pgStatSessionDisconnected = false;

I think we can house all of these globals inside PgBackendStatus and can
follow the protocol for reading/writing fields in PgBackendStatus.
Refer: PGSTAT_{BEGIN|END}_WRITE_ACTIVITY

Also, some of these fields are not required:

I don't think we need pgStatActiveStart and pgStatTransactionIdleStart -
instead of these two we could use
PgBackendStatus.st_state_start_timestamp which marks the beginning TS of
the backend's current state (st_state). We can look at that field along
with the current and to-be-transitioned-to states inside
pgstat_report_activity() when there is a transition away from
STATE_RUNNING, STATE_IDLEINTRANSACTION or
STATE_IDLEINTRANSACTION_ABORTED, in order to update pgStatActiveTime and
pgStatTransactionIdleTime. We would also need to update those counters
on disconnect/PGSTAT_STAT_INTERVAL timeout if the backend's current
state was STATE_RUNNING, STATE_IDLEINTRANSACTION or
STATE_IDLEINTRANSACTION_ABORTED (in pgstat_send_connstats())

pgStatSessionDisconnected is not required as it can be determined if a
session has been disconnected by looking at the force argument to
pgstat_report_stat() [unless we would want to distinguish between
client-initiated disconnects, which I am not sure why, as I have
brought up above].

pgStatSessionReported is not required. We can glean this information by
checking if the function local static last_report in
pgstat_report_stat() is 0 and passing this on as another param
"first_report" to pgstat_send_connstats().


* PGSTAT_FILE_FORMAT_ID needs to be updated when a stats collector data
structure changes and we had a change in PgStat_StatDBEntry.

* We can directly use PgBackendStatus.st_proc_start_timestamp for
calculating m_session_time. We can also choose to report session uptime
even when the report is for the not-disconnect case
(PGSTAT_STAT_INTERVAL elapsed). No reason why not. Then we would need to
pass in the value of last_report to pgstat_send_connstats() -> calculate
m_session_time to be number of time units from
PgBackendStatus.st_proc_start_timestamp for the first report and then
number of time units from the last_report for all subsequent reports.

* We would need to bump the catalog version since we have made
changes to system views. Refer: #define CATALOG_VERSION_NO


Regards,
Soumyadeep (VMware)




Re: Refactor pg_rewind code and make it work against a standby

2020-09-20 Thread Soumyadeep Chakraborty
Hey Heikki,

Thanks for refactoring and making the code much easier to read!

Before getting into the code review for the patch, I wanted to know why
we don't use a Bitmapset for target_modified_pages?

Code review:


1. We need to update the comments for process_source_file and
process_target_file. We don't decide the action on the file until later.


2. Rename target_modified_pages to target_pages_to_overwrite?
target_modified_pages can lead to confusion as to whether it includes pages
that were modified on the target but not even present in the source and
the other exclusionary cases. target_pages_to_overwrite is much clearer.


3.

> /*
>  * If this is a relation file, copy the modified blocks.
>  *
>  * This is in addition to any other changes.
>  */
> iter = datapagemap_iterate(>target_modified_pages);
> while (datapagemap_next(iter, ))
> {
> offset = blkno * BLCKSZ;
>
> source->queue_fetch_range(source, entry->path, offset, BLCKSZ);
> }
> pg_free(iter);

Can we put this hunk into a static function overwrite_pages()?


4.

> * block that have changed in the target system.  It makes note of all the
> * changed blocks in the pagemap of the file.

Can we replace the above with:

> * block that has changed in the target system.  It decides if the given
blkno in the target relfile needs to be overwritten from the source.


5.

> /*
>  * Doesn't exist in either server. Why does it have an entry in the
>  * first place??
>  */
> return FILE_ACTION_NONE;

Can we delete the above hunk and add the following assert to the very
top of decide_file_action():

Assert(entry->target_exists || entry->source_exists);


6.

> pg_fatal("unexpected page modification for directory or symbolic link \"%s\"",
> entry->path);

Can we replace above with:

pg_fatal("unexpected page modification for non-regular file \"%s\"",
entry->path);

This way we can address the undefined file type.


7. Please address the FIXME for the symlink case:
/* FIXME: Check if it points to the same target? */


8.

* it anyway. But there's no harm in copying it now.)

and

* copy them here. But we don't know which scenario we're
* dealing with, and there's no harm in copying the missing
* blocks now, so do it now.

Could you add a line or two explaining why there is "no harm" in these
two hunks above?


9. Can we add pg_control, /pgsql_tmp/... and .../pgsql_tmp.* and PG_VERSION
files to check_file_excluded()?


10.

- * block that have changed in the target system. It makes note of all the
+ * block that have changed in the target system.  It makes note of all the

Whitespace typo


11.

> * If the block is beyond the EOF in the source system, or the file doesn't
> * doesn'exist

Typo: Two doesnt's


12.

> /*
>  * This represents the final decisions on what to do with each file.
>  * 'actions' array contains an entry for each file, sorted in the order
>  * that their actions should executed.
>  */
> typedef struct filemap_t
> {
> /* Summary information, filled by calculate_totals() */
> uint64 total_size; /* total size of the source cluster */
> uint64 fetch_size; /* number of bytes that needs to be copied */
>
> int nactions; /* size of 'actions' array */
> file_entry_t *actions[FLEXIBLE_ARRAY_MEMBER];
> } filemap_t;

Replace nactions/actions with nentries/entries..clearer in intent as
it is easier to reconcile the modified pages stuff to an entry rather
than an action. It could be:

/*
 * This contains the final decisions on what to do with each file.
 * 'entries' array contains an entry for each file, sorted in the order
 * that their actions should executed.
 */
typedef struct filemap_t
{
/* Summary information, filled by calculate_totals() */
uint64 total_size; /* total size of the source cluster */
uint64 fetch_size; /* number of bytes that needs to be copied */
int nentries; /* size of 'entries' array */
file_entry_t *entries[FLEXIBLE_ARRAY_MEMBER];
} filemap_t;


13.

> filehash = filehash_create(1000, NULL);

Use a constant for the 1000 in (FILEMAP_INITIAL_SIZE):


14. queue_overwrite_range(), finish_overwrite() instead of
queue_fetch_range(), finish_fetch()? Similarly update\
*_fetch_file_range() and *_finish_fetch()


15. Let's have local_source.c and libpq_source.c instead of *_fetch.c


16.

> conn = PQconnectdb(connstr_source);
>
> if (PQstatus(conn) == CONNECTION_BAD)
> pg_fatal("could not connect to server: %s",
> PQerrorMessage(conn));
>
> if (showprogress)
> pg_log_info("connected to server");


The above hunk should be part of init_libpq_source(). Consequently,
init_libpq_source() should take a connection string instead of a conn.


17.

> if (conn)
> {
> PQfinish(conn);
> conn = NULL;
> }

The hunk above should be part of libpq_destroy()


18.

> /*
>  * Files are fetched max CHUNK_SIZE bytes at a time, and with a
>  * maximum of MAX_CHUNKS_PER_QUERY chunks in a single query.
>  */
> #define CHUNK_SIZE (1024 * 1024)

Can we rename CHUNK_SIZE to MAX_CHUNK_SIZE and update the comment?


19.

> typedef struct
> {
> const char 

Re: Adding Support for Copy callback functionality on COPY TO api

2020-09-14 Thread Soumyadeep Chakraborty
Hi Bilva,

Thank you for registering this patch!

I had a few suggestions:

1. Please run pg_indent[1] on your code. Make sure you add
copy_data_destination_cb to src/tools/pgindent/typedefs.list. Please
run pg_indent on only the files you changed (it will take files as
command line args)

2. For features such as this, it is often helpful to find a use case
within backend/utility/extension code that demonstrate thes callback and
to include the code to exercise it with the patch. Refer how
copy_read_data() is used as copy_data_source_cb, to copy the data from
the query results from the WAL receiver (Refer: copy_table()). Finding
a similar use case in the source tree will make a stronger case
for this patch.

3. Wouldn't we want to return the number of bytes written from
copy_data_destination_cb? (Similar to copy_data_source_cb) We should
also think about how to represent failure. Looking at CopySendEndOfRow(),
we should error out like we do for the other copy_dests after checking the
return value for the callback invocation.

4.
> bool pipe = (cstate->filename == NULL) && (cstate->data_destination_cb == 
> NULL);

I think a similar change should also be applied to BeginCopyFrom() and
CopyFrom(). Or even better, such code could be refactored to have a
separate destination type COPY_PIPE. This of course, will be a separate
patch. I think the above line is okay for this patch.

Regards,
Soumyadeep




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-24 Thread Soumyadeep Chakraborty
On Fri, Jul 24, 2020 at 7:34 AM Robert Haas  wrote:

>
> On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty
>  wrote:
> > In the read-only level I was suggesting, I wasn't suggesting that we
> > stop WAL flushes, in fact we should flush the WAL before we mark the
> > system as read-only. Once the system declares itself as read-only, it
> > will not perform any more on-disk changes; It may perform all the
> > flushes it needs as a part of the read-only request handling.
>
> I think that's already how the patch works, or at least how it should
> work. You stop new writes, flush any existing WAL, and then declare
> the system read-only. That can all be done quickly.
>

True, except for the fact that it allows dirty buffers to be flushed
after the ALTER command returns.

> > What I am saying is it doesn't have to be just the queries. I think we
> > can cater to all the other use cases simply by forcing a checkpoint
> > before marking the system as read-only.
>
> But that part can't, which means that if we did that, it would break
> the feature for the originally intended use case. I'm not on board
> with that.
>

Referring to the options you presented in [1]:
I am saying that we should allow for both: with a checkpoint (#2) (can
also be #3) and without a checkpoint (#1) before having the ALTER
command return, by having different levels of read-onlyness.

We should have syntax variants for these. The syntax should not be an
ALTER SYSTEM SET as you have pointed out before. Perhaps:

ALTER SYSTEM READ ONLY; -- #2 or #3
ALTER SYSTEM READ ONLY WAL; -- #1
ALTER SYSTEM READ WRITE;

or even:

ALTER SYSTEM FREEZE; -- #2 or #3
ALTER SYSTEM FREEZE WAL; -- #1
ALTER SYSTEM UNFREEZE;

Regards,
Soumyadeep (VMware)

[1] 
http://postgr.es/m/ca+tgmoz-c3dz9qwhwmm4bc36n4u0xz2oyenewmf+bwokbyd...@mail.gmail.com




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-24 Thread Soumyadeep Chakraborty
On Fri, Jul 24, 2020 at 7:32 AM Robert Haas  wrote:
>
> On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty
>  wrote:
> > So if we are not going to address those cases, we should change the
> > syntax and remove the notion of read-only. It could be:
> >
> > ALTER SYSTEM SET wal_writes TO off|on;
> > or
> > ALTER SYSTEM SET prohibit_wal TO off|on;
>
> This doesn't really work because of the considerations mentioned in
> http://postgr.es/m/ca+tgmoakctzozr0xeqalfimbcje2rgcbazf4eybpxjtnetp...@mail.gmail.com

Ah yes. We should then have ALTER SYSTEM WAL {PERMIT|PROHIBIT}. I don't
think we should say "READ ONLY" if we still allow on-disk file changes
after the ALTER SYSTEM command returns (courtesy dirty buffer flushes)
because it does introduce confusion, especially to an audience not privy
to this thread. When people hear "read-only" they may think of static on-disk
files immediately.

> Contrary to what you write, I don't think either #2 or #3 is
> sufficient to enable checksums, at least not without some more
> engineering, because the server would cache the state from the control
> file, and a bunch of blocks from the database. I guess it would work
> if you did a server restart afterward, but I think there are better
> ways of supporting online checksum enabling that don't require
> shutting down the server, or even making it read-only; and there's
> been significant work done on those already.

Agreed. As you mentioned, if we did do #2 or #3, we would be able to do
pg_checksums on a server that was shut down or that had crashed while it
was in a read-only state, which is what Michael was asking for in [1]. I
think it's just cleaner if we allow for this.

I don't have enough context to enumerate use cases for the advantages or
opportunities that would come with an assurance that the cluster's files
are frozen (and not covered by any existing utilities), but surely there
are some? Like the possibility of pg_upgrade on a running server while
it can entertain read-only queries? Surely, that's a nice one!

Of course, some or all of these utilities would need to be taught about
read-only mode.

Regards,
Soumyadeep

[1] http://postgr.es/m/20200626095921.gf1...@paquier.xyz




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-24 Thread Soumyadeep Chakraborty
On Thu, Jul 23, 2020 at 10:14 PM Amul Sul  wrote:
>
> On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty 
>  wrote:
> > In case it is necessary, the patch set does not wait for the checkpoint to
> > complete before marking the system as read-write. Refer:
> >
> > /* Set final state by clearing in-progress flag bit */
> > if (SetWALProhibitState(wal_state &
> ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> > {
> >   if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
> > ereport(LOG, (errmsg("system is now read only")));
> >   else
> >   {
> > /* Request checkpoint */
> > RequestCheckpoint(CHECKPOINT_IMMEDIATE);
> > ereport(LOG, (errmsg("system is now read write")));
> >   }
> > }
> >
> > We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
> > we SetWALProhibitState() and do the ereport(), if we have a read-write
> > state change request.
> >
> +1, I too have the same question.
>
>
>
> FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it
> think
> it will be deadlock case -- checkpointer process waiting for itself.

We should really just call CreateCheckPoint() here instead of
RequestCheckpoint().

> > 3. Some of the functions that were added such as GetWALProhibitState(),
> > IsWALProhibited() etc could be declared static inline.
> >
> IsWALProhibited() can be static but not GetWALProhibitState() since it
> needed to
> be accessible from other files.

If you place a static inline function in a header file, it will be
accessible from other files. E.g. pg_atomic_* functions.

Regards,
Soumyadeep




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-23 Thread Soumyadeep Chakraborty
On Thu, Jun 18, 2020 at 7:54 AM Robert Haas  wrote:
> I think we'd want the FIRST write operation to be the end-of-recovery
> checkpoint, before the system is fully read-write. And then after that
> completes you could do other things.

I can't see why this is necessary from a correctness or performance
point of view. Maybe I'm missing something.

In case it is necessary, the patch set does not wait for the checkpoint to
complete before marking the system as read-write. Refer:

/* Set final state by clearing in-progress flag bit */
if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
{
  if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
ereport(LOG, (errmsg("system is now read only")));
  else
  {
/* Request checkpoint */
RequestCheckpoint(CHECKPOINT_IMMEDIATE);
ereport(LOG, (errmsg("system is now read write")));
  }
}

We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
we SetWALProhibitState() and do the ereport(), if we have a read-write
state change request.

Also, we currently request this checkpoint even if there was no startup
recovery and we don't set CHECKPOINT_END_OF_RECOVERY in the case where
the read-write request does follow a startup recovery.
So it should really be:
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
CHECKPOINT_END_OF_RECOVERY);
We would need to convey that an end-of-recovery-checkpoint is pending in
shmem somehow (and only if one such checkpoint is pending, should we do
it as a part of the read-write request handling).
Maybe we can set CHECKPOINT_END_OF_RECOVERY in ckpt_flags where we do:
/*
 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
 */
and then check for that.

Some minor comments about the code (some of them probably doesn't
warrant immediate attention, but for the record...):

1. There are some places where we can use a local variable to store the
result of RelationNeedsWAL() to avoid repeated calls to it. E.g.
brin_doupdate()

2. Similarly, we can also capture the calls to GetWALProhibitState() in
a local variable where applicable. E.g. inside WALProhibitRequest().

3. Some of the functions that were added such as GetWALProhibitState(),
IsWALProhibited() etc could be declared static inline.

4. IsWALProhibited(): Shouldn't it really be:
bool
IsWALProhibited(void)
{
  uint32 walProhibitState = GetWALProhibitState();
  return (walProhibitState & WALPROHIBIT_STATE_READ_ONLY) != 0
&& (walProhibitState & WALPROHIBIT_TRANSITION_IN_PROGRESS) == 0;
}

5. I think the comments:
/* Must be performing an INSERT or UPDATE, so we'll have an XID */
and
/* Can reach here from VACUUM, so need not have an XID */
can be internalized in the function/macro comment header.

6. Typo: ConditionVariable readonly_cv; /* signaled when ckpt_started
advances */
We need to update the comment here.

Regards,
Soumyadeep (VMware)




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-23 Thread Soumyadeep Chakraborty
On Thu, Jul 23, 2020 at 3:57 AM Amul Sul  wrote:

> Well, once we've initiated the change to a read-only state, we probably want 
> to
> always either finish that change or go back to read-write, even if the process
> that initiated the change is interrupted. Leaving the system in a
> half-way-in-between state long term seems bad. Maybe we would have put some
> background process, but choose the checkpointer in charge of making the state
> change and to avoid the new background process to keep the first version patch
> simple.  The checkpointer isn't likely to get killed, but if it does, it will
> be relaunched and the new one can clean things up. On the other hand, I agree
> making the checkpointer responsible for more than one thing might not
> be a good idea
> but I don't think the postmaster should do the work that any
> background process can
> do.

+1 for doing it in a background process rather than in the backend
itself (as we can't risk doing it in a backend as it can crash and won't
restart and clean up as a background process would).

As my co-worker pointed out to me, doing the work in the postmaster is a
very bad idea as we don't want delays in serving connection requests on
account of the barrier that comes with this patch.

I would like to see this responsibility in a separate auxiliary process
but I guess having it in the checkpointer isn't the end of the world.

Regards,
Soumyadeep (VMware)




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-23 Thread Soumyadeep Chakraborty
On Thu, Jul 23, 2020 at 3:42 AM Amul Sul  wrote:
> The aim of this feature is preventing new WAL records from being generated, 
> not
> preventing them from being flushed to disk, or streamed to standbys, or 
> anything
> else. The rest should happen as normal.
>
> If you can't flush WAL, then you might not be able to evict some number of
> buffers, which in the worst case could be large. That's because you can't 
> evict
> a dirty buffer until WAL has been flushed up to the buffer's LSN (otherwise,
> you wouldn't be following the WAL-before-data rule). And having a potentially
> large number of unevictable buffers around sounds terrible, not only for
> performance, but also for having the system keep working at all.

In the read-only level I was suggesting, I wasn't suggesting that we
stop WAL flushes, in fact we should flush the WAL before we mark the
system as read-only. Once the system declares itself as read-only, it
will not perform any more on-disk changes; It may perform all the
flushes it needs as a part of the read-only request handling.

WAL should still stream to the secondary of course, even after you mark
the primary as read-only.

> Read-only is for the queries.

What I am saying is it doesn't have to be just the queries. I think we
can cater to all the other use cases simply by forcing a checkpoint
before marking the system as read-only.

> The intention is to change the system to read-only ASAP; the checkpoint will
> make it much slower.

I agree - if one needs that speed, then they can do the equivalent of:
ALTER SYSTEM SET read_only to 'wal';
and the expensive checkpoint you mentioned can be avoided.

> I don't think we can skip control file updates that need to make read-only
> state persistent across the restart.

I was referring to control file updates post the read-only state change.
Any updates done as a part of the state change is totally cool.


Regards,
Soumyadeep (VMware)




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-22 Thread Soumyadeep Chakraborty
Hi Amul,

On Tue, Jun 16, 2020 at 6:56 AM amul sul  wrote:
> The proposed feature is built atop of super barrier mechanism commit[1] to
> coordinate
> global state changes to all active backends.  Backends which executed
> ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
> process to change the requested WAL read/write state aka WAL prohibited and
> WAL
> permitted state respectively.  When the checkpointer process sees the WAL
> prohibit
> state change request, it emits a global barrier and waits until all
> backends that
> participate in the ProcSignal absorbs it.

Why should the checkpointer have the responsibility of setting the state
of the system to read-only? Maybe this should be the postmaster's
responsibility - the checkpointer should just handle requests to
checkpoint. I think the backend requesting the read-only transition
should signal the postmaster, which in turn, will take on the aforesaid
responsibilities. The postmaster, could also additionally request a
checkpoint, using RequestCheckpoint() (if we want to support the
read-onlyness discussed in [1]). checkpointer.c should not be touched by
this feature.

Following on, any condition variable used by the backend to wait for the
ALTER SYSTEM command to finish (the patch uses
CheckpointerShmem->readonly_cv), could be housed in ProcGlobal.

Regards,
Soumyadeep (VMware)

[1] 
https://www.postgresql.org/message-id/CAE-ML%2B-zdWODAyWNs_Eu-siPxp_3PGbPkiSg%3DtoLeW9iS_eioA%40mail.gmail.com




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-22 Thread Soumyadeep Chakraborty
Hello,

I think we should really term this feature, as it stands, as a means to
solely stop WAL writes from happening.

The feature doesn't truly make the system read-only (e.g. dirty buffer
flushes may succeed the system being put into a read-only state), which
does make it confusing to a degree.

Ideally, if we were to have a read-only system, we should be able to run
pg_checksums on it, or take file-system snapshots etc, without the need
to shut down the cluster. It would also enable an interesting use case:
we should also be able to do a live upgrade on any running cluster and
entertain read-only queries at the same time, given that all the
cluster's files will be immutable?

So if we are not going to address those cases, we should change the
syntax and remove the notion of read-only. It could be:

ALTER SYSTEM SET wal_writes TO off|on;
or
ALTER SYSTEM SET prohibit_wal TO off|on;

If we are going to try to make it truly read-only, and cater to the
other use cases, we have to:

Perform a checkpoint before declaring the system read-only (i.e. before
the command returns). This may be expensive of course, as Andres has
pointed out in this thread, but it is a price that has to be paid. If we
do this checkpoint, then we can avoid an additional shutdown checkpoint
and an end-of-recovery checkpoint (if we restart the primary after a
crash while in read-only mode). Also, we would have to prevent any
operation that touches control files, which I am not sure we do today in
the current patch.

Why not have the best of both worlds? Consider:

ALTER SYSTEM SET read_only to {off, on, wal};

-- on: wal writes off + no writes to disk
-- off: default
-- wal: only wal writes off

Of course, there can probably be better syntax for the above.

Regards,

Soumyadeep (VMware)




Re: Parallel Seq Scan vs kernel read ahead

2020-07-22 Thread Soumyadeep Chakraborty
On Tue, Jul 21, 2020 at 9:33 PM Thomas Munro  wrote:
>
> On Wed, Jul 22, 2020 at 3:57 PM Amit Kapila  wrote:
> > Yeah, that is true but every time before the test the same amount of
> > data should be present in shared buffers (or OS cache) if any which
> > will help in getting consistent results.  However, it is fine to
> > reboot the machine as well if that is a convenient way.
>
> We really should have an extension (pg_prewarm?) that knows how to
> kick stuff out PostgreSQL's shared buffers and the page cache
> (POSIX_FADV_DONTNEED).
>
>
+1. Clearing the OS page cache on FreeBSD is non-trivial during testing.
You can't do this on FreeBSD:
sync; echo 3 > /proc/sys/vm/drop_caches

Also, it would be nice to evict only those pages from the OS page cache
that are Postgres pages instead of having to drop everything.

Regards,
Soumyadeep (VMware)




Re: Parallel Seq Scan vs kernel read ahead

2020-07-22 Thread Soumyadeep Chakraborty
Hi David,

Apologies for the delay, I had missed these emails.

On Tue, Jul 14, 2020 at 8:52 PM David Rowley  wrote:

> It would be good to know if the
> regression is repeatable or if it was affected by some other process.


These are the latest results on the same setup as [1].
(TL;DR: the FreeBSD VM with Google Persistent Disk is too unstable -
there is too much variability in performance to conclude that there is a
regression)


With 1 rows:

master (606c384598):

max_parallel_workers_per_gatherTime(seconds)
  0   20.09s
  19.77s
  29.92s
  69.55s


v2 patch (applied on 606c384598):

max_parallel_workers_per_gatherTime(seconds)
  0   18.34s
  19.68s
  29.15s
  69.11s


The above results were averaged across 3 runs with little or no
deviation between runs. The absolute values are very different from the
results reported in [1].

So, I tried to repro the regression as I had reported in [1] with
15000 rows:

master (449e14a561)

max_parallel_workers_per_gatherTime(seconds)
  0 42s, 42s
  1   395s, 393s
  2   404s, 403s
  6   403s, 403s

Thomas' patch (applied on 449e14a561):

max_parallel_workers_per_gatherTime(seconds)
  0  43s,43s
  1203s, 42s
  2 42s, 42s
  6 44s, 43s


v2 patch (applied on 449e14a561):

max_parallel_workers_per_gatherTime(seconds)
  0   274s, 403s
  1   419s, 419s
  2   448s, 448s
  6   137s, 419s


As you can see, I got wildly different results with 15000 rows (even
between runs of the same experiment)
I don't think that the environment is stable enough to tell if there is
any regression.

What I can say is that there are no processes apart from Postgres
running on the system. Also, the environment is pretty constrained -
just 1G of free hard drive space before the start of every run, when I
have 15000 rows, apart from the mere 32M of shared buffers and only
4G of RAM.

I don't know much about Google Persistent Disk to be very honest.
Basically, I just provisioned one when I provisioned a GCP VM for testing on
FreeBSD, as Thomas had mentioned that FreeBSD UFS is a bad case for
parallel seq scan.

> It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET
> track_io_timing = on; for each value of max_parallel_workers.

As for running EXPLAIN ANALYZE, running that on this system incurs a
non-trivial amount of overhead. The overhead is simply staggering. This
is the result of pg_test_timing in the FreeBSD GCP VM:

$ /usr/local/pgsql/bin/pg_test_timing -d 50
Testing timing overhead for 50 seconds.
Per loop time including overhead: 4329.80 ns
Histogram of timing durations:
  < us   % of total  count
 1  0.0  0
 2  0.0  0
 4  3.08896 356710
 8 95.97096   11082616
16  0.37748  43591
32  0.55502  64093
64  0.00638737
   128  0.00118136
   256  0.2  2

As a point of comparison, on my local Ubuntu workstation:

$ /usr/local/pgsql/bin/pg_test_timing -d 50
Testing timing overhead for 50 seconds.
Per loop time including overhead: 22.65 ns
Histogram of timing durations:
  < us   % of total  count
 1 97.73691 2157634382
 2  2.26246   49945854
 4  0.00039   8711
 8  0.00016   3492
16  0.8   1689
32  0.0 63
64  0.0  1

This is why I opted to simply use \timing on.

Regards,

Soumyadeep (VMware)

[1] 
https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB=xyj192ezcnwgfcca_wj5ghvm7sv8oe...@mail.gmail.com




Re: Does TupleQueueReaderNext() really need to copy its result?

2020-07-11 Thread Soumyadeep Chakraborty
Hey Thomas,

On Fri, Jul 10, 2020 at 7:30 PM Thomas Munro  wrote:
>
> > I could reproduce the speed gain that you saw for a plan with a simple
> > parallel sequential scan. However, I got no gain at all for a parallel
> > hash join and parallel agg query.
>
> Right, it's not going to make a difference when you only send one
> tuple through the queue, like COUNT(*) does.

How silly of me! I should have paid more attention to the rows output
from each worker and that there was a select count(*) on the join query.
Anyway, these are a new set of results:

---

select pg_prewarm('lineitem');
select pg_prewarm('orders');
-- lineitem is 17G, orders is 4G. (TPCH scale = 20). shared_buffers = 30G

explain analyze select *
  from lineitem
  join orders on l_orderkey = o_orderkey
 where o_totalprice > 5.00;

[w/o any patch]   637s
[w/ first patch]  635s
[w/ last minimal tuple patch] 568s

---

We do indeed get the speedup.

> > As for gather merge, is it possible to have a situation where the slot
> > input to tqueueReceiveSlot() is a heap slot (as would be the case for a
> > simple select *)? If yes, in those scenarios, we would be incurring an
> > extra call to minimal_tuple_from_heap_tuple() because of the extra call
> > to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch.
> > And since, in a gather merge, we can't avoid the copy on the leader side
> > (heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be
> > doing extra work in that scenario. I couldn't come up with a plan that
> > creates a scenario like this however.
>
> Hmm.  I wish we had a way to do an "in-place" copy-to-minimal-tuple
> where the caller supplies the memory, with some fast protocol to get
> the size right.  We could use that for copying tuples into shm queues,
> hash join tables etc without an extra palloc()/pfree() and double
> copy.

Do you mean that we should have an implementation for
get_minimal_tuple() for the heap AM and have it return a pointer to the
minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as
tqueueReceiveSlot() will ensure that the heap tuple from which it wants
to extract the minimal tuple was allocated in the tuple queue in the
first place? If we consider that the node directly below a gather is a
SeqScan, then we could possibly, in ExecInitSeqScan() set-up the
ss_ScanTupleSlot to point to memory in the shared tuple queue?
Similarly, other ExecInit*() methods can do the same for other executor
nodes that involve parallelism? Of course, things would be slightly
different for
the other use cases you mentioned (such as hash table population)

All things considered, I think the patch in its current form should go
in. Having the in-place copy, could be done as a separate patch? Do you
agree?

Regards,
Soumyadeep (VMware)




Re: Does TupleQueueReaderNext() really need to copy its result?

2020-07-10 Thread Soumyadeep Chakraborty
Hi Thomas,

+1 to the idea! I ran some experiments on both of your patches.

I could reproduce the speed gain that you saw for a plan with a simple
parallel sequential scan. However, I got no gain at all for a parallel
hash join and parallel agg query.

---

select pg_prewarm('lineitem');
-- lineitem is 17G. (TPCH scale = 20). shared_buffers = 30G
explain analyze select * from lineitem;

[w/o any patch]   99s
[w/ first patch]  89s
[w/ last minimal tuple patch] 79s

---

select pg_prewarm('lineitem');
-- lineitem is 17G. (TPCH scale = 20). shared_buffers = 30G
explain analyze select count(*) from lineitem;

[w/o any patch]   10s
[w/ first patch]  10s
[w/ last minimal tuple patch] 10s

---

select pg_prewarm('lineitem');
select pg_prewarm('orders');
-- lineitem is 17G, orders is 4G. (TPCH scale = 20). shared_buffers = 30G

explain analyze select count(*)
  from lineitem
  join orders on l_orderkey = o_orderkey
 where o_totalprice > 5.00;

[w/o any patch]   54s
[w/ first patch]  53s
[w/ last minimal tuple patch] 56s

---

Maybe I'm missing something, since there should be improvements with
anything that has a gather?

As for gather merge, is it possible to have a situation where the slot
input to tqueueReceiveSlot() is a heap slot (as would be the case for a
simple select *)? If yes, in those scenarios, we would be incurring an
extra call to minimal_tuple_from_heap_tuple() because of the extra call
to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch.
And since, in a gather merge, we can't avoid the copy on the leader side
(heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be
doing extra work in that scenario. I couldn't come up with a plan that
creates a scenario like this however.

Regards,
Soumyadeep (VMware)




Re: posgres 12 bug (partitioned table)

2020-07-09 Thread Soumyadeep Chakraborty
Hey Amit,

On Thu, Jul 9, 2020 at 12:16 AM Amit Langote  wrote:

> By the way, what happens today if you do INSERT INTO a_zedstore_table
> ... RETURNING xmin?  Do you get an error "xmin is unrecognized" or
> some such in slot_getsysattr() when trying to project the RETURNING
> list?
>

We get garbage values for xmin and cmin. If we request cmax/xmax, we get
an ERROR from slot_getsystattr()->tts_zedstore_getsysattr():
"zedstore tuple table slot does not have system attributes (except xmin
and cmin)"

A ZedstoreTupleTableSlot only stores xmin and xmax. Also,
zedstoream_insert(), which is the tuple_insert() implementation, does
not supply the xmin/cmin, thus making those values garbage.

For context, Zedstore has its own UNDO log implementation to act as
storage for transaction information. (which is intended to be replaced
with the upstream UNDO log in the future).

The above behavior is not just restricted to INSERT..RETURNING, right
now. If we do a select  from foo in Zedstore, the behavior is
the same.  The transaction information is never returned from Zedstore
in tableam calls that don't demand transactional information be
used/returned. If you ask it to do a tuple_satisfies_snapshot(), OTOH,
it will use the transactional information correctly. It will also
populate TM_FailureData, which contains xmax and cmax, in the APIs where
it is demanded.

I really wonder what other AMs are doing about these issues.

I think we should either:

1. Demand transactional information off of AMs for all APIs that involve
a projection of transactional information.

2. Have some other component of Postgres supply the transactional
information. This is what I think the upstream UNDO log can probably
provide.

3. (Least elegant) Transform tuple table slots into heap tuple table
slots (since it is the only kind of tuple storage that can supply
transactional info) and explicitly fill in the transactional values
depending on the context, whenever transactional information is
projected.

For this bug report, I am not sure what is right. Perhaps, to stop the
bleeding temporarily, we could use the pi_PartitionTupleSlot and assume
that the AM needs to provide the transactional info in the respective
insert AM API calls, as well as demand a heap slot for partition roots
and interior nodes. And then later on. we would need a larger effort
making all of these APIs not really demand transactional information.
Perhaps the UNDO framework will come to the rescue.

Regards,
Soumyadeep (VMware)




Re: posgres 12 bug (partitioned table)

2020-07-08 Thread Soumyadeep Chakraborty
Hey Amit,

On Tue, Jul 7, 2020 at 7:17 PM Amit Langote  wrote:
> Ah, I see.  You might've noticed that ExecInsert() only ever sees leaf
> partitions, because tuple routing would've switched the result
> relation to a leaf partition by the time we are in ExecInsert().  So,
> table_tuple_insert() always refers to a leaf partition's AM.   Not
> sure if you've also noticed but each leaf partition gets to own a slot
> (PartitionRoutingInfo.pi_PartitionTupleSlot), but currently it is only
> used if the leaf partition attribute numbers are not the same as the
> root partitioned table.  How about we also use it if the leaf
> partition AM's table_slot_callbacks() differs from the root
> partitioned table's slot's tts_ops?  That would be the case, for
> example, if the leaf partition is of Zedstore AM.  In the more common
> cases where all leaf partitions are of heap AM, this would mean the
> original slot would be used as is, that is, if we accept hard-coding
> table_slot_callbacks() to return a "heap" slot for partitioned tables
> as I suggest.

Even then, we will still need to fill in the system columns explicitly as
pi_PartitionTupleSlot will not be filled with system columns after
it comes back out of table_tuple_insert() if we have a non-heap AM.


Regards,
Soumyadeep (VMware)




Re: posgres 12 bug (partitioned table)

2020-07-07 Thread Soumyadeep Chakraborty
Hi Amit,

Thanks for your reply!

On Tue, Jul 7, 2020 at 7:18 AM Amit Langote  wrote:
>
> Hi Soumyadeep,
>
> Thanks for picking this up.
>
> On Tue, Jul 7, 2020 at 7:46 AM Soumyadeep Chakraborty
>  wrote:
> > Upon investigation, it seems that the problem is caused by the
> > following:
> >
> > The slot passed to the call to ExecProcessReturning() inside
> > ExecInsert() is often a virtual tuple table slot.
>
> Actually, not that often in practice.  The slot is not virtual, for
> example, when inserting into a regular non-partitioned table.

Indeed! I meant partitioned tables are a common use case. Sorry, I
should have elaborated.

> If I change this to return a "heap" slot for partitioned tables, just
> like for foreign tables, the problem goes away (see the attached).  In
> fact, even make check-world passes, so I don't know why it isn't that
> way to begin with.
>

This is what I had thought of initially but I had taken a step back for 2
reasons:

1. It is not mandatory for an AM to supply a heap tuple in the slot
passed to any AM routine. With your patch, the slot passed to
table_tuple_insert() inside ExecInsert() for instance is now expected to
supply a heap tuple for the subsequent call to ExecProcessReturning().
This can lead to garbage values for xmin, xmax, cmin and cmax. I tried
applying your patch on Zedstore [1], a columnar AM, and consequently, I
got incorrect values for xmin, xmax etc with the query reported in this
issue.

2. This is a secondary aspect but I will mention it here for
completeness. Not knowing enough about this code: will demanding heap
tuples for partitioned tables all throughout the code have a performance
impact? At a first glance it didn't seem to be the case.  However, I did
find lots of callsites for partitioning or otherwise where we kind of
expect a virtual tuple table slot (as evidenced with the calls to
ExecStoreVirtualTuple()). With your patch, we seem to be calling
ExecStoreVirtualTuple() on a heap tuple table slot, in various places:
such as inside execute_attr_map_slot(). It seems to be harmless to do so
however, in accordance with my limited investigation.

All in all, I think we have to explicitly supply those system columns. I
heard from Daniel that one of the motivations for having table AMs
was to ensure that transaction meta-data storage is not demanded off any
AM.

[1] https://github.com/greenplum-db/postgres/tree/zedstore

Regards,
Soumyadeep




Re: posgres 12 bug (partitioned table)

2020-07-06 Thread Soumyadeep Chakraborty
Hello,

ccing pgsql-hack...@postgresql.org

Upon investigation, it seems that the problem is caused by the
following:

The slot passed to the call to ExecProcessReturning() inside
ExecInsert() is often a virtual tuple table slot. If there are system
columns other than ctid and tableOid referenced in the RETURNING clause
(not only xmin as in the bug report), it will lead to the ERROR as
mentioned in this thread as virtual tuple table slots don't really store
such columns. (ctid and tableOid are present directly in the
TupleTableSlot struct and can be satisfied from there: refer:
slot_getsysattr()))

I have attached two alternate patches to solve the problem. Both patches
use and share a mechanism to detect if there are any such system
columns. This is done inside ExecBuildProjectionInfo() and we store this
info inside the ProjectionInfo struct. Then based on this info, system
columns are populated in a suitable slot, which is then passed on to
ExecProcessReturning(). (If it is deemed that this operation only be
done for RETURNING, we can just as easily do it in the callsite for
ExecBuildProjectionInfo() in ExecInitModifyTable() for RETURNING instead
of doing it inside ExecBuildProjectionInfo())

The first patch [1] explicitly creates a heap tuple table slot, fills in the
system column values as we would do during heap_prepare_insert() and
then passes that slot to ExecProcessReturning(). (We use a heap tuple table
slot as it is guaranteed to support these attributes).

The second patch [2] instead of relying on a heap tuple table slot,
relies on ExecGetReturningSlot() for the right slot and
table_tuple_fetch_row_version() to supply the system column values.  It
does make the assumption that the AM would supply a slot that will have
these system columns.

[1] v1-0001-Explicitly-supply-system-columns-for-INSERT.RETUR.patch
[2] v1-0001-Use-table_tuple_fetch_row_version-to-supply-INSER.patch

Regards,
Soumyadeep (VMware)
From baa48a89e571405f4cd802f065d0f82131599f53 Mon Sep 17 00:00:00 2001
From: soumyadeep2007 
Date: Sun, 5 Jul 2020 10:23:30 -0700
Subject: [PATCH v1 1/1] Explicitly supply system columns for INSERT..RETURNING
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If there are system columns other than ctid and tableOid in the
RETURNING clause for an INSERT, we must ensure that the slot we pass to
ExecProcessReturning() has those columns (ctid and tableOid are present
directly in the TupleTableSlot struct and can be satisifed from there:
refer: slot_getsysattr)).

This is in response to bug #16446 reported by Георгий Драк. In the bug
an INSERT..RETURNING statement fails with:

"ERROR: virtual tuple table slot does not have system
attributes"

This is caused due to the fact that ExecProcessReturning() was fed with
a virtual tuple table slot. Thus the resultant expression evaluation
and by extension, call to ExecEvalSysVar(), blows up with the
aforementioned error.

So, we fix this by:

1. Determining if there are system columns (other than tableOid and
ctid) referenced in ExecBuildProjectionInfo() and we store this info
inside the ProjectionInfo struct.

2. If there are such system columns in RETURNING, we explicitly create a
heap tuple table slot, fill in the system column values as we would do
during heap_prepare_insert() and then pass that slot to
ExecProcessReturning(). We use a heap tuple table slot as it is
guaranteed to support these attributes.
---
 src/backend/executor/execExpr.c| 19 ++-
 src/backend/executor/nodeModifyTable.c | 23 +++
 src/include/nodes/execnodes.h  |  2 ++
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 236413f62a..8cd966d227 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -360,10 +360,12 @@ ExecBuildProjectionInfo(List *targetList,
 	ExprState  *state;
 	ExprEvalStep scratch = {0};
 	ListCell   *lc;
+	List	   *tlist_vars;
 
 	projInfo->pi_exprContext = econtext;
 	/* We embed ExprState into ProjectionInfo instead of doing extra palloc */
 	projInfo->pi_state.tag = T_ExprState;
+	projInfo->has_non_slot_system_cols = false;
 	state = >pi_state;
 	state->expr = (Expr *) targetList;
 	state->parent = parent;
@@ -455,7 +457,6 @@ ExecBuildProjectionInfo(List *targetList,
 			 */
 			ExecInitExprRec(tle->expr, state,
 			>resvalue, >resnull);
-
 			/*
 			 * Column might be referenced multiple times in upper nodes, so
 			 * force value to R/O - but only if it could be an expanded datum.
@@ -469,6 +470,22 @@ ExecBuildProjectionInfo(List *targetList,
 		}
 	}
 
+	/* Record system columns that are part of this projection */
+	tlist_vars = pull_var_clause((Node *) targetList,
+ PVC_RECURSE_AGGREGATES |
+	 PVC_RECURSE_WINDOWFUNCS |
+	 PVC_INCLUDE_PLACEHOLDERS);
+	foreach(lc, tlist_vars)
+	{
+		Var *var = (Var *) lfirst(lc);
+		if 

Re: Extracting only the columns needed for a query

2020-06-30 Thread Soumyadeep Chakraborty
Hello,

Melanie and me will be posting a separate thread with the scanCols patch
listed here, a patch to capture the set of cols in RETURNING and a group
of patches to pass down the list of cols to various table AM functions
together as a patch set. This will take some time. Thus, we are
deregistering this patch for the July commitfest and will come back.

Regards,
Soumyadeep (VMware)




Re: Parallel Seq Scan vs kernel read ahead

2020-06-03 Thread Soumyadeep Chakraborty
On Wed, Jun 3, 2020 at 3:18 PM Soumyadeep Chakraborty
 wrote:
> Idk if that is a lesser evil than the workers
> being idle..probably not?

Apologies, I meant that the extra atomic fetches is probably a lesser
evil than the workers being idle.

Soumyadeep




Re: Parallel Seq Scan vs kernel read ahead

2020-06-03 Thread Soumyadeep Chakraborty
On Sat, May 23, 2020 at 12:00 AM Robert Haas
 wrote:
> I think there's a significant difference. The idea I remember being
> discussed at the time was to divide the relation into equal parts at
> the very start and give one part to each worker. I think that carries
> a lot of risk of some workers finishing much sooner than others.

Was the idea of work-stealing considered? Here is what I have been
thinking about:

Each worker would be assigned a contiguous chunk of blocks at init time.
Then if a worker is finished with its work, it can inspect other
workers' remaining work and "steal" some of the blocks from the end of
the victim worker's allocation.

Considerations for such a scheme:

1. Victim selection: Who will be the victim worker? It can be selected at
random if nothing else.

2. How many blocks to steal? Stealing half of the victim's remaining
blocks seems to be fair.

3. Stealing threshold: We should disallow stealing if the amount of
remaining work is not enough in the victim worker.

4. Additional parallel state: Representing the chunk of "work". I guess
one variable for the current block and one for the last block in the
chunk allocated. The latter would have to be protected with atomic
fetches as it would be decremented by the stealing worker.

5. Point 4 implies that there might be more atomic fetch operations as
compared to this patch. Idk if that is a lesser evil than the workers
being idle..probably not? A way to salvage that a little would be to
forego atomic fetches when the amount of work remaining is less than the
threshold discussed in 3 as there is no possibility of work stealing then.


Regards,

Soumyadeep




Re: Parallel Seq Scan vs kernel read ahead

2020-06-03 Thread Soumyadeep Chakraborty
> It doesn't look like it's using table_block_parallelscan_nextpage() as
> a block allocator so it's not affected by the patch.  It has its own
> thing zs_parallelscan_nextrange(), which does
> pg_atomic_fetch_add_u64(>pzs_allocatedtids,
> ZS_PARALLEL_CHUNK_SIZE), and that macro is 0x10.

My apologies, I was too hasty. Indeed, you are correct. Zedstore's
unit of work is chunks of the logical zstid space. There is a
correlation between the zstid and blocks: zstids near each other are
likely to lie in the same block or in neighboring blocks. It would be
interesting to try something like this patch for Zedstore.

Regards,
Soumyadeep




Re: Parallel Seq Scan vs kernel read ahead

2020-05-21 Thread Soumyadeep Chakraborty
Hi Thomas,

Some more data points:

create table t_heap as select generate_series(1, 1) i;

Query: select count(*) from t_heap;
shared_buffers=32MB (so that I don't have to clear buffers, OS page
cache)
OS: FreeBSD 12.1 with UFS on GCP
4 vCPUs, 4GB RAM Intel Skylake
22G Google PersistentDisk
Time is measured with \timing on.

Without your patch:

max_parallel_workers_per_gatherTime(seconds)
  0   33.88s
  1   57.62s
  2   62.01s
  6  222.94s

With your patch:

max_parallel_workers_per_gatherTime(seconds)
  0   29.04s
  1   29.17s
  2   28.78s
  6  291.27s

I checked with explain analyze to ensure that the number of workers
planned = max_parallel_workers_per_gather

Apart from the last result (max_parallel_workers_per_gather=6), all
the other results seem favorable.
Could the last result be down to the fact that the number of workers
planned exceeded the number of vCPUs?

I also wanted to evaluate Zedstore with your patch.
I used the same setup as above.
No discernible difference though, maybe I'm missing something:

Without your patch:

max_parallel_workers_per_gatherTime(seconds)
  0   25.86s
  1   15.70s
  2   12.60s
  6   12.41s


With your patch:

max_parallel_workers_per_gatherTime(seconds)
  0   26.96s
  1   15.73s
  2   12.46s
  6   12.10s
--
Soumyadeep


On Thu, May 21, 2020 at 3:28 PM Thomas Munro  wrote:

> On Fri, May 22, 2020 at 10:00 AM David Rowley 
> wrote:
> > On Thu, 21 May 2020 at 17:06, David Rowley  wrote:
> > > For the patch. I know you just put it together quickly, but I don't
> > > think you can do that ramp up the way you have. It looks like there's
> > > a risk of torn reads and torn writes and I'm unsure how much that
> > > could affect the test results here.
> >
> > Oops. On closer inspection, I see that memory is per worker, not
> > global to the scan.
>
> Right, I think it's safe.  I think you were probably right that
> ramp-up isn't actually useful though, it's only the end of the scan
> that requires special treatment so we don't get unfair allocation as
> the work runs out, due to course grain.  I suppose that even if you
> have a scheme that falls back to fine grained allocation for the final
> N pages, it's still possible that a highly distracted process (most
> likely the leader given its double duties) can finish up sitting on a
> large range of pages and eventually have to process them all at the
> end after the other workers have already knocked off and gone for a
> pint.
>
>
>


Re: WIP: expression evaluation improvements

2020-03-03 Thread Soumyadeep Chakraborty
Hello Andres,

Attached is a patch on top of
v2-0026-WIP-expression-eval-relative-pointer-suppport.patch that eliminates
the
const pointer references to fmgrInfo in the generated code.

FmgrInfos are now allocated like the FunctionCallInfos are
(ExprBuilderAllocFunctionMgrInfo()) and are initialized with
expr_init_fmgri().

Unfortunately, inside expr_init_fmgri(), I had to emit const pointers to set
fn_addr, fn_extra, fn_mcxt and fn_expr.

fn_addr, fn_mcxt should always be the same const pointer value in between
two identical
calls. So this isn't too bad?

fn_extra is NULL most of the time. So not too bad?

fn_expr is very difficult to eliminate because it is allocated way earlier.
Is
it something that will have a const pointer value in between two identical
calls? (don't know enough about plan caching..I ran the same query twice
and it
seemed to have different pointer values). Eliminating this pointer poses
a similar challenge to that of FunctionCallInfo->context. fn_expr is
allocated
quite early on. I had tried writing ExprBuilderAllocNode() to handle the
context
field. The trouble with writing something like expr_init_node() or something
even more specific like expr_init_percall() (for the percall context for
aggs)
as these structs have lots of pointer references to further pointers and so
on
-> so eventually we would have to emit some const pointers.
One naive way to handle this problem may be to emit a call to the _copy*()
functions inside expr_init_node(). It wouldn't be as performant though.

We could decide to live with the const pointers even if our cache key would
be
the generated code. The caching layer could be made smart enough to ignore
such
pointer references OR we could feed the caching layer with generated code
that
has been passed through a custom pass that normalizes all const pointer
values
to some predetermined / sentinel value. To help the custom pass we could
emit
some metadata when we generate a const pointer (that we know won't have the
same
const pointer value) to tell the pass to ignore it.

Soumyadeep


v1-0001-jit-Eliminate-const-pointer-to-fmgrinfo.patch
Description: Binary data


Re: WIP: expression evaluation improvements

2020-02-19 Thread Soumyadeep Chakraborty
Hey Andres,

> Awesome! +1. Attached 2nd version of patch rebased on master.
> (v2-0001-Resolve-PL-handler-names-for-JITed-code-instead-o.patch)
>
>
>
> > Did you check whether there's any cases this fails in the tree with your
> > patch applied? The way I usually do that is by running the regression
> > tests like
> > PGOPTIONS='-cjit_above_cost=0' make -s -Otarget check-world
> >
> > (which will take a bit longer if use an optimized LLVM build, and a
> > *lot* longer if you use a debug llvm build)
>
>
>
> Great suggestion! I used:
> PGOPTIONS='-c jit_above_cost=0' gmake installcheck-world
> It all passed except a couple of logical decoding tests that never pass
> on my machine for any tree (t/006_logical_decoding.pl and
> t/010_logical_decoding_timelines.pl) and point (which seems to be failing
> even
> on master as of: d80be6f2f) I have attached the regression.diffs which
> captures
> the point failure.

I have attached the 3rd version of the patch rebased on master. I made one
slight modification to the previous patch. PL handlers, such as that of
plsh,
can be in an external library. So I account for that in modname (earlier
naively I set it to NULL). There are also some minor changes to the comments
and I have rehashed the commit message.

Apart from running the regress tests as you suggested above, I installed
plsh
and forced JIT on the following:

CREATE FUNCTION query_plsh (x int) RETURNS text
LANGUAGE plsh
AS $$
#!/bin/sh
psql -At -c "select 1"
$$;

SELECT query_plsh(5);

and I also ran plsh's make installcheck with jit_above_cost = 0. Everything
looks good. I think this is ready for another round of review. Thanks!!

Soumyadeep


v3-0001-Resolve-PL-handler-names-for-JITed-code-instead-o.patch
Description: Binary data


Re: Zedstore - compressed in-core columnar storage

2020-02-14 Thread Soumyadeep Chakraborty
Hello,

We (David and I) recently observed that a Zedstore table can be considerably
bloated when we load data into it with concurrent copies. Also, we found
that
concurrent insert performance was less than desirable. This is a detailed
analysis of the extent of the problem, the cause of the problem and how we
fixed it. This has input from much of our team: Alex, Ashwin, David, Heikki,
Melanie, Taylor and myself.

An example of the bloat that we observed:
TPC-DS scale = 270:
Table heap   zed(serial)  zed(16 parallel COPYs)
web_sales39G19G39G

We found that it was caused due to inefficient page splits resultant from
out-of-tid-order-inserts into full/full-ish attribute tree leaf pages. The
degree of under-utilization was significant - attribute tree leaves with a
serial data load had 6-8x more datums than the attribute tree leaves
resultant with a parallel load of 16 sessions.

Consider the scenario below:

Assumptions:
1. Let us consider two concurrent copy commands executing (sessions S1 and
S2).
2. The table has only one (fixed-length for sake of argument) attribute 'a'.
3. For attribute 'a', a full attribute tree leaf page can accommodate 1500
datums.

TID allocations:
S1: 1-1000
S2: 1001-2000, 2001-3000

Order of operations:

1. S2 writes datums for tids 1001-2000, 2001-3000.
The resulting leaves are:
L1:
lokey  = 1  hikey= 2500
firsttid = 1001lasttid   = 2500
L2:
lokey  = 2501hikey= MaxZSTid
firsttid = 2501lasttid   = 3000

2. S1 now writes datums for tids 1-1000.
We have to split L1 into L1' and L1''.
L1':
lokey= 1   hikey= 1500
firsttid   = 1   lasttid  =  1500
L1'': [under-utilized page]
lokey= 1501 hikey   = 2000
firsttid = 1501   lasttid  = 2000
L2:
lokey  = 2501hikey   = MaxZSTid
firsttid = 2501lasttid  = 3000

Note: The lokeys/hikeys reflect ranges of what CAN be inserted whereas
firsttid
and lasttid reflect what actually have been inserted.

L1'' will be an under-utilized page that is not going to be filled again
because
it inherits the tight hikey from L1. In this example, space wastage in L1''
is
66% but it could very easily be close to 100%, especially under concurrent
workloads which mixes single and multi-inserts, or even unequally sized
multi-inserts.

Solution (kudos to Ashwin!):

For every multi-insert (and only multi-insert, not for singleton inserts),
allocate N times more tids. Each session will keep these extra tids in a
buffer. Subsequent calls to multi-insert would use these buffered tids. If
at
any time a tid allocation request cannot be met by the remaining buffered
tids,
a new batch of N times the number of tids requested will again be allocated.

If we take the same example above and say we allocated N=5 times the number
of
tids upon the first request for 1000 tids.:

TID allocations:
S1: 1-5000
S2: 5001-1

Order of operations:

1. S2 writes datums for tids 5001-6000, 6001-7000.
The resulting leaves are:
L1:
lokey  = 1 hikey= 6500
firsttid = 5001   lasttid  = 6500
L2:
lokey  = 6501hikey   = MaxZSTid
firsttid = 6501   lasttid  = 7000

2. S1 writes datums for tids 1-1000.
L1 will be split into L1' and L1''.

L1':
lokey  = 1 hikey  = 5500
firsttid = 1 lasttid  = 1000
L1'' [under-utilized page]:
lokey  = 5501   hikey   = 6500
firsttid = 5501   lasttid  = 6500
L2:
lokey  = 6501hikey   = MaxZSTid
firsttid = 6501lasttid  = 7000

Subsequent inserts by S1 will land on L1' whose hikey isn't restrictive.

However, we do end up with the inefficient page L1''. With a high enough
value
of N, we reduce the frequency of such pages. We could further reduce this
wastage by incorporating a special left split (Since L1 was already full, we
don't change it at all -> we simply update it's lokey -> L1 becomes L1''
and we
fork of a new leaf to its left: L1'). This would look like:

L1':
lokey  = 1   hikey   = 5000
firsttid = 1   lasttid  = 1000

L1'':
lokey  = 5001  hikey   = 6500
firsttid = 5001  lasttid  = 6500

We found that with a high enough value of N, we did not get significant
space
benefits from the left split. Thus, we decided to only incorporate N.

Results: [TPC-DS scale = 50, 16 conc copies]

Table zed N=10 N=100 N=1000 heap
zed(serial)
catalog_sales   15G9.1G  7.7G  7.5G15G
8.0G
catalog_returns1.5G0.9G 0.7G  0.7G1.2G
 0.8G
store_returns2.1G   1.2G 1.1G  1.1G1.9G
 1.2G
store_sales  17G11G   10.1G 10.1G 21G10G

Load time:
N=10 30min
N=100   10min
N=1000 7min
zed100min
heap  8min

'zed' refers to the zedstore branch without our fix. We see that with N =
10, we
get closer to what we get with serial inserts. For N = 100, we even 

Re: WIP: expression evaluation improvements

2020-02-09 Thread Soumyadeep Chakraborty
Hi Andres,

> Could you expand on what you mean here? Are you saying that you got
> significantly better optimization results by doing function optimization
> early on?  That'd be surprising imo?

Sorry for the ambiguity, I meant that I had observed differences in the
sizes
of the bitcode files dumped.

These are the size differences that I observed (for TPCH Q1):
Without my patch:
-rw---   1 pivotal  staff   278K Feb  9 11:59 1021.0.bc
-rw---   1 pivotal  staff   249K Feb  9 11:59 1374.0.bc
-rw---   1 pivotal  staff   249K Feb  9 11:59 1375.0.bc
With my patch:
-rw---   1 pivotal  staff   245K Feb  9 11:43 88514.0.bc
-rw---   1 pivotal  staff   245K Feb  9 11:43 88515.0.bc
-rw---   1 pivotal  staff   270K Feb  9 11:43 79323.0.bc

This means that the sizes of the module when execution encountered:

if (jit_dump_bitcode)
{
char *filename;

filename = psprintf("%u.%zu.bc",
MyProcPid,
context->module_generation);
LLVMWriteBitcodeToFile(context->module, filename);
pfree(filename);
}

were smaller with my patch applied. This means there is less memory
pressure between when the functions were built and when
llvm_compile_module() is called. I don't know if the difference is
practically
significant.

Soumyadeep


Re: WIP: expression evaluation improvements

2020-02-09 Thread Soumyadeep Chakraborty
Hi Andres,
> I've comitted a (somewhat evolved) version of this patch. I think it
> really improves the code!
Awesome! Thanks for taking it forward!

> I do wonder about adding a variadic wrapper like the one introduced here
> more widely, seems like it could simplify a number of places. If we then
> redirected all function calls through a common wrapper, for LLVMBuildCall,
> that also validated parameter count (and perhaps types), I think it'd be
> easier to develop...
+1. I was wondering whether such validations should be Asserts instead of
ERRORs.

Regards,

Soumyadeep Chakraborty
Senior Software Engineer
Pivotal Greenplum
Palo Alto


On Thu, Feb 6, 2020 at 10:35 PM Andres Freund  wrote:

> Hi,
>
> On 2019-10-28 23:58:11 -0700, Soumyadeep Chakraborty wrote:
> > > Sorry for not replying to that earlier.  I'm not quite sure it's
> > > actually worthwhile doing so - did you try to measure any memory / cpu
> > > savings?
> >
> > No problem, thanks for the reply! Unfortunately, I did not do anything
> > significant in terms of mem/cpu measurements. However, I have noticed
> > non-trivial differences between optimized and unoptimized .bc files
> > that were dumped from time to time.
>
> Could you expand on what you mean here? Are you saying that you got
> significantly better optimization results by doing function optimization
> early on?  That'd be surprising imo?
>
> Greetings,
>
> Andres Freund
>


Re: Default JIT setting in V12

2020-01-29 Thread Soumyadeep Chakraborty
Hello,

Based on this thread, Alexandra and I decided to investigate if we could
borrow
some passes from -O1 and add on to the default optimization of -O0 and
mem2reg.
To determine what passes would make most sense, we ran ICW with
jit_above_cost
set to 0, dumped all the backends and then analyzed them with 'opt'. Based
on
the stats dumped that the instcombine pass and sroa had the most scope for
optimization. We have attached the stats we dumped.

Then, we investigated whether mixing in sroa and instcombine gave us a
better
run time. We used TPCH Q1 (TPCH repo we used:
https://github.com/dimitri/tpch-citus) at scales of 1, 5 and 50. We found
that
there was no significant difference in query runtime over the default of -O0
with mem2reg.

We also performed the same experiment with -O1 as the default
optimization level, as Andres had suggested on this thread. We found
that the results were much more promising (refer the results for scale
= 5 and 50 below). At the lower scale of 1, we had to force optimization
to meet the query cost. There was no adverse impact from increased
query optimization time due to the ramp up to -O1 at this lower scale.


Results summary (eyeball-averaged over 5 runs, excluding first run after
restart. For each configuration we flushed the OS cache and restarted the
database):

settings: max_parallel_workers_per_gather = 0

scale = 50:
-O3  : 77s
-O0 + mem2reg   : 107s
-O0 + mem2reg + instcombine: 107s
-O0 + mem2reg + sroa: 107s
-O0 + mem2reg + sroa + instcombine : 107s
-O1   : 84s

scale = 5:
-O3   : 8s
-O0 + mem2reg: 10s
-O0 + mem2reg + instcombine : 10s
-O0 + mem2reg + sroa : 10s
-O0 + mem2reg + sroa + instcombine : 10s
-O1   : 8s


scale = 1:
-O3   : 1.7s
-O0 + mem2reg: 1.7s
-O0 + mem2reg + instcombine: 1.7s
-O0 + mem2reg + sroa : 1.7s
-O0 + mem2reg + sroa + instcombine : 1.7s
-O1   : 1.7s

Based on the evidence above, maybe it is worth considering ramping up the
default optimization level to -O1.

Regards,

Soumyadeep and Alexandra


opt_dump.pdf
Description: Adobe PDF document


Re: WIP: expression evaluation improvements

2019-10-29 Thread Soumyadeep Chakraborty
Hi Andres,

> I think I'd probably try to apply this to master independent of the
> larger patchset, to avoid a large dependency.

Awesome! +1. Attached 2nd version of patch rebased on master.
(v2-0001-Resolve-PL-handler-names-for-JITed-code-instead-o.patch)

> Did you check whether there's any cases this fails in the tree with your
> patch applied? The way I usually do that is by running the regression
> tests like
> PGOPTIONS='-cjit_above_cost=0' make -s -Otarget check-world
>
> (which will take a bit longer if use an optimized LLVM build, and a
> *lot* longer if you use a debug llvm build)

Great suggestion! I used:
PGOPTIONS='-c jit_above_cost=0' gmake installcheck-world
It all passed except a couple of logical decoding tests that never pass
on my machine for any tree (t/006_logical_decoding.pl and
t/010_logical_decoding_timelines.pl) and point (which seems to be failing
even
on master as of: d80be6f2f) I have attached the regression.diffs which
captures
the point failure.

> Hm. Aren't you breaking things here? If fmgr_symbol returns a basename
> of NULL, as is the case for all internal functions, you're going to
> print a NULL pointer, no?

For internal functions, it is supposed to return modname = NULL but basename
will be non-NULL right?  As things stand, fmgr_symbol can never return a
null
basename. I have added an Assert to make that even more explicit.

> Cool! I'll probably merge that into my patch (with attribution of
> course).
>
> I wonder if it'd nicer to not have separate C variables for all of
> these, and instead look them up on-demand from the module loaded in
> llvm_create_types(). Not sure.

Great! It is much nicer indeed. Attached version 2 with your suggested
changes.
(v2-0001-Rely-on-llvmjit_types-for-building-EvalFunc-calls.patch)
Used the same testing method as above.

> Sorry for not replying to that earlier.  I'm not quite sure it's
> actually worthwhile doing so - did you try to measure any memory / cpu
> savings?

No problem, thanks for the reply! Unfortunately, I did not do anything
significant in terms of mem/cpu measurements. However, I have noticed
non-trivial
differences between optimized and unoptimized .bc files that were dumped
from
time to time.

> The magnitude of wins aside, I also have a local patch that I'm going to
> try to publish this or next week, that deduplicates functions more
> aggressively, mostly to avoid redundant optimizations. It's quite
> possible that we should run that before the function passes - or even
> give up entirely on the function pass optimizations. As the module pass
> manager does the same optimizations it's not that clear in which cases
> it'd be beneficial to run it, especially if it means we can't
> deduplicate before optimizations.

Agreed, excited to see the patch!

--
Soumyadeep


v2-0001-Rely-on-llvmjit_types-for-building-EvalFunc-calls.patch
Description: Binary data


v2-0001-Resolve-PL-handler-names-for-JITed-code-instead-o.patch
Description: Binary data


point_failure.diffs
Description: Binary data


Re: WIP: expression evaluation improvements

2019-10-28 Thread Soumyadeep Chakraborty
Hi Andres,

Apologies, I realize my understanding of symbol resolution and the
referenced_functions mechanism wasn't correct. Thank you for your very
helpful
explanations.

> There's also a related edge-case where are unable to figure out a symbol
> name in llvm_function_reference(), and then resort to creating a global
> variable pointing to the function.

Indeed.

> If indeed the only case this is being hit is language PL handlers, it
> might be better to instead work out the symbol name for that handler -
> we should be able to get that via pg_language.lanplcallfoid.

I took a stab at this (on top of your patch set):
v1-0001-Resolve-PL-handler-names-for-JITed-code-instead-o.patch

> Which cases are you talking about here? Because I don't think there's
> any others where would know a symbol name to add to referenced_functions
> in the first place?

I had misunderstood the intent of referenced_functions.

> I do want to benefit from getting accurate signatures for patch
> [PATCH v2 26/32] WIP: expression eval: relative pointer suppport
> I had a number of cases where I passed the wrong parameters, and llvm
> couldn't tell me...

I took a stab:
v1-0001-Rely-on-llvmjit_types-for-building-EvalFunc-calls.patch


On a separate note, I had submitted a patch earlier to optimize functions
earlier
in accordance to the code comment:
/*
 * Do function level optimization. This could be moved to the point where
 * functions are emitted, to reduce memory usage a bit.
 */
 LLVMInitializeFunctionPassManager(llvm_fpm);
Refer:
https://www.postgresql.org/message-id/flat/cae-ml+_oe4-shvn0aa_qakc5qkzvqvainxwb1ztuut67spm...@mail.gmail.com
I have rebased that patch on top of your patch set. Here it is:
v2-0001-Optimize-generated-functions-earlier-to-lower-mem.patch

--
Soumyadeep


v2-0001-Optimize-generated-functions-earlier-to-lower-mem.patch
Description: Binary data


v1-0001-Rely-on-llvmjit_types-for-building-EvalFunc-calls.patch
Description: Binary data


v1-0001-Resolve-PL-handler-names-for-JITed-code-instead-o.patch
Description: Binary data


Re: WIP: expression evaluation improvements

2019-10-24 Thread Soumyadeep Chakraborty
Hey Andres,

After looking at
v2-0006-jit-Reference-functions-by-name-in-IOCOERCE-steps.patch, I was
wondering
about other places in the code where we have const pointers to functions
outside
LLVM's purview: specially EEOP_FUNCEXPR* for any function call expressions,
EEOP_DISTINCT and EEOP_NULLIF which involve operator specific comparison
function call invocations, deserialization and trans functions for
aggregates
etc. All of the above cases involve to some degree some server functions
that
can be inlined/optimized.

If we do go down this road, the most immediate solution that comes to mind
would
be to populate referenced_functions[] with these. Also, we can replace all
l_ptr_const() calls taking function addresses with calls to
llvm_function_reference() (this is safe as it falls back to a l_pt_const()
call). We could do the l_ptr_const() -> llvm_function_reference() even if we
don't go down this road.

One con with the approach above would be bloating of llvmjit_types.bc but we
would be introducing @declares instead of @defines in the IR...so I think
that
is fine.

Let me know your thoughts. I would like to submit a patch here in this
thread or
elsewhere.

--
Soumyadeep


JIT: Optimize generated functions earlier to lower memory usage

2019-10-06 Thread Soumyadeep Chakraborty
Hello,

This is to introduce a patch to lower the memory footprint of JITed
code by optimizing functions at the function level (i.e. with
function-level optimization passes) as soon as they are generated.
This addresses the code comment inside llvm_optimize_module():

/*
 * Do function level optimization. This could be moved to the point where
 * functions are emitted, to reduce memory usage a bit.
 */
 LLVMInitializeFunctionPassManager(llvm_fpm);

 --
 Soumyadeep (Deep)


v1-0001-Optimize-generated-functions-earlier-to-lower-mem.patch
Description: Binary data


Re: Don't codegen deform code for virtual tuples in expr eval for scan fetch

2019-09-30 Thread Soumyadeep Chakraborty
Awesome! Thanks so much for all the review! :)

--
Soumyadeep


Re: Don't codegen deform code for virtual tuples in expr eval for scan fetch

2019-09-30 Thread Soumyadeep Chakraborty
Hi Andres,

I don't feel very strongly about the changes I proposed.

> > I completely agree, that was an important consideration.
> >
> > I had some purely cosmetic suggestions:
> > 1. Rename ExecComputeSlotInfo to eliminate the need for the asserts.
>
> How does renaming it do so? I feel like the asserts are a good idea
> independent of anything else?

I felt that encoding the restriction that the function is meant to be called
only in the context of fetch operations in the function name itself
ensured that we don't call it from a non-fetch operation - something the
asserts within ExecComputeSlotInfo() are guarding against.

>
> > 2. Extract return value to a bool variable for slightly better
> > readability.
>
> To me that seems clearly worse. The variable doesn't add anything, but
> needing to track more state.>

Agreed.

--
Soumyadeep


Re: Don't codegen deform code for virtual tuples in expr eval for scan fetch

2019-09-28 Thread Soumyadeep Chakraborty
Hey,

I completely agree, that was an important consideration.

I had some purely cosmetic suggestions:
1. Rename ExecComputeSlotInfo to eliminate the need for the asserts.
2. Extract return value to a bool variable for slightly better
readability.
3. Taking the opportunity to use TTS_IS_VIRTUAL.

v2 of patch set attached. The first two patches are unchanged, the cosmetic
changes are part of v2-0003-Some-minor-cosmetic-changes.patch.

--
Soumyadeep


v2-0001-Reduce-code-duplication-for-ExecJust-Var-operatio.patch
Description: Binary data


v2-0002-Don-t-generate-EEOP_-_FETCHSOME-operations-for-sl.patch
Description: Binary data


v2-0003-Some-minor-cosmetic-changes.patch
Description: Binary data


Re: Don't codegen deform code for virtual tuples in expr eval for scan fetch

2019-09-25 Thread Soumyadeep Chakraborty
Hi Andres,

Thank you for your insight and the link offered just the context I needed!

On Wed, Sep 25, 2019 at 1:06 PM Andres Freund  wrote:

> > I'm doubtful this is worth the complexity - and not that we already have
> plenty other places with zero length blocks.

Agreed.

On Wed, Sep 25, 2019 at 1:06 PM Andres Freund  wrote:

> > WRT jit_optimize_above_cost not triggering, I think we need to replace
> the "basic, unoptimized" codegen path with one that does cheap
> optimizations only. The reason we don't do the full optimizations all
> the time is that they're expensive, but there's enough optimizations
> that are pretty cheap.  At some point we'll probably need our own
> optimization pipeline, but I don't want to maintain that right now
> (i.e. if some other people want to help with this aspect, cool)...

I would absolutely love to work on this!

I was thinking the same. Perhaps not only replace the default on the
compile path, but also introduce additional levels. These levels could then
be configured at a granularity lower than the -O0, -O1, .. In other words,
perhaps we could have levels with ad-hoc pass combinations. We could then
maybe have costs associated with each level. I think such a framework
would be ideal.

To summarize this into TODOs:
1. Tune default compilation path optimizations.
2. New costs per optimization level.
3. Capacity to define "levels" with ad-hoc pass combinations that are
costing
sensitive.

Is this what you meant by "optimization pipeline"?

--
Soumyadeep


Re: Don't codegen deform code for virtual tuples in expr eval for scan fetch

2019-09-25 Thread Soumyadeep Chakraborty
Hello Andres,

Thank you very much for reviewing my patch!

On Wed, Sep 25, 2019 at 1:02 PM Andres Freund  wrote:
> IOW, wherever ExecComputeSlotInfo() is called, we should only actually
> push the expression step, when ExecComputeSlotInfo does not determine
> that a) the slot is virtual, b) and fixed (i.e. guaranteed to always be
> the same type of slot).
>
> Does that make sense?

That is a great suggestion and I totally agree. I have attached a patch
that achieves this.

--
Soumyadeep


v1-0001-Skip-deform-expr-step-emission-for-virtual-tuples.patch
Description: Binary data


Re: Don't codegen deform code for virtual tuples in expr eval for scan fetch

2019-09-20 Thread Soumyadeep Chakraborty
Hello,

In my previous patch 0001, the resulting opblock consisted of a single
br instruction to it's successor opblock. Such a block represents
unnecessary overhead. Even though such a block would be optimized
away, what if optimization is not performed (perhaps due to
jit_optimize_above_cost)? Perhaps we could be more aggressive. We
could maybe remove the opblock altogether. However, such a solution
is not without complexity.

Such unlinking is non-trivial, and is one of the things the
simplify-cfg pass takes care of. One solution could be to run this
pass ad-hoc for the evalexpr function. Or we could perform the
unlinking during codegen. Such a solution is presented below.

To unlink the current opblock from
the cfg we have to replace all of its uses. From the current state of
the code, these uses are either:
1. Terminator instruction of opblocks[i - 1]
2. Terminator instruction of some sub-op-block of opblocks[i - 1]. By
sub-op-block, I mean some block that is directly linked from opblock[i
- 1] but not opblocks[i].
3. Terminator instruction of the entry block.

We should replace all of these uses with opblocks[i + 1] and then and
only then can we delete opblocks[i]. My latest patch v2-0001 in my latest
patch set, achieves this.

I guard LLVMReplaceAllUsesWith() with Assert()s to ensure that we
don't accidentally replace non-terminator uses of opblocks[i], should
they be introduced in the future. If these asserts fail in the future,
replacing these non-terminator instructions with undefs constitutes a
commonly adopted solution. Otherwise, we can always fall back to making
opblocks[i] empty with just the unconditional br, as in my previous
patch 0001.

--
Soumyadeep

On Tue, Sep 17, 2019 at 11:54 PM Soumyadeep Chakraborty <
soumyadeep2...@gmail.com> wrote:

> Hello Hackers,
>
> This is to address a TODO I found in the JIT expression evaluation
> code (opcode =
> EEOP_INNER_FETCHSOME/EEOP_OUTER_FETCHSOME/EEOP_SCAN_FETCHSOME):
>
> * TODO: skip nvalid check if slot is fixed and known to
> * be a virtual slot.
>
> Not only should we skip the nvalid check if the tuple is virtual, the
> whole basic block should be a no-op. There is no deforming to be done,
> JIT (slot_compile_deform) or otherwise (slot_getsomeattrs) for virtual
> tuples.
>
> Attached is a patch 0001 that achieves exactly that by:
>
> 1. Performing the check at the very beginning of the case.
> 2. Emitting an unconditional branch to the next opblock if we have a
> virtual tuple. We end up with a block with just the single
> unconditional branch instruction in this case. This block is optimized
> away when we run llvm_optimize_module().
>
> Also enclosed is another patch 0002 that adds on some minor
> refactoring of conditionals around whether to JIT deform or not.
>
> ---
> Soumyadeep Chakraborty
>


v2-0001-Don-t-codegen-if-tuple-is-virtual-in-expr-eval-of.patch
Description: Binary data


v2-0002-Minor-refactor-JIT-deform-or-not.patch
Description: Binary data


Don't codegen deform code for virtual tuples in expr eval for scan fetch

2019-09-18 Thread Soumyadeep Chakraborty
Hello Hackers,

This is to address a TODO I found in the JIT expression evaluation
code (opcode =
EEOP_INNER_FETCHSOME/EEOP_OUTER_FETCHSOME/EEOP_SCAN_FETCHSOME):

* TODO: skip nvalid check if slot is fixed and known to
* be a virtual slot.

Not only should we skip the nvalid check if the tuple is virtual, the
whole basic block should be a no-op. There is no deforming to be done,
JIT (slot_compile_deform) or otherwise (slot_getsomeattrs) for virtual
tuples.

Attached is a patch 0001 that achieves exactly that by:

1. Performing the check at the very beginning of the case.
2. Emitting an unconditional branch to the next opblock if we have a
virtual tuple. We end up with a block with just the single
unconditional branch instruction in this case. This block is optimized
away when we run llvm_optimize_module().

Also enclosed is another patch 0002 that adds on some minor
refactoring of conditionals around whether to JIT deform or not.

---
Soumyadeep Chakraborty


0001-Don-t-codegen-if-tuple-is-virtual-in-expr-eval-of-sc.patch
Description: Binary data


0002-Minor-refactor-JIT-deform-or-not.patch
Description: Binary data


Infinite wait for SyncRep while handling USR1

2019-08-22 Thread Soumyadeep Chakraborty
Hello Hackers,

There is an edge case in 9_5_STABLE (doesn't reproduce 9_6+) we found in how
backends handle the TERM signal while handling a USR1 signal that can cause
them to infinitely wait. A backend can end up waiting forever inside
SyncRepWaitForLSN() at:

rc = WaitLatch(MyLatch, WL_LATCH_SET |
WL_POSTMASTER_DEATH, -1, WAIT_EVENT_SYNC_REP);

Even pg_terminate_backend() would not be able to terminate it after it
reaches this point.

Stack trace:

#0  0x7f39418d73c8 in poll () from /lib64/libc.so.6
#1  0x00753d7b in WaitLatchOrSocket (latch=0x7f3940d371cc,
wakeEvents=17, sock=-1, timeout=-1)
at pg_latch.c:333
#2  0x00753b8c in WaitLatch (latch=0x7f3940d371cc, wakeEvents=17,
timeout=-1) at pg_latch.c:197
#3  0x007a0e02 in SyncRepWaitForLSN (XactCommitLSN=167868344) at
syncrep.c:231
#4  0x00518b8a in RecordTransactionCommit () at xact.c:1349
#5  0x0051966d in CommitTransaction () at xact.c:2057
#6  0x0051a1f9 in CommitTransactionCommand () at xact.c:2769
#7  0x0054ff5e in RemoveTempRelationsCallback (code=1, arg=0) at
namespace.c:3878
#8  0x007be3d1 in shmem_exit (code=1) at ipc.c:228
#9  0x007be2c6 in proc_exit_prepare (code=1) at ipc.c:185
#10 0x007be234 in proc_exit (code=1) at ipc.c:102
#11 0x0093152e in errfinish (dummy=0) at elog.c:535
#12 0x007ea492 in ProcessInterrupts () at postgres.c:2913
#13 0x007e9f93 in die (postgres_signal_arg=15) at postgres.c:2682
#14 
#15 procsignal_sigusr1_handler (postgres_signal_arg=10) at procsignal.c:271
#16 
#17 SocketBackend (inBuf=0x7ffc8dde0f80) at postgres.c:353
#18 0x007e70ca in ReadCommand (inBuf=0x7ffc8dde0f80) at
postgres.c:510
#19 0x007eb990 in PostgresMain (argc=1, argv=0x1708cd0,
dbname=0x1708b38 "gpadmin",
username=0x1708b18 "gpadmin") at postgres.c:4032
#20 0x00769922 in BackendRun (port=0x172e060) at postmaster.c:4309
#21 0x00769086 in BackendStartup (port=0x172e060) at
postmaster.c:3983
#22 0x007657dc in ServerLoop () at postmaster.c:1706
#23 0x00764e16 in PostmasterMain (argc=3, argv=0x1707ce0) at
postmaster.c:1314
#24 0x006bcdb3 in main (argc=3, argv=0x1707ce0) at main.c:228

Root cause:

The reason why the backend waits forever in WaitLatch is that it expects a
USR1
signal to be delivered in order to wake it up from the wait
(WaitLatchOrSocket
implementation) using latch_sigusr1_handler. Now since its is already
inside a
USR1 handler, USR1 is blocked by default. Thus, it never receives a USR1 and
never wakes up.

Reproduction: [9_5_STABLE, commit: 5a32fcd]

We must align the stars as follows:

1. Create a psql session and create a temp table. [temp table forces a Tx
during backend termination to drop the table => SyncRepWaitForLSN will be
called]

2. Since, we don't have a fault injection framework, we have to rely on the
debugger: Attach to the following processes and install these breakpoints:

i) Attach to the Backend:
b procsignal_sigusr1_handler
b postgres.c:353
at: SocketBackend: ereport(DEBUG1,
(errcode(ERRCODE_CONNECTION_DOES_NOT_EXIST),
errmsg("unexpected EOF on client connection")));

ii) Attach to the Wal Sender, no need for a specific breakpoint [We need to
make the Wal Sender fall behind in order for the condition lsn <=
WalSndCtl->lsn[mode] to be false inside SyncRepWaitForLSN()]

3. kill -9 the psql process [This will send an EOF to the backend's
connection
to the front end, which will cause DoingCommandRead = true and
whereToSendOutput = DestNone (in SocketBackend()). This will later ensure
that
ProcessInterrupts() is called from within die: the backend's TERM handler]

4. Continue the backend process in the debugger, it should reach the
breakpoint
we set inside SocketBackend.

5. Send a USR1 to the backend. kill -SIGUSR1. Continue the backend. We will
hit
the breakpoint b procsignal_sigusr1_handler.

6. Send a TERM or perform pg_terminate_backend() on the backend.

7. Detach from all of the processes.

8. The backend will be waiting for replication forever inside
SyncRepWaitForLSN(). We can confirm this by pausing the backend and checking
the back trace and also by looking at pg_stat_activity. Subsequent attempts
to
terminate the backend would be futile.


Working fix:
At the start of SyncRepWaitForLSN() check if we are already in a USR1
handler.
If yes, we return. We use sigprocmask(), sigismember() to perform the
detection.

-
Ashwin Agrawal, Bhuvnesh Chaudhary, Jesse Zhang and Soumyadeep Chakraborty