from:"Dilip Kumar"

Re: New committers: Melanie Plageman, Richard Guo

2024-04-26 Thread Dilip Kumar

On Fri, Apr 26, 2024 at 5:24 PM Jonathan S. Katz  wrote:
>
> The Core Team would like to extend our congratulations to Melanie
> Plageman and Richard Guo, who have accepted invitations to become our
> newest PostgreSQL committers.
>
> Please join us in wishing them much success and few reverts!

Congratulations to both of you.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Why don't we support external input/output functions for the composite types

2024-04-24 Thread Dilip Kumar

On Thu, Apr 25, 2024 at 10:14 AM Tom Lane  wrote:
>
> Dilip Kumar  writes:
> > I'm curious about composite types in PostgreSQL. By default, when we
> > create a composite type, it utilizes the "record_in" and "record_out"
> > functions for input/output. Do you think it would be beneficial to
> > expand the syntax to allow users to specify custom input/output
> > functions when creating composite types?
>
> No.
>
> > I believe it would be beneficial because users creating a new type
> > might prefer to define specific input/output syntax rather than
> > conforming to what is accepted by the RECORD type.
>

Thanks for the quick response, Tom.

> The primary outcome would be to require a huge amount of new work
> to be done by a lot of software, much of it not under our control.

Yeah, I agree with that.

> And the impact wouldn't only be to software that would prefer not
> to know about this.  For example, how likely do you think it is
> that these hypothetical user-defined I/O functions would cope
> well with ALTER TABLE/ALTER TYPE commands that change those
> rowtypes?

That's a good point. I was primarily focused on altering the
representation of input and output values, rather than considering
changes to internal storage. However, offering this feature could
indeed allow users to influence how values are stored.  And that can
potentially affect ALTER TYPE because then we do not have control over
how those values are stored internally.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Why don't we support external input/output functions for the composite types

2024-04-24 Thread Dilip Kumar

Hi,

I'm curious about composite types in PostgreSQL. By default, when we
create a composite type, it utilizes the "record_in" and "record_out"
functions for input/output. Do you think it would be beneficial to
expand the syntax to allow users to specify custom input/output
functions when creating composite types? Has anyone attempted this
before, and are there any design challenges associated with it? Or is
it simply not implemented because it's not seen as a valuable
addition?

I believe it would be beneficial because users creating a new type
might prefer to define specific input/output syntax rather than
conforming to what is accepted by the RECORD type.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Dilip Kumar

On Wed, Apr 3, 2024 at 7:40 PM Alvaro Herrera  wrote:
>
> Hello,
>
> On 2024-Apr-03, Alexander Lakhin wrote:
>
> > I've managed to trigger an assert added by 53c2a97a9.
> > Please try the following script against a server compiled with
> > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
> > define, it just simplifies reproducing...):
>
> Ah yes, absolutely, we're missing to trade the correct SLRU bank lock
> there.  This rewrite of that small piece should fix it.  Thanks for
> reporting this.
>

Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
thanks for reporting.  Your fix looks correct to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-14 Thread Dilip Kumar

On Thu, Mar 14, 2024 at 4:07 PM Heikki Linnakangas  wrote:
>
> > Yeah, that's a very valid point. So I think now Heikki/Melanie might
> > have got an answer to their question, about the thought process behind
> > serializing the snapshot for each scan node.  And the same thing is
> > followed for BitmapHeapNode as well.
>
> I see. Thanks, understanding the thought process helps.
>
> So when a parallel table or index scan runs in the executor as part of a
> query, we could just use the active snapshot. But there are some other
> callers of parallel table scans that don't use the executor, namely
> parallel index builds. For those it makes sense to pass the snapshot for
> the scan independent of the active snapshot.

Right

> A parallel bitmap heap scan isn't really a parallel scan as far as the
> table AM is concerned, though. It's more like an independent bitmap heap
> scan in each worker process, nodeBitmapHeapscan.c does all the
> coordination of which blocks to scan. So I think that
> table_parallelscan_initialize() was the wrong role model, and we should
> still remove the snapshot serialization code from nodeBitmapHeapscan.c.

I think that seems right.

> Digging deeper into the question of whether es_snapshot ==
> GetActiveSnapshot() is a valid assumption:
>
> 
>
> es_snapshot is copied from the QueryDesc in standard_ExecutorStart().
> Looking at the callers of ExecutorStart(), they all get the QueryDesc by
> calling CreateQueryDesc() with GetActiveSnapshot(). And I don't see any
> callers changing the active snapshot between the ExecutorStart() and
> ExecutorRun() calls either. In pquery.c, we explicitly
> PushActiveSnapshot(queryDesc->snapshot) before calling ExecutorRun(). So
> no live bug here AFAICS, es_snapshot == GetActiveSnapshot() holds.
>
> _SPI_execute_plan() has code to deal with the possibility that the
> active snapshot is not set. That seems fishy; do we really support SPI
> without any snapshot? I'm inclined to turn that into an error. I ran the
> regression tests with an "Assert(ActiveSnapshotSet())" there, and
> everything worked.

IMHO, we can call SPI_Connect() and SPI_Execute() from any C
extension, so I don't think there we can guarantee that the snapshot
must be set, do we?

> If es_snapshot was different from the active snapshot, things would get
> weird, even without parallel query. The scans would use es_snapshot for
> the visibility checks, but any functions you execute in quals would use
> the active snapshot.
>
> We could double down on that assumption, and remove es_snapshot
> altogether and use GetActiveSnapshot() instead. And perhaps add
> "PushActiveSnapshot(queryDesc->snapshot)" to ExecutorRun().
>
> 
>
> In summary, this es_snapshot stuff is a bit confusing and could use some
> cleanup. But for now, I'd like to just add some assertions and a
> comments about this, and remove the snapshot serialization from bitmap
> heap scan node, to make it consistent with other non-parallel scan nodes
> (it's not really a parallel scan as far as the table AM is concerned).
> See attached patch, which is the same as previous patch with some extra
> assertions.

Maybe for now we can just handle this specific case to remove the
snapshot serializing for the BitmapHeapScan as you are doing in the
patch.  After looking into the code your theory seems correct that we
are just copying the ActiveSnapshot while building the query
descriptor and from there we are copying into the Estate so logically
there should not be any reason for these two to be different.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-13 Thread Dilip Kumar

On Wed, Mar 13, 2024 at 9:25 PM Robert Haas  wrote:
>
> On Wed, Mar 13, 2024 at 11:39 AM Dilip Kumar  wrote:
> > > Andres already commented on the snapshot stuff on an earlier patch
> > > version, and that's much nicer with this version. However, I don't
> > > understand why a parallel bitmap heap scan needs to do anything at all
> > > with the snapshot, even before these patches. The parallel worker
> > > infrastructure already passes the active snapshot from the leader to the
> > > parallel worker. Why does bitmap heap scan code need to do that too?
> >
> > Yeah thinking on this now it seems you are right that the parallel
> > infrastructure is already passing the active snapshot so why do we
> > need it again.  Then I checked other low scan nodes like indexscan and
> > seqscan and it seems we are doing the same things there as well.
> > Check for SerializeSnapshot() in table_parallelscan_initialize() and
> > index_parallelscan_initialize() which are being called from
> > ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM()
> > respectively.
>
> I remember thinking about this when I was writing very early parallel
> query code. It seemed to me that there must be some reason why the
> EState has a snapshot, as opposed to just using the active snapshot,
> and so I took care to propagate that snapshot, which is used for the
> leader's scans, to the worker scans also. Now, if the EState doesn't
> need to contain a snapshot, then all of that mechanism is unnecessary,
> but I don't see how it can be right for the leader to do
> table_beginscan() using estate->es_snapshot and the worker to use the
> active snapshot.

Yeah, that's a very valid point. So I think now Heikki/Melanie might
have got an answer to their question, about the thought process behind
serializing the snapshot for each scan node.  And the same thing is
followed for BitmapHeapNode as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-13 Thread Dilip Kumar

On Wed, Mar 13, 2024 at 7:04 PM Heikki Linnakangas  wrote:
>
> (Adding Dilip, the original author of the parallel bitmap heap scan
> patch all those years ago, in case you remember anything about the
> snapshot stuff below.)
>
> On 27/02/2024 16:22, Melanie Plageman wrote:

> Andres already commented on the snapshot stuff on an earlier patch
> version, and that's much nicer with this version. However, I don't
> understand why a parallel bitmap heap scan needs to do anything at all
> with the snapshot, even before these patches. The parallel worker
> infrastructure already passes the active snapshot from the leader to the
> parallel worker. Why does bitmap heap scan code need to do that too?

Yeah thinking on this now it seems you are right that the parallel
infrastructure is already passing the active snapshot so why do we
need it again.  Then I checked other low scan nodes like indexscan and
seqscan and it seems we are doing the same things there as well.
Check for SerializeSnapshot() in table_parallelscan_initialize() and
index_parallelscan_initialize() which are being called from
ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM()
respectively.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Streaming I/O, vectored I/O (WIP)

2024-03-12 Thread Dilip Kumar

On Tue, Mar 12, 2024 at 12:10 PM Thomas Munro  wrote:
>
> I think you'd be right if StartReadBuffers() were capable of
> processing a sequence consisting of a hit followed by misses, but
> currently it always gives up after the first hit.  That is, it always
> processes some number of misses (0-16) and then at most one hit.  So
> for now the variable would always turn out to be the same as blockNum.
>
Okay, then shouldn't this "if (found)" block immediately break the
loop so that when we hit the block we just return that block?  So it
makes sense what you explained but with the current code if there are
the first few hits followed by misses then we will issue the
smgrprefetch() for the initial hit blocks as well.

+ if (found)
+ {
+ /*
+ * Terminate the read as soon as we get a hit.  It could be a
+ * single buffer hit, or it could be a hit that follows a readable
+ * range.  We don't want to create more than one readable range,
+ * so we stop here.
+ */
+ actual_nblocks = operation->nblocks = *nblocks = i + 1;(Dilip: I
think we should break after this?)
+ }
+ else
+ {
+ /* Extend the readable range to cover this block. */
+ operation->io_buffers_len++;
+ }
+ }

> The reason is that I wanted to allows "full sized" read system calls
> to form.  If you said "hey please read these 16 blocks" (I'm calling
> that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits,
> then it could only form a read of 14 blocks, but there might be more
> blocks that could be read after those.  We would have some arbitrary
> shorter read system calls, when we wanted to make them all as big as
> possible.  So in the current patch you say "hey please read these 16
> blocks" and it returns saying "only read 1", you call again with 15
> and it says "only read 1", and you call again and says "read 16!"
> (assuming 2 more were readable after the original range we started
> with).  Then physical reads are maximised.  Maybe there is some nice
> way to solve that, but I thought this way was the simplest (and if
> there is some instruction-cache-locality/tight-loop/perf reason why we
> should work harder to find ranges of hits, it could be for later).
> Does that make sense?

Understood, I think this makes sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Streaming I/O, vectored I/O (WIP)

2024-03-12 Thread Dilip Kumar

On Sat, Mar 9, 2024 at 3:55 AM Thomas Munro  wrote:
>
Hi Thomas,

I am planning to review this patch set, so started going through 0001,
I have a question related to how we are issuing smgrprefetch in
StartReadBuffers()

+ if (operation->io_buffers_len > 0)
+ {
+ if (flags & READ_BUFFERS_ISSUE_ADVICE)
  {
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum;
- MemSet((char *) bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum;
+ /*
+ * In theory we should only do this if PrepareReadBuffers() had to
+ * allocate new buffers above.  That way, if two calls to
+ * StartReadBuffers() were made for the same blocks before
+ * WaitReadBuffers(), only the first would issue the advice.
+ * That'd be a better simulation of true asynchronous I/O, which
+ * would only start the I/O once, but isn't done here for
+ * simplicity.  Note also that the following call might actually
+ * issue two advice calls if we cross a segment boundary; in a
+ * true asynchronous version we might choose to process only one
+ * real I/O at a time in that case.
+ */
+ smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
  }

 This is always issuing smgrprefetch starting with the input blockNum,
shouldn't we pass the first blockNum which we did not find in the
 Buffer pool?  So basically in the loop above this call where we are
doing PrepareReadBuffer() we should track the first blockNum for which
 the found is not true and pass that blockNum into the smgrprefetch()
as a first block right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PostgreSQL Contributors Updates

2024-03-04 Thread Dilip Kumar

On Sun, Mar 3, 2024 at 9:28 PM Joe Conway  wrote:
>
> All,
>
> The PostgreSQL Contributor Page
> (https://www.postgresql.org/community/contributors/) includes people who
> have made substantial, long-term contributions of time and effort to the
> PostgreSQL project. The PostgreSQL Contributors Team recognizes the
> following people for their contributions.
>
> New PostgreSQL Contributors:
>
> * Bertrand Drouvot
> * Gabriele Bartolini
> * Richard Guo
>
> New PostgreSQL Major Contributors:
>
> * Alexander Lakhin
> * Daniel Gustafsson
> * Dean Rasheed
> * John Naylor
> * Melanie Plageman
> * Nathan Bossart
>
> Thank you and congratulations to all!
>

 Congratulations to all!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: Improve performance of subsystems on top of SLRU

2024-03-03 Thread Dilip Kumar

On Mon, Mar 4, 2024 at 1:56 AM Alvaro Herrera  wrote:
>
> On 2024-Feb-28, Alvaro Herrera wrote:
>
> > Improve performance of subsystems on top of SLRU
>
> Coverity had the following complaint about this commit:
>
> 
> *** CID NNN:  Control flow issues  (DEADCODE)
> /srv/coverity/git/pgsql-git/postgresql/src/backend/access/transam/multixact.c:
>  1375 in GetMultiXactIdMembers()
> 1369 * and acquire the lock of the new bank.
> 1370 */
> 1371lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
> 1372if (lock != prevlock)
> 1373{
> 1374if (prevlock != NULL)
> >>> CID 1592913:  Control flow issues  (DEADCODE)
> >>> Execution cannot reach this statement: "LWLockRelease(prevlock);".
> 1375LWLockRelease(prevlock);
> 1376LWLockAcquire(lock, LW_EXCLUSIVE);
> 1377prevlock = lock;
> 1378}
> 1379
> 1380slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, 
> multi);
>
> And I think it's correct that this is somewhat bogus, or at least
> confusing: the only way to have control back here on line 1371 after
> having executed once is via the "goto retry" line below; and there we
> release "prevlock" and set it to NULL beforehand, so it's impossible for
> prevlock to be NULL.  Looking closer I think this code is all confused,
> so I suggest to rework it as shown in the attached patch.
>
> I'll have a look at the other places where we use this "prevlock" coding
> pattern tomorrow.


+ /* Acquire the bank lock for the page we need. */
  lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
- if (lock != prevlock)
- {
- if (prevlock != NULL)
- LWLockRelease(prevlock);
- LWLockAcquire(lock, LW_EXCLUSIVE);
- prevlock = lock;
- }
+ LWLockAcquire(lock, LW_EXCLUSIVE);

This part is definitely an improvement.

I am not sure about the other changes, I mean that makes the code much
simpler but now we are not releasing the 'MultiXactOffsetCtl' related
bank lock, and later in the following loop, we are comparing that lock
against 'MultiXactMemberCtl' related bank lock. This makes code
simpler because now in the loop we are sure that we are always holding
the lock but I do not like comparing the bank locks for 2 different
SLRUs, although there is no problem as there would not be a common
lock address, anyway, I do not have any strong objection to what you
have done here.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Dilip Kumar

On Tue, Feb 27, 2024 at 11:41 PM Alvaro Herrera 
wrote:

> On 2024-Feb-27, Alvaro Herrera wrote:
>
> > Here's the complete set, with these two names using the singular.
>
> BTW one thing I had not noticed is that before this patch we have
> minimum shmem size that's lower than the lowest you can go with the new
> code.
>
> This means Postgres may no longer start when extremely tight memory
> restrictions (and of course use more memory even when idle or with small
> databases).  I wonder to what extent should we make an effort to relax
> that.  For small, largely inactive servers, this is just memory we use
> for no good reason.  However, anything we do here will impact
> performance on the high end, because as Andrey says this will add
> calculations and jumps where there are none today.
>
>
I was just comparing the minimum memory required for SLRU when the system
is minimally configured, correct me if I am wrong.

SLRUunpatched
patched
commit_timestamp_buffers  4   16
subtransaction_buffers 32 16
transaction_buffers   4   16
multixact_offset_buffers8   16
multixact_member_buffers   16  16
notify_buffers 8
 16
serializable_buffers   16  16
-
total buffers 88
112

so that is < 200kB of extra memory on a minimally configured system, IMHO
this should not matter.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Dilip Kumar

On Mon, Feb 26, 2024 at 9:46 PM Alvaro Herrera 
wrote:

> On 2024-Feb-23, Dilip Kumar wrote:
>
> +  
> +   For each SLRU area that's part of the core server,
> +   there is a configuration parameter that controls its size, with the
> suffix
> +   _buffers appended.  For historical
> +   reasons, the names are not exact matches, but Xact
> +   corresponds to transaction_buffers and the rest
> should
> +   be obvious.
> +   
> +  
>
> I think I would like to suggest renaming the GUCs to have the _slru_ bit
> in the middle:
>
> +# - SLRU Buffers (change requires restart) -
> +
> +#commit_timestamp_slru_buffers = 0  # memory for pg_commit_ts (0
> = auto)
> +#multixact_offsets_slru_buffers = 16# memory for
> pg_multixact/offsets
> +#multixact_members_slru_buffers = 32# memory for
> pg_multixact/members
> +#notify_slru_buffers = 16   # memory for pg_notify
> +#serializable_slru_buffers = 32 # memory for pg_serial
> +#subtransaction_slru_buffers = 0# memory for pg_subtrans (0 =
> auto)
> +#transaction_slru_buffers = 0   # memory for pg_xact (0 =
> auto)
>
> and the pgstat_internal.h table:
>
> static const char *const slru_names[] = {
> "commit_timestamp",
> "multixact_members",
> "multixact_offsets",
> "notify",
> "serializable",
> "subtransaction",
> "transaction",
> "other" /* has to be last
> */
> };
>
> This way they match perfectly.
>

Yeah, I think this looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-23 Thread Dilip Kumar

On Fri, Feb 23, 2024 at 1:06 PM Dilip Kumar  wrote:
>
> On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera  
> wrote:
> >
> > On 2024-Feb-07, Dilip Kumar wrote:
> >
> > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  
> > > wrote:
> >
> > > > Sure, but is that really what we want?
> > >
> > > So your question is do we want these buffers to be in multiple of
> > > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > > don't think it should create any problem logically.  I mean we can
> > > look again in the patch to see if we have made any such assumptions
> > > but that should be fairly easy to fix, then maybe if we are going in
> > > this way we should get rid of the check_slru_buffers() function as
> > > well.
> >
> > Not really, I just don't think the macro should be in slru.h.
>
> Okay
>
> > Another thing I've been thinking is that perhaps it would be useful to
> > make the banks smaller, when the total number of buffers is small.  For
> > example, if you have 16 or 32 buffers, it's not really clear to me that
> > it makes sense to have just 1 bank or 2 banks.  It might be more
> > sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> > the algorithm scale down as well as up ...
>
> It might be helpful to have small-size banks when SLRU buffers are set
> to a very low value and we are only accessing a couple of pages at a
> time (i.e. no buffer replacement) because in such cases most of the
> contention will be on SLRU Bank lock. Although I am not sure how
> practical such a use case would be, I mean if someone is using
> multi-xact very heavily or creating frequent subtransaction overflow
> then wouldn't they should set this buffer limit to some big enough
> value?  By doing this we would lose some simplicity of the patch I
> mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
> need to compute this and store it in SlruShared. Maybe that's not that
> bad.
>
> >
> > I haven't done either of those things in the attached v19 version.  I
> > did go over the comments once again and rewrote the parts I was unhappy
> > with, including some existing ones.  I think it's OK now from that point
> > of view ... at some point I thought about creating a separate README,
> > but in the end I thought it not necessary.
>
> Thanks, I will review those changes.

Few other things I noticed while reading through the patch, I haven't
read it completely yet but this is what I got for now.

1.
+ * If no process is already in the list, we're the leader; our first step
+ * is to "close out the group" by resetting the list pointer from
+ * ProcGlobal->clogGroupFirst (this lets other processes set up other
+ * groups later); then we lock the SLRU bank corresponding to our group's
+ * page, do the SLRU updates, release the SLRU bank lock, and wake up the
+ * sleeping processes.

I think here we are saying that we "close out the group" before
acquiring the SLRU lock but that's not true.  We keep the group open
until we gets the lock so that we can get maximum members in while we
are anyway waiting for the lock.

2.
 static void
 TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
  RepOriginId nodeid, int slotno)
 {
- Assert(TransactionIdIsNormal(xid));
+ if (!TransactionIdIsNormal(xid))
+ return;
+
+ entryno = TransactionIdToCTsEntry(xid);

I do not understand why we need this change.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-22 Thread Dilip Kumar

On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera  wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  
> > wrote:
>
> > > Sure, but is that really what we want?
> >
> > So your question is do we want these buffers to be in multiple of
> > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > don't think it should create any problem logically.  I mean we can
> > look again in the patch to see if we have made any such assumptions
> > but that should be fairly easy to fix, then maybe if we are going in
> > this way we should get rid of the check_slru_buffers() function as
> > well.
>
> Not really, I just don't think the macro should be in slru.h.

Okay

> Another thing I've been thinking is that perhaps it would be useful to
> make the banks smaller, when the total number of buffers is small.  For
> example, if you have 16 or 32 buffers, it's not really clear to me that
> it makes sense to have just 1 bank or 2 banks.  It might be more
> sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> the algorithm scale down as well as up ...

It might be helpful to have small-size banks when SLRU buffers are set
to a very low value and we are only accessing a couple of pages at a
time (i.e. no buffer replacement) because in such cases most of the
contention will be on SLRU Bank lock. Although I am not sure how
practical such a use case would be, I mean if someone is using
multi-xact very heavily or creating frequent subtransaction overflow
then wouldn't they should set this buffer limit to some big enough
value?  By doing this we would lose some simplicity of the patch I
mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
need to compute this and store it in SlruShared. Maybe that's not that
bad.

>
> I haven't done either of those things in the attached v19 version.  I
> did go over the comments once again and rewrote the parts I was unhappy
> with, including some existing ones.  I think it's OK now from that point
> of view ... at some point I thought about creating a separate README,
> but in the end I thought it not necessary.

Thanks, I will review those changes.

> I did add a bunch of Assert()s to make sure the locks that are supposed
> to be held are actually held.  This led me to testing the page status to
> be not EMPTY during SimpleLruWriteAll() before calling
> SlruInternalWritePage(), because the assert was firing.  The previous
> code is not really *buggy*, but to me it's weird to call WritePage() on
> a slot with no contents.

Okay,  I mean internally SlruInternalWritePage() will flush only if
the status is SLRU_PAGE_VALID, but it is better the way you have done.

> Another change was in TransactionGroupUpdateXidStatus: the original code
> had the leader doing pg_atomic_read_u32(>clogGroupFirst) to
> know which bank to lock.  I changed it to simply be the page used by the
> leader process; this doesn't need an atomic read, and should be the same
> page anyway.  (If it isn't, it's no big deal).  But what's more: even if
> we do read ->clogGroupFirst at that point, there's no guarantee that
> this is going to be exactly for the same process that ends up being the
> first in the list, because since we have not set it to INVALID by the
> time we grab the bank lock, it is quite possible for more processes to
> add themselves to the list.

Yeah, this looks better

> I realized all this while rewriting the comments in a way that would let
> me understand what was going on ... so IMO the effort was worthwhile.

+1

I will review and do some more testing early next week and share my feedback.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-21 Thread Dilip Kumar

On Wed, Feb 21, 2024 at 2:52 PM Robert Haas  wrote:
>
> On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar  wrote:
> > So the problem is that we might consider the transaction change as
> > non-transaction and mark this flag as true.
>
> But it's not "might" right? It's absolutely 100% certain that we will
> consider that transaction's changes as non-transactional ... because
> when we're in fast-forward mode, the table of new relfilenodes is not
> built, and so whenever we check whether any transaction made a new
> relfilenode for this sequence, the answer will be no.
>
> > But what would have
> > happened if we would have identified it correctly as transactional?
> > In such cases, we wouldn't have set this flag here but then we would
> > have set this while processing the DecodeAbort/DecodeCommit, so the
> > net effect would be the same no?  You may question what if the
> > Abort/Commit WAL never appears in the WAL, but this flag is
> > specifically for the upgrade case, and in that case we have to do a
> > clean shutdown so may not be an issue.  But in the future, if we try
> > to use 'ctx->processing_required' for something else where the clean
> > shutdown is not guaranteed then this flag can be set incorrectly.
> >
> > I am not arguing that this is a perfect design but I am just making a
> > point about why it would work.
>
> Even if this argument is correct (and I don't know if it is), the code
> and comments need some updating. We should not be testing a flag that
> is guaranteed false with comments that make it sound like the value of
> the flag is trustworthy when it isn't.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-21 Thread Dilip Kumar

On Wed, Feb 21, 2024 at 1:24 PM Robert Haas  wrote:
>
> On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar  wrote:
> > > But I am wondering why this flag is always set to true in
> > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> > > aborted transactions are not supposed to be replayed?  So if my
> > > observation is correct that for the aborted transaction, this
> > > shouldn't be set to true then we have a problem with sequence where we
> > > are identifying the transactional changes as non-transaction changes
> > > because now for transactional changes this should depend upon commit
> > > status.
> >
> > I have checked this case with Amit Kapila.  So it seems in the cases
> > where we have sent the prepared transaction or streamed in-progress
> > transaction we would need to send the abort also, and for that reason,
> > we are setting 'ctx->processing_required' as true so that if these
> > WALs are not streamed we do not allow upgrade of such slots.
>
> I don't find this explanation clear enough for me to understand.


Explanation about why we set 'ctx->processing_required' to true from
DecodeCommit as well as DecodeAbort:
--
For upgrading logical replication slots, it's essential to ensure
these slots are completely synchronized with the subscriber.  To
identify that we process all the pending WAL in 'fast_forward' mode to
find whether there is any decodable WAL or not.  So in short any WAL
type that we stream to standby in normal mode (no fast_forward mode)
is considered decodable and so is the abort WAL.  That's the reason
why at the end of the transaction commit/abort we need to set this
'ctx->processing_required' to true i.e. there are some decodable WAL
exists so we can not upgrade this slot.

Why the below check is safe?
> + if (ctx->fast_forward)
> + {
> + /*
> + * We need to set processing_required flag to notify the sequence
> + * change existence to the caller. Usually, the flag is set when
> + * either the COMMIT or ABORT records are decoded, but this must be
> + * turned on here because the non-transactional logical message is
> + * decoded without waiting for these records.
> + */
> + if (!transactional)
> + ctx->processing_required = true;
> +
> + return;
> + }

So the problem is that we might consider the transaction change as
non-transaction and mark this flag as true.  But what would have
happened if we would have identified it correctly as transactional?
In such cases, we wouldn't have set this flag here but then we would
have set this while processing the DecodeAbort/DecodeCommit, so the
net effect would be the same no?  You may question what if the
Abort/Commit WAL never appears in the WAL, but this flag is
specifically for the upgrade case, and in that case we have to do a
clean shutdown so may not be an issue.  But in the future, if we try
to use 'ctx->processing_required' for something else where the clean
shutdown is not guaranteed then this flag can be set incorrectly.

I am not arguing that this is a perfect design but I am just making a
point about why it would work.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Dilip Kumar

On Tue, Feb 20, 2024 at 4:32 PM Dilip Kumar  wrote:
>
> On Tue, Feb 20, 2024 at 3:38 PM Robert Haas  wrote:
>
> > Let's say fast_forward is true. Then smgr_decode() is going to skip
> > recording anything about the relfilenode, so we'll identify all
> > sequence changes as non-transactional. But look at how this case is
> > handled in seq_decode():
> >
> > + if (ctx->fast_forward)
> > + {
> > + /*
> > + * We need to set processing_required flag to notify the sequence
> > + * change existence to the caller. Usually, the flag is set when
> > + * either the COMMIT or ABORT records are decoded, but this must be
> > + * turned on here because the non-transactional logical message is
> > + * decoded without waiting for these records.
> > + */
> > + if (!transactional)
> > + ctx->processing_required = true;
> > +
> > + return;
> > + }
>
> It appears that the 'processing_required' flag was introduced as part
> of supporting upgrades for logical replication slots. Its purpose is
> to determine whether a slot is fully caught up, meaning that there are
> no pending decodable changes left before it can be upgraded.
>
> So now if some change was transactional but we have identified it as
> non-transaction then we will mark this flag  'ctx->processing_required
> = true;' so we temporarily set this flag incorrectly, but even if the
> flag would have been correctly identified initially, it would have
> been set again to true in the DecodeTXNNeedSkip() function regardless
> of whether the transaction is committed or aborted. As a result, the
> flag would eventually be set to 'true', and the behavior would align
> with the intended logic.
>
> But I am wondering why this flag is always set to true in
> DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> aborted transactions are not supposed to be replayed?  So if my
> observation is correct that for the aborted transaction, this
> shouldn't be set to true then we have a problem with sequence where we
> are identifying the transactional changes as non-transaction changes
> because now for transactional changes this should depend upon commit
> status.

I have checked this case with Amit Kapila.  So it seems in the cases
where we have sent the prepared transaction or streamed in-progress
transaction we would need to send the abort also, and for that reason,
we are setting 'ctx->processing_required' as true so that if these
WALs are not streamed we do not allow upgrade of such slots.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Dilip Kumar

On Tue, Feb 20, 2024 at 3:38 PM Robert Haas  wrote:

> Let's say fast_forward is true. Then smgr_decode() is going to skip
> recording anything about the relfilenode, so we'll identify all
> sequence changes as non-transactional. But look at how this case is
> handled in seq_decode():
>
> + if (ctx->fast_forward)
> + {
> + /*
> + * We need to set processing_required flag to notify the sequence
> + * change existence to the caller. Usually, the flag is set when
> + * either the COMMIT or ABORT records are decoded, but this must be
> + * turned on here because the non-transactional logical message is
> + * decoded without waiting for these records.
> + */
> + if (!transactional)
> + ctx->processing_required = true;
> +
> + return;
> + }

It appears that the 'processing_required' flag was introduced as part
of supporting upgrades for logical replication slots. Its purpose is
to determine whether a slot is fully caught up, meaning that there are
no pending decodable changes left before it can be upgraded.

So now if some change was transactional but we have identified it as
non-transaction then we will mark this flag  'ctx->processing_required
= true;' so we temporarily set this flag incorrectly, but even if the
flag would have been correctly identified initially, it would have
been set again to true in the DecodeTXNNeedSkip() function regardless
of whether the transaction is committed or aborted. As a result, the
flag would eventually be set to 'true', and the behavior would align
with the intended logic.

But I am wondering why this flag is always set to true in
DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
aborted transactions are not supposed to be replayed?  So if my
observation is correct that for the aborted transaction, this
shouldn't be set to true then we have a problem with sequence where we
are identifying the transactional changes as non-transaction changes
because now for transactional changes this should depend upon commit
status.

On another thought, can there be a situation where we have identified
this flag wrongly as non-transaction and set this flag, and the
commit/abort record never appeared in the WAL so never decoded? That
can also lead to an incorrect decision during the upgrade.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Dilip Kumar

On Tue, Feb 20, 2024 at 10:30 AM Robert Haas  wrote:
>
> Is the rule that changes are transactional if and only if the current
> transaction has assigned a new relfilenode to the sequence?

Yes, thats the rule.

> Why does the logic get confused if the state of the snapshot changes?

The rule doesn't get changed, but the way this identification is
implemented at the decoding gets confused and assumes transactional as
non-transactional.  The identification of whether the sequence is
transactional or not is implemented based on what WAL we have decoded
from the particular transaction and whether we decode a particular WAL
or not depends upon the snapshot state (it's about what we decode not
necessarily what we sent).  So if the snapshot state changed the
mid-transaction that means we haven't decoded the WAL which created a
new relfilenode but we will decode the WAL which is operating on the
sequence.  So here we will assume the change is non-transaction
whereas it was transactional because we did not decode some of the
changes of transaction which we rely on for identifying whether it is
transactional or not.

> My naive reaction is that it kinda sounds like you're relying on two
> different mistakes cancelling each other out, and that might be a bad
> idea, because maybe there's some situation where they don't. But I
> don't understand the issue well enough to have an educated opinion at
> this point.

I would say the first one is a mistake in identifying the
transactional as non-transactional during the decoding and that
mistake happens only when we decode the transaction partially.  But we
never stream the partially decoded transactions downstream which means
even though we have made a mistake in decoding it, we are not
streaming it so our mistake is not getting converted into a real
problem.  But again I agree there is a temporary wrong decision and if
we try to do something else based on this decision then it could be an
issue.

You might be interested in more detail [1] where I first reported this
problem and also [2] where we concluded why this is not creating a
real problem.

[1] 
https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-02-07 Thread Dilip Kumar

> > So now if we have such a functionality then it would be even better to
> > extend it to selectively sync the slot.  For example, if there is some
> > issue in syncing all slots, maybe some bug or taking a long time to
> > sync because there are a lot of slots but if the user needs to quickly
> > failover and he/she is interested in only a couple of slots then such
> > a option could be helpful. no?
> >
>
> I see your point but not sure how useful it is in the field. I am fine
> if others also think such a parameter will be useful and anyway I
> think we can even extend it after v1 is done.
>

Okay, I am fine with that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-07 Thread Dilip Kumar

On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera  
> > wrote:
> > >
> > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> > > compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> > > because we don't have that macro at that point, so I just used constant
> > > 16.  Obviously need a better solution for this.
> >
> > If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
> > because these files are anyway include slur.h so.
>
> Sure, but is that really what we want?

So your question is do we want these buffers to be in multiple of
SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
don't think it should create any problem logically.  I mean we can
look again in the patch to see if we have made any such assumptions
but that should be fairly easy to fix, then maybe if we are going in
this way we should get rid of the check_slru_buffers() function as
well.

> > > I've been wondering whether we should add a "slru" to the name of the
> > > GUCs:
> > >
> > > commit_timestamp_slru_buffers
> > > transaction_slru_buffers
> > > etc
> >
> > I am not sure we are exposing anything related to SLRU to the user,
>
> We do -- we have pg_stat_slru already.
>
> > I mean transaction_buffers should make sense for the user that it
> > stores transaction-related data in some buffers pool but whether that
> > buffer pool is called SLRU or not doesn't matter much to the user
> > IMHO.
>
> Yeah, that's exactly what my initial argument was for naming these this
> way.  But since the term slru already escaped into the wild via the
> pg_stat_slru view, perhaps it helps users make the connection between
> these things.  Alternatively, we can cross-reference each term from the
> other's documentation and call it a day.

Yeah, that's true I forgot this point about the pg_stat_slru, from
this pov if the configuration has the name slru they would be able to
make a better connection with the configured value, and the stats in
this view based on that they can take call if they need to somehow
increase the size of these slru buffers.

> Another painful point is that pg_stat_slru uses internal names in the
> data it outputs, which obviously do not match the new GUCs.

Yeah, that's true, but I think this could be explained somewhere not
sure what is the right place for this.

FYI, I have also repeated all the performance tests I performed in my
first email[1], and I am seeing a similar gain.

Some comments on v18 in my first pass of the review.

1.
@@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
  lsnindex = GetLSNIndex(slotno, xid);
  *lsn = XactCtl->shared->group_lsn[lsnindex];

- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno));

Maybe here we can add an assert before releasing the lock for a safety check

Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno)));

2.
+ *
+ * XXX could we make the LSNs to be bank-based?
  */
  XLogRecPtr *group_lsn;

IMHO, the flush still happens at the page level so up to which LSN
should be flush before flushing the particular clog page should also
be maintained at the page level.

[1] 
https://www.postgresql.org/message-id/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-06 Thread Dilip Kumar

On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera  wrote:
>
> Here's the rest of it rebased on top of current master.  I think it
> makes sense to have this as one individual commit.
>
> I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> because we don't have that macro at that point, so I just used constant
> 16.  Obviously need a better solution for this.

If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
because these files are anyway include slur.h so.

>
> I also changed the location of bank_mask in SlruCtlData for better
> packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO()
> to SlotGetBankNumber().

Okay.

> Some very critical comments still need to be updated to the new design,
> particularly anything that mentions "control lock"; but also the overall
> model needs to be explained in some central location, rather than
> incongruently some pieces here and other pieces there.  I'll see about
> this later.  But at least this is code you should be able to play with.

Okay, I will review and test this

> I've been wondering whether we should add a "slru" to the name of the
> GUCs:
>
> commit_timestamp_slru_buffers
> transaction_slru_buffers
> etc

I am not sure we are exposing anything related to SLRU to the user, I
mean transaction_buffers should make sense for the user that it stores
transaction-related data in some buffers pool but whether that buffer
pool is called SLRU or not doesn't matter much to the user IMHO.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-06 Thread Dilip Kumar

On Tue, Feb 6, 2024 at 4:23 PM Alvaro Herrera  wrote:
>
> > > (We also have SimpleLruTruncate, but I think it's not as critical to
> > > have a barrier there anyhow: accessing a slightly outdated page number
> > > could only be a problem if a bug elsewhere causes us to try to truncate
> > > in the current page.  I think we only have this code there because we
> > > did have such bugs in the past, but IIUC this shouldn't happen anymore.)
> >
> > +1, I agree with this theory in general.  But the below comment in
> > SimpleLruTrucate in your v3 patch doesn't seem correct, because here
> > we are checking if the latest_page_number is smaller than the cutoff
> > if so we log it as wraparound and skip the whole thing and that is
> > fine even if we are reading with atomic variable and slightly outdated
> > value should not be a problem but the comment claim that this safe
> > because we have the same bank lock as SimpleLruZeroPage(), but that's
> > not true here we will be acquiring different bank locks one by one
> > based on which slotno we are checking.  Am I missing something?
>
> I think you're correct.  I reworded this comment, so now it says this:
>
> /*
>  * An important safety check: the current endpoint page must not be
>  * eligible for removal.  This check is just a backstop against wraparound
>  * bugs elsewhere in SLRU handling, so we don't care if we read a slightly
>  * outdated value; therefore we don't add a memory barrier.
>  */
>
> Pushed with those changes.  Thank you!

Yeah, this looks perfect, thanks.

> Now I'll go rebase the rest of the patch on top.

Okay, I will review and test after that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-02-06 Thread Dilip Kumar

On Tue, Feb 6, 2024 at 3:41 PM Amit Kapila  wrote:
>
> On Tue, Feb 6, 2024 at 3:23 PM Dilip Kumar  wrote:
> >
> > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > ---
> > > > > Since Two processes (e.g. the slotsync worker and
> > > > > pg_sync_replication_slots()) concurrently fetch and update the slot
> > > > > information, there is a race condition where slot's
> > > > > confirmed_flush_lsn goes backward.
> > > > >
> > > >
> > > > Right, this is possible, though there shouldn't be a problem because
> > > > anyway, slotsync is an async process. Till we hold restart_lsn, the
> > > > required WAL won't be removed. Having said that, I can think of two
> > > > ways to avoid it: (a) We can have some flag in shared memory using
> > > > which we can detect whether any other process is doing slot
> > > > syncronization and then either error out at that time or simply wait
> > > > or may take nowait kind of parameter from user to decide what to do?
> > > > If this is feasible, we can simply error out for the first version and
> > > > extend it later if we see any use cases for the same (b) similar to
> > > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an
> > > > error, this is good for now but in future we may still have another
> > > > similar issue, so I would prefer (a) among these but I am fine if you
> > > > prefer (b) or have some other ideas like just note down in comments
> > > > that this is a harmless case and can happen only very rarely.
> > >
> > > Thank you for sharing the ideas. I would prefer (a). For (b), the same
> > > issue still happens for other fields.
> >
> > I agree that (a) looks better.  On a separate note, while looking at
> > this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there
> > be an optional parameter to give one slot or multiple slots or all
> > slots as default, that will give better control to the user no?
> >
>
> As of now, we want to give functionality similar to slotsync worker
> with a difference that users can use this new function for planned
> switchovers. So, syncing all failover slots by default. I think if
> there is a use case to selectively sync some of the failover slots
> then we can probably extend this function and slotsync worker as well.
> Normally, if the primary goes down due to whatever reason users would
> want to restart the replication for all the defined publications via
> existing failover slots. Why would anyone want to do it partially?

If we consider the usability of such a function (I mean as it is
implemented now, without any argument) one use case could be that if
the slot sync worker is not keeping up or at some point in time the
user doesn't want to wait for the worker to do this instead user can
do it by himself.

So now if we have such a functionality then it would be even better to
extend it to selectively sync the slot.  For example, if there is some
issue in syncing all slots, maybe some bug or taking a long time to
sync because there are a lot of slots but if the user needs to quickly
failover and he/she is interested in only a couple of slots then such
a option could be helpful. no?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Question on LWLockMode in dsa.c

2024-02-06 Thread Dilip Kumar

On Tue, Jan 30, 2024 at 6:24 AM Masahiko Sawada  wrote:
>
> Hi,
>
> While working on radix tree patch[1], John Naylor found that dsa.c
> doesn't already use shared locks even in dsa_dump(). dsa_dump() seems
> a pure read-only function so I thought we could use a shared lock mode
> there. Is there any reason to use exclusive mode even in dsa_dump()?
>
> Ultimately, since we're trying to add a new function
> dsa_get_total_size() that just returns
> dsa_area_control.total_segment_size and therefore would also be a
> read-only function, I'd like to find out the correct lock mode there.
>

Doesn't seem like there is any reason for this to be an exclusive lock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-02-06 Thread Dilip Kumar

On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada  wrote:
>
> On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila  wrote:
> >
> > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada  
> > wrote:
> > >
> > > ---
> > > Since Two processes (e.g. the slotsync worker and
> > > pg_sync_replication_slots()) concurrently fetch and update the slot
> > > information, there is a race condition where slot's
> > > confirmed_flush_lsn goes backward.
> > >
> >
> > Right, this is possible, though there shouldn't be a problem because
> > anyway, slotsync is an async process. Till we hold restart_lsn, the
> > required WAL won't be removed. Having said that, I can think of two
> > ways to avoid it: (a) We can have some flag in shared memory using
> > which we can detect whether any other process is doing slot
> > syncronization and then either error out at that time or simply wait
> > or may take nowait kind of parameter from user to decide what to do?
> > If this is feasible, we can simply error out for the first version and
> > extend it later if we see any use cases for the same (b) similar to
> > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an
> > error, this is good for now but in future we may still have another
> > similar issue, so I would prefer (a) among these but I am fine if you
> > prefer (b) or have some other ideas like just note down in comments
> > that this is a harmless case and can happen only very rarely.
>
> Thank you for sharing the ideas. I would prefer (a). For (b), the same
> issue still happens for other fields.

I agree that (a) looks better.  On a separate note, while looking at
this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there
be an optional parameter to give one slot or multiple slots or all
slots as default, that will give better control to the user no?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-02-06 Thread Dilip Kumar

On Thu, Jan 11, 2024 at 10:48 AM Bharath Rupireddy
 wrote:
>
> Hi,
>
> Replication slots in postgres will prevent removal of required
> resources when there is no connection using them (inactive). This
> consumes storage because neither required WAL nor required rows from
> the user tables/system catalogs can be removed by VACUUM as long as
> they are required by a replication slot. In extreme cases this could
> cause the transaction ID wraparound.
>
> Currently postgres has the ability to invalidate inactive replication
> slots based on the amount of WAL (set via max_slot_wal_keep_size GUC)
> that will be needed for the slots in case they become active. However,
> the wraparound issue isn't effectively covered by
> max_slot_wal_keep_size - one can't tell postgres to invalidate a
> replication slot if it is blocking VACUUM. Also, it is often tricky to
> choose a default value for max_slot_wal_keep_size, because the amount
> of WAL that gets generated and allocated storage for the database can
> vary.
>
> Therefore, it is often easy for developers to do the following:
> a) set an XID age (age of slot's xmin or catalog_xmin) of say 1 or 1.5
> billion, after which the slots get invalidated.
> b) set a timeout of say 1 or 2 or 3 days, after which the inactive
> slots get invalidated.
>
> To implement (a), postgres needs a new GUC called max_slot_xid_age.
> The checkpointer then invalidates all the slots whose xmin (the oldest
> transaction that this slot needs the database to retain) or
> catalog_xmin (the oldest transaction affecting the system catalogs
> that this slot needs the database to retain) has reached the age
> specified by this setting.
>
> To implement (b), first postgres needs to track the replication slot
> metrics like the time at which the slot became inactive (inactive_at
> timestamptz) and the total number of times the slot became inactive in
> its lifetime (inactive_count numeric) in ReplicationSlotPersistentData
> structure. And, then it needs a new timeout GUC called
> inactive_replication_slot_timeout. Whenever a slot becomes inactive,
> the current timestamp and inactive count are stored in
> ReplicationSlotPersistentData structure and persisted to disk. The
> checkpointer then invalidates all the slots that are lying inactive
> for about inactive_replication_slot_timeout duration starting from
> inactive_at.
>
> In addition to implementing (b), these two new metrics enable
> developers to improve their monitoring tools as the metrics are
> exposed via pg_replication_slots system view. For instance, one can
> build a monitoring tool that signals when replication slots are lying
> inactive for a day or so using inactive_at metric, and/or when a
> replication slot is becoming inactive too frequently using inactive_at
> metric.
>
> I’m attaching the v1 patch set as described below:
> 0001 - Tracks invalidation_reason in pg_replication_slots. This is
> needed because slots now have multiple reasons for slot invalidation.
> 0002 - Tracks inactive replication slot information inactive_at and
> inactive_timeout.
> 0003 - Adds inactive_timeout based replication slot invalidation.
> 0004 - Adds XID based replication slot invalidation.
>
> Thoughts?
>
+1 for the idea,  here are some comments on 0002, I will review other
patches soon and respond.

1.
+  
+   inactive_at timestamptz
+  
+  
+The time at which the slot became inactive.
+NULL if the slot is currently actively being
+used.
+  
+ 

Maybe we can change the field name to 'last_inactive_at'? or maybe the
comment can explain timestampt at which slot was last inactivated.
I think since we are already maintaining the inactive_count so better
to explicitly say this is the last invaliding time.

2.
+ /*
+ * XXX: Can inactive_count of type uint64 ever overflow? It takes
+ * about a half-billion years for inactive_count to overflow even
+ * if slot becomes inactive for every 1 millisecond. So, using
+ * pg_add_u64_overflow might be an overkill.
+ */

Correct we don't need to use pg_add_u64_overflow for this counter.

3.

+
+ /* Convert to numeric. */
+ snprintf(buf, sizeof buf, UINT64_FORMAT, slot_contents.data.inactive_count);
+ values[i++] = DirectFunctionCall3(numeric_in,
+   CStringGetDatum(buf),
+   ObjectIdGetDatum(0),
+   Int32GetDatum(-1));

What is the purpose of doing this? I mean inactive_count is 8 byte
integer and you can define function outparameter as 'int8' which is 8
byte integer.  Then you don't need to convert int to string and then
to numeric?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-04 Thread Dilip Kumar

On Sun, Feb 4, 2024 at 7:10 PM Alvaro Herrera  wrote:
>
> On 2024-Feb-02, Dilip Kumar wrote:
>
> > I have checked the patch and it looks fine to me other than the above
> > question related to memory barrier usage one more question about the
> > same, basically below to instances 1 and 2 look similar but in 1 you
> > are not using the memory write_barrier whereas in 2 you are using the
> > write_barrier, why is it so?  I mean why the reordering can not happen
> > in 1 and it may happen in 2?
>
> What I was thinking is that there's a lwlock operation just below, which
> acts as a barrier.  But I realized something more important: there are
> only two places that matter, which are SlruSelectLRUPage and
> SimpleLruZeroPage.  The others are all initialization code that run at a
> point where there's no going to be any concurrency in SLRU access, so we
> don't need barriers anyway.  In SlruSelectLRUPage we definitely don't
> want to evict the page that SimpleLruZeroPage has initialized, starting
> from the point where it returns that new page to its caller.
> But if you consider the code of those two routines, you realize that the
> only time an equality between latest_page_number and "this_page_number"
> is going to occur, is when both pages are in the same bank ... and both
> routines are required to be holding the bank lock while they run, so in
> practice this is never a problem.

Right, in fact when I first converted this 'latest_page_number' to an
atomic the thinking was to protect it from concurrently setting the
values in SimpleLruZeroPage() and also concurrently reading in
SlruSelectLRUPage() should not read the corrupted value.  All other
usages were during the initialization phase where we do not need any
protection.

>
> We need the atomic write and atomic read so that multiple processes
> processing pages in different banks can update latest_page_number
> simultaneously.  But the equality condition that we're looking for?
> it can never happen concurrently.

Yeah, that's right, after you told I also realized that the case is
protected by the bank lock.  Earlier I didn't think about this case.

> In other words, these barriers are fully useless.
>
> (We also have SimpleLruTruncate, but I think it's not as critical to
> have a barrier there anyhow: accessing a slightly outdated page number
> could only be a problem if a bug elsewhere causes us to try to truncate
> in the current page.  I think we only have this code there because we
> did have such bugs in the past, but IIUC this shouldn't happen anymore.)

+1, I agree with this theory in general.  But the below comment in
SimpleLruTrucate in your v3 patch doesn't seem correct, because here
we are checking if the latest_page_number is smaller than the cutoff
if so we log it as wraparound and skip the whole thing and that is
fine even if we are reading with atomic variable and slightly outdated
value should not be a problem but the comment claim that this safe
because we have the same bank lock as SimpleLruZeroPage(), but that's
not true here we will be acquiring different bank locks one by one
based on which slotno we are checking.  Am I missing something?


+ * An important safety check: the current endpoint page must not be
+ * eligible for removal.  Like SlruSelectLRUPage, we don't need a
+ * memory barrier here because for the affected page to be relevant,
+ * we'd have to have the same bank lock as SimpleLruZeroPage.
  */
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u64(>latest_page_number),
+   cutoffPage))


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 4:34 PM Dilip Kumar  wrote:
>
> On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar  wrote:
> >
> > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera  
> > wrote:
>
> > Okay.
> > >
> > > While I have your attention -- if you could give a look to the 0001
> > > patch I posted, I would appreciate it.
> > >
> >
> > I will look into it.  Thanks.
>
> Some quick observations,
>
> Do we need below two write barriers at the end of the function?
> because the next instruction is separated by the function boundary
>
> @@ -766,14 +766,11 @@ StartupCLOG(void)
>   ..
> - XactCtl->shared->latest_page_number = pageno;
> -
> - LWLockRelease(XactSLRULock);
> + pg_atomic_init_u64(>shared->latest_page_number, pageno);
> + pg_write_barrier();
>  }
>
> /*
>   * Initialize member's idea of the latest page number.
>   */
>   pageno = MXOffsetToMemberPage(offset);
> - MultiXactMemberCtl->shared->latest_page_number = pageno;
> + pg_atomic_init_u64(>shared->latest_page_number,
> +pageno);
> +
> + pg_write_barrier();
>  }
>

I have checked the patch and it looks fine to me other than the above
question related to memory barrier usage one more question about the
same, basically below to instances 1 and 2 look similar but in 1 you
are not using the memory write_barrier whereas in 2 you are using the
write_barrier, why is it so?  I mean why the reordering can not happen
in 1 and it may happen in 2?

1.
+ pg_atomic_write_u64(>shared->latest_page_number,
+ trunc->pageno);

  SimpleLruTruncate(CommitTsCtl, trunc->pageno);

vs
2.

  - shared->latest_page_number = pageno;
+ pg_atomic_write_u64(>latest_page_number, pageno);
+ pg_write_barrier();

  /* update the stats counter of zeroed pages */
  pgstat_count_slru_page_zeroed(shared->slru_stats_idx);



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar  wrote:
>
> On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera  wrote:

> Okay.
> >
> > While I have your attention -- if you could give a look to the 0001
> > patch I posted, I would appreciate it.
> >
>
> I will look into it.  Thanks.

Some quick observations,

Do we need below two write barriers at the end of the function?
because the next instruction is separated by the function boundary

@@ -766,14 +766,11 @@ StartupCLOG(void)
  ..
- XactCtl->shared->latest_page_number = pageno;
-
- LWLockRelease(XactSLRULock);
+ pg_atomic_init_u64(>shared->latest_page_number, pageno);
+ pg_write_barrier();
 }

/*
  * Initialize member's idea of the latest page number.
  */
  pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u64(>shared->latest_page_number,
+pageno);
+
+ pg_write_barrier();
 }

I am looking more into this from the concurrency point of view and
will update you soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera  wrote:
>
> On 2024-Feb-01, Dilip Kumar wrote:
>
> > On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera  
> > wrote:
> > >
> > > postgres -c lc_messages=C -c shared_buffers=$((512*17))
> > >
> > > 2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter 
> > > "transaction_buffers": 17
> > > 2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must 
> > > be a multiple of 16
> >
> > Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE
> > instead of giving an error?
>
> Since this is the auto-tuning feature, I think it should use the
> previous multiple rather than the next, but yeah, something like that.

Okay.
>
> While I have your attention -- if you could give a look to the 0001
> patch I posted, I would appreciate it.
>

I will look into it.  Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera  wrote:
>
> Hah:
>
> postgres -c lc_messages=C -c shared_buffers=$((512*17))
>
> 2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter 
> "transaction_buffers": 17
> 2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must be 
> a multiple of 16

Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE
instead of giving an error?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-28 Thread Dilip Kumar

On Fri, Jan 26, 2024 at 11:08 PM Alvaro Herrera  wrote:
>
> I've continued to review this and decided that I don't like the mess
> this patch proposes in order to support pg_commit_ts's deletion of all
> files.  (Yes, I know that I was the one that proposed this idea. It's
> still a bad one).  I'd like to change that code by removing the limit
> that we can only have 128 bank locks in a SLRU.  To recap, the reason we
> did this is that commit_ts sometimes wants to delete all files while
> running (DeactivateCommitTs), and for this it needs to acquire all bank
> locks.  Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we
> added the SLRU limit making multiple banks share lwlocks.
>
> I propose two alternative solutions:
>
> 1. The easiest is to have DeactivateCommitTs continue to hold
> CommitTsLock until the end, including while SlruScanDirectory does its
> thing.  This sounds terrible, but considering that this code only runs
> when the module is being deactivated, I don't think it's really all that
> bad in practice.  I mean, if you deactivate the commit_ts module and
> then try to read commit timestamp data, you deserve to wait for a few
> seconds just as a punishment for your stupidity.

I think this idea looks reasonable.  I agree that if we are trying to
read commit_ts after deactivating it then it's fine to make it wait.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-28 Thread Dilip Kumar

On Thu, Jan 25, 2024 at 10:03 PM Alvaro Herrera  wrote:
>
> On 2024-Jan-25, Alvaro Herrera wrote:
>
> > Here's a touched-up version of this patch.
>
> > diff --git a/src/backend/storage/lmgr/lwlock.c 
> > b/src/backend/storage/lmgr/lwlock.c
> > index 98fa6035cc..4a5e05d5e4 100644
> > --- a/src/backend/storage/lmgr/lwlock.c
> > +++ b/src/backend/storage/lmgr/lwlock.c
> > @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = {
> >   [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash",
> >   [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA",
> >   [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash",
> > + [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU",
> > + [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU",
> > + [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU",
> > + [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU",
> > + [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU"
> > + [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
> > + [LWTRANCHE_XACT_SLRU] = "XactSLRU",
> >  };
>
> Eeek.  Last minute changes ...  Fixed here.

Thank you for working on this.  There is one thing that I feel is
problematic.  We have kept the allowed values for these GUCs to be in
multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min
values were changed to 16 but in this refactoring patch for some of
the buffers you have changed that to 8 so I think that's not good.

+ {
+ {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the size of the dedicated buffer pool used for
the MultiXact offset cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ _offsets_buffers,
+ 16, 8, SLRU_MAX_ALLOWED_BUFFERS,
+ check_multixact_offsets_buffers, NULL, NULL
+ },

Other than this patch looks good to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: index prefetching

2024-01-25 Thread Dilip Kumar

On Wed, Jan 24, 2024 at 11:43 PM Tomas Vondra
 wrote:

> On 1/22/24 07:35, Konstantin Knizhnik wrote:
> >
> > On 22/01/2024 1:47 am, Tomas Vondra wrote:
> >> h, right. Well, you're right in this case we perhaps could set just one
> >> of those flags, but the "purpose" of the two places is quite different.
> >>
> >> The "prefetch" flag is fully controlled by the prefetcher, and it's up
> >> to it to change it (e.g. I can easily imagine some new logic touching
> >> setting it to "false" for some reason).
> >>
> >> The "data" flag is fully controlled by the custom callbacks, so whatever
> >> the callback stores, will be there.
> >>
> >> I don't think it's worth simplifying this. In particular, I don't think
> >> the callback can assume it can rely on the "prefetch" flag.
> >>
> > Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
> > cause any extra space overhead (because of alignment), but allows to
> > avoid dynamic memory allocation (not sure if it is critical, but nice to
> > avoid if possible).
> >
>
While reading through the first patch I got some questions, I haven't
read it complete yet but this is what I got so far.

1.
+static bool
+IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+ int idx;
...
+ if (prefetch->blockItems[idx] != (block - i))
+ return false;
+
+ /* Don't prefetch if the block happens to be the same. */
+ if (prefetch->blockItems[idx] == block)
+ return false;
+ }
+
+ /* not sequential, not recently prefetched */
+ return true;
+}

The above function name is BlockIsSequential but at the end, it
returns true if it is not sequential, seem like a problem?
Also other 2 checks right above the end of the function are returning
false if the block is the same or the pattern is sequential I think
those are wrong too.

 2.
 I have noticed that the prefetch history is maintained at the backend
level, but what if multiple backends are trying to fetch the same heap
blocks maybe scanning the same index, so should that be in some shared
structure?  I haven't thought much deeper about this from the
implementation POV, but should we think about it, or it doesn't
matter?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Opportunistically pruning page before update

2024-01-22 Thread Dilip Kumar

On Tue, Jan 23, 2024 at 7:18 AM James Coleman  wrote:
>
> On Mon, Jan 22, 2024 at 8:21 PM James Coleman  wrote:
> >
> > See rebased patch attached.
>
> I just realized I left a change in during the rebase that wasn't necessary.
>
> v4 attached.

I have noticed that you are performing the opportunistic pruning after
we decided that the updated tuple can not fit in the current page and
then we are performing the pruning on the new target page.  Why don't
we first perform the pruning on the existing page of the tuple itself?
 Or this is already being done before this patch? I could not find
such existing pruning so got this question because such pruning can
convert many non-hot updates to the HOT update right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-01-21 Thread Dilip Kumar

On Wed, Jan 17, 2024 at 1:46 PM Kartyshov Ivan
 wrote:
>
> Add some fixes and rebase.
>
While quickly looking into the patch, I understood the idea of what we
are trying to achieve here and I feel that a useful feature.  But
while looking at both the patches I could not quickly differentiate
between these two approaches.  I believe, internally at the core both
are implementing similar wait logic but providing different syntaxes.
So if we want to keep both these approaches open for the sake of
discussion then better first to create a patch that implements the
core approach i.e. the waiting logic and the other common part and
then add top-up patches with 2 different approaches that would be easy
for review.  I also see in v4 that there is no documentation for the
syntax part so it makes it even harder to understand.

I think this thread is implementing a useful feature so my suggestion
is to add some documentation in v4 and also make it more readable
w.r.t. What are the clear differences between these two approaches,
maybe adding commit message will also help.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-19 Thread Dilip Kumar

On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila  wrote:
>
> On Wed, Jan 17, 2024 at 4:00 PM shveta malik  wrote:
> >
>
> I had some off-list discussions with Sawada-San, Hou-San, and Shveta
> on the topic of extending replication commands instead of using the
> current model where we fetch the required slot information via SQL
> using a database connection. I would like to summarize the discussion
> and would like to know the thoughts of others on this topic.
>
> In the current patch, we launch the slotsync worker on physical
> standby which connects to the specified database (currently we let
> users specify the required dbname in primary_conninfo) on the primary.
> It then fetches the required information for failover marked slots
> from the primary and also does some primitive checks on the upstream
> node via SQL (the additional checks are like whether the upstream node
> has a specified physical slot or whether the upstream node is a
> primary node or a standby node). To fetch the required information it
> uses a libpqwalreciever API which is mostly apt for this purpose as it
> supports SQL execution but for this patch, we don't need a replication
> connection, so we extend the libpqwalreciever connect API.

What sort of extension we have done to 'libpqwalreciever'? Is it
something like by default this supports replication connections so we
have done an extension to the API so that we can provide an option
whether to create a replication connection or a normal connection?

> Now, the concerns related to this could be that users would probably
> need to change existing mechanisms/tools to update priamry_conninfo
> and one of the alternatives proposed is to have an additional GUC like
> slot_sync_dbname. Users won't be able to drop the database this worker
> is connected to aka whatever is specified in slot_sync_dbname but as
> the user herself sets up the configuration it shouldn't be a big deal.

Yeah for this purpose users may use template1 or so which they
generally don't plan to drop.  So in case the user wants to drop that
database user needs to turn off the slot syncing option and then it
can be done?

> Then we also discussed whether extending libpqwalreceiver's connect
> API is a good idea and whether we need to further extend it in the
> future. As far as I can see, slotsync worker's primary requirement is
> to execute SQL queries which the current API is sufficient, and don't
> see something that needs any drastic change in this API. Note that
> tablesync worker that executes SQL also uses these APIs, so we may
> need something in the future for either of those. Then finally we need
> a slotsync worker to also connect to a database to use SQL and fetch
> results.

While looking into the patch v64-0002 I could not exactly point out
what sort of extensions are there in libpqwalreceiver.c, I just saw
one extra API for fetching the dbname from connection info?

> Now, let us consider if we extend the replication commands like
> READ_REPLICATION_SLOT and or introduce a new set of replication
> commands to fetch the required information then we don't need a DB
> connection with primary or a connection in slotsync worker. As per my
> current understanding, it is quite doable but I think we will slowly
> go in the direction of making replication commands something like SQL
> because today we need to extend it to fetch all slots info that have
> failover marked as true, the existence of a particular replication,
> etc. Then tomorrow, if we want to extend this work to have multiple
> slotsync workers say workers perdb then we have to extend the
> replication command to fetch per-database failover marked slots. To
> me, it sounds more like we are slowly adding SQL-like features to
> replication commands.
>
> Apart from this when we are reading per-db replication slots without
> connecting to a database, we probably need some additional protection
> mechanism so that the database won't get dropped.

Something like locking the database only while fetching the slots?

> Considering all this it seems that for now probably extending
> replication commands can simplify a few things like mentioned above
> but using SQL's with db-connection is more extendable.

Even I have similar thoughts.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

2024-01-18 Thread Dilip Kumar

On Wed, Nov 1, 2023 at 2:42 AM Matthias van de Meent
 wrote:
>
> Hi,
>
> Currently, nbtree code compares each and every column of an index
> tuple during the binary search on the index page. With large indexes
> that have many duplicate prefix column values (e.g. an index on (bool,
> bool, uuid) ) that means a lot of wasted time getting to the right
> column.
>
> The attached patch improves on that by doing per-page dynamic prefix
> truncation: If we know that on both the right and left side there are
> index tuples where the first two attributes are equal to the scan key,
> we skip comparing those attributes at the current index tuple and
> start with comparing attribute 3, saving two attribute compares. We
> gain performance whenever comparing prefixing attributes is expensive
> and when there are many tuples with a shared prefix - in unique
> indexes this doesn't gain much, but we also don't lose much in this
> case.
>
> This patch was originally suggested at [0], but it was mentioned that
> they could be pulled out into it's own thread. Earlier, the
> performance gains were not clearly there for just this patch, but
> after further benchmarking this patch stands on its own for
> performance: it sees no obvious degradation of performance, while
> gaining 0-5% across various normal indexes on the cc-complete sample
> dataset, with the current worst-case index shape getting a 60%+
> improved performance on INSERTs in the tests at [0].

+1 for the idea, I have some initial comments while reading through the patch.

1.
Commit message refers to a non-existing reference '(see [0])'.


2.
+When we do a binary search on a sorted set (such as a BTree), we know that a
+tuple will be smaller than its left neighbour, and larger than its right
+neighbour.

I think this should be 'larger than left neighbour and smaller than
right neighbour' instead of the other way around.

3.
+With the above optimizations, dynamic prefix truncation improves the worst
+case complexity of indexing from O(tree_height * natts * log(tups_per_page))
+to O(tree_height * (3*natts + log(tups_per_page)))

Where do the 3*natts come from?  Is it related to setting up the
dynamic prefix at each level?

4.
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *compareattr = ntupatts + 1;

Can you elaborate on this more? If all tuple attributes are equal to
the scan key then what do those 'later attributes' mean?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-15 Thread Dilip Kumar

On Tue, Jan 16, 2024 at 9:37 AM Amit Kapila  wrote:
>
> On Tue, Jan 16, 2024 at 9:03 AM shveta malik  wrote:
> >
> Agreed and as said earlier I think it is better to make it a
> PGC_SIGHUP. Also, not sure we can say it is a non-standard way as
> already autovacuum launcher is handled in the same way. One more minor
> thing is it will save us for having a new bgworker state
> BgWorkerStart_ConsistentState_HotStandby as introduced by this patch.

Yeah, it's not a nonstandard way.  But bgworker provides a lot of
inbuilt infrastructure which otherwise we would have to maintain by
ourselves if we opt for option 2.  I would have preferred option 3
from the simplicity point of view but I would prefer to make this
PGC_SIGHUP over simplicity.  But anyway, if there are issues in doing
so then we can keep it PGC_POSTMASTER but it's worth trying this out.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-11 Thread Dilip Kumar

On Wed, Jan 10, 2024 at 5:53 PM Zhijie Hou (Fujitsu)
 wrote:
>

> > IIUC on the standby we just want to overwrite what we get from primary no? 
> > If
> > so why we are using those APIs that are meant for the actual decoding slots
> > where it needs to take certain logical decisions instead of mere 
> > overwriting?
>
> I think we don't have a strong reason to use these APIs, but it was 
> convenient to
> use these APIs as they can take care of updating the slots info and will call
> functions like, ReplicationSlotsComputeRequiredXmin,
> ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly 
> overwriting
> the fields and call these manually ?

I might be missing something but do you want to call
ReplicationSlotsComputeRequiredXmin() kind of functions in standby?  I
mean those will ultimately update the catalog xmin and replication
xmin in Procarray and that prevents Vacuum from cleaning up some of
the required xids.  But on standby, those shared memory parameters are
not used IIUC.

In my opinion on standby, we just need to update the values in the
local slots and whatever we get from remote slots without taking all
the logical decisions in the hope that they will all fall into a
particular path, for example, if you see LogicalIncreaseXminForSlot(),
it is doing following steps of operations as shown below[1].  These
all make sense when you are doing candidate-based updation where we
first mark the candidates and then update the candidate to real value
once you get the confirmation for the LSN.  Now following all this
logic looks completely weird unless this can fall in a different path
I feel it will do some duplicate steps as well.  For example in
local_slot_update(), first you call LogicalConfirmReceivedLocation()
which will set the 'data.confirmed_flush' and then you will call
LogicalIncreaseXminForSlot() which will set the 'updated_xmin = true;'
and will again call LogicalConfirmReceivedLocation().  I don't think
this is the correct way of reusing the function unless you need to go
through those paths and I am missing something.

[1]
LogicalIncreaseXminForSlot()
{
   if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin))
  {
  }
  else if (current_lsn <= slot->data.confirmed_flush)
  {
  }
 else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
 {
 }

if (updated_xmin)
  LogicalConfirmReceivedLocation(slot->data.confirmed_flush);
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-09 Thread Dilip Kumar

On Tue, Jan 9, 2024 at 5:44 PM Zhijie Hou (Fujitsu)
 wrote:
>
comments on 0002

1.
+/* Worker's nap time in case of regular activity on the primary server */
+#define WORKER_DEFAULT_NAPTIME_MS   10L /* 10 ms */
+
+/* Worker's nap time in case of no-activity on the primary server */
+#define WORKER_INACTIVITY_NAPTIME_MS1L /* 10 sec */

Instead of directly switching between 10ms to 10s shouldn't we
increase the nap time gradually?  I mean it can go beyond 10 sec as
well but instead of directly switching
from 10ms to 10 sec we can increase it every time with some multiplier
and keep a max limit up to which it can grow.  Although we can reset
back to 10ms directly as soon as
we observe some activity.

2.
SlotSyncWorkerCtxStruct add this to typedefs. list file

3.
+/*
+ * Update local slot metadata as per remote_slot's positions
+ */
+static void
+local_slot_update(RemoteSlot *remote_slot)
+{
+ Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
+
+ LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn);
+ LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn,
+remote_slot->catalog_xmin);
+ LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn,
+   remote_slot->restart_lsn);
+}

IIUC on the standby we just want to overwrite what we get from primary
no? If so why we are using those APIs that are meant for the actual
decoding slots where it needs to take certain logical decisions
instead of mere overwriting?

4.
+/*
+ * Helper function for drop_obsolete_slots()
+ *
+ * Drops synced slot identified by the passed in name.
+ */
+static void
+drop_synced_slots_internal(const char *name, bool nowait)

Suggestion to add one line to explain no wait in the header

5.
+/*
+ * Helper function to check if local_slot is present in remote_slots list.
+ *
+ * It also checks if logical slot is locally invalidated i.e. invalidated on
+ * the standby but valid on the primary server. If found so, it sets
+ * locally_invalidated to true.
+ */

Instead of saying "but valid on the primary server" better to mention
it in the remote_slots list, because here this function is just
checking the remote_slots list regardless of whether the list came
from.  Mentioning primary
seems like it might fetch directly from the primary in this function
so this is a bit confusing.

6.
+/*
+ * Check that all necessary GUCs for slot synchronization are set
+ * appropriately. If not, raise an ERROR.
+ */
+static void
+validate_slotsync_parameters(char **dbname)


The function name just says 'validate_slotsync_parameters' but it also
gets the dbname so I think it better we change the name accordingly
also instead of passing dbname as a parameter just return it directly.
There
is no need to pass this extra parameter and make the function return void.

7.
+ tupslot = MakeSingleTupleTableSlot(res->tupledesc, );
+ tuple_ok = tuplestore_gettupleslot(res->tuplestore, true, false, tupslot);
+ Assert(tuple_ok); /* It must return one tuple */

Comments say 'It must return one tuple' but asserting just for at
least one tuple shouldn't we enhance assert so that it checks that we
got exactly one tuple?

8.
/* No need to check further, return that we are cascading standby */
+ *am_cascading_standby = true;

we are not returning immediately we are just setting
am_cascading_standby to true so adjust comments accordingly

9.
+ /* No need to check further, return that we are cascading standby */
+ *am_cascading_standby = true;
+ }
+ else
+ {
+ /* We are a normal standby. */

Single-line comments do not follow the uniform pattern for the full
stop, either use a full stop for all single-line comments or none, at
least follow the same rule in a file or nearby comments.

10.
+ errmsg("exiting from slot synchronization due to bad configuration"),
+ errhint("%s must be defined.", "primary_slot_name"));

Why we are using the constant string "primary_slot_name" as a variable
in this error formatting?

11.
+ /*
+ * Hot_standby_feedback must be enabled to cooperate with the physical
+ * replication slot, which allows informing the primary about the xmin and
+ * catalog_xmin values on the standby.

I do not like capitalizing the first letter of the
'hot_standby_feedback' which is a GUC parameter

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-08 Thread Dilip Kumar

On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera  wrote:
>
> The more I look at TransactionGroupUpdateXidStatus, the more I think
> it's broken, and while we do have some tests, I don't have confidence
> that they cover all possible cases.
>
> Or, at least, if this code is good, then it hasn't been sufficiently
> explained.

Any thought about a case in which you think it might be broken, I mean
any vague thought might also help where you think it might not work as
expected so that I can also think in that direction.  It might be
possible that I might not be thinking of some perspective that you are
thinking and comments might be lacking from that point of view.

> If we have multiple processes trying to write bits to clog, and they are
> using different banks, then the LWLockConditionalAcquire will be able to
> acquire the bank lock

Do you think there is a problem with multiple processes getting the
lock? I mean they are modifying different CLOG pages so that can be
done concurrently right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-07 Thread Dilip Kumar

On Fri, Jan 5, 2024 at 5:45 PM Amit Kapila  wrote:
>
> On Fri, Jan 5, 2024 at 4:25 PM Dilip Kumar  wrote:
> >
> > On Fri, Jan 5, 2024 at 8:59 AM shveta malik  wrote:
> > >
> > I was going the the patch set again, I have a question.  The below
> > comments say that we keep the failover option as PENDING until we have
> > done the initial table sync which seems fine.  But what happens if we
> > add a new table to the publication and refresh the subscription? In
> > such a case does this go back to the PENDING state or something else?
> >
>
> At this stage, such an operation is prohibited. Users need to disable
> the failover option first, then perform the above operation, and after
> that failover option can be re-enabled.

Okay, that makes sense to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-05 Thread Dilip Kumar

On Fri, Jan 5, 2024 at 8:59 AM shveta malik  wrote:
>
I was going the the patch set again, I have a question.  The below
comments say that we keep the failover option as PENDING until we have
done the initial table sync which seems fine.  But what happens if we
add a new table to the publication and refresh the subscription? In
such a case does this go back to the PENDING state or something else?

+ * As a result, we enable the failover option for the main slot only after the
+ * initial sync is complete. The failover option is implemented as a tri-state
+ * with values DISABLED, PENDING, and ENABLED. The state transition process
+ * between these values is the same as the two_phase option (see TWO_PHASE
+ * TRANSACTIONS for details).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-05 Thread Dilip Kumar

On Tue, Dec 12, 2023 at 4:15 PM Michael Paquier  wrote:
>
> On Tue, Dec 12, 2023 at 10:27:09AM +0530, Dilip Kumar wrote:
> > Oops, I only included the code changes where I am adding injection
> > points and some comments to verify that, but missed the actual test
> > file. Attaching it here.
>
> I see.  Interesting that this requires persistent connections to work.
> That's something I've found clunky to rely on when the scenarios a
> test needs to deal with are rather complex.  That's an area that could
> be made easier to use outside of this patch..  Something got proposed
> by Andrew Dunstan to make the libpq routines usable through a perl
> module, for example.
>
> > Note:  I think the latest patches are conflicting with the head, can you 
> > rebase?
>
> Indeed, as per the recent manipulations in ipci.c for the shmem
> initialization areas.  Here goes a v6.

Some comments in 0001, mostly cosmetics

1.
+/* utilities to handle the local array cache */
+static void
+injection_point_cache_add(const char *name,
+   InjectionPointCallback callback)

I think the comment for this function should be more specific about
adding an entry to the local injection_point_cache_add.  And add
comments for other functions as well e.g. injection_point_cache_get


2.
+typedef struct InjectionPointEntry
+{
+ char name[INJ_NAME_MAXLEN]; /* hash key */
+ char library[INJ_LIB_MAXLEN]; /* library */
+ char function[INJ_FUNC_MAXLEN]; /* function */
+} InjectionPointEntry;

Some comments would be good for the structure

3.

+static bool
+file_exists(const char *name)
+{
+ struct stat st;
+
+ Assert(name != NULL);
+ if (stat(name, ) == 0)
+ return !S_ISDIR(st.st_mode);
+ else if (!(errno == ENOENT || errno == ENOTDIR))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access file \"%s\": %m", name)));
+ return false;
+}

dfmgr.c has a similar function so can't we reuse it by making that
function external?

4.
+ if (found)
+ {
+ LWLockRelease(InjectionPointLock);
+ elog(ERROR, "injection point \"%s\" already defined", name);
+ }
+
...
+#else
+ elog(ERROR, "Injection points are not supported by this build");

Better to use similar formatting for error output, Injection vs
injection (better not to capitalize the first letter for consistency
pov)

5.
+ * Check first the shared hash table, and adapt the local cache
+ * depending on that as it could be possible that an entry to run
+ * has been removed.
+ */

What if the entry is removed after we have released the
InjectionPointLock? Or this would not cause any harm?


0004:

I think
test_injection_points_wake() and test_injection_wait() can be moved as
part of 0002



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-03 Thread Dilip Kumar

On Wed, Jan 3, 2024 at 4:20 PM Amit Kapila  wrote:
>
> On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila  wrote:

> > I see your point and agree that users need to be careful. I was trying
> > to compare it with other places like the conninfo used with a
> > subscription where no separate dbname needs to be provided. Now, here
> > the situation is not the same because the same conninfo is used for
> > different purposes (walreceiver doesn't require dbname (dbname is
> > ignored even if present) whereas slotsyncworker requires dbname). I
> > was just trying to see if we can avoid having a new GUC for this
> > purpose. Does anyone else have an opinion on this matter?
> >
>
> Bertrand, Dilip, and others involved in this thread or otherwise, see
> if you can share an opinion on the above point because it would be
> good to get some more opinions before we decide to add a new GUC (for
> dbname) for slotsync worker.

IMHO, as of now we can only use the primary_coninfo and let the user
modify this and add the dbname to this.  In the future, if this
creates some discomfort or we see some complaints about the usage then
we can expand the behavior by providing an additional GUC with dbname.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-02 Thread Dilip Kumar

On Tue, Jan 2, 2024 at 7:53 PM Robert Haas  wrote:
>
> On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin  
> wrote:
> > Just a side node.
> > It seems like commit log is kind of antipattern of data contention. Even 
> > when we build a super-optimized SLRU. Nearby **bits** are written by 
> > different CPUs.
> > I think that banks and locks are good thing. But also we could reorganize 
> > log so that
> > status of transaction 0 is on a page 0 at bit offset 0
> > status of transaction 1 is on a page 1 at bit offset 0
> > status of transaction 2 is on a page 2 at bit offset 0
> > status of transaction 3 is on a page 3 at bit offset 0
> > status of transaction 4 is on a page 0 at bit offset 2
> > status of transaction 5 is on a page 1 at bit offset 2
> > status of transaction 6 is on a page 2 at bit offset 2
> > status of transaction 7 is on a page 3 at bit offset 2
> > etc...
>
> This is an interesting idea. A variant would be to stripe across
> cachelines within the same page rather than across pages. If we do
> stripe across pages as proposed here, we'd probably want to rethink
> the way the SLRU is extended -- page at a time wouldn't really make
> sense, but preallocating an entire file might.

Yeah, this is indeed an interesting idea.  So I think if we are
interested in working in this direction maybe this can be submitted in
a different thread, IMHO.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: index prefetching

2023-12-20 Thread Dilip Kumar

On Wed, Dec 20, 2023 at 7:11 AM Tomas Vondra
 wrote:
>
I was going through to understand the idea, couple of observations

--
+ for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+ {
+ entry = >prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+ /* Is this the oldest prefetch request in this LRU? */
+ if (entry->request < oldestRequest)
+ {
+ oldestRequest = entry->request;
+ oldestIndex = i;
+ }
+
+ /*
+ * If the entry is unused (identified by request being set to 0),
+ * we're done. Notice the field is uint64, so empty entry is
+ * guaranteed to be the oldest one.
+ */
+ if (entry->request == 0)
+ continue;

If the 'entry->request == 0' then we should break instead of continue, right?

---
/*
 * Used to detect sequential patterns (and disable prefetching).
 */
#define PREFETCH_QUEUE_HISTORY 8
#define PREFETCH_SEQ_PATTERN_BLOCKS 4

If for sequential patterns we search only 4 blocks then why we are
maintaining history for 8 blocks

---

+ *
+ * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+ *
+ * XXX Could it be harmful that we read the queue backwards? Maybe memory
+ * prefetching works better for the forward direction?
+ */
+ for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)

Correct, I think if we fetch this forward it will have an advantage
with memory prefetching.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Dilip Kumar

On Mon, Dec 18, 2023 at 11:00 PM Robert Haas  wrote:
>
> On Mon, Dec 18, 2023 at 12:04 PM Robert Haas  wrote:
> > certain sense they are competing for the same job. However, they do
> > aim to alleviate different TYPES of contention: the group XID update
> > stuff should be most valuable when lots of processes are trying to
> > update the same page, and the banks should be most valuable when there
> > is simultaneous access to a bunch of different pages. So I'm not
> > convinced that this patch is a reason to remove the group XID update
> > mechanism, but someone might argue otherwise.
>
> Hmm, but, on the other hand:
>
> Currently all readers and writers are competing for the same LWLock.
> But with this change, the readers will (mostly) no longer be competing
> with the writers. So, in theory, that might reduce lock contention
> enough to make the group update mechanism pointless.

Thanks for your feedback, I agree that with a bank-wise lock, we might
not need group updates for some of the use cases as you said where
readers and writers are contenting on the centralized lock because, in
most of the cases, readers will be distributed across different banks.
OTOH there are use cases where the writer commit is the bottleneck (on
SLRU lock) like pgbench simple-update or TPC-B then we will still
benefit by group update.  During group update testing we have seen
benefits with such a scenario[1] with high client counts.  So as per
my understanding by distributing the SLRU locks there are scenarios
where we will not need group update anymore but OTOH there are also
scenarios where we will still benefit from the group update.

[1] 
https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Dilip Kumar

On Thu, Dec 14, 2023 at 4:36 PM Dilip Kumar  wrote:
>
> On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin  
> wrote:
>
> > > On 12 Dec 2023, at 18:28, Alvaro Herrera  wrote:
> > >
> > > Andrey, do you have any stress tests or anything else that you used to
> > > gain confidence in this code?
> >

I have done some more testing for the clog group update as the
attached test file executes two concurrent scripts executed with
pgbench, the first script is the slow script which will run 10-second
long transactions and the second script is a very fast transaction
with ~1 transactions per second.  Along with that, I have also
changed the bank size such that each bank will contain just 1 page
i.e. 32k transactions per bank.  I have done this way so that we do
not need to keep long-running transactions running for very long in
order to get the transactions from different banks committed during
the same time.  With this test, I have got that behavior and the below
logs shows that multiple transaction range which is in different
slru-bank (considering 32k transactions per bank) are doing group
update at the same time. e.g. in the below logs, we can see xid range
around 70600, 70548, and 70558, and xid range around 755, and 752 are
getting group updates by different leaders but near the same time.

It is running fine when running for a long duration, but I am not sure
how to validate the sanity of this kind of test.

2023-12-14 14:43:31.813 GMT [3306] LOG:  group leader procno 606
updated status of procno 606 xid 70600
2023-12-14 14:43:31.816 GMT [3326] LOG:  procno 586 for xid 70548
added for group update
2023-12-14 14:43:31.816 GMT [3326] LOG:  procno 586 is group leader
and got the lock
2023-12-14 14:43:31.816 GMT [3326] LOG:  group leader procno 586
updated status of procno 586 xid 70548
2023-12-14 14:43:31.818 GMT [3327] LOG:  procno 585 for xid 70558
added for group update
2023-12-14 14:43:31.818 GMT [3327] LOG:  procno 585 is group leader
and got the lock
2023-12-14 14:43:31.818 GMT [3327] LOG:  group leader procno 585
updated status of procno 585 xid 70558
2023-12-14 14:43:31.829 GMT [3155] LOG:  procno 687 for xid 752 added
for group update
2023-12-14 14:43:31.829 GMT [3207] LOG:  procno 669 for xid 755 added
for group update
2023-12-14 14:43:31.829 GMT [3155] LOG:  procno 687 is group leader
and got the lock
2023-12-14 14:43:31.829 GMT [3155] LOG:  group leader procno 687
updated status of procno 669 xid 755
2023-12-14 14:43:31.829 GMT [3155] LOG:  group leader procno 687
updated status of procno 687 xid 752


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
# Goal of this script to generate scenario where some old long running slow
# transaction get committed with the new transactions such that they falls in
# different slru banks


rm -rf pgdata
./initdb -D pgdata
echo "max_wal_size=20GB" >> pgdata/postgresql.conf
echo "shared_buffers=20GB" >> pgdata/postgresql.conf
echo  "checkpoint_timeout=40min" >> pgdata/postgresql.conf
echo "max_connections=700" >> pgdata/postgresql.conf
echo "maintenance_work_mem=1GB" >> pgdata/postgresql.conf
echo "subtrans_buffers=64" >> pgdata/postgresql.conf
echo "multixact_members_buffers=128" >> pgdata/postgresql.conf

#create slow_txn.sql script
cat > slow_txn.sql << EOF
BEGIN;
INSERT INTO test VALUES(1);
DELETE FROM test WHERE a=1;
select pg_sleep(10);
COMMIT;
EOF

#create fast_txn.sql script
cat > fast_txn.sql << EOF
BEGIN;
INSERT INTO test1 VALUES(1);
DELETE FROM test1 WHERE a=1;
COMMIT;
EOF

./pg_ctl -D pgdata -l logfile -c start
./psql -d postgres -c "create table test(a int)"
./psql -d postgres -c "create table test1(a int)"
./pgbench -i -s 1 postgres
./pgbench -f slow_txn.sql -c 28 -j 28 -P 1 -T 60 postgres &
./pgbench -f fast_txn.sql -c 100 -j 100 -P 1 -T 60 postgres
sleep(10);
./pg_ctl -D pgdata -l logfile -c stop

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Dilip Kumar

On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin  wrote:

> > On 12 Dec 2023, at 18:28, Alvaro Herrera  wrote:
> >
> > Andrey, do you have any stress tests or anything else that you used to
> > gain confidence in this code?
>
> We are using only first two steps of the patchset, these steps do not touch 
> locking stuff.
>
> We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks!
>

I have run this test [1], instead of comparing against the master I
have compared the effect of (patch-1 = (0001+0002)slur buffer bank) vs
(patch-2 = (0001+0002+0003) slur buffer bank + bank-wise lock), and
here is the result of the benchmark-1 and benchmark-2.  I have noticed
a very good improvement with the addition of patch 0003.

Machine information:
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127

configurations:

max_wal_size=20GB
shared_buffers=20GB
checkpoint_timeout=40min
max_connections=700
maintenance_work_mem=1GB

subtrans_buffers=$variable
multixact_offsets_buffers=$variable
multixact_members_buffers=$variable

benchmark-1
version  | subtrans | multixact   | tps
  | buffers   | offs/memb | func+ballast
---+--+--+--
patch-1 | 64   | 64/128|  87 + 58
patch-2 | 64   | 64/128 | 128 +83
patch-1 | 1024| 512/1024|  96 + 64
patch-2 | 1024| 512/1024| 163+108

benchmark-2

version  | subtrans | multixact   | tps
  | buffers   | offs/memb | func
---+--+--+--
patch-1 | 64   | 64/128|  10
patch-2 | 64   | 64/128 | 12
patch-1 | 1024| 512/1024| 44
patch-2 | 1024| 512/1024| 72


[1] 
https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2023-12-13 Thread Dilip Kumar

On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar  wrote:
>
> > > >
> > >
> > > It is correct that we can make a wrong decision about whether a change
> > > is transactional or non-transactional when sequence DDL happens before
> > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > > after that state. However, one thing to note here is that we won't try
> > > to stream such a change because for non-transactional cases we don't
> > > proceed unless the snapshot is in a consistent state. Now, if the
> > > decision had been correct then we would probably have queued the
> > > sequence change and discarded at commit.
> > >
> > > One thing that we deviate here is that for non-sequence transactional
> > > cases (including logical messages), we immediately start queuing the
> > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > > SnapBuildProcessChange() returns true which is quite possible) and
> > > take final decision at commit/prepare/abort time. However, that won't
> > > be the case for sequences because of the dependency of determining
> > > transactional cases on one of the prior records. Now, I am not
> > > completely sure at this stage if such a deviation can cause any
> > > problem and or whether we are okay to have such a deviation for
> > > sequences.
> >
> > Okay, so this particular scenario that I raised is somehow saved, I
> > mean although we are considering transactional sequence operation as
> > non-transactional we also know that if some of the changes for a
> > transaction are skipped because the snapshot was not FULL that means
> > that transaction can not be streamed because that transaction has to
> > be committed before snapshot become CONSISTENT (based on the snapshot
> > state change machinery).  Ideally based on the same logic that the
> > snapshot is not consistent the non-transactional sequence changes are
> > also skipped.  But the only thing that makes me a bit uncomfortable is
> > that even though the result is not wrong we have made some wrong
> > intermediate decisions i.e. considered transactional change as
> > non-transactions.
> >
> > One solution to this issue is that, even if the snapshot state does
> > not reach FULL just add the sequence relids to the hash, I mean that
> > hash is only maintained for deciding whether the sequence is changed
> > in that transaction or not.  So no adding such relids to hash seems
> > like a root cause of the issue.  Honestly, I haven't analyzed this
> > idea in detail about how easy it would be to add only these changes to
> > the hash and what are the other dependencies, but this seems like a
> > worthwhile direction IMHO.
>
> I also thought about the same solution. I tried this solution as the
> attached patch on top of Hayato's diagnostic changes.

I think you forgot to attach the patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2023-12-13 Thread Dilip Kumar

On Wed, Dec 13, 2023 at 6:26 PM Amit Kapila  wrote:
>
> > > But can this even happen? Can we start decoding in the middle of a
> > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> > > messages, where we also call the output plugin in non-transactional cases.
> >
> > It's not a problem for logical messages because whether the message is
> > transaction or non-transactional is decided while WAL logs the message
> > itself.  But here our problem starts with deciding whether the change
> > is transactional vs non-transactional, because if we insert the
> > 'relfilenode' in hash then the subsequent sequence change in the same
> > transaction would be considered transactional otherwise
> > non-transactional.
> >
>
> It is correct that we can make a wrong decision about whether a change
> is transactional or non-transactional when sequence DDL happens before
> the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> after that state. However, one thing to note here is that we won't try
> to stream such a change because for non-transactional cases we don't
> proceed unless the snapshot is in a consistent state. Now, if the
> decision had been correct then we would probably have queued the
> sequence change and discarded at commit.
>
> One thing that we deviate here is that for non-sequence transactional
> cases (including logical messages), we immediately start queuing the
> changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> SnapBuildProcessChange() returns true which is quite possible) and
> take final decision at commit/prepare/abort time. However, that won't
> be the case for sequences because of the dependency of determining
> transactional cases on one of the prior records. Now, I am not
> completely sure at this stage if such a deviation can cause any
> problem and or whether we are okay to have such a deviation for
> sequences.

Okay, so this particular scenario that I raised is somehow saved, I
mean although we are considering transactional sequence operation as
non-transactional we also know that if some of the changes for a
transaction are skipped because the snapshot was not FULL that means
that transaction can not be streamed because that transaction has to
be committed before snapshot become CONSISTENT (based on the snapshot
state change machinery).  Ideally based on the same logic that the
snapshot is not consistent the non-transactional sequence changes are
also skipped.  But the only thing that makes me a bit uncomfortable is
that even though the result is not wrong we have made some wrong
intermediate decisions i.e. considered transactional change as
non-transactions.

One solution to this issue is that, even if the snapshot state does
not reach FULL just add the sequence relids to the hash, I mean that
hash is only maintained for deciding whether the sequence is changed
in that transaction or not.  So no adding such relids to hash seems
like a root cause of the issue.  Honestly, I haven't analyzed this
idea in detail about how easy it would be to add only these changes to
the hash and what are the other dependencies, but this seems like a
worthwhile direction IMHO.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Improve eviction algorithm in ReorderBuffer

2023-12-12 Thread Dilip Kumar

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  wrote:
>
> > Thanks for working on this, I think it would be good to test other
> > scenarios as well where this might have some negative impact and see
> > where we stand.
>
> Agreed.
>
> > 1) A scenario where suppose you have one very large transaction that
> > is consuming ~40% of the memory and 5-6 comparatively smaller
> > transactions that are just above 10% of the memory limit.  And now for
> > coming under the memory limit instead of getting 1 large transaction
> > evicted out, we are evicting out multiple times.
>
> Given the large transaction list will have up to 10 transactions, I
> think it's cheap to pick the largest transaction among them. It's O(N)
> but N won't be large.

Yeah, that makes sense.

> > 2) Another scenario where all the transactions are under 10% of the
> > memory limit but let's say there are some transactions are consuming
> > around 8-9% of the memory limit each but those are not very old
> > transactions whereas there are certain old transactions which are
> > fairly small and consuming under 1% of memory limit and there are many
> > such transactions.  So how it would affect if we frequently select
> > many of these transactions to come under memory limit instead of
> > selecting a couple of large transactions which are consuming 8-9%?
>
> Yeah, probably we can do something for small transactions (i.e. small
> and on-memory transactions). One idea is to pick the largest
> transaction among them by iterating over all of them. Given that the
> more transactions are evicted, the less transactions the on-memory
> transaction list has, unlike the current algorithm, we would still
> win. Or we could even split it into several sub-lists in order to
> reduce the number of transactions to check. For example, splitting it
> into two lists: transactions consuming 5% < and 5% >=  of the memory
> limit, and checking the 5% >= list preferably. The cost for
> maintaining these lists could increase, though.
>
> Do you have any ideas?

Yeah something like what you mention might be good, we maintain 3 list
that says large, medium, and small transactions.  In a large
transaction, list suppose we allow transactions that consume more than
10% so there could be at max 10 transactions so we can do a sequence
search and spill the largest of all.  Whereas in the medium list
suppose we keep transactions ranging from e.g. 3-10% then it's just
fine to pick from the head because the size differences between the
largest and smallest transaction in this list are not very
significant.  And remaining in the small transaction list and from the
small transaction list we can choose to spill multiple transactions at
a time.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-12 Thread Dilip Kumar

me group, and for
handling that exceptional case we are checking the respective bank
lock for each page and if that exception occurred we will release the
old bank lock and acquire a new lock.  This case might not be
performant because now it is possible that after getting the lock
leader might need to wait again on another bank lock, but this is an
extremely exceptional case so should not be worried about performance
and I do not see any correctness issue here as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-12-11 Thread Dilip Kumar

On Mon, Dec 11, 2023 at 3:14 PM Michael Paquier  wrote:
>
> On Mon, Dec 11, 2023 at 11:09:45AM +0530, Dilip Kumar wrote:
> > I haven't specifically done a review or testing of this patch, but I
> > have used this for testing the CLOG group update code with my
> > SLRU-specific changes and I found it quite helpful to test some of the
> > concurrent areas where you need to stop processing somewhere in the
> > middle of the code and testing that area without this kind of
> > injection point framework is really difficult or may not be even
> > possible.  We wanted to test the case of clog group update where we
> > can get multiple processes added to a single group and get the xid
> > status updated by the group leader, you can refer to my test in that
> > thread[1] (the last patch test_group_commit.patch is using this
> > framework for testing).
>
> Could you be more specific?  test_group_commit.patch includes this
> line but there is nothing specific about this injection point getting
> used in a test or a callback assigned to it:
> ./test_group_commit.patch:+ INJECTION_POINT("ClogGroupCommit");

Oops, I only included the code changes where I am adding injection
points and some comments to verify that, but missed the actual test
file. Attaching it here.

Note:  I think the latest patches are conflicting with the head, can you rebase?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
# Test clog group update

use strict;
use warnings;

use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;

my $node = PostgreSQL::Test::Cluster->new('node');
$node->init(allows_streaming => 'logical');
$node->start;

$node->safe_psql('postgres', 'CREATE EXTENSION test_injection_points;');
$node->safe_psql('postgres', 'CREATE TABLE test(a int);');

# Consume multiple xids so that next xids get generated in new banks
$node->safe_psql(
	'postgres', q{
do $$
begin
  for i in 1..128001 loop
-- use an exception block so that each iteration eats an XID
begin
  insert into test values (i);
exception
  when division_by_zero then null;
end;
  end loop;
end$$;
});

#attach to the injection point
$node->safe_psql('postgres',
  "SELECT test_injection_points_attach('ClogGroupCommit', 'wait');");


# First session will get the slru lock and will wait on injection point
my $session1 = $node->background_psql('postgres');

$session1->query_until(
	qr/start/, q(
\echo start
INSERT INTO test VALUES(1);
));

#create another 4 session which will not get the lock as first session is holding that lock
#so these all will go for group update
my $session2 = $node->background_psql('postgres');

$session2->query_until(
	qr/start/, q(
\echo start
INSERT INTO test VALUES(2);
));

my $session3 = $node->background_psql('postgres');

$session3->query_until(
	qr/start/, q(
\echo start
INSERT INTO test VALUES(3);
));

my $session4 = $node->background_psql('postgres');

$session4->query_until(
	qr/start/, q(
\echo start
INSERT INTO test VALUES(4);
));

my $session5 = $node->background_psql('postgres');

$session5->query_until(
	qr/start/, q(
\echo start
INSERT INTO test VALUES(5);
));

# Now wake up the first session and let next 4 session perform the group update
$node->safe_psql('postgres',
  "SELECT test_injection_points_wake();");
$node->safe_psql('postgres',
  "SELECT test_injection_points_detach('ClogGroupCommit');");

done_testing();

Re: Improve eviction algorithm in ReorderBuffer

2023-12-11 Thread Dilip Kumar

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada  wrote:
>
> Hi all,
>
> As the comment of ReorderBufferLargestTXN() says, it's very slow with
> many subtransactions:
>
> /*
>  * Find the largest transaction (toplevel or subxact) to evict (spill to 
> disk).
>  *
>  * XXX With many subtransactions this might be quite slow, because we'll have
>  * to walk through all of them. There are some options how we could improve
>  * that: (a) maintain some secondary structure with transactions sorted by
>  * amount of changes, (b) not looking for the entirely largest transaction,
>  * but e.g. for transaction using at least some fraction of the memory limit,
>  * and (c) evicting multiple transactions at once, e.g. to free a given 
> portion
>  * of the memory limit (e.g. 50%).
>  */
>
> This is because the reorderbuffer has transaction entries for each
> top-level and sub transaction, and ReorderBufferLargestTXN() walks
> through all transaction entries to pick the transaction to evict.
> I've heard the report internally that replication lag became huge when
> decoding transactions each consisting of 500k sub transactions. Note
> that ReorderBufferLargetstTXN() is used only in non-streaming mode.
>
> Here is a test script for a many subtransactions scenario. In my
> environment, the logical decoding took over 2min to decode one top
> transaction having 100k subtransctions.
>
> -
> create table test (c int);
> create or replace function testfn (cnt int) returns void as $$
> begin
>   for i in 1..cnt loop
> begin
>   insert into test values (i);
> exception when division_by_zero then
>   raise notice 'caught error';
>   return;
> end;
>   end loop;
> end;
> $$
> language plpgsql;
> select testfn(10)
> set logical_decoding_work_mem to '4MB';
> select count(*) from pg_logical_slot_peek_changes('s', null, null)
> 
>
> To deal with this problem, I initially thought of the idea (a)
> mentioned in the comment; use a binary heap to maintain the
> transactions sorted by the amount of changes or the size. But it seems
> not a good idea to try maintaining all transactions by  its size since
> the size of each transaction could be changed frequently.
>
> The attached patch uses a different approach that consists of three
> strategies; (1) maintain the list of transactions whose size is larger
> than 10% of logical_decoding_work_mem, and preferentially evict a
> transaction from this list. If the list is empty, all transactions are
> small enough, (2) so we evict the oldest top-level transaction from
> rb->toplevel_by_lsn list. Evicting older transactions would help in
> freeing memory blocks in GenerationContext. Finally, if this is also
> empty, (3) we evict a transaction that size is > 0. Here, we need to
> note the fact that even if a transaction is evicted the
> ReorderBufferTXN entry is not removed from rb->by_txn but its size is
> 0. In the worst case where all (quite a few) transactions are smaller
> than 10% of the memory limit, we might end up checking many
> transactions to find non-zero size transaction entries to evict. So
> the patch adds a new list to maintain all transactions that have at
> least one change in memory.
>
> Summarizing the algorithm I've implemented in the patch,
>
> 1. pick a transaction from the list of large transactions (larger than
> 10% of memory limit).
> 2. pick a transaction from the top-level transaction list in LSN order.
> 3. pick a transaction from the list of transactions that have at least
> one change in memory.
>
> With the patch, the above test case completed within 3 seconds in my
> environment.

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand.  I mean
1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit.  And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.
2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions.  So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

>
> As a side note, the idea (c) mentioned in the comment, evicting
> multiple transactions at once to free a given portion of the memory,
> would also help in avoiding back and forth the memory threshold. It's
> also worth considering.

Yes, I think it is worth considering.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2023-12-11 Thread Dilip Kumar

On Mon, Dec 11, 2023 at 2:21 PM shveta malik  wrote:
>
> On Mon, Dec 11, 2023 at 1:47 PM Dilip Kumar  wrote:
> >
> > On Fri, Dec 8, 2023 at 2:36 PM Amit Kapila  wrote:
> > >
> > > On Wed, Dec 6, 2023 at 4:53 PM shveta malik  
> > > wrote:
> > > >
> > > > PFA v43, changes are:
> > > >
> > >
> > > I wanted to discuss 0003 patch about cascading standby's. It is not
> > > clear to me whether we want to allow physical standbys to further wait
> > > for cascading standby to sync their slots. If we allow such a feature
> > > one may expect even primary to wait for all the cascading standby's
> > > because otherwise still logical subscriber can be ahead of one of the
> > > cascading standby. I feel even if we want to allow such a behaviour we
> > > can do it later once the main feature is committed. I think it would
> > > be good to just allow logical walsenders on primary to wait for
> > > physical standbys represented by GUC 'standby_slot_names'. If we agree
> > > on that then it would be good to prohibit setting this GUC on standby
> > > or at least it should be a no-op even if this GUC should be set on
> > > physical standby.
> > >
> > > Thoughts?
> >
> > IMHO, why not keep the behavior consistent across primary and standby?
> >  I mean if it doesn't require a lot of new code/design addition then
> > it should be the user's responsibility.  I mean if the user has set
> > 'standby_slot_names' on standby then let standby also wait for
> > cascading standby to sync their slots?  Is there any issue with that
> > behavior?
> >
>
> Without waiting for cascading standby on primary, it won't be helpful
> to just wait on standby.
>
> Currently logical walsenders on primary waits for physical standbys to
> take changes before they update their own logical slots. But they wait
> only for their immediate standbys and not for cascading standbys.
> Although, on first standby, we do have logic where slot-sync workers
> wait for cascading standbys before they update their own slots (synced
> ones, see patch3). But this does not guarantee that logical
> subscribers on primary will never be ahead of the cascading standbys.
> Let us consider this timeline:
>
> t1: logical walsender on primary waiting for standby1 (first standby).
> t2: physical walsender on standby1 is stuck and thus there is delay in
> sending these changes to standby2 (cascading standby).
> t3: standby1 has taken changes and sends confirmation to primary.
> t4: logical walsender on primary receives confirmation from standby1
> and updates slot, logical subscribers of primary also receives the
> changes.
> t5: standby2 has not received changes yet as physical walsender on
> standby1 is still stuck, slotsync worker still waiting for standby2
> (cascading) before it updates its own slots (synced ones).
> t6: standby2 is promoted to become primary.
>
> Now we are in a state wherein primary, logical subscriber and first
> standby has some changes but cascading standby does not. And logical
> slots on primary were updated w/o confirming if cascading standby has
> taken changes or not. This is a problem and we do not have a simple
> solution for this yet.

Okay, I think that makes sense.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2023-12-11 Thread Dilip Kumar

On Fri, Dec 8, 2023 at 2:36 PM Amit Kapila  wrote:
>
> On Wed, Dec 6, 2023 at 4:53 PM shveta malik  wrote:
> >
> > PFA v43, changes are:
> >
>
> I wanted to discuss 0003 patch about cascading standby's. It is not
> clear to me whether we want to allow physical standbys to further wait
> for cascading standby to sync their slots. If we allow such a feature
> one may expect even primary to wait for all the cascading standby's
> because otherwise still logical subscriber can be ahead of one of the
> cascading standby. I feel even if we want to allow such a behaviour we
> can do it later once the main feature is committed. I think it would
> be good to just allow logical walsenders on primary to wait for
> physical standbys represented by GUC 'standby_slot_names'. If we agree
> on that then it would be good to prohibit setting this GUC on standby
> or at least it should be a no-op even if this GUC should be set on
> physical standby.
>
> Thoughts?

IMHO, why not keep the behavior consistent across primary and standby?
 I mean if it doesn't require a lot of new code/design addition then
it should be the user's responsibility.  I mean if the user has set
'standby_slot_names' on standby then let standby also wait for
cascading standby to sync their slots?  Is there any issue with that
behavior?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

2023-12-10 Thread Dilip Kumar

On Mon, Dec 11, 2023 at 11:44 AM Dilip Kumar  wrote:
>
> On Tue, Dec 5, 2023 at 11:40 PM Robert Haas  wrote:
> >
> > On Mon, Dec 4, 2023 at 3:58 PM Robert Haas  wrote:
> > > Considering all this, what I'm inclined to do is go and put
> > > UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
> > > accordingly. But first: does anybody see more problems here that I may
> > > have missed?
> >
> > OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
> > long comment explaining why that's believed to be necessary and
> > sufficient. I committed 0001 and 0002 from the previous series also,
> > since it doesn't seem like anyone has further comments on those
> > renamings.
>
> I have done some testing on standby, but I am facing some issues,
> although things are working fine on the primary.  As shown below test
> [1]standby is reporting some errors that manifest require WAL from
> 0/6F8, but this backup starts at 0/628.  Then I tried to look
> into the manifest file of the full backup and it shows contents as
> below[0].  Actually from this WARNING and ERROR, I am not clear what
> is the problem, I understand that full backup ends at  "0/6F8" so
> for the next incremental backup we should be looking for a summary
> that has WAL starting at "0/6F8" and we do have those WALs.  In
> fact, the error message is saying "this backup starts at 0/628"
> which is before  "0/6F8" so whats the issue?
>
> [0]
> "WAL-Ranges": [
> { "Timeline": 1, "Start-LSN": "0/628", "End-LSN": "0/6F8" }
>
>
> [1]
> -- test on primary
> dilipkumar@dkmac bin % ./pg_basebackup -D d
> dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest
>
> -- cleanup the backup directory
> dilipkumar@dkmac bin % rm -rf d
> dilipkumar@dkmac bin % rm -rf d1
>
> --test on standby
> dilipkumar@dkmac bin % ./pg_basebackup -D d -p 5433
> dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -p 5433
>
> WARNING:  aborting backup due to backend exiting before pg_backup_stop
> was called
> pg_basebackup: error: could not initiate base backup: ERROR:  manifest
> requires WAL from final timeline 1 ending at 0/6F8, but this
> backup starts at 0/6000028
> pg_basebackup: removing data directory "d1"

Jakub, pinged me offlist and pointed me to the thread[1] where it is
already explained so I think we can ignore this.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

2023-12-10 Thread Dilip Kumar

On Tue, Dec 5, 2023 at 11:40 PM Robert Haas  wrote:
>
> On Mon, Dec 4, 2023 at 3:58 PM Robert Haas  wrote:
> > Considering all this, what I'm inclined to do is go and put
> > UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
> > accordingly. But first: does anybody see more problems here that I may
> > have missed?
>
> OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
> long comment explaining why that's believed to be necessary and
> sufficient. I committed 0001 and 0002 from the previous series also,
> since it doesn't seem like anyone has further comments on those
> renamings.

I have done some testing on standby, but I am facing some issues,
although things are working fine on the primary.  As shown below test
[1]standby is reporting some errors that manifest require WAL from
0/6F8, but this backup starts at 0/628.  Then I tried to look
into the manifest file of the full backup and it shows contents as
below[0].  Actually from this WARNING and ERROR, I am not clear what
is the problem, I understand that full backup ends at  "0/6F8" so
for the next incremental backup we should be looking for a summary
that has WAL starting at "0/6F8" and we do have those WALs.  In
fact, the error message is saying "this backup starts at 0/628"
which is before  "0/6F8" so whats the issue?

[0]
"WAL-Ranges": [
{ "Timeline": 1, "Start-LSN": "0/628", "End-LSN": "0/6F8" }

[1]
-- test on primary
dilipkumar@dkmac bin % ./pg_basebackup -D d
dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest

-- cleanup the backup directory
dilipkumar@dkmac bin % rm -rf d
dilipkumar@dkmac bin % rm -rf d1

--test on standby
dilipkumar@dkmac bin % ./pg_basebackup -D d -p 5433
dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -p 5433

WARNING:  aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR:  manifest
requires WAL from final timeline 1 ending at 0/6F8, but this
backup starts at 0/628
pg_basebackup: removing data directory "d1"

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-12-10 Thread Dilip Kumar

On Tue, Nov 28, 2023 at 4:07 AM Michael Paquier  wrote:
>
> On Mon, Nov 27, 2023 at 12:14:05PM +0530, Ashutosh Bapat wrote:
> > Since you wroten "(still need to improve ...) I thought you are
> > working on v6. No problem. Sorry for the confusion.
>
> I see why my previous message could be confusing.  Sorry about that.

I haven't specifically done a review or testing of this patch, but I
have used this for testing the CLOG group update code with my
SLRU-specific changes and I found it quite helpful to test some of the
concurrent areas where you need to stop processing somewhere in the
middle of the code and testing that area without this kind of
injection point framework is really difficult or may not be even
possible.  We wanted to test the case of clog group update where we
can get multiple processes added to a single group and get the xid
status updated by the group leader, you can refer to my test in that
thread[1] (the last patch test_group_commit.patch is using this
framework for testing).  Overall I feel this framework is quite useful
and easy to use as well.

[1] 
https://www.postgresql.org/message-id/CAFiTN-udSTGG_t5n9Z3eBbb4_%3DzNoKU%2B8FP-S6zpv-r4Gm-Y%2BQ%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Dilip Kumar

On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra
 wrote:
>
> Yes, if something like this happens, that'd be a problem:
>
> 1) decoding starts, with
>
>SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT
>
> 2) transaction that creates a new refilenode gets decoded, but we skip
>it because we don't have the correct snapshot
>
> 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT
>
> 4) we decode sequence change from nextval() for the sequence
>
> This would lead to us attempting to apply sequence change for a
> relfilenode that's not visible yet (and may even get aborted).
>
> But can this even happen? Can we start decoding in the middle of a
> transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> messages, where we also call the output plugin in non-transactional cases.

It's not a problem for logical messages because whether the message is
transaction or non-transactional is decided while WAL logs the message
itself.  But here our problem starts with deciding whether the change
is transactional vs non-transactional, because if we insert the
'relfilenode' in hash then the subsequent sequence change in the same
transaction would be considered transactional otherwise
non-transactional.  And XLOG_HEAP2_NEW_CID is just for changing the
snapshot->curcid which will only affect the catalog visibility of the
upcoming operation in the same transaction, but that's not an issue
because if some of the changes of this transaction are seen when
snapbuild state < SNAPBUILD_FULL_SNAPSHOT then this transaction has to
get committed before the state change to SNAPBUILD_CONSISTENT_SNAPSHOT
i.e. the commit LSN of this transaction is going to be <
start_decoding_at.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Dilip Kumar

On Wed, Dec 6, 2023 at 7:17 PM Tomas Vondra
 wrote:
>
> On 12/6/23 12:05, Dilip Kumar wrote:
> > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila  wrote:
> >>
> >>> Why can't we use the same concept of
> >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
> >>> non-transactional changes (have some base snapshot before the first
> >>> change), and whenever there is any catalog change, queue new snapshot
> >>> change also in the queue of the non-transactional sequence change so
> >>> that while sending it to downstream whenever it is necessary we will
> >>> change the historic snapshot?
> >>>
> >>
> >> Oh, do you mean maintain different historic snapshots and then switch
> >> based on the change we are processing? I guess the other thing we need
> >> to consider is the order of processing the changes if we maintain
> >> separate queues that need to be processed.
> >
> > I mean we will not specifically maintain the historic changes, but if
> > there is any catalog change where we are pushing the snapshot to all
> > the transaction's change queue, at the same time we will push this
> > snapshot in the non-transactional sequence queue as well.  I am not
> > sure what is the problem with the ordering? because we will be
> > queueing all non-transactional sequence changes in a separate queue in
> > the order they arrive and as soon as we process the next commit we
> > will process all the non-transactional changes at that time.  Do you
> > see issue with that?
> >
>
> Isn't this (in principle) the idea of queuing the non-transactional
> changes and then applying them on the next commit?

Yes, it is.

 Yes, I didn't get
> very far with that, but I got stuck exactly on tracking which snapshot
> to use, so if there's a way to do that, that'd fix my issue.

Thinking more about the snapshot issue do we need to even bother about
changing the snapshot at all while streaming the non-transactional
sequence changes or we can send all the non-transactional changes with
a single snapshot? So mainly snapshot logically gets changed due to
these 2 events case1: When any transaction gets committed which has
done catalog operation (this changes the global snapshot) and case2:
When within a transaction, there is some catalog change (this just
updates the 'curcid' in the base snapshot of the transaction).

Now, if we are thinking that we are streaming all the
non-transactional sequence changes right before the next commit then
we are not bothered about the (case1) at all because all changes we
have queues so far are before this commit.   And if we come to a
(case2), if we are performing any catalog change on the sequence then
the following changes on the same sequence will be considered
transactional and if the changes are just on some other catalog (not
relevant to our sequence operation) then also we should not be worried
about command_id change because visibility of catalog lookup for our
sequence will be unaffected by this.

In short, I am trying to say that we can safely queue the
non-transactional sequence changes and stream them based on the
snapshot we got when we decode the first change, and as long as we are
planning to stream just before the next commit (or next in-progress
stream), we don't ever need to update the snapshot.

> Also, would this mean we don't need to track the relfilenodes, if we're
> able to query the catalog? Would we be able to check if the relfilenode
> was created by the current xact?

I think by querying the catalog and checking the xmin we should be
able to figure that out, but isn't that costlier than looking up the
relfilenode in hash?  Because just for identifying whether the changes
are transactional or non-transactional you would have to query the
catalog, that means for each change before we decide whether we add to
the transaction's change queue or non-transactional change queue we
will have to query the catalog i.e. you will have to start/stop the
transaction?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Dilip Kumar

On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila  wrote:
>
> > Why can't we use the same concept of
> > SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
> > non-transactional changes (have some base snapshot before the first
> > change), and whenever there is any catalog change, queue new snapshot
> > change also in the queue of the non-transactional sequence change so
> > that while sending it to downstream whenever it is necessary we will
> > change the historic snapshot?
> >
>
> Oh, do you mean maintain different historic snapshots and then switch
> based on the change we are processing? I guess the other thing we need
> to consider is the order of processing the changes if we maintain
> separate queues that need to be processed.

I mean we will not specifically maintain the historic changes, but if
there is any catalog change where we are pushing the snapshot to all
the transaction's change queue, at the same time we will push this
snapshot in the non-transactional sequence queue as well.  I am not
sure what is the problem with the ordering? because we will be
queueing all non-transactional sequence changes in a separate queue in
the order they arrive and as soon as we process the next commit we
will process all the non-transactional changes at that time.  Do you
see issue with that?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Dilip Kumar

On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar  wrote:
>
> On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
>  wrote:
> >

I was also wondering what happens if the sequence changes are
transactional but somehow the snap builder state changes to
SNAPBUILD_FULL_SNAPSHOT in between processing of the smgr_decode() and
the seq_decode() which means RelFileLocator will not be added to the
hash table and during the seq_decode() we will consider the change as
non-transactional.  I haven't fully analyzed that what is the real
problem in this case but have we considered this case? what happens if
the transaction having both ALTER SEQUENCE and nextval() gets aborted
but the nextva() has been considered as non-transactional because
smgr_decode() changes were not processed because snap builder state
was not yet SNAPBUILD_FULL_SNAPSHOT.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2023-12-05 Thread Dilip Kumar

On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
 wrote:
>

> Some time ago I floated the idea of maybe "queuing" the sequence changes
> and only replay them on the next commit, somehow. But we did ran into
> problems with which snapshot to use, that I didn't know how to solve.
> Maybe we should try again. The idea is we'd queue the non-transactional
> changes somewhere (can't be in the transaction, because we must keep
> them even if it aborts), and then "inject" them into the next commit.
> That'd mean we wouldn't do the separate start/abort for each change.

Why can't we use the same concept of
SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
non-transactional changes (have some base snapshot before the first
change), and whenever there is any catalog change, queue new snapshot
change also in the queue of the non-transactional sequence change so
that while sending it to downstream whenever it is necessary we will
change the historic snapshot?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-29 Thread Dilip Kumar

On Wed, Nov 29, 2023 at 3:29 PM Alvaro Herrera  wrote:
>
> On 2023-Nov-29, tender wang wrote:
>
> > The v8-0001 patch failed to apply in my local repo as below:
> >
> > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> > error: patch failed: src/backend/access/transam/multixact.c:1851
> > error: src/backend/access/transam/multixact.c: patch does not apply
> > error: patch failed: src/backend/access/transam/subtrans.c:184
> > error: src/backend/access/transam/subtrans.c: patch does not apply
> > error: patch failed: src/backend/commands/async.c:117
> > error: src/backend/commands/async.c: patch does not apply
> > error: patch failed: src/backend/storage/lmgr/predicate.c:808
> > error: src/backend/storage/lmgr/predicate.c: patch does not apply
> > error: patch failed: src/include/commands/async.h:15
> > error: src/include/commands/async.h: patch does not apply
>
> Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.

I will send a rebased version by tomorrow.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-23 Thread Dilip Kumar

On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar  wrote:
>
> Note: With this testing, we have found a bug in the bank-wise
> approach, basically we are clearing a procglobal->clogGroupFirst, even
> before acquiring the bank lock that means in most of the cases there
> will be a single process in each group as a group leader

I realized that the bug fix I have done is not proper, so will send
the updated patch set with the proper fix soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-22 Thread Dilip Kumar

On Tue, Nov 21, 2023 at 2:03 PM Dilip Kumar  wrote:
>
> On Mon, Nov 20, 2023 at 4:42 PM Dilip Kumar  wrote:
> >
> > On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin  
> > wrote:
> >
> > > > On 20 Nov 2023, at 13:51, Dilip Kumar  wrote:
> > > >
> > > > 2) Do we really need one separate lwlock tranche for each SLRU?
> > > >
> > > > IMHO if we use the same lwlock tranche then the wait event will show
> > > > the same wait event name, right? And that would be confusing for the
> > > > user, whether we are waiting for Subtransaction or Multixact or
> > > > anything else.  Is my understanding no correct here?
> > >
> > > If we give to a user multiple GUCs to tweak, I think we should give a way 
> > > to understand which GUC to tweak when they observe wait times.
>
> PFA, updated patch set, I have worked on review comments by Alvaro and
> Andrey.  So the only open comments are about clog group commit
> testing, for that my question was as I sent in the previous email
> exactly what part we are worried about in the coverage report?
>
> The second point is, if we want to generate a group update we will
> have to create the injection point after we hold the control lock so
> that other processes go for group update and then for waking up the
> waiting process who is holding the SLRU control lock in the exclusive
> mode we would need to call a function ('test_injection_points_wake()')
> to wake that up and for calling the function we would need to again
> acquire the SLRU lock in read mode for visibility check in the catalog
> for fetching the procedure row and now this wake up session will block
> on control lock for the session which is waiting on injection point so
> now it will create a deadlock.   Maybe with bank-wise lock we can
> create a lot of transaction so that these 2 falls in different banks
> and then we can somehow test this, but then we will have to generate
> 16 * 4096 = 64k transaction so that the SLRU banks are different for
> the transaction which inserted procedure row in system table from the
> transaction in which we are trying to do the group commit

I have attached a POC patch for testing the group update using the
injection point framework.  This is just for testing the group update
part and is not yet a committable test.  I have added a bunch of logs
in the code so that we can see what's going on with the group update.
>From the below logs, we can see that multiple processes are getting
accumulated for the group update and the leader is updating their xid
status.


Note: With this testing, we have found a bug in the bank-wise
approach, basically we are clearing a procglobal->clogGroupFirst, even
before acquiring the bank lock that means in most of the cases there
will be a single process in each group as a group leader (I think this
is what Alvaro was pointing in his coverage report).  I have added
this fix in this POC just for testing purposes but in my next version
I will add this fix to my proper patch version after a proper review
and a bit more testing.


here is the output after running the test
==
2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl LOG:
procno 6 got the lock
2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl
STATEMENT:  SELECT txid_current();
2023-11-23 05:55:29.406 UTC [93369] 003_clog_group_commit.pl LOG:
statement: SELECT test_injection_points_attach('ClogGroupCommit',
'wait');
2023-11-23 05:55:29.415 UTC [93371] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(1);
2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl LOG:
procno 4 got the lock
2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(1);
2023-11-23 05:55:29.424 UTC [93373] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(2);
2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl LOG:
procno 3 for xid 128742 added for group update
2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(2);
2023-11-23 05:55:29.431 UTC [93376] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(3);
2023-11-23 05:55:29.438 UTC [93378] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(4);
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG:
procno 2 for xid 128743 added for group update
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(3);
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG:
procno 2 is follower and wait for group leader to update commit status
of xid 128743
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(3);
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG:

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-20 Thread Dilip Kumar

On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin  wrote:

> > On 20 Nov 2023, at 13:51, Dilip Kumar  wrote:
> >
> > 2) Do we really need one separate lwlock tranche for each SLRU?
> >
> > IMHO if we use the same lwlock tranche then the wait event will show
> > the same wait event name, right? And that would be confusing for the
> > user, whether we are waiting for Subtransaction or Multixact or
> > anything else.  Is my understanding no correct here?
>
> If we give to a user multiple GUCs to tweak, I think we should give a way to 
> understand which GUC to tweak when they observe wait times.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-20 Thread Dilip Kumar

On Fri, Nov 17, 2023 at 7:28 PM Alvaro Herrera  wrote:
>
> On 2023-Nov-17, Dilip Kumar wrote:

I think I need some more clarification for some of the review comments

> > On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera  
> > wrote:
> > >
> > > I just noticed that 0003 does some changes to
> > > TransactionGroupUpdateXidStatus() that haven't been adequately
> > > explained AFAICS.  How do you know that these changes are safe?
> >
> > IMHO this is safe as well as logical to do w.r.t. performance.  It's
> > safe because whenever we are updating any page in a group we are
> > acquiring the respective bank lock in exclusive mode and in extreme
> > cases if there are pages from different banks then we do switch the
> > lock as well before updating the pages from different groups.
>
> Looking at the coverage for this code,
> https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> it seems in our test suites we never hit the case where there is
> anything in the "nextidx" field for commit groups.

1)
I was looking into your coverage report and I have attached a
screenshot from the same, it seems we do hit the block where nextidx
is not INVALID_PGPROCNO, which means there is some other process other
than the group leader.  Although I have already started exploring the
injection point but just wanted to be sure what is your main concern
point about the coverage so though of checking that first.

470 : /*
 471 :  * If the list was not empty, the leader
will update the status of our
 472 :  * XID. It is impossible to have followers
without a leader because the
 473 :  * first process that has added itself to
the list will always have
 474 :  * nextidx as INVALID_PGPROCNO.
 475 :  */
 476  98 : if (nextidx != INVALID_PGPROCNO)
 477 : {
 478   2 : int extraWaits = 0;
 479 :
 480 : /* Sleep until the leader updates our
XID status. */
 481   2 :
pgstat_report_wait_start(WAIT_EVENT_XACT_GROUP_UPDATE);
 482 : for (;;)
 483 : {
 484 : /* acts as a read barrier */
 485   2 : PGSemaphoreLock(proc->sem);
 486   2 : if (!proc->clogGroupMember)
 487   2 : break;
 488   0 : extraWaits++;
 489 : }

2) Do we really need one separate lwlock tranche for each SLRU?

IMHO if we use the same lwlock tranche then the wait event will show
the same wait event name, right? And that would be confusing for the
user, whether we are waiting for Subtransaction or Multixact or
anything else.  Is my understanding no correct here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-19 Thread Dilip Kumar

On Sun, Nov 19, 2023 at 12:39 PM Andrey M. Borodin  wrote:
>
> I’ve skimmed through the patch set. Here are some minor notes.

Thanks for the review
>
> 1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in 
> SlruSelectLRUPage() and SimpleLruReadPage_ReadOnly() now have identical 
> comments. I think a little of copy-paste is OK.
> But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while 
> SlruSelectLRUPage() does not. This is not related to the patch set, just a 
> code nearby.

Do you mean to say we need to modify the comments or you are saying
pgstat_count_slru_page_hit() is missing in SlruSelectLRUPage(), if it
is later then I can see the caller of SlruSelectLRUPage() is calling
pgstat_count_slru_page_hit() and the SlruRecentlyUsed().

> 2. Do we really want these functions doing all the same?
> extern bool check_multixact_offsets_buffers(int *newval, void 
> **extra,GucSource source);
> extern bool check_multixact_members_buffers(int *newval, void 
> **extra,GucSource source);
> extern bool check_subtrans_buffers(int *newval, void **extra,GucSource 
> source);
> extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
> extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
> extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
> extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource 
> source);

I tried duplicating these by doing all the work inside the
check_slru_buffers() function.  But I think it is hard to make them a
single function because there is no option to pass an SLRU name in the
GUC check hook and IMHO in the check hook we need to print the GUC
name, any suggestions on how we can avoid having so many functions?

> 3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d 
> suggest truncating prefix of infix.
>
> I do not have hard opinion on any of this items.
>

I prefer SimpleLruGetBankLock() so that it is consistent with other
external functions starting with "SimpleLruGet", are you fine with
this name?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-18 Thread Dilip Kumar

On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera  wrote:

Thanks for the review, all comments looks fine to me, replying to
those that need some clarification

> I wonder what's the deal with false sharing in the new
> bank_cur_lru_count array.  Maybe instead of using LWLockPadded for
> bank_locks, we should have a new struct, with both the LWLock and the
> LRU counter; then pad *that* to the cacheline size.  This way, both the
> lwlock and the counter come to the CPU running this code together.

Actually, the array lengths of both LWLock and the LRU counter are
different so I don't think we can move them to a common structure.
The length of the *buffer_locks array is equal to the number of slots,
the length of the *bank_locks array is Min (number_of_banks, 128), and
the length of the *bank_cur_lru_count array is number_of_banks.

> Looking at the coverage for this code,
> https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> it seems in our test suites we never hit the case where there is
> anything in the "nextidx" field for commit groups.  To be honest, I
> don't understand this group stuff, and so I'm doubly hesitant to go
> without any testing here.  Maybe it'd be possible to use Michael
> Paquier's injection points somehow?

Sorry, but I am not aware of "Michael Paquier's injection" Is it
something already in the repo? Can you redirect me to some of the
example test cases if we already have them? Then I will try it out.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-16 Thread Dilip Kumar

On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera  wrote:
>
> I just noticed that 0003 does some changes to
> TransactionGroupUpdateXidStatus() that haven't been adequately
> explained AFAICS.  How do you know that these changes are safe?

IMHO this is safe as well as logical to do w.r.t. performance.  It's
safe because whenever we are updating any page in a group we are
acquiring the respective bank lock in exclusive mode and in extreme
cases if there are pages from different banks then we do switch the
lock as well before updating the pages from different groups.  And, we
do not wake any process in a group unless we have done the status
update for all the processes so there could not be any race condition
as well.  Also, It should not affect the performance adversely as well
and this will not remove the need for group updates.  The main use
case of group update is that it will optimize a situation when most of
the processes are contending for status updates on the same page and
processes that are waiting for status updates on different pages will
go to different groups w.r.t. that page, so in short in a group on
best effort basis we are trying to have the processes which are
waiting to update the same clog page that mean logically all the
processes in the group will be waiting on the same bank lock.  In an
extreme situation if there are processes in the group that are trying
to update different pages or even pages from different banks then we
are handling it well by changing the lock.  Although someone may raise
a concern that in cases where there are processes that are waiting for
different bank locks then after releasing one lock why not wake up
those processes, I think that is not required because that is the
situation we are trying to avoid where there are processes trying to
update different are in the same group so there is no point in adding
complexity to optimize that case.

> 0001 contains one typo in the docs, "cotents".
>
> I'm not a fan of the fact that some CLOG sizing macros moved to clog.h,
> leaving others in clog.c.  Maybe add commentary cross-linking both.
> Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to
> the slru.h-defined limit of 131072 is not that bad, even if it's more
> than could possibly be needed for xact_buffers; nobody is going to use
> 64k buffers, since useful values are below a couple thousand anyhow.

I agree, that allowing xact_buffers to grow beyond 65536 up to the
slru.h-defined limit of 131072 is not that bad, so I will change that
in the next version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

2023-11-14 Thread Dilip Kumar

On Tue, Nov 14, 2023 at 2:10 AM Robert Haas  wrote:
>
> On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera  
> wrote:
> > Great stuff you got here.  I'm doing a first pass trying to grok the
> > whole thing for more substantive comments, but in the meantime here are
> > some cosmetic ones.
>
> Thanks, thanks, and thanks.
>
> I've fixed some things that you mentioned in the attached version.
> Other comments below.

Here are some more comments based on what I have read so far, mostly
cosmetics comments.

1.
+ * summary file yet, then stoppng doesn't make any sense, and we
+ * should wait until the next stop point instead.

Typo /stoppng/stopping

2.
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+

We have already freed the xlogreader so the second part of the comment
is not valid.

3.+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.


Typo. /there is in point/ there is no point

4.
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */

/recods/records

5.

+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, );
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, [i],
+forknum, 0);
+ }
+ }

For SmgrCreate and Truncate I understand setting the 'limit block' but
why for commit/abort?  I think it would be better to add some comments
here.

6.
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)

The comments say "private_data->read_upto to the lowest LSN that is
not known to be safe" but is it really the lowest LSN? I think it is
the highest LSN this is known to be safe for that TLI no?

7.
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();

I was just wondering whether removing old summaries should be the job
of the wal summarizer or it should be the job of the checkpointer, I
mean while removing the old wals it can also check and remove the old
summaries?  Anyway, it's just a question and I do not have a strong
opinion on this.

8.
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */

Typo /Whether we we removed/ Whether we removed

9.
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+   ForkNumber forknum, BlockNumber *limit_block)

If this function is already returning 'BlockRefTableEntry' then why
would it need to set an extra '*limit_block' out parameter which it is
actually reading from the entry itself?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

2023-11-13 Thread Dilip Kumar

On Tue, Nov 14, 2023 at 12:52 AM Robert Haas  wrote:
>
> On Fri, Nov 10, 2023 at 6:27 AM Dilip Kumar  wrote:
> > - I think 0001 looks good improvement irrespective of the patch series.
>
> OK, perhaps that can be independently committed, then, if nobody objects.
>
> Thanks for the review; I've fixed a bunch of things that you
> mentioned. I'll just comment on the ones I haven't yet done anything
> about below.
>
> > 2.
> > +  > xreflabel="wal_summarize_keep_time">
> > +  wal_summarize_keep_time 
> > (boolean)
> > +  
> > +   wal_summarize_keep_time
> > configuration parameter
> > +  
> >
> > I feel the name of the guy should be either wal_summarizer_keep_time
> > or wal_summaries_keep_time, I mean either we should refer to the
> > summarizer process or to the way summaries files.
>
> How about wal_summary_keep_time?

Yes, that looks perfect to me.

> > 6.
> > + * If the whole range of LSNs is covered, returns true, otherwise false.
> > + * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
> > + * if there are no WAL summary files in the input list, or to the first LSN
> > + * in the range that is not covered by a WAL summary file in the input 
> > list.
> > + */
> > +bool
> > +WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
> >
> > I did not see the usage of this function, but I think if the whole
> > range is not covered why not keep the behavior uniform w.r.t. what we
> > set for '*missing_lsn',  I mean suppose there is no file then
> > missing_lsn is the start_lsn because a very first LSN is missing.
>
> It's used later in the patch series. I think the way that I have it
> makes for a more understandable error message.

Okay

> > 8.
> > +/*
> > + * Comparator to sort a List of WalSummaryFile objects by start_lsn.
> > + */
> > +static int
> > +ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
> > +{
>
> I'm not sure what needs fixing here.

I think I copy-pasted it by mistake, just ignore it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

2023-11-10 Thread Dilip Kumar

On Tue, Nov 7, 2023 at 2:06 AM Robert Haas  wrote:
>
> On Mon, Oct 30, 2023 at 2:46 PM Andres Freund  wrote:
> > After playing with this for a while, I don't see a reason for 
> > wal_summarize_mb
> > from a memory usage POV at least.
>
> Here's v8. Changes:

Review comments, based on what I reviewed so far.

- I think 0001 looks good improvement irrespective of the patch series.

- review 0003
1.
+   be enabled either on a primary or on a standby. WAL summarization can
+   cannot be enabled when wal_level is set to
+   minimal.

Grammatical error
"WAL summarization can cannot" -> WAL summarization cannot

2.
+ 
+  wal_summarize_keep_time (boolean)
+  
+   wal_summarize_keep_time
configuration parameter
+  

I feel the name of the guy should be either wal_summarizer_keep_time
or wal_summaries_keep_time, I mean either we should refer to the
summarizer process or to the way summaries files.

3.

+XLogGetOldestSegno(TimeLineID tli)
+{
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, _tli, _segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */

Some of the single-line comments end with a full stop whereas others
do not, so better to be consistent.

4.

+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)

Instead of "If start_lsn != InvalidXLogRecPtr, only summaries that end
before the" it should be "If start_lsn != InvalidXLogRecPtr, only
summaries that end after the" because only if the summary files are
Ending after the start_lsn then it will have some overlapping and we
need to return them if ending before start lsn then those files are
not overlapping at all, right?

5.
In FilterWalSummaries() header also the comment is wrong same as for
GetWalSummaries() function.

6.
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,

I did not see the usage of this function, but I think if the whole
range is not covered why not keep the behavior uniform w.r.t. what we
set for '*missing_lsn',  I mean suppose there is no file then
missing_lsn is the start_lsn because a very first LSN is missing.

7.
+ nbytes = FileRead(io->file, data, length, io->filepos,
+   WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file;

/could not write file/ could not read file

8.
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-09 Thread Dilip Kumar

On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera  wrote:
>
> IMO the whole area of SLRU buffering is in horrible shape and many users
> are struggling with overall PG performance because of it.  An
> improvement doesn't have to be perfect -- it just has to be much better
> than the current situation, which should be easy enough.  We can
> continue to improve later, using more scalable algorithms or ones that
> allow us to raise the limits higher.

I agree with this.

> The only point on which we do not have full consensus yet is the need to
> have one GUC per SLRU, and a lot of effort seems focused on trying to
> fix the problem without adding so many GUCs (for example, using shared
> buffers instead, or use a single "scaling" GUC).  I think that hinders
> progress.  Let's just add multiple GUCs, and users can leave most of
> them alone and only adjust the one with which they have a performance
> problems; it's not going to be the same one for everybody.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-09 Thread Dilip Kumar

On Thu, Nov 9, 2023 at 9:39 PM Robert Haas  wrote:
>
> On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar  wrote:
> > Here is the updated version of the patch, here I have taken the
> > approach suggested by Andrey and I discussed the same with Alvaro
> > offlist and he also agrees with it.  So the idea is that we will keep
> > the bank size fixed which is 16 buffers per bank and the allowed GUC
> > value for each slru buffer must be in multiple of the bank size.  We
> > have removed the centralized lock but instead of one lock per bank, we
> > have kept the maximum limit on the number of bank locks which is 128.
> > We kept the max limit as 128 because, in one of the operations (i.e.
> > ActivateCommitTs), we need to acquire all the bank locks (but this is
> > not a performance path at all) and at a time we can acquire a max of
> > 200 LWlocks, so we think this limit of 128 is good.  So now if the
> > number of banks is <= 128 then we will be using one lock per bank
> > otherwise the one lock may protect access of buffer in multiple banks.
>
> Just so I understand, I guess this means that an SLRU is limited to
> 16*128 = 2k buffers = 16MB?

Not really, because 128 is the maximum limit on the number of bank
locks not on the number of banks.  So for example, if you have 16*128
= 2k buffers then each lock will protect one bank, and likewise when
you have 16 * 512 =  8k buffers then each lock will protect 4 banks.
So in short we can get the lock for each bank by simple computation
(banklockno = bankno % 128 )

> When we were talking about this earlier, I suggested fixing the number
> of banks and allowing the number of buffers per bank to scale
> depending on the setting. That seemed simpler than allowing both the
> number of banks and the number of buffers to vary, and it might allow
> the compiler to optimize some code better, by converting a calculation
> like page_no%number_of_banks into a masking operation like page_no&0xf
> or whatever. However, because it allows an individual bank to become
> arbitrarily large, it more or less requires us to use a buffer mapping
> table. Some of the performance problems mentioned could be alleviated
> by omitting the hash table when the number of buffers per bank is
> small, and we could also create the dynahash with a custom hash
> function that just does modular arithmetic on the page number rather
> than a real hashing operation. However, maybe we don't really need to
> do any of that. I agree that dynahash is clunky on a good day. I
> hadn't realized the impact would be so noticeable.

Yes, so one idea is that we keep the number of banks fixed and with
that, as you pointed out correctly with a large number of buffers, the
bank size can be quite big and for that, we would need a hash table
and OTOH what I am doing here is keeping the bank size fixed and
smaller (16 buffers per bank) and with that we can have large numbers
of the bank when the buffer pool size is quite big.  But I feel having
more banks is not really a problem if we grow the number of locks
beyond a certain limit as in some corner cases we need to acquire all
locks together and there is a limit on that.   So I like this idea of
sharing locks across the banks with that 1) We can have enough locks
so that lock contention or cache invalidation due to a common lock
should not be a problem anymore 2) We can keep a small bank size with
that seq search within the bank is quite fast so reads are fast  3)
With small bank size victim buffer search which has to be sequential
is quite fast.

> This proposal takes the opposite approach of fixing the number of
> buffers per bank, letting the number of banks vary. I think that's
> probably fine, although it does reduce the effective associativity of
> the cache. If there are more hot buffers in a bank than the bank size,
> the bank will be contended, even if other banks are cold. However,
> given the way SLRUs are accessed, it seems hard to imagine this being
> a real problem in practice. There aren't likely to be say 20 hot
> buffers that just so happen to all be separated from one another by a
> number of pages that is a multiple of the configured number of banks.
> And in the seemingly very unlikely event that you have a workload that
> behaves like that, you could always adjust the number of banks up or
> down by one, and the problem would go away. So this seems OK to me.

I agree with this

> I also agree with a couple of points that Alvaro made, specifically
> that (1) this doesn't have to be perfect, just better than now and (2)
> separate GUCs for each SLRU is fine. On the latter point, it's worth
> keeping in mind that the cost of a GUC that most people don't need to
> tune is fairly low. GUCs like work_mem and shared_buffers are
> "expensive" because every

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-07 Thread Dilip Kumar

On Wed, Nov 8, 2023 at 10:52 AM Amul Sul  wrote:

Thanks for review Amul,

> Here are some minor comments:
>
> + * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
> + * maximum value that slru.c will allow, but always at least 16 buffers.
>   */
>  Size
>  CommitTsShmemBuffers(void)
>  {
> -   return Min(256, Max(4, NBuffers / 256));
> +   /* Use configured value if provided. */
> +   if (commit_ts_buffers > 0)
> +   return Max(16, commit_ts_buffers);
> +   return Min(256, Max(16, NBuffers / 256));
>
> Do you mean "4MB of for every 1GB"  in the comment?

You are right

> --
>
> diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
> index 5087cdce51..78d017ad85 100644
> --- a/src/include/access/commit_ts.h
> +++ b/src/include/access/commit_ts.h
> @@ -16,7 +16,6 @@
>  #include "replication/origin.h"
>  #include "storage/sync.h"
>
> -
>  extern PGDLLIMPORT bool track_commit_timestamp;
>
> A spurious change.

Will fix

> --
>
> @@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
>
> if (nlsns > 0)
> sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));/* 
> group_lsn[] */
> -
> return BUFFERALIGN(sz) + BLCKSZ * nslots;
>  }
>
> Another spurious change in 0002 patch.

Will fix

> --
>
> +/*
> + * The slru buffer mapping table is partitioned to reduce contention. To
> + * determine which partition lock a given pageno requires, compute the 
> pageno's
> + * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
> + */
>
> I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere 
> in
> your patches, is that outdated comment?

Yes will fix it, actually, there are some major design changes to this.

> --
>
> -   sz += MAXALIGN(nslots * sizeof(LWLockPadded));  /* buffer_locks[] */
> -   sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* 
> part_locks[] */
> +   sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded));  
> /* locks[] */
>
> I am a bit uncomfortable with these changes, merging parts and buffer locks
> making it hard to understand the code. Not sure what we were getting out of
> this?

Yes, even I do not like this much because it is confusing.  But the
advantage of this is that we are using a single pointer for the lock
which means the next variable for the LRU counter will come in the
same cacheline and frequent updates of lru counter will be benefitted
from this.  Although I don't have any number which proves this.
Currently, I want to focus on all the base patches and keep this patch
as add on and later if we find its useful and want to pursue this then
we will see how to make it better readable.


>
> Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of
>  partitions
>
> I think the 0005 patch can be merged to 0001.

Yeah in the next version, it is done that way.  Planning to post end of the day.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-06 Thread Dilip Kumar

On Mon, Nov 6, 2023 at 1:05 PM Andrey M. Borodin  wrote:

> > On 6 Nov 2023, at 09:09, Dilip Kumar  wrote:
> >
> >
> >> Having hashtable to find SLRU page in the buffer IMV is too slow. Some 
> >> comments on this approach can be found here [0].
> >> I'm OK with having HTAB for that if we are sure performance does not 
> >> degrade significantly, but I really doubt this is the case.
> >> I even think SLRU buffers used HTAB in some ancient times, but I could not 
> >> find commit when it was changed to linear search.
> >
> > The main intention of having this buffer mapping hash is to find the
> > SLRU page faster than sequence search when banks are relatively bigger
> > in size, but if we find the cases where having hash creates more
> > overhead than providing gain then I am fine to remove the hash because
> > the whole purpose of adding hash here to make the lookup faster.  So
> > far in my test I did not find the slowness.  Do you or anyone else
> > have any test case based on the previous research on whether it
> > creates any slowness?
> PFA test benchmark_slru_page_readonly(). In this test we run 
> SimpleLruReadPage_ReadOnly() (essential part of TransactionIdGetStatus())
> before introducing HTAB for buffer mapping I get
> Time: 14837.851 ms (00:14.838)
> with buffer HTAB I get
> Time: 22723.243 ms (00:22.723)
>
> This hash table makes getting transaction status ~50% slower.
>
> Benchmark script I used:
> make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; 
> ./initdb test && echo "shared_preload_libraries = 'test_slru'">> 
> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension 
> test_slru' postgres && ./pg_ctl -D test restart && ./psql -c "SELECT 
> count(test_slru_page_write(a, 'Test SLRU'))
>   FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT 
> benchmark_slru_page_readonly(12377);" postgres)

With this test, I got below numbers,

nslots.  no-hashhash
810s   13s
16  10s   13s
32  15s   13s
64  17s   13s

Yeah so we can see with a small bank size <=16 slots we are seeing
that the fetching page with hash is 30% slower than the sequential
search, but beyond 32 slots sequential search is become slower as you
grow the number of slots whereas with hash it stays constant as
expected.  But now as you told if keep the lock partition range
different than the bank size then we might not have any problem by
having more numbers of banks and with that, we can keep the bank size
small like 16.  Let me put some more thought into this and get back.
Any other opinions on this?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-05 Thread Dilip Kumar

On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin  wrote:

> On 30 Oct 2023, at 09:20, Dilip Kumar  wrote:
>
> changed the logic of SlruAdjustNSlots() in 0002, such that now it
> starts with the next power of 2 value of the configured slots and
> keeps doubling the number of banks until we reach the number of banks
> to the max SLRU_MAX_BANKS(128) and bank size is bigger than
> SLRU_MIN_BANK_SIZE (8).  By doing so, we will ensure we don't have too
> many banks
>
> There was nothing wrong with having too many banks. Until bank-wise locks and 
> counters were added in later patchsets.

I agree with that, but I feel with bank-wise locks we are removing
major contention from the centralized control lock and we can see that
from my first email that how much benefit we can get in one of the
simple test cases when we create subtransaction overflow.

> Having hashtable to find SLRU page in the buffer IMV is too slow. Some 
> comments on this approach can be found here [0].
> I'm OK with having HTAB for that if we are sure performance does not degrade 
> significantly, but I really doubt this is the case.
> I even think SLRU buffers used HTAB in some ancient times, but I could not 
> find commit when it was changed to linear search.

The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster.  So
far in my test I did not find the slowness.  Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?

> Maybe we could decouple locks and counters from SLRU banks? Banks were meant 
> to be small to exploit performance of local linear search. Lock partitions 
> have to be bigger for sure.

Yeah, that could also be an idea if we plan to drop the hash.  I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks.  Let's hear more
opinion on this.

>
> On 30 Oct 2023, at 09:20, Dilip Kumar  wrote:
>
> I have taken 0001 and 0002 from [1], done some bug fixes in 0001
>
>
> BTW can you please describe in more detail what kind of bugs?

Yeah, actually that patch was using the same GUC
(multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
well as for  MultiXactMemberCtl, see the below patch snippet from the
original patch.

@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
  MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;

  SimpleLruInit(MultiXactOffsetCtl,
-   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+   "MultiXactOffset", multixact_offsets_buffers, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
SYNC_HANDLER_MULTIXACT_OFFSET);
  SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
  SimpleLruInit(MultiXactMemberCtl,
-   "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+   "MultiXactMember", multixact_offsets_buffers, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
SYNC_HANDLER_MULTIXACT_MEMBER);

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2023-11-04 Thread Dilip Kumar

On Fri, Jan 20, 2023 at 2:04 PM David Geier  wrote:
>
> Hi hackers,
>
> EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports
> the number of heap blocks processed by the leader. It's missing the
> per-worker stats. The attached patch adds that functionality in the
> spirit of e.g. Sort or Memoize. Here is a simple test case and the
> EXPLAIN ANALYZE output with and without the patch:
>

> With the patch:
>
>   Gather (actual rows=99501 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> ->  Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3)
>   Recheck Cond: ((col0 > 900) OR (col1 = 1))
>   Heap Blocks: exact=98
>   Worker 0:  Heap Blocks: exact=171 lossy=0
>   Worker 1:  Heap Blocks: exact=172 lossy=0


else
  {
+ if (planstate->stats.exact_pages > 0)
+appendStringInfo(es->str, " exact=%ld", planstate->stats.exact_pages);
+ if (planstate->stats.lossy_pages > 0)
+ appendStringInfo(es->str, " lossy=%ld", planstate->stats.lossy_pages);
  appendStringInfoChar(es->str, '\n');
  }
  }

+ for (int n = 0; n < planstate->shared_info->num_workers; n++)
+ {

+ "Heap Blocks: exact="UINT64_FORMAT" lossy=" INT64_FORMAT"\n", +
si->exact_pages, si->lossy_pages);

Shouldn't we use the same format for reporting exact and lossy pages
for the actual backend and the worker? I mean here for the backend you
are showing lossy pages only if it is > 0 whereas for workers we are
showing 0 lossy pages as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-25 Thread Dilip Kumar

On Wed, Oct 25, 2023 at 5:58 PM Amit Kapila  wrote:
>
> On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar  wrote:
> >
> > On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila  wrote:
> > >
> > > This and other results shared by you look promising. Will there be any
> > > improvement in workloads related to clog buffer usage?
> >
> > I did not understand this question can you explain this a bit?
> >
>
> I meant to ask about the impact of this patch on accessing transaction
> status via TransactionIdGetStatus(). Shouldn't we expect some
> improvement in accessing CLOG buffers?

Yes, there should be because 1) Now there is no common lock so
contention on a centralized control lock will be reduced when we are
accessing the transaction status from pages falling in different SLRU
banks 2) Buffers size is configurable so if the workload is accessing
transactions status of different range then it would help in frequent
buffer eviction but this might not be most common case.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-24 Thread Dilip Kumar

On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera  wrote:
>
> On 2023-Oct-11, Dilip Kumar wrote:
>
> > In my last email, I forgot to give the link from where I have taken
> > the base path for dividing the buffer pool in banks so giving the same
> > here[1].  And looking at this again it seems that the idea of that
> > patch was from Andrey M. Borodin and the idea of the SLRU scale factor
> > were introduced by Yura Sokolov and Ivan Lazarev.  Apologies for
> > missing that in the first email.
>
> You mean [1].
> [1] https://postgr.es/m/452d01f7e331458f56ad79bef537c31b%40postgrespro.ru
> I don't like this idea very much, because of the magic numbers that act
> as ratios for numbers of buffers on each SLRU compared to other SLRUs.
> These values, which I took from the documentation part of the patch,
> appear to have been selected by throwing darts at the wall:
>
> NUM_CLOG_BUFFERS= Min(128 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_COMMIT_TS_BUFFERS   = Min(128 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_SUBTRANS_BUFFERS= Min(64 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_NOTIFY_BUFFERS  = Min(32 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_SERIAL_BUFFERS  = Min(32 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, 
> shared_buffers/256)
>
> ... which look pretty random already, if similar enough to the current
> hardcoded values.  In reality, the code implements different values than
> what the documentation says.
>
> I don't see why would CLOG have the same number as COMMIT_TS, when the
> size for elements of the latter is like 32 times bigger -- however, the
> frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
> SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
> same range.  The MULTIXACT ones appear to keep the current ratio among
> them (8/16 gets changed to 32/64).
>
> ... and this whole mess is scaled exponentially without regard to the
> size that each SLRU requires.  This is just betting that enough memory
> can be wasted across all SLRUs up to the point where the one that is
> actually contended has sufficient memory.  This doesn't sound sensible
> to me.
>
> Like everybody else, I like having less GUCs to configure, but going
> this far to avoid them looks rather disastrous to me.  IMO we should
> just use Munro's older patches that gave one GUC per SLRU, and users
> only need to increase the one that shows up in pg_wait_event sampling.
> Someday we will get the (much more complicated) patches to move these
> buffers to steal memory from shared buffers, and that'll hopefully let
> use get rid of all this complexity.

Overall I agree with your comments, actually, I haven't put that much
thought into the GUC part and how it scales the SLRU buffers w.r.t.
this single configurable parameter.  Yeah, so I think it is better
that we take the older patch version as our base patch where we have
separate GUC per SLRU.

> I'm inclined to use Borodin's patch last posted here [2] instead of your
> proposed 0001.
> [2] https://postgr.es/m/93236d36-b91c-4dfa-af03-99c083840...@yandex-team.ru

I will rebase my patches on top of this.

> I did skim patches 0002 and 0003 without going into too much detail;
> they look reasonable ideas.  I have not tried to reproduce the claimed
> performance benefits.  I think measuring this patch set with the tests
> posted by Shawn Debnath in [3] is important, too.
> [3] https://postgr.es/m/yemddpmrsojfq...@f01898859afd.ant.amazon.com

Thanks for taking a look.

>
> On the other hand, here's a somewhat crazy idea.  What if, instead of
> stealing buffers from shared_buffers (which causes a lot of complexity),

Currently, we do not steal buffers from shared_buffers, computation is
dependent upon Nbuffers though. I mean for each SLRU we are computing
separate memory which is additional than the shared_buffers no?

> we allocate a common pool for all SLRUs to use?  We provide a single
> knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
> algorithm (or something) to distribute that memory across all the SLRUs.
> So the ratio to use for this SLRU or that one would depend on the nature
> of the workload: maybe more for multixact in this server here, but more
> for subtrans in that server there; it's just the total amount that the
> user would have to configure, side by side with shared_buffers (and
> perhaps scale with it like wal_buffers), and the LRU would handle the
> rest.  The "

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-19 Thread Dilip Kumar

On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila  wrote:
>
> This and other results shared by you look promising. Will there be any
> improvement in workloads related to clog buffer usage?

I did not understand this question can you explain this a bit?  In
short, if it is regarding the performance then we will see it for all
the SLRUs as the control lock is not centralized anymore instead it is
a bank-wise lock.

 BTW, I remember
> that there was also a discussion of moving SLRU into a regular buffer
> pool [1]. You have not provided any explanation as to whether that
> approach will have any merits after we do this or whether that
> approach is not worth pursuing at all.
>
> [1] - https://commitfest.postgresql.org/43/3514/

Yeah, I haven't read that thread in detail about performance numbers
and all.  But both of these can not coexist because this patch is
improving the SLRU buffer pool access/configurable size and also lock
contention.  If we move SLRU to the main buffer pool then we might not
have a similar problem instead there might be other problems like SLRU
buffers getting swapped out due to other relation buffers and all and
OTOH the advantages of that approach would be that we can just use a
bigger buffer pool and SLRU can also take advantage of that.  But in
my opinion, most of the time we have limited page access in SLRU and
the SLRU buffer access pattern is also quite different from the
relation pages access pattern so keeping them under the same buffer
pool and comparing against relation pages for victim buffer selection
might cause different problems.  But anyway I would discuss those
points maybe in that thread.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pg_upgrade's interaction with pg_resetwal seems confusing

2023-10-13 Thread Dilip Kumar

On Fri, Oct 13, 2023 at 11:07 AM Amit Kapila  wrote:
>
> > But is this a problem? basically, we will set the
> > ShmemVariableCache->nextOid counter to the value that we want in the
> > IsBinaryUpgrade-specific function.  And then the shutdown checkpoint
> > will flush that value to the control file and that is what we want no?
> >
>
> I think that can work. Basically, we need to do something like what
> SetNextObjectId() does and then let the shutdown checkpoint update the
> actual value in the control file.

Right.

> >  I mean instead of resetwal directly modifying the control file we
> > will modify that value in the server using the binary_upgrade function
> > and then have that value flush to the disk by shutdown checkpoint.
> >
>
> True, the alternative to consider is to let pg_upgrade update the
> control file by itself with the required value of OID. The point I am
> slightly worried about doing via server-side function is that some
> online and or shutdown checkpoint can update other values in the
> control file as well whereas if we do via pg_upgrade, we can have
> better control over just updating the OID.

Yeah, thats a valid point.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pg_upgrade's interaction with pg_resetwal seems confusing

2023-10-12 Thread Dilip Kumar

On Fri, Oct 13, 2023 at 9:29 AM Amit Kapila  wrote:
>
> On Fri, Oct 13, 2023 at 12:00 AM Robert Haas  wrote:
> >
> > On Thu, Oct 12, 2023 at 7:17 AM Amit Kapila  wrote:
> > > Now, as mentioned in the first paragraph, it seems we anyway don't
> > > need to reset the WAL at the end when setting the next OID for the new
> > > cluster with the -o option. If that is true, then I think even without
> > > slots work it will be helpful to have such an option in pg_resetwal.
> > >
> > > Thoughts?
> >
> > I wonder if we should instead provide a way to reset the OID counter
> > with a function call inside the database, gated by IsBinaryUpgrade.
> >
>
> I think the challenge in doing so would be that when the server is
> running, a concurrent checkpoint can also update the OID counter value
> in the control file. See below code:
>
> CreateCheckPoint()
> {
> ...
> LWLockAcquire(OidGenLock, LW_SHARED);
> checkPoint.nextOid = ShmemVariableCache->nextOid;
> if (!shutdown)
> checkPoint.nextOid += ShmemVariableCache->oidCount;
> LWLockRelease(OidGenLock);
> ...
> UpdateControlFile()
> ...
> }
>

But is this a problem? basically, we will set the
ShmemVariableCache->nextOid counter to the value that we want in the
IsBinaryUpgrade-specific function.  And then the shutdown checkpoint
will flush that value to the control file and that is what we want no?
 I mean instead of resetwal directly modifying the control file we
will modify that value in the server using the binary_upgrade function
and then have that value flush to the disk by shutdown checkpoint.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-12 Thread Dilip Kumar

On Wed, Oct 11, 2023 at 5:57 PM Dilip Kumar  wrote:
>
> On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar  wrote:

> In my last email, I forgot to give the link from where I have taken
> the base path for dividing the buffer pool in banks so giving the same
> here[1].  And looking at this again it seems that the idea of that
> patch was from
> Andrey M. Borodin and the idea of the SLRU scale factor were
> introduced by Yura Sokolov and Ivan Lazarev.  Apologies for missing
> that in the first email.
>
> [1] https://commitfest.postgresql.org/43/2627/

In my last email I have just rebased the base patch, so now while
reading through that patch I realized that there was some refactoring
needed and some unused functions were there so I have removed that and
also added some comments.  Also did some refactoring to my patches. So
reposting the patch series.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v2-0002-bank-wise-slru-locks.patch
Description: Binary data


v2-0003-Introduce-bank-wise-LRU-counter.patch
Description: Binary data


v2-0001-Divide-SLRU-buffers-into-banks.patch
Description: Binary data

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-11 Thread Dilip Kumar

On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar  wrote:
>
> The small size of the SLRU buffer pools can sometimes become a
> performance problem because it’s not difficult to have a workload
> where the number of buffers actively in use is larger than the
> fixed-size buffer pool. However, just increasing the size of the
> buffer pool doesn’t necessarily help, because the linear search that
> we use for buffer replacement doesn’t scale, and also because
> contention on the single centralized lock limits scalability.
>
> There is a couple of patches proposed in the past to address the
> problem of increasing the buffer pool size, one of the patch [1] was
> proposed by Thomas Munro where we make the size of the buffer pool
> configurable.

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1].  And looking at this again it seems that the idea of that
patch was from
Andrey M. Borodin and the idea of the SLRU scale factor were
introduced by Yura Sokolov and Ivan Lazarev.  Apologies for missing
that in the first email.

[1] https://commitfest.postgresql.org/43/2627/

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-11 Thread Dilip Kumar

ion on the
SLRU buffer pool.  I am not leaving the long-running transaction
running forever as that will start to show another problem with
respect to bloat and we will lose the purpose of what I am trying to
show here.

Note: test configurations are the same as Exp1, just the workload is
different, we are running below 2 scripts.
and new config parameter(added in v1-0001) slru_buffers_size_scale=4,
that means NUM_MULTIXACTOFFSET_BUFFERS will be 64 that is 16 in Head
and
NUM_MULTIXACTMEMBER_BUFFERS will be 128 which is 32 in head

./pgbench -c $ -j $ -T 600 -P5 -M prepared -f multixact.sql postgres
./pgbench -c 1 -j 1 -T 600 -f longrunning.sql postgres

cat > multixact.sql < longrunning.sql << EOF
BEGIN;
INSERT INTO pgbench_test VALUES(1);
select pg_sleep(10);
COMMIT;
EOF

Results:
Clients HeadSlruBank   SlruBank+BankwiseLock
1  528513   531
8 3870 4239 4157
32 13945   14470   14556
64 10086   19034   24482
128 690915627  18161

Here we can see good improvement with the SlruBank patch itself
because of increasing the SLRU buffer pool, as in this workload there
is a lot of contention due to buffer replacement.  As shown below we
can see a lot of load on MultiXactOffsetSLRU as well as on
MultiXactOffsetBuffer which shows there are frequent buffer evictions
in this workload.  And, increasing the SLRU buffer pool size is
helping a lot, and further dividing the SLRU lock into bank-wise locks
we are seeing a further gain.  So in total, we are seeing ~2.5x TPS at
64 and 128 thread compared to head.

   3401  LWLock  | MultiXactOffsetSLRU
   2031  LWLock  | MultiXactOffsetBuffer
687 |
427  LWLock  | BufferContent

Credits:
- The base patch v1-0001 is authored by Thomas Munro and I have just rebased it.
- 0002 and 0003 are new patches written by me based on design ideas
from Robert and Myself.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v1-0003-Introduce-bank-wise-LRU-counter.patch
Description: Binary data


v1-0001-Divide-SLRU-buffers-into-banks.patch
Description: Binary data


v1-0002-bank-wise-slru-locks.patch
Description: Binary data

Re: Opportunistically pruning page before update

2023-10-05 Thread Dilip Kumar

On Thu, Oct 5, 2023 at 2:35 AM James Coleman  wrote:
>
> I talked to Andres and Peter again today, and out of that conversation
> I have some observations and ideas for future improvements.
>
> 1. The most trivial case where this is useful is INSERT: we have a
> target page, and it may have dead tuples, so trying to prune may
> result in us being able to use the target page rather than getting a
> new page.
> 2. The next most trivial case is where UPDATE (potentially after
> failing to find space for a HOT tuple on the source tuple's page);
> much like the INSERT case our backend's target page may benefit from
> pruning.

By looking at the patch I believe that v2-0003 is implementing these 2
ideas.  So my question is are we planning to prune the backend's
current target page only or if we can not find space in that then we
are targetting to prune the other target pages as well which we are
getting from FSM?  Because in the patch you have put a check in a loop
it will try to prune every page it gets from the FSM not just the
current target page of the backend.  Just wanted to understand if this
is intentional.

In general, all 4 ideas look promising.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-05 Thread Dilip Kumar

On Thu, Oct 5, 2023 at 1:48 AM Amit Kapila  wrote:
>
> On Tue, Oct 3, 2023 at 9:58 AM Bharath Rupireddy
>  wrote:
> >
> > On Fri, Sep 29, 2023 at 5:27 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > Yeah, the approach enforces developers to check the decodability.
> > > But the benefit seems smaller than required efforts for it because the 
> > > function
> > > would be used only by pg_upgrade. Could you tell me if you have another 
> > > use case
> > > in mind? We may able to adopt if we have...
> >
> > I'm attaching 0002 patch (on top of v45) which implements the new
> > decodable callback approach that I have in mind. IMO, this new
> > approach is extensible, better than the current approach (hard-coding
> > of certain WAL records that may be generated during pg_upgrade) taken
> > by the patch, and helps deal with the issue that custom WAL resource
> > managers can have with the current approach taken by the patch.
> >
>
> Today, I discussed this problem with Andres at PGConf NYC and he
> suggested as following. To verify, if there is any pending unexpected
> WAL after shutdown, we can have an API like
> pg_logical_replication_slot_advance() which will simply process
> records without actually sending anything downstream.

So I assume in each lower-level decode function (e.g. heap_decode() )
we will add the check that if we are checking the WAL for an upgrade
then from that level we will return true or false based on whether the
WAL is decodable or not.  Is my understanding correct?  At first
thought this approach look better and generic.

 In this new API,
> we will start with each slot's restart_lsn location and try to process
> till the end of WAL, if we encounter any WAL that needs to be
> processed (like we need to send the decoded WAL downstream) we can
> return a false indicating that there is an unexpected WAL. The reason
> to start with restart_lsn is that it is the location that we use to
> start scanning the WAL anyway.

Yeah, that makes sense.

> Then, we should also try to create slots before invoking pg_resetwal.
> The idea is that we can write a new binary mode function that will do
> exactly what pg_resetwal does to compute the next segment and use that
> location as a new location (restart_lsn) to create the slots in a new
> node. Then, pass it pg_resetwal by using the existing option '-l
> walfile'. As we don't have any API that takes restart_lsn as input, we
> can write a new API probably for binary mode to create slots that do
> take restart_lsn as input. This will ensure that there is no new WAL
> inserted by background processes between resetwal and the creation of
> slots.

Yeah, that looks cleaner IMHO.

> The other potential problem Andres pointed out is that during shutdown
> if due to some reason, the walreceiver goes down, we won't be able to
> send the required WAL and users won't be able to ensure that because
> even after restart the same situation can happen. The ideal way is to
> have something that puts the system in READ ONLY state during shutdown
> and then we can probably allow walreceivers to reconnect and receive
> the required WALs. As we don't have such functionality available and
> it won't be easy to achieve the same, we can leave this for now.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-02 Thread Dilip Kumar

On Tue, Oct 3, 2023 at 9:58 AM Bharath Rupireddy
 wrote:
>
> On Fri, Sep 29, 2023 at 5:27 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Yeah, the approach enforces developers to check the decodability.
> > But the benefit seems smaller than required efforts for it because the 
> > function
> > would be used only by pg_upgrade. Could you tell me if you have another use 
> > case
> > in mind? We may able to adopt if we have...
>
> I'm attaching 0002 patch (on top of v45) which implements the new
> decodable callback approach that I have in mind. IMO, this new
> approach is extensible, better than the current approach (hard-coding
> of certain WAL records that may be generated during pg_upgrade) taken
> by the patch, and helps deal with the issue that custom WAL resource
> managers can have with the current approach taken by the patch.

I did not see the patch, but I like this approach better.  I mean this
approach does not check what record types are generated during updagre
instead this directly targets that after the confirmed_flush_lsn what
type of records shouldn't be generated.  So if rmgr says that after
commit_flush_lsn no decodable record was generated then we are safe to
upgrade that slot.  So this seems an expandable approach.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-25 Thread Dilip Kumar

On Mon, Sep 25, 2023 at 1:23 PM Bharath Rupireddy
 wrote:
>
> On Mon, Sep 25, 2023 at 12:32 PM Dilip Kumar  wrote:
> >
> > > > Is there anything else that stops this patch from supporting migration
> > > > of logical replication slots from PG versions < 17?
> > >
> > > IMHO one of the main change we are doing in PG 17 is that on shutdown
> > > checkpoint we are ensuring that if the confirmed flush lsn is updated
> > > since the last checkpoint and that is not yet synched to the disk then
> > > we are doing so.  I think this is the most important change otherwise
> > > many slots for which we have already streamed all the WAL might give
> > > an error assuming that there are pending WAL from the slots which are
> > > not yet confirmed.
> > >
> >
> > You might need to refer to [1] for the change I am talking about
> >
> > [1] 
> > https://www.postgresql.org/message-id/CAA4eK1%2BLtWDKXvxS7gnJ562VX%2Bs3C6%2B0uQWamqu%3DUuD8hMfORg%40mail.gmail.com
>
> I see. IIUC, without that commit e0b2eed [1], it may happen that the
> slot's on-disk confirmed_flush LSN value can be higher than the WAL
> LSN that's flushed to disk, no? If so, can't it be detected if the WAL
> at confirmed_flush LSN is valid or not when reading WAL with
> xlogreader machinery?

Actually, without this commit the slot's "confirmed_flush LSN" value
in memory can be higher than the disk because if you notice this
function LogicalConfirmReceivedLocation(), if we change only the
confirmed flush the slot is not marked dirty that means on shutdown
the slot will not be persisted to the disk.  But logically this will
not cause any issue so we can not treat it as a bug it may cause us to
process some extra records after the restart but that is not really a
bug.

> What if the commit e0b2eed [1] is treated to be fixing a bug with the
> reasoning [2] and backpatch? When done so, it's easy to support
> upgradation/migration of logical replication slots from PG versions <
> 17, no?

Maybe this could be backpatched in order to support this upgrade from
the older version but not as a bug fix.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-25 Thread Dilip Kumar

On Mon, Sep 25, 2023 at 12:30 PM Dilip Kumar  wrote:
>
> On Mon, Sep 25, 2023 at 11:15 AM Bharath Rupireddy
>  wrote:
> >
> > On Fri, Sep 22, 2023 at 12:11 PM Amit Kapila  
> > wrote:
> > >
> > > Yeah, both by tests and manually verifying the WAL records. Basically,
> > > we need to care about records that could be generated by background
> > > processes like checkpointer/bgwriter or can be generated during system
> > > table scans. You may want to read my latest email for a summary on how
> > > we reached at this design choice [1].
> > >
> > > [1] - 
> > > https://www.postgresql.org/message-id/CAA4eK1JVKZGRHLOEotWi%2Be%2B09jucNedqpkkc-Do4dh5FTAU%2B5w%40mail.gmail.com
> >
> > +/* Logical slots can be migrated since PG17. */
> > +if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1600)
> > +{
> >
> > Why can't the patch allow migration of logical replication slots from
> > PG versions < 17 to say 17 or later? If done, it will be a main
> > advantage of the patch since it will enable seamless major version
> > upgrades of postgres database instances with logical replication
> > slots.
> >
> > I'm looking at the changes to the postgres backend that this patch
> > does - AFICS, it does 2 things 1) implements
> > binary_upgrade_validate_wal_logical_end function, 2) adds an assertion
> > that the logical slots won't get invalidated. For (1), pg_upgrade can
> > itself can read the WAL from the old cluster to determine the logical
> > WAL end (i.e. implement the functionality of
> > binary_upgrade_validate_wal_logical_end ) because the xlogreader is
> > available to FRONTEND tools. For (2), it's just an assertion and
> > logical WAL end determining logic will anyway determine whether or not
> > the slots are valid; if needed, the assertion can be backported.
> >
> > Is there anything else that stops this patch from supporting migration
> > of logical replication slots from PG versions < 17?
>
> IMHO one of the main change we are doing in PG 17 is that on shutdown
> checkpoint we are ensuring that if the confirmed flush lsn is updated
> since the last checkpoint and that is not yet synched to the disk then
> we are doing so.  I think this is the most important change otherwise
> many slots for which we have already streamed all the WAL might give
> an error assuming that there are pending WAL from the slots which are
> not yet confirmed.
>

You might need to refer to [1] for the change I am talking about

[1] 
https://www.postgresql.org/message-id/CAA4eK1%2BLtWDKXvxS7gnJ562VX%2Bs3C6%2B0uQWamqu%3DUuD8hMfORg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-25 Thread Dilip Kumar

On Mon, Sep 25, 2023 at 11:15 AM Bharath Rupireddy
 wrote:
>
> On Fri, Sep 22, 2023 at 12:11 PM Amit Kapila  wrote:
> >
> > Yeah, both by tests and manually verifying the WAL records. Basically,
> > we need to care about records that could be generated by background
> > processes like checkpointer/bgwriter or can be generated during system
> > table scans. You may want to read my latest email for a summary on how
> > we reached at this design choice [1].
> >
> > [1] - 
> > https://www.postgresql.org/message-id/CAA4eK1JVKZGRHLOEotWi%2Be%2B09jucNedqpkkc-Do4dh5FTAU%2B5w%40mail.gmail.com
>
> +/* Logical slots can be migrated since PG17. */
> +if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1600)
> +{
>
> Why can't the patch allow migration of logical replication slots from
> PG versions < 17 to say 17 or later? If done, it will be a main
> advantage of the patch since it will enable seamless major version
> upgrades of postgres database instances with logical replication
> slots.
>
> I'm looking at the changes to the postgres backend that this patch
> does - AFICS, it does 2 things 1) implements
> binary_upgrade_validate_wal_logical_end function, 2) adds an assertion
> that the logical slots won't get invalidated. For (1), pg_upgrade can
> itself can read the WAL from the old cluster to determine the logical
> WAL end (i.e. implement the functionality of
> binary_upgrade_validate_wal_logical_end ) because the xlogreader is
> available to FRONTEND tools. For (2), it's just an assertion and
> logical WAL end determining logic will anyway determine whether or not
> the slots are valid; if needed, the assertion can be backported.
>
> Is there anything else that stops this patch from supporting migration
> of logical replication slots from PG versions < 17?

IMHO one of the main change we are doing in PG 17 is that on shutdown
checkpoint we are ensuring that if the confirmed flush lsn is updated
since the last checkpoint and that is not yet synched to the disk then
we are doing so.  I think this is the most important change otherwise
many slots for which we have already streamed all the WAL might give
an error assuming that there are pending WAL from the slots which are
not yet confirmed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1613 matches

Mail list logo