subject:"Re\: POC\: Cleaning up orphaned files using undo logs"

Re: POC: Cleaning up orphaned files using undo logs

2022-01-12 Thread Julien Rouhaud

Hi,

On Thu, Nov 25, 2021 at 10:00 PM Antonin Houska  wrote:
>
> So it should be ok if the temporary undo is managed and discarded by
> individual backends. Patch 0005 of the new series tries to do that.

The cfbot reports that at least the 001 patch doesn't apply anymore:
http://cfbot.cputube.org/patch_36_3228.log
> === applying patch 
> ./undo-20211125/0001-Add-SmgrId-to-smgropen-and-BufferTag.patch
> [...]
> patching file src/bin/pg_waldump/pg_waldump.c
> Hunk #1 succeeded at 480 (offset 17 lines).
> Hunk #2 FAILED at 500.
> Hunk #3 FAILED at 531.
> 2 out of 3 hunks FAILED -- saving rejects to file 
> src/bin/pg_waldump/pg_waldump.c.rej

Could you send a rebased version?  In the meantime I'll switch the cf
entry to Waiting on Author.

Re: POC: Cleaning up orphaned files using undo logs

2021-09-29 Thread Amit Kapila

On Tue, Sep 28, 2021 at 7:36 PM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
> > On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska  wrote:
> > >
> > > Amit Kapila  wrote:
> > >
> > > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska  wrote:
> > >
> > > > > The problem of the temporary undo log is that it's loaded into local 
> > > > > buffers
> > > > > and that backend can exit w/o flushing local buffers to disk, and 
> > > > > thus we are
> > > > > not guaranteed to find enough information when trying to discard the 
> > > > > undo log
> > > > > the backend wrote. I'm thinking about the following solutions:
> > > > >
> > > > > 1. Let the backend manage temporary undo log on its own (even the slot
> > > > >metadata would stay outside the shared memory, and in particular 
> > > > > the
> > > > >insertion pointer could start from 1 for each session) and remove 
> > > > > the
> > > > >segment files at the same moment the temporary relations are 
> > > > > removed.
> > > > >
> > > > >However, by moving the temporary undo slots away from the shared 
> > > > > memory,
> > > > >computation of oldestFullXidHavingUndo (see the PROC_HDR 
> > > > > structure) would
> > > > >be affected. It might seem that a transaction which only writes 
> > > > > undo log
> > > > >for temporary relations does not need to affect 
> > > > > oldestFullXidHavingUndo,
> > > > >but it needs to be analyzed thoroughly. Since 
> > > > > oldestFullXidHavingUndo
> > > > >prevents transactions to be truncated from the CLOG too early, I 
> > > > > wonder if
> > > > >the following is possible (This scenario is only applicable to the 
> > > > > zheap
> > > > >storage engine [1], which is not included in this patch, but 
> > > > > should already
> > > > >be considered.):
> > > > >
> > > > >A transaction creates a temporary table, does some (many) changes 
> > > > > and then
> > > > >gets rolled back. The undo records are being applied and it takes 
> > > > > some
> > > > >time. Since XID of the transaction did not affect 
> > > > > oldestFullXidHavingUndo,
> > > > >the XID can disappear from the CLOG due to truncation.
> > > > >
> > > >
> > > > By above do you mean to say that in zheap code, we don't consider XIDs
> > > > that operate on temp table/undo for oldestFullXidHavingUndo?
> > >
> > > I was referring to the code
> > >
> > > /* We can't process temporary undo logs. */
> > > if (log->meta.persistence == UNDO_TEMP)
> > > continue;
> > >
> > > in undodiscard.c:UndoDiscard().
> > >
> >
> > Here, I think it will just skip undo of temporary undo logs and
> > oldestFullXidHavingUndo should be advanced after skipping it.
>
> Right, it'll be adavanced, but the transaction XID (if the transaction wrote
> only to temporary relations) might still be needed.
>
> > > >
> > > > > However zundo.c in
> > > > >[1] indicates that the transaction status *is* checked during undo
> > > > >execution, so we might have a problem.
> > > > >
> > > >
> > > > It would be easier to follow if you can tell which exact code are you
> > > > referring here?
> > >
> > > In meant the call of TransactionIdDidCommit() in
> > > zundo.c:zheap_exec_pending_rollback().
> > >
> >
> > IIRC, this should be called for temp tables after they have exited as
> > this is only to apply the pending undo actions if any, and in case of
> > temporary undo after session exit, we shouldn't need it.
>
> I see (had to play with debugger a bit). Currently this works because the
> temporary relations are dropped by AbortTransaction() ->
> smgrDoPendingDeletes(), before the undo execution starts. The situation will
> change as soon as the file removal will also be handled by the undo subsystem,
> however I'm still not sure how to hit the TransactionIdDidCommit() call for
> the XID already truncated from CLOG.
>
> I'm starting to admint that there's no issue here: temporary undo is always
> applied immediately in foreground, and thus the zheap_exec_pending_rollback()
> function never needs to examine XID which no longer exists in the CLOG.
>
> > I am not able to understand what exact problem you are facing for temp
> > tables after the session exit. Can you please explain it a bit more?
>
> The problem is that the temporary undo buffers are loaded into backend-local
> buffers. Thus there's no guarantee that we'll find a consistent information in
> the undo file even if the backend exited cleanly (local buffers are not
> flushed at backend exit and there's no WAL for them). However, we need to read
> the undo file to find out if (part of) it can be discarded.
>
> I'm trying to find out whether we can ignore the temporary undo when trying to
> advance oldestFullXidHavingUndo or not.
>

It seems this is the crucial point. In the code, you pointed, we
ignore the temporary undo while advancing oldestFullXidHavingUndo but
if you find any case where that is not true then we need to

Re: POC: Cleaning up orphaned files using undo logs

2021-09-29 Thread Antonin Houska

Thomas Munro  wrote:

> On Wed, Sep 29, 2021 at 8:18 AM Antonin Houska  wrote:
> > I'm just trying to use the existing infrastructure: the effect of DROP TABLE
> > also appear to be performed by the checkpointer. However I don't know why 
> > the
> > unlinks need to be performed by the checkpointer.
> 
> For DROP TABLE, we leave an empty file (I've been calling it a
> "tombstone file") so that GetNewRelFileNode() won't let you reuse the
> same relfilenode in the same checkpoint cycle.  One reason is that
> wal_level=minimal has a data-eating crash recovery failure mode if you
> reuse a relfilenode in a checkpoint cycle.

Interesting. Is the problem that REDO of the DROP TABLE command deletes the
relfilenode which already contains the new data, but the new data cannot be
recovered because (due to wal_level=minimal) it's not present in WAL? In this
case I suppose that the checkpoint just ensures that the DROP TABLE won't be
replayed during the next crash recovery.

BTW, does that comment fix attached make sense to you? The corresponding code
in InitSync() is

/*
 * Create pending-operations hashtable if we need it.  Currently, we 
need
 * it if we are standalone (not under a postmaster) or if we are a
 * checkpointer auxiliary process.
 */
if (!IsUnderPostmaster || AmCheckpointerProcess())

I suspect this is also related to the commit 7ff23c6d27.

Thanks for your answer, I was considering to add you to CC :-)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 1c78581354..ae6c5ff8e4 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -563,7 +563,7 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,

 	if (pendingOps != NULL)
 	{
-		/* standalone backend or startup process: fsync state is local */
+		/* standalone backend or checkpointer process: fsync state is local */
 		RememberSyncRequest(ftag, type);
 		return true;
 	}

Re: POC: Cleaning up orphaned files using undo logs

2021-09-28 Thread Thomas Munro

On Wed, Sep 29, 2021 at 8:18 AM Antonin Houska  wrote:
> I'm just trying to use the existing infrastructure: the effect of DROP TABLE
> also appear to be performed by the checkpointer. However I don't know why the
> unlinks need to be performed by the checkpointer.

For DROP TABLE, we leave an empty file (I've been calling it a
"tombstone file") so that GetNewRelFileNode() won't let you reuse the
same relfilenode in the same checkpoint cycle.  One reason is that
wal_level=minimal has a data-eating crash recovery failure mode if you
reuse a relfilenode in a checkpoint cycle.

Re: POC: Cleaning up orphaned files using undo logs

2021-09-28 Thread Antonin Houska

Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Tue, Sep 21, 2021 at 10:07:55AM +0200, Dmitry Dolgov wrote:
> > On Tue, 21 Sep 2021 09:00 Antonin Houska,  wrote:
> >
> > > Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > >
> > > > Yep, makes sense, thanks. I have few more questions:
> > > >
> > > > * The use case with orphaned files is working somewhat differently after
> > > >   the rebase on the latest master, do you observe it as well? The
> > > >   difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
> > > >   an orphaned relation file immediately (only later on checkpoint)
> > > >   because of empty pendingUnlinks. I haven't investigated more yet, but
> > > >   seems like after this commit:
> > > >
> > > > commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> > > > Author: Thomas Munro 
> > > > Date:   Mon Aug 2 17:32:20 2021 +1200
> > > >
> > > > Run checkpointer and bgwriter in crash recovery.
> > > >
> > > > Start up the checkpointer and bgwriter during crash recovery
> > > (except in
> > > > --single mode), as we do for replication.  This wasn't done back
> > > in
> > > > commit cdd46c76 out of caution.  Now it seems like a better idea
> > > to make
> > > > the environment as similar as possible in both cases.  There may
> > > also be
> > > > some performance advantages.
> > > >
> > > >   something has to be updated (pendingOps are empty right now, so no
> > > >   unlink request is remembered).
> > >
> > > I haven't been debugging that part recently, but yes, this commit is
> > > relevant,
> > > thanks for pointing that out! Attached is a patch that should fix it. I'll
> > > include it in the next version of the patch series, unless you tell me 
> > > that
> > > something is still wrong.
> > >
> >
> > Sure, but  I can take a look only in a couple of days.
> 
> Thanks for the patch.
> 
> Hm, maybe there is some misunderstanding. My question above was about
> the changed behaviour, when orphaned files (e.g. created relation files
> after the backend was killed) are removed only by checkpointer when it
> kicks in. As far as I understand, the original intention was to do this
> job right away, that's why SyncPre/PostCheckpoint was invoked. But the
> recent changes around checkpointer make the current implementation
> insufficient.

> The patch you've proposed removes invokation of SyncPre/PostCheckpoint,
> do I see it correctly? In this sense it doesn't change anything, except
> removing non-functioning code of course.

Yes, it sounds like a misundeerstanding. I thought you complain about code
which is no longer needed.

The original intention was to make sure that the files are ever unlinked. IIRC
before the commit 7ff23c6d27 the calls SyncPre/PostCheckpoint were necessary
because the checkpointer wasn't runnig that early during the startup. Without
these calls the startup process would exit without doing anything. Sorry, I
see now that the comment incorrectly says "... it seems simpler ...", but in
fact it was necessary.

> But the question, probably
> reformulated from the more design point of view, stays the same — when
> and by which process such orphaned files have to be removed? I've
> assumed by removing right away the previous version was trying to avoid
> any kind of thunder effects of removing too many at once, but maybe I'm
> mistaken here.

I'm just trying to use the existing infrastructure: the effect of DROP TABLE
also appear to be performed by the checkpointer. However I don't know why the
unlinks need to be performed by the checkpointer.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-28 Thread Dmitry Dolgov

> On Tue, Sep 21, 2021 at 10:07:55AM +0200, Dmitry Dolgov wrote:
> On Tue, 21 Sep 2021 09:00 Antonin Houska,  wrote:
>
> > Dmitry Dolgov <9erthali...@gmail.com> wrote:
> >
> > > Yep, makes sense, thanks. I have few more questions:
> > >
> > > * The use case with orphaned files is working somewhat differently after
> > >   the rebase on the latest master, do you observe it as well? The
> > >   difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
> > >   an orphaned relation file immediately (only later on checkpoint)
> > >   because of empty pendingUnlinks. I haven't investigated more yet, but
> > >   seems like after this commit:
> > >
> > > commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> > > Author: Thomas Munro 
> > > Date:   Mon Aug 2 17:32:20 2021 +1200
> > >
> > > Run checkpointer and bgwriter in crash recovery.
> > >
> > > Start up the checkpointer and bgwriter during crash recovery
> > (except in
> > > --single mode), as we do for replication.  This wasn't done back
> > in
> > > commit cdd46c76 out of caution.  Now it seems like a better idea
> > to make
> > > the environment as similar as possible in both cases.  There may
> > also be
> > > some performance advantages.
> > >
> > >   something has to be updated (pendingOps are empty right now, so no
> > >   unlink request is remembered).
> >
> > I haven't been debugging that part recently, but yes, this commit is
> > relevant,
> > thanks for pointing that out! Attached is a patch that should fix it. I'll
> > include it in the next version of the patch series, unless you tell me that
> > something is still wrong.
> >
>
> Sure, but  I can take a look only in a couple of days.

Thanks for the patch.

Hm, maybe there is some misunderstanding. My question above was about
the changed behaviour, when orphaned files (e.g. created relation files
after the backend was killed) are removed only by checkpointer when it
kicks in. As far as I understand, the original intention was to do this
job right away, that's why SyncPre/PostCheckpoint was invoked. But the
recent changes around checkpointer make the current implementation
insufficient.

The patch you've proposed removes invokation of SyncPre/PostCheckpoint,
do I see it correctly? In this sense it doesn't change anything, except
removing non-functioning code of course. But the question, probably
reformulated from the more design point of view, stays the same — when
and by which process such orphaned files have to be removed? I've
assumed by removing right away the previous version was trying to avoid
any kind of thunder effects of removing too many at once, but maybe I'm
mistaken here.

Re: POC: Cleaning up orphaned files using undo logs

2021-09-28 Thread Antonin Houska

Amit Kapila  wrote:

> On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska  wrote:
> >
> > > > The problem of the temporary undo log is that it's loaded into local 
> > > > buffers
> > > > and that backend can exit w/o flushing local buffers to disk, and thus 
> > > > we are
> > > > not guaranteed to find enough information when trying to discard the 
> > > > undo log
> > > > the backend wrote. I'm thinking about the following solutions:
> > > >
> > > > 1. Let the backend manage temporary undo log on its own (even the slot
> > > >metadata would stay outside the shared memory, and in particular the
> > > >insertion pointer could start from 1 for each session) and remove the
> > > >segment files at the same moment the temporary relations are removed.
> > > >
> > > >However, by moving the temporary undo slots away from the shared 
> > > > memory,
> > > >computation of oldestFullXidHavingUndo (see the PROC_HDR structure) 
> > > > would
> > > >be affected. It might seem that a transaction which only writes undo 
> > > > log
> > > >for temporary relations does not need to affect 
> > > > oldestFullXidHavingUndo,
> > > >but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
> > > >prevents transactions to be truncated from the CLOG too early, I 
> > > > wonder if
> > > >the following is possible (This scenario is only applicable to the 
> > > > zheap
> > > >storage engine [1], which is not included in this patch, but should 
> > > > already
> > > >be considered.):
> > > >
> > > >A transaction creates a temporary table, does some (many) changes 
> > > > and then
> > > >gets rolled back. The undo records are being applied and it takes 
> > > > some
> > > >time. Since XID of the transaction did not affect 
> > > > oldestFullXidHavingUndo,
> > > >the XID can disappear from the CLOG due to truncation.
> > > >
> > >
> > > By above do you mean to say that in zheap code, we don't consider XIDs
> > > that operate on temp table/undo for oldestFullXidHavingUndo?
> >
> > I was referring to the code
> >
> > /* We can't process temporary undo logs. */
> > if (log->meta.persistence == UNDO_TEMP)
> > continue;
> >
> > in undodiscard.c:UndoDiscard().
> >
> 
> Here, I think it will just skip undo of temporary undo logs and
> oldestFullXidHavingUndo should be advanced after skipping it.

Right, it'll be adavanced, but the transaction XID (if the transaction wrote
only to temporary relations) might still be needed.

> > >
> > > > However zundo.c in
> > > >[1] indicates that the transaction status *is* checked during undo
> > > >execution, so we might have a problem.
> > > >
> > >
> > > It would be easier to follow if you can tell which exact code are you
> > > referring here?
> >
> > In meant the call of TransactionIdDidCommit() in
> > zundo.c:zheap_exec_pending_rollback().
> >
> 
> IIRC, this should be called for temp tables after they have exited as
> this is only to apply the pending undo actions if any, and in case of
> temporary undo after session exit, we shouldn't need it.

I see (had to play with debugger a bit). Currently this works because the
temporary relations are dropped by AbortTransaction() ->
smgrDoPendingDeletes(), before the undo execution starts. The situation will
change as soon as the file removal will also be handled by the undo subsystem,
however I'm still not sure how to hit the TransactionIdDidCommit() call for
the XID already truncated from CLOG.

I'm starting to admint that there's no issue here: temporary undo is always
applied immediately in foreground, and thus the zheap_exec_pending_rollback()
function never needs to examine XID which no longer exists in the CLOG.

> I am not able to understand what exact problem you are facing for temp
> tables after the session exit. Can you please explain it a bit more?

The problem is that the temporary undo buffers are loaded into backend-local
buffers. Thus there's no guarantee that we'll find a consistent information in
the undo file even if the backend exited cleanly (local buffers are not
flushed at backend exit and there's no WAL for them). However, we need to read
the undo file to find out if (part of) it can be discarded.

I'm trying to find out whether we can ignore the temporary undo when trying to
advance oldestFullXidHavingUndo or not. If we can, then each backend can mange
its temporary undo on its own and - instead of checking which chunks can be
discarded - simply delete the undo files on exit as a whole, just like it
deletes temporary relations. Thus we wouldn't need to pay any special
attention to discarding. Also, if backends managed the temporary undo this
way, it wouldn't be necessary to track it via the shared memory
(UndoLogMetaData).

(With this approach, the undo record to delete the temporary relation must not

Re: POC: Cleaning up orphaned files using undo logs

2021-09-27 Thread Amit Kapila

On Mon, Sep 27, 2021 at 7:43 PM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
> Although the postgres core probably does not raise FATAL errors too often (OOM
> conditions seem to be the typical cause), I'm still not enthusiastic about
> idea that the undo feature turns such errors into PANIC.
>
> I wonder what the reason to avoid undoing transaction on FATAL is. If it's
> about possibly long duration of the undo execution, deletion of orphaned files
> (relations or the whole databases) via undo shouldn't make things worse
> because currently FATAL also triggers this sort of cleanup immediately, it's
> just implemented in different ways.
>

During FATAL, we don't want to perform more operations which can make
the situation worse. Say, we are already short of memory (OOM), and
undo execution can further try to allocate the memory won't do any
good. Depending on the implementation, sometimes undo execution might
need to perform WAL writes or data write which we don't want to do
during FATAL error processing.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2021-09-27 Thread Antonin Houska

Amit Kapila  wrote:

> On Fri, Sep 24, 2021 at 4:44 PM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska  wrote:
> > > >
> > > > Amit Kapila  wrote:
> > > >
> > > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> 
> > > > > wrote:
> > > > > >
> > > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > > > > >
> > > > > > * What happened with the idea of abandoning discard worker for the 
> > > > > > sake
> > > > > >   of simplicity? From what I see limiting everything to foreground 
> > > > > > undo
> > > > > >   could reduce the core of the patch series to the first four 
> > > > > > patches
> > > > > >   (forgetting about test and docs, but I guess it would be enough at
> > > > > >   least for the design review), which is already less overwhelming.
> >
> > > > What we can miss, at least for the cleanup of the orphaned files, is 
> > > > the *undo
> > > > worker*. In this patch series the cleanup is handled by the startup 
> > > > process.
> > > >
> > >
> > > Okay, I think various people at different point of times has suggested
> > > that idea. I think one thing we might need to consider is what to do
> > > in case of a FATAL error? In case of FATAL error, it won't be
> > > advisable to execute undo immediately, so would we upgrade the error
> > > to PANIC in such cases. I remember vaguely that for clean up of
> > > orphaned files that can happen rarely and someone has suggested
> > > upgrading the error to PANIC in such a case but I don't remember the
> > > exact details.
> >
> > Do you mean FATAL error during normal operation?
> >
> 
> Yes.
> 
> > As far as I understand, even
> > zheap does not rely on immediate UNDO execution (otherwise it'd never
> > introduce the undo worker), so FATAL only means that the undo needs to be
> > applied later so it can be discarded.
> >
> 
> Yeah, zheap either applies undo later via background worker or next
> time before dml operation if there is a need.
> 
> > As for the orphaned files cleanup feature with no undo worker, we might need
> > PANIC to ensure that the undo is applied during restart and that it can be
> > discarded, otherwise the unapplied undo log would stay there until the next
> > (regular) restart and it would block discarding. However upgrading FATAL to
> > PANIC just because the current transaction created a table seems quite
> > rude.
> >
> 
> True, I guess but we can once see in what all scenarios it can
> generate FATAL during that operation.

By "that operation" you mean "CREATE TABLE"?

It's not about FATAL during CREATE TABLE, rather it's about FATAL anytime
during a transaction. Whichever operation caused the FATAL error, we'd need to
upgrade it to PANIC as long as the transaction has some undo.

Although the postgres core probably does not raise FATAL errors too often (OOM
conditions seem to be the typical cause), I'm still not enthusiastic about
idea that the undo feature turns such errors into PANIC.

I wonder what the reason to avoid undoing transaction on FATAL is. If it's
about possibly long duration of the undo execution, deletion of orphaned files
(relations or the whole databases) via undo shouldn't make things worse
because currently FATAL also triggers this sort of cleanup immediately, it's
just implemented in different ways.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-25 Thread Amit Kapila

On Fri, Sep 24, 2021 at 4:44 PM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
> > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska  wrote:
> > >
> > > Amit Kapila  wrote:
> > >
> > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> 
> > > > wrote:
> > > > >
> > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > > > >
> > > > > * What happened with the idea of abandoning discard worker for the 
> > > > > sake
> > > > >   of simplicity? From what I see limiting everything to foreground 
> > > > > undo
> > > > >   could reduce the core of the patch series to the first four patches
> > > > >   (forgetting about test and docs, but I guess it would be enough at
> > > > >   least for the design review), which is already less overwhelming.
>
> > > What we can miss, at least for the cleanup of the orphaned files, is the 
> > > *undo
> > > worker*. In this patch series the cleanup is handled by the startup 
> > > process.
> > >
> >
> > Okay, I think various people at different point of times has suggested
> > that idea. I think one thing we might need to consider is what to do
> > in case of a FATAL error? In case of FATAL error, it won't be
> > advisable to execute undo immediately, so would we upgrade the error
> > to PANIC in such cases. I remember vaguely that for clean up of
> > orphaned files that can happen rarely and someone has suggested
> > upgrading the error to PANIC in such a case but I don't remember the
> > exact details.
>
> Do you mean FATAL error during normal operation?
>

Yes.

> As far as I understand, even
> zheap does not rely on immediate UNDO execution (otherwise it'd never
> introduce the undo worker), so FATAL only means that the undo needs to be
> applied later so it can be discarded.
>

Yeah, zheap either applies undo later via background worker or next
time before dml operation if there is a need.

> As for the orphaned files cleanup feature with no undo worker, we might need
> PANIC to ensure that the undo is applied during restart and that it can be
> discarded, otherwise the unapplied undo log would stay there until the next
> (regular) restart and it would block discarding. However upgrading FATAL to
> PANIC just because the current transaction created a table seems quite
> rude.
>

True, I guess but we can once see in what all scenarios it can
generate FATAL during that operation.

> So the undo worker might be needed even for this patch?
>

I think we can keep undo worker as a separate patch and for base patch
keep the idea of promoting FATAL to PANIC. This will at the very least
make the review easier.

> Or do you mean FATAL error when executing the UNDO?
>

No.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2021-09-24 Thread Antonin Houska

Amit Kapila  wrote:

> On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> 
> > > wrote:
> > > >
> > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > > >
> > > > * What happened with the idea of abandoning discard worker for the sake
> > > >   of simplicity? From what I see limiting everything to foreground undo
> > > >   could reduce the core of the patch series to the first four patches
> > > >   (forgetting about test and docs, but I guess it would be enough at
> > > >   least for the design review), which is already less overwhelming.

> > What we can miss, at least for the cleanup of the orphaned files, is the 
> > *undo
> > worker*. In this patch series the cleanup is handled by the startup process.
> >
> 
> Okay, I think various people at different point of times has suggested
> that idea. I think one thing we might need to consider is what to do
> in case of a FATAL error? In case of FATAL error, it won't be
> advisable to execute undo immediately, so would we upgrade the error
> to PANIC in such cases. I remember vaguely that for clean up of
> orphaned files that can happen rarely and someone has suggested
> upgrading the error to PANIC in such a case but I don't remember the
> exact details.

Do you mean FATAL error during normal operation? As far as I understand, even
zheap does not rely on immediate UNDO execution (otherwise it'd never
introduce the undo worker), so FATAL only means that the undo needs to be
applied later so it can be discarded.

As for the orphaned files cleanup feature with no undo worker, we might need
PANIC to ensure that the undo is applied during restart and that it can be
discarded, otherwise the unapplied undo log would stay there until the next
(regular) restart and it would block discarding. However upgrading FATAL to
PANIC just because the current transaction created a table seems quite
rude. So the undo worker might be needed even for this patch?

Or do you mean FATAL error when executing the UNDO?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-22 Thread Amit Kapila

On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
> > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska  wrote:
>
> > > The problem of the temporary undo log is that it's loaded into local 
> > > buffers
> > > and that backend can exit w/o flushing local buffers to disk, and thus we 
> > > are
> > > not guaranteed to find enough information when trying to discard the undo 
> > > log
> > > the backend wrote. I'm thinking about the following solutions:
> > >
> > > 1. Let the backend manage temporary undo log on its own (even the slot
> > >metadata would stay outside the shared memory, and in particular the
> > >insertion pointer could start from 1 for each session) and remove the
> > >segment files at the same moment the temporary relations are removed.
> > >
> > >However, by moving the temporary undo slots away from the shared 
> > > memory,
> > >computation of oldestFullXidHavingUndo (see the PROC_HDR structure) 
> > > would
> > >be affected. It might seem that a transaction which only writes undo 
> > > log
> > >for temporary relations does not need to affect 
> > > oldestFullXidHavingUndo,
> > >but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
> > >prevents transactions to be truncated from the CLOG too early, I 
> > > wonder if
> > >the following is possible (This scenario is only applicable to the 
> > > zheap
> > >storage engine [1], which is not included in this patch, but should 
> > > already
> > >be considered.):
> > >
> > >A transaction creates a temporary table, does some (many) changes and 
> > > then
> > >gets rolled back. The undo records are being applied and it takes some
> > >time. Since XID of the transaction did not affect 
> > > oldestFullXidHavingUndo,
> > >the XID can disappear from the CLOG due to truncation.
> > >
> >
> > By above do you mean to say that in zheap code, we don't consider XIDs
> > that operate on temp table/undo for oldestFullXidHavingUndo?
>
> I was referring to the code
>
> /* We can't process temporary undo logs. */
> if (log->meta.persistence == UNDO_TEMP)
> continue;
>
> in undodiscard.c:UndoDiscard().
>

Here, I think it will just skip undo of temporary undo logs and
oldestFullXidHavingUndo should be advanced after skipping it.

> >
> > > However zundo.c in
> > >[1] indicates that the transaction status *is* checked during undo
> > >execution, so we might have a problem.
> > >
> >
> > It would be easier to follow if you can tell which exact code are you
> > referring here?
>
> In meant the call of TransactionIdDidCommit() in
> zundo.c:zheap_exec_pending_rollback().
>

IIRC, this should be called for temp tables after they have exited as
this is only to apply the pending undo actions if any, and in case of
temporary undo after session exit, we shouldn't need it.

I am not able to understand what exact problem you are facing for temp
tables after the session exit. Can you please explain it a bit more?

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2021-09-21 Thread Amit Kapila

On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
> > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > >
> > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > >
> > > * What happened with the idea of abandoning discard worker for the sake
> > >   of simplicity? From what I see limiting everything to foreground undo
> > >   could reduce the core of the patch series to the first four patches
> > >   (forgetting about test and docs, but I guess it would be enough at
> > >   least for the design review), which is already less overwhelming.
> > >
> >
> > I think the discard worker would be required even if we decide to
> > apply all the undo in the foreground. We need to forget/remove the
> > undo of committed transactions as well which we can't remove
> > immediately after the commit.
>
> I think I proposed foreground discarding at some point, but you reminded me
> that the undo may still be needed for some time even after transaction
> commit. Thus the discard worker is indispensable.
>

Right.

> What we can miss, at least for the cleanup of the orphaned files, is the *undo
> worker*. In this patch series the cleanup is handled by the startup process.
>

Okay, I think various people at different point of times has suggested
that idea. I think one thing we might need to consider is what to do
in case of a FATAL error? In case of FATAL error, it won't be
advisable to execute undo immediately, so would we upgrade the error
to PANIC in such cases. I remember vaguely that for clean up of
orphaned files that can happen rarely and someone has suggested
upgrading the error to PANIC in such a case but I don't remember the
exact details.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2021-09-21 Thread Dmitry Dolgov

On Tue, 21 Sep 2021 09:00 Antonin Houska,  wrote:

> Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > Yep, makes sense, thanks. I have few more questions:
> >
> > * The use case with orphaned files is working somewhat differently after
> >   the rebase on the latest master, do you observe it as well? The
> >   difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
> >   an orphaned relation file immediately (only later on checkpoint)
> >   because of empty pendingUnlinks. I haven't investigated more yet, but
> >   seems like after this commit:
> >
> > commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> > Author: Thomas Munro 
> > Date:   Mon Aug 2 17:32:20 2021 +1200
> >
> > Run checkpointer and bgwriter in crash recovery.
> >
> > Start up the checkpointer and bgwriter during crash recovery
> (except in
> > --single mode), as we do for replication.  This wasn't done back
> in
> > commit cdd46c76 out of caution.  Now it seems like a better idea
> to make
> > the environment as similar as possible in both cases.  There may
> also be
> > some performance advantages.
> >
> >   something has to be updated (pendingOps are empty right now, so no
> >   unlink request is remembered).
>
> I haven't been debugging that part recently, but yes, this commit is
> relevant,
> thanks for pointing that out! Attached is a patch that should fix it. I'll
> include it in the next version of the patch series, unless you tell me that
> something is still wrong.
>

Sure, but  I can take a look only in a couple of days.

>

Re: POC: Cleaning up orphaned files using undo logs

2021-09-21 Thread Antonin Houska

Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > Antonin Houska  wrote:
> >
> > > Dmitry Dolgov <9erthali...@gmail.com> wrote:
> >
> > > > * By throwing at the patchset `make installcheck` I'm getting from time 
> > > > to time
> > > >   and error on the restart:
> > > >
> > > > TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)",
> > > > File: "undorecordset.c", Line: 1098, PID: 6055)
> > > >
> > > >   From what I see XLogReadBufferForRedoExtended finds an invalid buffer 
> > > > and
> > > >   returns BLK_NOTFOUND. The commentary says:
> > > >
> > > >  If the block was not found, then it must be discarded later in
> > > >  the WAL.
> > > >
> > > >   and continues with skip = false, but fails to get a page from an 
> > > > invalid
> > > >   buffer few lines later. It seems that the skip flag is supposed to be 
> > > > used
> > > >   this situation, should it also guard the BufferGetPage part?
> > >
> > > I could see this sometime too, but can't reproduce it now. It's also not 
> > > clear
> > > to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the
> > > whole undo log segment is created at once, even if only part of it is 
> > > needed -
> > > see allocate_empty_undo_segment().
> >
> > I could eventually reproduce the problem. The root cause was that WAL 
> > records
> > were created even for temporary / unlogged undo, and thus only empty pages
> > could be found during replay. I've fixed that and also setup regular test 
> > for
> > the BLK_NOTFOUND value. That required a few more fixes to UndoReplay().
> >
> > Attached is a new version.
> 
> Yep, makes sense, thanks. I have few more questions:
> 
> * The use case with orphaned files is working somewhat differently after
>   the rebase on the latest master, do you observe it as well? The
>   difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
>   an orphaned relation file immediately (only later on checkpoint)
>   because of empty pendingUnlinks. I haven't investigated more yet, but
>   seems like after this commit:
> 
> commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> Author: Thomas Munro 
> Date:   Mon Aug 2 17:32:20 2021 +1200
> 
> Run checkpointer and bgwriter in crash recovery.
> 
> Start up the checkpointer and bgwriter during crash recovery (except 
> in
> --single mode), as we do for replication.  This wasn't done back in
> commit cdd46c76 out of caution.  Now it seems like a better idea to 
> make
> the environment as similar as possible in both cases.  There may also 
> be
> some performance advantages.
> 
>   something has to be updated (pendingOps are empty right now, so no
>   unlink request is remembered).

I haven't been debugging that part recently, but yes, this commit is relevant,
thanks for pointing that out! Attached is a patch that should fix it. I'll
include it in the next version of the patch series, unless you tell me that
something is still wrong.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/access/undo/undorecordset.c b/src/backend/access/undo/undorecordset.c
index 59eba7dfb6..9d05824141 100644
--- a/src/backend/access/undo/undorecordset.c
+++ b/src/backend/access/undo/undorecordset.c
@@ -2622,14 +2622,6 @@ ApplyPendingUndo(void)
 		}
 	}
 
-	/*
-	 * Some undo actions may unlink files. Since the checkpointer is not
-	 * guaranteed to be up, it seems simpler to process the undo request
-	 * ourselves in the way the checkpointer would do.
-	 */
-	SyncPreCheckpoint();
-	SyncPostCheckpoint();
-
 	/* Cleanup. */
 	chunktable_destroy(sets);
 }

Re: POC: Cleaning up orphaned files using undo logs

2021-09-19 Thread Antonin Houska

Amit Kapila  wrote:

> On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska  wrote:

> > The problem of the temporary undo log is that it's loaded into local buffers
> > and that backend can exit w/o flushing local buffers to disk, and thus we 
> > are
> > not guaranteed to find enough information when trying to discard the undo 
> > log
> > the backend wrote. I'm thinking about the following solutions:
> >
> > 1. Let the backend manage temporary undo log on its own (even the slot
> >metadata would stay outside the shared memory, and in particular the
> >insertion pointer could start from 1 for each session) and remove the
> >segment files at the same moment the temporary relations are removed.
> >
> >However, by moving the temporary undo slots away from the shared memory,
> >computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
> >be affected. It might seem that a transaction which only writes undo log
> >for temporary relations does not need to affect oldestFullXidHavingUndo,
> >but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
> >prevents transactions to be truncated from the CLOG too early, I wonder 
> > if
> >the following is possible (This scenario is only applicable to the zheap
> >storage engine [1], which is not included in this patch, but should 
> > already
> >be considered.):
> >
> >A transaction creates a temporary table, does some (many) changes and 
> > then
> >gets rolled back. The undo records are being applied and it takes some
> >time. Since XID of the transaction did not affect 
> > oldestFullXidHavingUndo,
> >the XID can disappear from the CLOG due to truncation.
> >
> 
> By above do you mean to say that in zheap code, we don't consider XIDs
> that operate on temp table/undo for oldestFullXidHavingUndo?

I was referring to the code

/* We can't process temporary undo logs. */
if (log->meta.persistence == UNDO_TEMP)
continue;

in undodiscard.c:UndoDiscard().

> 
> > However zundo.c in
> >[1] indicates that the transaction status *is* checked during undo
> >execution, so we might have a problem.
> >
> 
> It would be easier to follow if you can tell which exact code are you
> referring here?

In meant the call of TransactionIdDidCommit() in
zundo.c:zheap_exec_pending_rollback().

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-19 Thread Antonin Houska

Amit Kapila  wrote:

> On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> >
> > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> >
> > * What happened with the idea of abandoning discard worker for the sake
> >   of simplicity? From what I see limiting everything to foreground undo
> >   could reduce the core of the patch series to the first four patches
> >   (forgetting about test and docs, but I guess it would be enough at
> >   least for the design review), which is already less overwhelming.
> >
> 
> I think the discard worker would be required even if we decide to
> apply all the undo in the foreground. We need to forget/remove the
> undo of committed transactions as well which we can't remove
> immediately after the commit.

I think I proposed foreground discarding at some point, but you reminded me
that the undo may still be needed for some time even after transaction
commit. Thus the discard worker is indispensable.

What we can miss, at least for the cleanup of the orphaned files, is the *undo
worker*. In this patch series the cleanup is handled by the startup process.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-18 Thread Amit Kapila

On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
>
> * What happened with the idea of abandoning discard worker for the sake
>   of simplicity? From what I see limiting everything to foreground undo
>   could reduce the core of the patch series to the first four patches
>   (forgetting about test and docs, but I guess it would be enough at
>   least for the design review), which is already less overwhelming.
>

I think the discard worker would be required even if we decide to
apply all the undo in the foreground. We need to forget/remove the
undo of committed transactions as well which we can't remove
immediately after the commit.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2021-09-17 Thread Dmitry Dolgov

> On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> Antonin Houska  wrote:
>
> > Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > > * By throwing at the patchset `make installcheck` I'm getting from time 
> > > to time
> > >   and error on the restart:
> > >
> > > TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)",
> > > File: "undorecordset.c", Line: 1098, PID: 6055)
> > >
> > >   From what I see XLogReadBufferForRedoExtended finds an invalid buffer 
> > > and
> > >   returns BLK_NOTFOUND. The commentary says:
> > >
> > >  If the block was not found, then it must be discarded later in
> > >  the WAL.
> > >
> > >   and continues with skip = false, but fails to get a page from an invalid
> > >   buffer few lines later. It seems that the skip flag is supposed to be 
> > > used
> > >   this situation, should it also guard the BufferGetPage part?
> >
> > I could see this sometime too, but can't reproduce it now. It's also not 
> > clear
> > to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the
> > whole undo log segment is created at once, even if only part of it is 
> > needed -
> > see allocate_empty_undo_segment().
>
> I could eventually reproduce the problem. The root cause was that WAL records
> were created even for temporary / unlogged undo, and thus only empty pages
> could be found during replay. I've fixed that and also setup regular test for
> the BLK_NOTFOUND value. That required a few more fixes to UndoReplay().
>
> Attached is a new version.

Yep, makes sense, thanks. I have few more questions:

* The use case with orphaned files is working somewhat differently after
  the rebase on the latest master, do you observe it as well? The
  difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
  an orphaned relation file immediately (only later on checkpoint)
  because of empty pendingUnlinks. I haven't investigated more yet, but
  seems like after this commit:

commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
Author: Thomas Munro 
Date:   Mon Aug 2 17:32:20 2021 +1200

Run checkpointer and bgwriter in crash recovery.

Start up the checkpointer and bgwriter during crash recovery (except in
--single mode), as we do for replication.  This wasn't done back in
commit cdd46c76 out of caution.  Now it seems like a better idea to make
the environment as similar as possible in both cases.  There may also be
some performance advantages.

  something has to be updated (pendingOps are empty right now, so no
  unlink request is remembered).

* What happened with the idea of abandoning discard worker for the sake
  of simplicity? From what I see limiting everything to foreground undo
  could reduce the core of the patch series to the first four patches
  (forgetting about test and docs, but I guess it would be enough at
  least for the design review), which is already less overwhelming.

Re: POC: Cleaning up orphaned files using undo logs

2021-09-17 Thread Amit Kapila

On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska  wrote:
>
> The cfbot complained that the patch series no longer applies, so I've rebased
> it and also tried to make sure that the other flags become green.
>
> One particular problem was that pg_upgrade complained that "live undo data"
> remains in the old cluster. I found out that the temporary undo log causes the
> problem, so I've adjusted the query in check_for_undo_data() accordingly until
> the problem gets fixed properly.
>
> The problem of the temporary undo log is that it's loaded into local buffers
> and that backend can exit w/o flushing local buffers to disk, and thus we are
> not guaranteed to find enough information when trying to discard the undo log
> the backend wrote. I'm thinking about the following solutions:
>
> 1. Let the backend manage temporary undo log on its own (even the slot
>metadata would stay outside the shared memory, and in particular the
>insertion pointer could start from 1 for each session) and remove the
>segment files at the same moment the temporary relations are removed.
>
>However, by moving the temporary undo slots away from the shared memory,
>computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
>be affected. It might seem that a transaction which only writes undo log
>for temporary relations does not need to affect oldestFullXidHavingUndo,
>but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
>prevents transactions to be truncated from the CLOG too early, I wonder if
>the following is possible (This scenario is only applicable to the zheap
>storage engine [1], which is not included in this patch, but should already
>be considered.):
>
>A transaction creates a temporary table, does some (many) changes and then
>gets rolled back. The undo records are being applied and it takes some
>time. Since XID of the transaction did not affect oldestFullXidHavingUndo,
>the XID can disappear from the CLOG due to truncation.
>

By above do you mean to say that in zheap code, we don't consider XIDs
that operate on temp table/undo for oldestFullXidHavingUndo?

> However zundo.c in
>[1] indicates that the transaction status *is* checked during undo
>execution, so we might have a problem.
>

It would be easier to follow if you can tell which exact code are you
referring here?


-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2021-08-16 Thread Dmitry Dolgov

> On Wed, Jun 30, 2021 at 07:41:16PM +0200, Antonin Houska wrote:
> Antonin Houska  wrote:
>
> > tsunakawa.ta...@fujitsu.com  wrote:
> >
> > > I'm crawling like a snail to read the patch set.  Below are my first set 
> > > of review comments, which are all minor.
> >
> > Thanks.
>
> I've added the patch to the upcoming CF [1], so it possibly gets more review
> and makes some progress. I've marked myself as the author so it's clear who
> will try to respond to the reviews. It's clear that other people did much more
> work on the feature than I did so far - they are welcome to add themselves to
> the author list.
>
> [1] https://commitfest.postgresql.org/33/3228/

Hi,

I'm crawling through the patch set like even slower creature than a snail,
sorry for long absence. I'm reading the latest version posted here and,
although it's hard to give any high level design comments on it yet, I thought
it could be useful to post a few findings and questions in the meantime.

* One question about the functionality:

  > On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote:
  > Attached is the next version. Changes done:
  >
  >   * Removed the progress tracking and implemented undo discarding in a 
simpler
  > way. Now, instead of maintaining the pointer to the last record applied,
  > only a boolean field in the chunk header is set when ROLLBACK is
  > done. This helps to determine whether the undo of a non-committed
  > transaction can be discarded.

  Just to clarify, the whole feature was removed for the sake of
  simplicity, right?

* By throwing at the patchset `make installcheck` I'm getting from time to time
  and error on the restart:

TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)",
File: "undorecordset.c", Line: 1098, PID: 6055)

  From what I see XLogReadBufferForRedoExtended finds an invalid buffer and
  returns BLK_NOTFOUND. The commentary says:

 If the block was not found, then it must be discarded later in
 the WAL.

  and continues with skip = false, but fails to get a page from an invalid
  buffer few lines later. It seems that the skip flag is supposed to be used
  this situation, should it also guard the BufferGetPage part?

* Another interesting issue I've found happened inside
  DropUndoLogsInTablespace, when the process got SIGTERM. It seems processing
  stuck on:

slist_foreach_modify(iter, >shared_free_lists[i])

  iterating on the same element over and over. My guess is
  clear_private_free_lists was called and caused such unexpected outcome,
  should the access to shared_free_lists be somehow protected?

* I also wonder about the segments in base/undo, the commentary in pg_undodump
  says:

 Since the UNDO log is a continuous stream of changes, any hole
 terminates processing.

  It looks like it's relatively easy to end up with such holes, and pg_undodump
  ends up with a message (found is added by me and contains a found offset
  which do not match the expected value):

pg_undodump: error: segment 00 missing in log 2, found 10

  This seems to be not causing any real issues, but it's not clear for me if
  such situation with gaps is fine or is it a problem?

Other than that one more time thank you for this tremendous work, I find that
the topic is of extreme importance.

Re: POC: Cleaning up orphaned files using undo logs

2021-07-15 Thread vignesh C

On Wed, Jun 30, 2021 at 11:10 PM Antonin Houska  wrote:
>
> Antonin Houska  wrote:
>
> > tsunakawa.ta...@fujitsu.com  wrote:
> >
> > > I'm crawling like a snail to read the patch set.  Below are my first set 
> > > of review comments, which are all minor.
> >
> > Thanks.
>
> I've added the patch to the upcoming CF [1], so it possibly gets more review
> and makes some progress. I've marked myself as the author so it's clear who
> will try to respond to the reviews. It's clear that other people did much more
> work on the feature than I did so far - they are welcome to add themselves to
> the author list.
>
> [1] https://commitfest.postgresql.org/33/3228/
>

The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".

Regards,
Vignesh

Re: POC: Cleaning up orphaned files using undo logs

2021-06-30 Thread Antonin Houska

Antonin Houska  wrote:

> tsunakawa.ta...@fujitsu.com  wrote:
> 
> > I'm crawling like a snail to read the patch set.  Below are my first set of 
> > review comments, which are all minor.
> 
> Thanks.

I've added the patch to the upcoming CF [1], so it possibly gets more review
and makes some progress. I've marked myself as the author so it's clear who
will try to respond to the reviews. It's clear that other people did much more
work on the feature than I did so far - they are welcome to add themselves to
the author list.

[1] https://commitfest.postgresql.org/33/3228/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-03-10 Thread Antonin Houska

tsunakawa.ta...@fujitsu.com  wrote:

> I'm crawling like a snail to read the patch set.  Below are my first set of 
> review comments, which are all minor.

Thanks. I will check your comments when I'll be preparing the next version of
the patch.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

RE: POC: Cleaning up orphaned files using undo logs

2021-03-09 Thread tsunakawa.ta...@fujitsu.com

I'm crawling like a snail to read the patch set.  Below are my first set of 
review comments, which are all minor.


(1)
+ 
tablespacetemporary

temporary -> undo


(2)
  undo_tablespaces (string)
+
...
+The value is a list of names of tablespaces.  When there is more than
+one name in the list, PostgreSQL chooses an
+arbitrary one.  If the name doesn't correspond to an existing
+tablespace, the next name is tried, and so on until all names have
+been tried.  If no valid tablespace is specified, an error is raised.
+The validation of the name doesn't happen until the first attempt to
+write undo data.

CREATE privilege needs to be mentioned like temp_tablespaces.


(3)
+The variable can only be changed before the first statement is
+executed in a transaction.

Does it include any first statement that doesn't emit undo?

(4)
+  One row for each undo log, showing current pointers,
+   transactions and backends.
+   See  for details.

I think this can just be written like "showing usage information about the undo 
log" just like other statistics views.  That way, we can avoid having to modify 
this sentence when we want to change the content of the view later.


(5)
+ discard
+ text
+ Location of the oldest data in this undo log.

The name does not match the description intuitively.  Maybe "oldest"?

BTW, how does this information help users?  (I don't mean to say we shouldn't 
output information that users cannot interpret; other DBMSs output such 
information probably for technical support staff.)


(6)
+ insert
+ text
+ Location where the next data will be written in this undo
+  log.
...
+ end
+ text
+ Location one byte past the end of the allocated physical storage
+  backing this undo log.

Again, how can these be used?  If they are useful to calculate the amount of 
used space, shouldn't they be bigint?


(7)
@@ -65,7 +65,7 @@
smgrid integer
 
 
- Block storage manager ID.  0 for regular relation data.
+ Block storage manager ID.  0 for regular relation data.
 
  

I guess this change is mistakenly included?


(8)
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
@@ -216,6 +216,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 

@@ -286,6 +286,7 @@



+   


It looks like this list needs to be ordered alphabetically.  So, the new line 
is better placed between pg_test_timing and pg_upgrade?


(9)
I don't want to be disliked because of being picky, but maybe pg_undo_dump 
should be pg_undodump.  Existing commands don't use '_' to separate words after 
pg_, except for pg_test_fsync and pg_test_timing.


(10)
+   This utility can only be run by the user who installed the server, because
+   it requires read-only access to the data directory.

I guess you copied this from pg_waldump or pg_resetwal, but I'm afraid this 
should be as follows, which is an excerpt from pg_controldata's page.  (The 
pages for pg_waldump and pg_resetwal should be fixed in a separate thread.)

This utility can only be run by the user who initialized the cluster because it 
requires read access to the data directory.


(11)
+The -m option cannot be used if
+either -c or -l is used.

-l -> -r
Or, why don't we align option characters with pg_waldump?  pg_waldump uses -r 
to filter by rmgr.  pg_undodump can output record contents by default like 
pg_waldump.  Considering pg_dump and pg_dumpall also output all data by 
default, that seems how PostgreSQL commands behave.


(12)
+   startsegendseg

startseg and endseg are not described.


(13)
+Undo files backing undo logs in the default tablespace are stored under
...
+Undo log files contain standard page headers as described in the next section,

Fluctuations in expressions can be seen: undo file and undo log file.  I think 
the following "undo data file" fits best.  What do you think?

+  UndoFileRead
+  Waiting for a read from an undo data file.


(14)
+Undo data exists in a 64-bit address space divided into 2^34 undo
+logs, each with a theoretical capacity of 1TB.  The first time a
+backend writes undo, it attaches to an existing undo log whose
+capacity is not yet exhausted and which is not currently being used by
+any other backend; or if no suitable undo log already exists, it
+creates a new one.  To avoid wasting space, each undo log is further
+divided into 1MB segment files, so that segments which are no longer
+needed can be removed (possibly recycling the underlying file by
+renaming it) and segments which are not yet needed do not need to be
+physically created on disk.  An undo segment file has a name like
+04.000120, where
+04 is the undo log number and
+000120 is the offset of the first byte
+held in the file.

The number of undo logs is not 2^34 but 2^24 (2^64 - 2^40 (1 TB)).


(15)

Re: POC: Cleaning up orphaned files using undo logs

2021-02-03 Thread Amit Kapila

On Wed, Feb 3, 2021 at 2:45 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Antonin Houska 
> > not really an "ordinary patch".
> >
> > [1]
> > https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhR
> > q-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com
>
> I'm a bit interested in zheap-related topics.  I'm reading this discussion to 
> see what I can do.  (But this thread is too long... there are still 13,000 
> lines out of 45,000 lines.)
>
> What's the latest patch set to look at to achieve the undo infrastructure and 
> its would-be first user, orphan file cleanup?  As far as I've read, multiple 
> people posted multiple patch sets, and I don't see how they are related.
>

I feel it is good to start with the latest patch-set posted by Antonin [1].

[1] - https://www.postgresql.org/message-id/87363.1611941415%40antos

-- 
With Regards,
Amit Kapila.

RE: POC: Cleaning up orphaned files using undo logs

2021-02-03 Thread tsunakawa.ta...@fujitsu.com

From: Antonin Houska 
> not really an "ordinary patch".
> 
> [1]
> https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhR
> q-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com

I'm a bit interested in zheap-related topics.  I'm reading this discussion to 
see what I can do.  (But this thread is too long... there are still 13,000 
lines out of 45,000 lines.)

What's the latest patch set to look at to achieve the undo infrastructure and 
its would-be first user, orphan file cleanup?  As far as I've read, multiple 
people posted multiple patch sets, and I don't see how they are related.


Regards
Takayuki Tsunakawa

Re: POC: Cleaning up orphaned files using undo logs

2021-02-03 Thread Antonin Houska

Bruce Momjian  wrote:

> On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote:
> > Antonin Houska  wrote:
> > > Well, on repeated run of the test I could also hit the first one. I could 
> > > fix
> > > it and will post a new version of the patch (along with some other small
> > > changes) this week.
> > 
> > Attached is the next version. Changes done:
> 
> Yikes, this patch is 23k lines, and most of it looks like added lines of
> code.  Is this size expected?

Yes, earlier versions of this patch, e.g. [1], were of comparable size. It's
not really an "ordinary patch".

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-02-01 Thread Bruce Momjian

On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote:
> Antonin Houska  wrote:
> > Well, on repeated run of the test I could also hit the first one. I could 
> > fix
> > it and will post a new version of the patch (along with some other small
> > changes) this week.
> 
> Attached is the next version. Changes done:

Yikes, this patch is 23k lines, and most of it looks like added lines of
code.  Is this size expected?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: POC: Cleaning up orphaned files using undo logs

2021-01-18 Thread Antonin Houska

Dmitry Dolgov <9erthali...@gmail.com> wrote:

> Thanks for the updated patch. As I've mentioned off the list I'm slowly
> looking through it with the intent to concentrate on undo progress
> tracking. But before I will post anything I want to mention couple of
> strange issues I see, otherwise I will forget for sure. Maybe it's
> already known, but running several times 'make installcheck' against a
> freshly build postgres with the patch applied from time to time I
> observe various errors.
> 
> This one happens on a crash recovery, seems like
> UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in
> the replay process:
> 
> TRAP: FailedAssertion("page_offset + this_page_bytes <= 
> uph->ud_insertion_point", File: "undopage.c", Line: 300)
> postgres: startup recovering 
> 00010012(ExceptionalCondition+0xa1)[0x558b38b8a350]
> postgres: startup recovering 
> 00010012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e]
> postgres: startup recovering 
> 00010012(UndoReplay+0xa1d)[0x558b38766f32]
> postgres: startup recovering 
> 00010012(XactUndoReplay+0x77)[0x558b38769281]
> postgres: startup recovering 
> 00010012(smgr_redo+0x1af)[0x558b387aa7bd]
> 
> This one is somewhat similar:
> 
> TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: 
> "undopage.c", Line: 287)
> postgres: undo worker for database 36893 
> (ExceptionalCondition+0xa1)[0x5559c90f1350]
> postgres: undo worker for database 36893 
> (UndoPageOverwrite+0xa6)[0x5559c8cc8ae3]
> postgres: undo worker for database 36893 
> (UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008]
> postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989]

Well, on repeated run of the test I could also hit the first one. I could fix
it and will post a new version of the patch (along with some other small
changes) this week.

> There are also here and there messages about not found undo files:
> 
> ERROR:  cannot open undo segment file 'base/undo/08.02': No 
> such file or directory
> WARNING:  failed to undo transaction

I don't see this one in the log so far, will try again.

Thanks for the report!

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-01-17 Thread Dmitry Dolgov

> On Fri, Dec 04, 2020 at 10:22:42AM +0100, Antonin Houska wrote:
> Amit Kapila  wrote:
>
> > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska  wrote:
> > >
> > > Amit Kapila  wrote:
> > >
> > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  wrote:
> > > >
> > > > If you want to track at undo record level, then won't it lead to
> > > > performance overhead and probably additional WAL overhead considering
> > > > this action needs to be WAL-logged. I think recording at page-level
> > > > might be a better idea.
> > >
> > > I'm not worried about WAL because the undo execution needs to be 
> > > WAL-logged
> > > anyway - see smgr_undo() in the 0005- part of the patch set. What needs 
> > > to be
> > > evaluated regarding performance is the (exclusive) locking of the page 
> > > that
> > > carries the progress information.
> > >
> >
> > That is just for one kind of smgr, think how you will do it for
> > something like zheap. Their idea is to collect all the undo records
> > (unless the undo for a transaction is very large) for one zheap-page
> > and apply them together, so maintaining the status at each undo record
> > level will surely lead to a large amount of additional WAL. See below
> > how and why we have decided to do it differently.
> >
> > > I'm still not sure whether this info should
> > > be on every page or only in the chunk header. In either case, we have a
> > > problem if there are two or more chunks created by different transactions 
> > > on
> > > the same page, and if more than on of these transactions need to perform
> > > undo. I tend to believe that this should happen rarely though.
> > >
> >
> > I think we need to maintain this information at the transaction level
> > and need to update it after processing a few blocks, at least that is
> > what was decided and implemented earlier. We also need to update it
> > when the log is switched or all the actions of the transaction were
> > applied. The reasoning is that for short transactions it won't matter
> > and for larger transactions, it is good to update it after a few pages
> > to avoid WAL and locking overhead. Also, it is better if we collect
> > the undo in bulk, this is proved to be beneficial for large
> > transactions.
>
> Attached is what I originally did not include in the patch series, see the
> part 0012. I have no better idea so far. The progress information is stored in
> the chunk header.
>
> To avoid too frequent locking, maybe the UpdateLastAppliedRecord() function
> can be modified so it recognizes when it's necessary to update the progress
> info. Also the user (zheap) should think when it should call the function.
> Since I've included 0012 now as a prerequisite for discarding (0013),
> currently it's only necessary to update the progress at undo log chunk
> boundary.
>
> In this version of the patch series I wanted to publish the remaining ideas I
> haven't published yet.

Thanks for the updated patch. As I've mentioned off the list I'm slowly
looking through it with the intent to concentrate on undo progress
tracking. But before I will post anything I want to mention couple of
strange issues I see, otherwise I will forget for sure. Maybe it's
already known, but running several times 'make installcheck' against a
freshly build postgres with the patch applied from time to time I
observe various errors.

This one happens on a crash recovery, seems like
UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in
the replay process:

TRAP: FailedAssertion("page_offset + this_page_bytes <= 
uph->ud_insertion_point", File: "undopage.c", Line: 300)
postgres: startup recovering 
00010012(ExceptionalCondition+0xa1)[0x558b38b8a350]
postgres: startup recovering 
00010012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e]
postgres: startup recovering 
00010012(UndoReplay+0xa1d)[0x558b38766f32]
postgres: startup recovering 
00010012(XactUndoReplay+0x77)[0x558b38769281]
postgres: startup recovering 
00010012(smgr_redo+0x1af)[0x558b387aa7bd]

This one is somewhat similar:

TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: 
"undopage.c", Line: 287)
postgres: undo worker for database 36893 
(ExceptionalCondition+0xa1)[0x5559c90f1350]
postgres: undo worker for database 36893 
(UndoPageOverwrite+0xa6)[0x5559c8cc8ae3]
postgres: undo worker for database 36893 
(UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008]
postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989]

There are also here and there messages about not found undo files:

ERROR:  cannot open undo segment file 'base/undo/08.02': No 
such file or directory
WARNING:  failed to undo transaction

I haven't found out the trigger yet, but got an impression that it
happens after create_table tests.

Re: POC: Cleaning up orphaned files using undo logs

2020-12-04 Thread Amit Kapila

On Fri, Dec 4, 2020 at 1:50 PM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
>
> > The earlier version of the patch having all these ideas
> > implemented is attached
> > (Infrastructure-to-execute-pending-undo-actions and
> > Provide-interfaces-to-store-and-fetch-undo-records). The second one
> > has some APIs used by the first one but the main concepts were
> > implemented in the first one
> > (Infrastructure-to-execute-pending-undo-actions). I see that in the
> > current version these can't be used as it is but still it can give us
> > a good start point and we might be able to either re-use some code and
> > or ideas from these patches.
>
> Is there a branch with these patches applied? They reference some functions
> that I don't see in [1]. I'd like to examine if / how my approach can be
> aligned with the current zheap design.
>

Can you once check in the patch-set attached in the email [1]?

[1] - 
https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com
-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2020-11-26 Thread Antonin Houska

Amit Kapila  wrote:

> On Wed, Nov 25, 2020 at 8:00 PM Antonin Houska  wrote:

> > I meant that AbortOutOfAnyTransaction should PANIC itself if it sees that
> > there is unapplied undo, so nothing changes for its callers. Do I still miss
> > something?
> >
> 
> Adding PANIC in some generic code-path sounds scary. Why can't we
> simply try to execute undo?

Indeed it should try. I imagined it this way but probably got distracted by
some other thought when writing the email :-)

> > Actually I hit the problem of missing connection when playing with the
> > "undoxacttest" module. Those tests use table_open() / table_close() 
> > functions,
> > but it might not be necessary for the real RMGRs.
> >
> 
> How can we apply the action on a page without opening the relation?

If the undo record contains RelFileNode, ReadBufferWithoutRelcache() can be
used, just like it happens with WAL. Not sure how much it would affect zheap.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2020-11-26 Thread Antonin Houska

Amit Kapila  wrote:

> On Wed, Nov 25, 2020 at 7:47 PM Antonin Houska  wrote:
> >
> > Antonin Houska  wrote:
> >
> > > Amit Kapila  wrote:
> >
> > > > I think we also need to maintain oldestXidHavingUndo for CLOG 
> > > > truncation and
> > > > transaction-wraparound. We can't allow CLOG truncation for the 
> > > > transaction
> > > > whose undo is not discarded as that could be required by some other
> > > > transaction.
> > >
> > > Good point. Even the discard worker might need to check the transaction 
> > > status
> > > when deciding whether undo log of that transaction should be discarded.
> >
> > In the zheap code [1] I see that DiscardWorkerMain() discards undo log up to
> > OldestXmin:
> >
> >
> > OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_AUTOVACUUM |
> >
> > PROCARRAY_FLAGS_VACUUM);
> >
> > oldestXidHavingUndo = 
> > GetXidFromEpochXid(pg_atomic_read_u64(>oldestXidWithEpochHavingUndo));
> >
> > /*
> >  * Call the discard routine if there oldestXidHavingUndo is lagging
> >  * behind OldestXmin.
> >  */
> > if (OldestXmin != InvalidTransactionId &&
> > TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin))
> > {
> > UndoDiscard(OldestXmin, );
> >
> > and that UndoDiscard() eventually advances oldestXidHavingUndo in the shared
> > memory.
> >
> > I'm not sure this is correct because, IMO, OldestXmin can advance as soon as
> > AbortTransaction() has cleared both xid and xmin fields of the transaction's
> > PGXACT (by calling ProcArrayEndTransactionInternal). However the 
> > corresponding
> > undo log may still be waiting for processing. Am I wrong?

> The UndoDiscard->UndoDiscardOneLog ensures that we don't discard the
> undo if there is a pending abort.

ok, I should have dug deeper than just reading the header comment of
UndoDiscard(). Checked now and seem to understand why no information is lost.

Nevertheless, I see in the zheap code that the discard worker may need to scan
a lot of undo log each time. While the oldest_xid and oldest_data fields of
UndoLogControl help to skip parts of the log, I'm not sure such information
fits into the undo-record-set (URS) approach. For now I tend to try to
implement the "exhaustive" scan for the URS too, and later let's teach the
discard worker to store some metadata so that the processing is rather
incremental.

> > I think that oldestXidHavingUndo should be advanced at the time transaction
> > commits or when the undo log of an aborted transaction has been applied.
> >
> 
> We can't advance oldestXidHavingUndo just on commit because later we
> need to rely on it for visibility, basically any transaction older
> than oldestXidHavingUndo should be all-visible.

ok

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2020-11-25 Thread Amit Kapila

On Wed, Nov 25, 2020 at 7:47 PM Antonin Houska  wrote:
>
> Antonin Houska  wrote:
>
> > Amit Kapila  wrote:
>
> > > I think we also need to maintain oldestXidHavingUndo for CLOG truncation 
> > > and
> > > transaction-wraparound. We can't allow CLOG truncation for the transaction
> > > whose undo is not discarded as that could be required by some other
> > > transaction.
> >
> > Good point. Even the discard worker might need to check the transaction 
> > status
> > when deciding whether undo log of that transaction should be discarded.
>
> In the zheap code [1] I see that DiscardWorkerMain() discards undo log up to
> OldestXmin:
>
>
> OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_AUTOVACUUM |
>
> PROCARRAY_FLAGS_VACUUM);
>
> oldestXidHavingUndo = 
> GetXidFromEpochXid(pg_atomic_read_u64(>oldestXidWithEpochHavingUndo));
>
> /*
>  * Call the discard routine if there oldestXidHavingUndo is lagging
>  * behind OldestXmin.
>  */
> if (OldestXmin != InvalidTransactionId &&
> TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin))
> {
> UndoDiscard(OldestXmin, );
>
> and that UndoDiscard() eventually advances oldestXidHavingUndo in the shared
> memory.
>
> I'm not sure this is correct because, IMO, OldestXmin can advance as soon as
> AbortTransaction() has cleared both xid and xmin fields of the transaction's
> PGXACT (by calling ProcArrayEndTransactionInternal). However the corresponding
> undo log may still be waiting for processing. Am I wrong?
>

The UndoDiscard->UndoDiscardOneLog ensures that we don't discard the
undo if there is a pending abort.

> I think that oldestXidHavingUndo should be advanced at the time transaction
> commits or when the undo log of an aborted transaction has been applied.
>

We can't advance oldestXidHavingUndo just on commit because later we
need to rely on it for visibility, basically any transaction older
than oldestXidHavingUndo should be all-visible.

> Then
> the discard worker would simply discard the undo log up to
> oldestXidHavingUndo. However, as the transactions whose undo is still not
> applied may no longer be registered in the shared memory (proc array), I don't
> know how to determine the next value of oldestXidHavingUndo.
>
> Also I wonder if FullTransactionId is needed for oldestXidHavingUndo in the
> shared memory rather than plain TransactionId (see
> oldestXidWithEpochHavingUndo in PROC_HDR). I think that the value cannot lag
> behind nextFullXid by more than 2 billions transactions anyway because in that
> case it would cause XID wraparound.
>

You are right but still, it is better to keep it as FullTransactionId
because (a) zheap uses FullTransactionId and we need to compare it
with oldestXidWithEpochHavingUndo for visibility purpose, (b) In
future, we want to get rid this of this limitation for undo as well.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2020-11-25 Thread Amit Kapila

On Wed, Nov 25, 2020 at 8:00 PM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
> > On Wed, Nov 18, 2020 at 4:03 PM Antonin Houska  wrote:
> > >
> > > Amit Kapila  wrote:
> > >
> > > > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska  wrote:
> > > > >
> > > > > Amit Kapila  wrote:
> > > > >
> > > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  
> > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > No background undo
> > > > > > > --
> > > > > > >
> > > > > > > Reduced complexity of the patch seems to be the priority at the 
> > > > > > > moment. Amit
> > > > > > > suggested that cleanup of an orphaned relation file is simple 
> > > > > > > enough to be
> > > > > > > done on foreground and I agree.
> > > > > > >
> > > > > >
> > > > > > Yeah, I think we should try and see if we can make it work but I
> > > > > > noticed that there are few places like AbortOutOfAnyTransaction 
> > > > > > where
> > > > > > we have the assumption that undo will be executed in the background.
> > > > > > We need to deal with it.
> > > > >
> > > > > I think this is o.k. if we always check for unapplied undo during 
> > > > > startup.
> > > > >
> > > >
> > > > Hmm, how it is ok to leave undo (and rely on startup) unless it is a
> > > > PANIC error. IIRC, this path is invoked in non-panic errors as well.
> > > > Basically, we won't be able to discard such an undo which doesn't seem
> > > > like a good idea.
> > >
> > > Since failure to apply leaves unconsistent data, I assume it should always
> > > cause PANIC, shouldn't it?
> > >
> >
> > But how can we ensure that AbortOutOfAnyTransaction will be called
> > only in that scenario?
>
> I meant that AbortOutOfAnyTransaction should PANIC itself if it sees that
> there is unapplied undo, so nothing changes for its callers. Do I still miss
> something?
>

Adding PANIC in some generic code-path sounds scary. Why can't we
simply try to execute undo?

> > > > Another thing is that it seems we need to connect to the database to 
> > > > perform
> > > > it which might appear a bit odd that we don't allow users to connect to 
> > > > the
> > > > database but internally we are connecting it.
> > >
> > > I think the implementation will need to follow the outcome of the part of 
> > > the
> > > discussion that starts at [2], but I see your concern. I'm thinking why
> > > database connection is not needed to apply WAL but is needed for UNDO. I 
> > > think
> > > locks make the difference.
> > >
> >
> > Yeah, it would be probably a good idea to see if we can make undo
> > apply work without db-connection especially if we want to do before
> > allowing connections. The other possibility could be to let discard
> > worker do this work lazily after allowing connections.
>
> Actually I hit the problem of missing connection when playing with the
> "undoxacttest" module. Those tests use table_open() / table_close() functions,
> but it might not be necessary for the real RMGRs.
>

How can we apply the action on a page without opening the relation?

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2020-11-25 Thread Antonin Houska

Amit Kapila  wrote:

> On Wed, Nov 18, 2020 at 4:03 PM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska  wrote:
> > > >
> > > > Amit Kapila  wrote:
> > > >
> > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > No background undo
> > > > > > --
> > > > > >
> > > > > > Reduced complexity of the patch seems to be the priority at the 
> > > > > > moment. Amit
> > > > > > suggested that cleanup of an orphaned relation file is simple 
> > > > > > enough to be
> > > > > > done on foreground and I agree.
> > > > > >
> > > > >
> > > > > Yeah, I think we should try and see if we can make it work but I
> > > > > noticed that there are few places like AbortOutOfAnyTransaction where
> > > > > we have the assumption that undo will be executed in the background.
> > > > > We need to deal with it.
> > > >
> > > > I think this is o.k. if we always check for unapplied undo during 
> > > > startup.
> > > >
> > >
> > > Hmm, how it is ok to leave undo (and rely on startup) unless it is a
> > > PANIC error. IIRC, this path is invoked in non-panic errors as well.
> > > Basically, we won't be able to discard such an undo which doesn't seem
> > > like a good idea.
> >
> > Since failure to apply leaves unconsistent data, I assume it should always
> > cause PANIC, shouldn't it?
> >
> 
> But how can we ensure that AbortOutOfAnyTransaction will be called
> only in that scenario?

I meant that AbortOutOfAnyTransaction should PANIC itself if it sees that
there is unapplied undo, so nothing changes for its callers. Do I still miss
something?

> > > Another thing is that it seems we need to connect to the database to 
> > > perform
> > > it which might appear a bit odd that we don't allow users to connect to 
> > > the
> > > database but internally we are connecting it.
> >
> > I think the implementation will need to follow the outcome of the part of 
> > the
> > discussion that starts at [2], but I see your concern. I'm thinking why
> > database connection is not needed to apply WAL but is needed for UNDO. I 
> > think
> > locks make the difference.
> >
> 
> Yeah, it would be probably a good idea to see if we can make undo
> apply work without db-connection especially if we want to do before
> allowing connections. The other possibility could be to let discard
> worker do this work lazily after allowing connections.

Actually I hit the problem of missing connection when playing with the
"undoxacttest" module. Those tests use table_open() / table_close() functions,
but it might not be necessary for the real RMGRs.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2020-11-25 Thread Antonin Houska

Antonin Houska  wrote:

> Amit Kapila  wrote:

> > I think we also need to maintain oldestXidHavingUndo for CLOG truncation and
> > transaction-wraparound. We can't allow CLOG truncation for the transaction
> > whose undo is not discarded as that could be required by some other
> > transaction.
> 
> Good point. Even the discard worker might need to check the transaction status
> when deciding whether undo log of that transaction should be discarded.

In the zheap code [1] I see that DiscardWorkerMain() discards undo log up to
OldestXmin:

OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_AUTOVACUUM |
   PROCARRAY_FLAGS_VACUUM);

oldestXidHavingUndo = 
GetXidFromEpochXid(pg_atomic_read_u64(>oldestXidWithEpochHavingUndo));

/*
 * Call the discard routine if there oldestXidHavingUndo is lagging
 * behind OldestXmin.
 */
if (OldestXmin != InvalidTransactionId &&
TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin))
{
UndoDiscard(OldestXmin, );

and that UndoDiscard() eventually advances oldestXidHavingUndo in the shared
memory.

I'm not sure this is correct because, IMO, OldestXmin can advance as soon as
AbortTransaction() has cleared both xid and xmin fields of the transaction's
PGXACT (by calling ProcArrayEndTransactionInternal). However the corresponding
undo log may still be waiting for processing. Am I wrong?

I think that oldestXidHavingUndo should be advanced at the time transaction
commits or when the undo log of an aborted transaction has been applied. Then
the discard worker would simply discard the undo log up to
oldestXidHavingUndo. However, as the transactions whose undo is still not
applied may no longer be registered in the shared memory (proc array), I don't
know how to determine the next value of oldestXidHavingUndo.

Also I wonder if FullTransactionId is needed for oldestXidHavingUndo in the
shared memory rather than plain TransactionId (see
oldestXidWithEpochHavingUndo in PROC_HDR). I think that the value cannot lag
behind nextFullXid by more than 2 billions transactions anyway because in that
case it would cause XID wraparound. (That in turn makes me think that VACUUM
FREEZE should be able to discard undo log too.)

[1] https://github.com/EnterpriseDB/zheap/tree/master

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2020-11-18 Thread Amit Kapila

On Wed, Nov 18, 2020 at 4:03 PM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
> > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska  wrote:
> > >
> > > Amit Kapila  wrote:
> > >
> > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  wrote:
> > > > >
> > > > >
> > > > > No background undo
> > > > > --
> > > > >
> > > > > Reduced complexity of the patch seems to be the priority at the 
> > > > > moment. Amit
> > > > > suggested that cleanup of an orphaned relation file is simple enough 
> > > > > to be
> > > > > done on foreground and I agree.
> > > > >
> > > >
> > > > Yeah, I think we should try and see if we can make it work but I
> > > > noticed that there are few places like AbortOutOfAnyTransaction where
> > > > we have the assumption that undo will be executed in the background.
> > > > We need to deal with it.
> > >
> > > I think this is o.k. if we always check for unapplied undo during startup.
> > >
> >
> > Hmm, how it is ok to leave undo (and rely on startup) unless it is a
> > PANIC error. IIRC, this path is invoked in non-panic errors as well.
> > Basically, we won't be able to discard such an undo which doesn't seem
> > like a good idea.
>
> Since failure to apply leaves unconsistent data, I assume it should always
> cause PANIC, shouldn't it?
>

But how can we ensure that AbortOutOfAnyTransaction will be called
only in that scenario?

> (Thomas seems to assume the same in [1].)
>
> > > > > "undo worker" is still there, but it only processes undo requests 
> > > > > after server
> > > > > restart because relation data can only be changed in a transaction - 
> > > > > it seems
> > > > > cleaner to launch a background worker for this than to hack the 
> > > > > startup
> > > > > process.
> > > > >
> > > >
> > > > But, I see there are still multiple undoworkers that are getting
> > > > launched and I am not sure if that works correctly because a
> > > > particular undoworker is connected to a database and then it starts
> > > > processing all the pending undo.
> > >
> > > Each undo worker applies only transactions for its own database, see
> > > ProcessExistingUndoRequests():
> > >
> > > /* We can only process undo of the database we are connected to. 
> > > */
> > > if (xact_hdr.dboid != MyDatabaseId)
> > > continue;
> > >
> > > Nevertheless, as I've just mentioned in my response to Thomas, I admit 
> > > that we
> > > should try to live w/o the undo worker altogether.
> > >
> >
> > Okay, but keep in mind that there could be a large amount of undo
> > (unlike redo which has some limit as we can replay it from the last
> > checkpoint) which needs to be processed but it might be okay to live
> > with that for now.
>
> Yes, the information to remove relation file does not take much space in the
> undo log.
>
> > Another thing is that it seems we need to connect to the database to perform
> > it which might appear a bit odd that we don't allow users to connect to the
> > database but internally we are connecting it.
>
> I think the implementation will need to follow the outcome of the part of the
> discussion that starts at [2], but I see your concern. I'm thinking why
> database connection is not needed to apply WAL but is needed for UNDO. I think
> locks make the difference.
>

Yeah, it would be probably a good idea to see if we can make undo
apply work without db-connection especially if we want to do before
allowing connections. The other possibility could be to let discard
worker do this work lazily after allowing connections.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2020-11-18 Thread Antonin Houska

Amit Kapila  wrote:

> On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  wrote:
> > > >
> > > >
> > > > No background undo
> > > > --
> > > >
> > > > Reduced complexity of the patch seems to be the priority at the moment. 
> > > > Amit
> > > > suggested that cleanup of an orphaned relation file is simple enough to 
> > > > be
> > > > done on foreground and I agree.
> > > >
> > >
> > > Yeah, I think we should try and see if we can make it work but I
> > > noticed that there are few places like AbortOutOfAnyTransaction where
> > > we have the assumption that undo will be executed in the background.
> > > We need to deal with it.
> >
> > I think this is o.k. if we always check for unapplied undo during startup.
> >
> 
> Hmm, how it is ok to leave undo (and rely on startup) unless it is a
> PANIC error. IIRC, this path is invoked in non-panic errors as well.
> Basically, we won't be able to discard such an undo which doesn't seem
> like a good idea.

Since failure to apply leaves unconsistent data, I assume it should always
cause PANIC, shouldn't it? (Thomas seems to assume the same in [1].)

> > > > "undo worker" is still there, but it only processes undo requests after 
> > > > server
> > > > restart because relation data can only be changed in a transaction - it 
> > > > seems
> > > > cleaner to launch a background worker for this than to hack the startup
> > > > process.
> > > >
> > >
> > > But, I see there are still multiple undoworkers that are getting
> > > launched and I am not sure if that works correctly because a
> > > particular undoworker is connected to a database and then it starts
> > > processing all the pending undo.
> >
> > Each undo worker applies only transactions for its own database, see
> > ProcessExistingUndoRequests():
> >
> > /* We can only process undo of the database we are connected to. */
> > if (xact_hdr.dboid != MyDatabaseId)
> > continue;
> >
> > Nevertheless, as I've just mentioned in my response to Thomas, I admit that 
> > we
> > should try to live w/o the undo worker altogether.
> >
> 
> Okay, but keep in mind that there could be a large amount of undo
> (unlike redo which has some limit as we can replay it from the last
> checkpoint) which needs to be processed but it might be okay to live
> with that for now.

Yes, the information to remove relation file does not take much space in the
undo log.

> Another thing is that it seems we need to connect to the database to perform
> it which might appear a bit odd that we don't allow users to connect to the
> database but internally we are connecting it.

I think the implementation will need to follow the outcome of the part of the
discussion that starts at [2], but I see your concern. I'm thinking why
database connection is not needed to apply WAL but is needed for UNDO. I think
locks make the difference. So maybe we can make the RMGR specific callbacks
(rm_undo) aware of the fact that the cluster is still in the startup state, so
the relations should be opened in NoLock mode?

> > > > Discarding
> > > > --
> > > >
> > > > There's no discard worker for the URS infrastructure yet. I thought 
> > > > about
> > > > discarding the undo log during checkpoint, but checkpoint should 
> > > > probably do
> > > > more straightforward tasks than the calculation of a new discard 
> > > > pointer for
> > > > each undo log, so a background worker is needed. A few notes on that:
> > > >
> > > >   * until the zheap AM gets added, only the transaction that creates 
> > > > the undo
> > > > records needs to access them. This assumption should make the 
> > > > discarding
> > > > algorithm a bit simpler. Note that with zheap, the other 
> > > > transactions need
> > > > to look for old versions of tuples, so the concept of 
> > > > oldestXidHavingUndo
> > > > variable is needed there.
> > > >
> > > >   * it's rather simple to pass pointer the URS pointer to the discard 
> > > > worker
> > > > when transaction either committed or the undo has been executed.
> > > >
> > >
> > > Why can't we have a separate discard worker which keeps on scanning
> > > the undorecords and discard accordingly? Giving the onus of foreground
> > > process might be tricky because say discard worker is not up to speed
> > > and we ran out of space to pass such information for each commit/abort
> > > request.
> >
> > Sure, there should be a discard worker. The question is how to make its work
> > efficient. The initial run after restart probably needs to scan everything
> > between 'discard' and 'insert' pointers,
> >
> 
> Yeah, such an initial scan would be helpful to identify pending aborts
> and allow them to be processed.
> 
> > but then it should process only the
> > parts created by individual transactions.
> >
> 
> Yeah, it needs to process transaction-by-transaction to see which all
> we can

Re: POC: Cleaning up orphaned files using undo logs

2020-11-14 Thread Amit Kapila

On Sun, Nov 15, 2020 at 11:24 AM Amit Kapila  wrote:
>
> On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  wrote:
> > > >
> > > >
> > > > No background undo
> > > > --
> > > >
> > > > Reduced complexity of the patch seems to be the priority at the moment. 
> > > > Amit
> > > > suggested that cleanup of an orphaned relation file is simple enough to 
> > > > be
> > > > done on foreground and I agree.
> > > >
> > >
> > > Yeah, I think we should try and see if we can make it work but I
> > > noticed that there are few places like AbortOutOfAnyTransaction where
> > > we have the assumption that undo will be executed in the background.
> > > We need to deal with it.
> >
> > I think this is o.k. if we always check for unapplied undo during startup.
> >
>
> Hmm, how it is ok to leave undo (and rely on startup) unless it is a
> PANIC error. IIRC, this path is invoked in non-panic errors as well.
> Basically, we won't be able to discard such an undo which doesn't seem
> like a good idea.
>
> > > > "undo worker" is still there, but it only processes undo requests after 
> > > > server
> > > > restart because relation data can only be changed in a transaction - it 
> > > > seems
> > > > cleaner to launch a background worker for this than to hack the startup
> > > > process.
> > > >
> > >
> > > But, I see there are still multiple undoworkers that are getting
> > > launched and I am not sure if that works correctly because a
> > > particular undoworker is connected to a database and then it starts
> > > processing all the pending undo.
> >
> > Each undo worker applies only transactions for its own database, see
> > ProcessExistingUndoRequests():
> >
> > /* We can only process undo of the database we are connected to. */
> > if (xact_hdr.dboid != MyDatabaseId)
> > continue;
> >
> > Nevertheless, as I've just mentioned in my response to Thomas, I admit that 
> > we
> > should try to live w/o the undo worker altogether.
> >
>
> Okay, but keep in mind that there could be a large amount of undo
> (unlike redo which has some limit as we can replay it from the last
> checkpoint) which needs to be processed but it might be okay to live
> with that for now. Another thing is that it seems we need to connect
> to the database to perform it which might appear a bit odd that we
> don't allow users to connect to the database but internally we are
> connecting it. These are just some points to consider while finalizing
> the solution to this.
>
> > > > Discarding
> > > > --
> > > >
> > > > There's no discard worker for the URS infrastructure yet. I thought 
> > > > about
> > > > discarding the undo log during checkpoint, but checkpoint should 
> > > > probably do
> > > > more straightforward tasks than the calculation of a new discard 
> > > > pointer for
> > > > each undo log, so a background worker is needed. A few notes on that:
> > > >
> > > >   * until the zheap AM gets added, only the transaction that creates 
> > > > the undo
> > > > records needs to access them. This assumption should make the 
> > > > discarding
> > > > algorithm a bit simpler. Note that with zheap, the other 
> > > > transactions need
> > > > to look for old versions of tuples, so the concept of 
> > > > oldestXidHavingUndo
> > > > variable is needed there.
> > > >
> > > >   * it's rather simple to pass pointer the URS pointer to the discard 
> > > > worker
> > > > when transaction either committed or the undo has been executed.
> > > >
> > >
> > > Why can't we have a separate discard worker which keeps on scanning
> > > the undorecords and discard accordingly? Giving the onus of foreground
> > > process might be tricky because say discard worker is not up to speed
> > > and we ran out of space to pass such information for each commit/abort
> > > request.
> >
> > Sure, there should be a discard worker. The question is how to make its work
> > efficient. The initial run after restart probably needs to scan everything
> > between 'discard' and 'insert' pointers,
> >
>
> Yeah, such an initial scan would be helpful to identify pending aborts
> and allow them to be processed.
>
> > but then it should process only the
> > parts created by individual transactions.
> >
>
> Yeah, it needs to process transaction-by-transaction to see which all
> we can discard. Also, note that in Single-User mode we need to discard
> undo after commit. I think we also need to maintain
> oldestXidHavingUndo for CLOG truncation and transaction-wraparound. We
> can't allow CLOG truncation for the transaction whose undo is not
> discarded as that could be required by some other transaction. For
> similar reasons, we can't allow transaction-wraparound and we need to
> integrate this into the existing xid-allocation mechanism. I have
> found one of the old patch
> (Allow-execution-and-discard-of-undo-by-background-wo)

Re: POC: Cleaning up orphaned files using undo logs

2020-11-14 Thread Amit Kapila

On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
> > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  wrote:
> > >
> > >
> > > No background undo
> > > --
> > >
> > > Reduced complexity of the patch seems to be the priority at the moment. 
> > > Amit
> > > suggested that cleanup of an orphaned relation file is simple enough to be
> > > done on foreground and I agree.
> > >
> >
> > Yeah, I think we should try and see if we can make it work but I
> > noticed that there are few places like AbortOutOfAnyTransaction where
> > we have the assumption that undo will be executed in the background.
> > We need to deal with it.
>
> I think this is o.k. if we always check for unapplied undo during startup.
>

Hmm, how it is ok to leave undo (and rely on startup) unless it is a
PANIC error. IIRC, this path is invoked in non-panic errors as well.
Basically, we won't be able to discard such an undo which doesn't seem
like a good idea.

> > > "undo worker" is still there, but it only processes undo requests after 
> > > server
> > > restart because relation data can only be changed in a transaction - it 
> > > seems
> > > cleaner to launch a background worker for this than to hack the startup
> > > process.
> > >
> >
> > But, I see there are still multiple undoworkers that are getting
> > launched and I am not sure if that works correctly because a
> > particular undoworker is connected to a database and then it starts
> > processing all the pending undo.
>
> Each undo worker applies only transactions for its own database, see
> ProcessExistingUndoRequests():
>
> /* We can only process undo of the database we are connected to. */
> if (xact_hdr.dboid != MyDatabaseId)
> continue;
>
> Nevertheless, as I've just mentioned in my response to Thomas, I admit that we
> should try to live w/o the undo worker altogether.
>

Okay, but keep in mind that there could be a large amount of undo
(unlike redo which has some limit as we can replay it from the last
checkpoint) which needs to be processed but it might be okay to live
with that for now. Another thing is that it seems we need to connect
to the database to perform it which might appear a bit odd that we
don't allow users to connect to the database but internally we are
connecting it. These are just some points to consider while finalizing
the solution to this.

> > > Discarding
> > > --
> > >
> > > There's no discard worker for the URS infrastructure yet. I thought about
> > > discarding the undo log during checkpoint, but checkpoint should probably 
> > > do
> > > more straightforward tasks than the calculation of a new discard pointer 
> > > for
> > > each undo log, so a background worker is needed. A few notes on that:
> > >
> > >   * until the zheap AM gets added, only the transaction that creates the 
> > > undo
> > > records needs to access them. This assumption should make the 
> > > discarding
> > > algorithm a bit simpler. Note that with zheap, the other transactions 
> > > need
> > > to look for old versions of tuples, so the concept of 
> > > oldestXidHavingUndo
> > > variable is needed there.
> > >
> > >   * it's rather simple to pass pointer the URS pointer to the discard 
> > > worker
> > > when transaction either committed or the undo has been executed.
> > >
> >
> > Why can't we have a separate discard worker which keeps on scanning
> > the undorecords and discard accordingly? Giving the onus of foreground
> > process might be tricky because say discard worker is not up to speed
> > and we ran out of space to pass such information for each commit/abort
> > request.
>
> Sure, there should be a discard worker. The question is how to make its work
> efficient. The initial run after restart probably needs to scan everything
> between 'discard' and 'insert' pointers,
>

Yeah, such an initial scan would be helpful to identify pending aborts
and allow them to be processed.

> but then it should process only the
> parts created by individual transactions.
>

Yeah, it needs to process transaction-by-transaction to see which all
we can discard. Also, note that in Single-User mode we need to discard
undo after commit. I think we also need to maintain
oldestXidHavingUndo for CLOG truncation and transaction-wraparound. We
can't allow CLOG truncation for the transaction whose undo is not
discarded as that could be required by some other transaction. For
similar reasons, we can't allow transaction-wraparound and we need to
integrate this into the existing xid-allocation mechanism. I have
found one of the old patch
(Allow-execution-and-discard-of-undo-by-background-wo) attached where
all these concepts were implemented. Unless you have a reason why we
don't these things, you might want to refer to the attached patch to
either re-use or refer to these ideas. There are a few other things
like undorequest and some undoworker mechanism which you can ignore.

-- 
With Regards,

Re: POC: Cleaning up orphaned files using undo logs

2020-11-13 Thread Antonin Houska

Amit Kapila  wrote:

> On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  wrote:
> >
> >
> > No background undo
> > --
> >
> > Reduced complexity of the patch seems to be the priority at the moment. Amit
> > suggested that cleanup of an orphaned relation file is simple enough to be
> > done on foreground and I agree.
> >
> 
> Yeah, I think we should try and see if we can make it work but I
> noticed that there are few places like AbortOutOfAnyTransaction where
> we have the assumption that undo will be executed in the background.
> We need to deal with it.

I think this is o.k. if we always check for unapplied undo during startup.

> > "undo worker" is still there, but it only processes undo requests after 
> > server
> > restart because relation data can only be changed in a transaction - it 
> > seems
> > cleaner to launch a background worker for this than to hack the startup
> > process.
> >
> 
> But, I see there are still multiple undoworkers that are getting
> launched and I am not sure if that works correctly because a
> particular undoworker is connected to a database and then it starts
> processing all the pending undo.

Each undo worker applies only transactions for its own database, see
ProcessExistingUndoRequests():

/* We can only process undo of the database we are connected to. */
if (xact_hdr.dboid != MyDatabaseId)
continue;

Nevertheless, as I've just mentioned in my response to Thomas, I admit that we
should try to live w/o the undo worker altogether.

> > Discarding
> > --
> >
> > There's no discard worker for the URS infrastructure yet. I thought about
> > discarding the undo log during checkpoint, but checkpoint should probably do
> > more straightforward tasks than the calculation of a new discard pointer for
> > each undo log, so a background worker is needed. A few notes on that:
> >
> >   * until the zheap AM gets added, only the transaction that creates the 
> > undo
> > records needs to access them. This assumption should make the discarding
> > algorithm a bit simpler. Note that with zheap, the other transactions 
> > need
> > to look for old versions of tuples, so the concept of 
> > oldestXidHavingUndo
> > variable is needed there.
> >
> >   * it's rather simple to pass pointer the URS pointer to the discard worker
> > when transaction either committed or the undo has been executed.
> >
> 
> Why can't we have a separate discard worker which keeps on scanning
> the undorecords and discard accordingly? Giving the onus of foreground
> process might be tricky because say discard worker is not up to speed
> and we ran out of space to pass such information for each commit/abort
> request.

Sure, there should be a discard worker. The question is how to make its work
efficient. The initial run after restart probably needs to scan everything
between 'discard' and 'insert' pointers, but then it should process only the
parts created by individual transactions.

> >
> > Do not execute the same undo record multiple times
> > --
> >
> > Although I've noticed in the zheap code that it checks whether particular 
> > undo
> > action was already undone, I think this functionality fits better in the URS
> > layer.
> >
> 
> If you want to track at undo record level, then won't it lead to
> performance overhead and probably additional WAL overhead considering
> this action needs to be WAL-logged. I think recording at page-level
> might be a better idea.

I'm not worried about WAL because the undo execution needs to be WAL-logged
anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be
evaluated regarding performance is the (exclusive) locking of the page that
carries the progress information. I'm still not sure whether this info should
be on every page or only in the chunk header. In either case, we have a
problem if there are two or more chunks created by different transactions on
the same page, and if more than on of these transactions need to perform
undo. I tend to believe that this should happen rarely though.

> > I can spend more time on this project, but need a hint which part I should
> > focus on.
> >
> 
> I can easily imagine that this needs a lot of work and I can try to
> help with this as much as possible from my side. I feel at this stage
> we should try to focus on undo-related work (to start with you can
> look at finishing the undoprocessing work for which I have shared some
> thoughts) and then probably at some point in time we need to rebase
> zheap over this.

I agree, thanks!

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2020-11-13 Thread Antonin Houska

Thomas Munro  wrote:

> On Thu, Nov 12, 2020 at 10:15 PM Antonin Houska  wrote:

> I saw that -- great news! -- and have been meaning to write for a
> while.  I think I am nearly ready to talk about it again.

I'm looking forward to it :-)

> 100% that it's worth trying to do something much simpler than a new
> access manager, and this was the simplest useful feature solving a
> real-world-problem-that-people-actually-have we could come up with
> (based on an idea from Robert).  I think it needs a convincing
> explanation for why there is no scenario where the relfilenode is
> recycled for a new unlucky table before the rollback is executed,
> which might depend on details that you might be working on/changing
> (scenarios where you execute undo twice because you forgot you already
> did it).

Oh, I haven't thought about this problem yet. That might be another reason for
the undo log infrastructure to record the progress somehow.

> > No background undo
> > --
> >
> > Reduced complexity of the patch seems to be the priority at the moment. Amit
> > suggested that cleanup of an orphaned relation file is simple enough to be
> > done on foreground and I agree.
> >
> > "undo worker" is still there, but it only processes undo requests after 
> > server
> > restart because relation data can only be changed in a transaction - it 
> > seems
> > cleaner to launch a background worker for this than to hack the startup
> > process.
> 
> I suppose the simplest useful system would be one does the work at
> startup before allowing connections, and also in regular backends, and
> panics if a backend ever exits while it has pending undo (panic =
> "goto crash recovery").  Then you don't have to deal with undo workers
> running at the same time as regular sessions which might run into
> trouble reacquiring locks (for an AM I mean), or due to OIDs being
> recycled with multiple checkpoints, or undo work that gets deferred
> until the next restart of the server.

I think that zheap can recognize that page has unapplied undo, so we don't
need to reacquire any page lock on restart. However I agree that the
background undo might introduce other concurrency issues. At least for now
it's worth trying to move the cleanup into the startup process. We can
reconsider this when implementing more expensive undo actions, especially the
zheap rollback.

> > Since the concept of undo requests is closely related to the undo worker, I
> > removed undorequest.c too. The new (much simpler) undo worker gets the
> > information on incomplete / aborted transactions from the undo log as
> > mentioned above.
> >
> > SMGR enhancement
> > 
> >
> > I used the 0001 patch from [3] rather than [4], although it's more invasive
> > because I noticed somewhere in the discussion that there should be no 
> > reserved
> > database OID for the undo log. (InvalidOid cannot be used because it's 
> > already
> > in use for shared catalogs.)
> 
> I give up thinking about the colour of the BufferTag shed and went
> back to magic database 9, mainly because there seemed to be more
> pressing matters.  I don't even think it's that crazy to store this
> type of system-wide data in pseudo databases, and I know of other
> systems that do similar sorts of things without blinking...

ok

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2020-11-12 Thread Amit Kapila

On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  wrote:
>
>
> No background undo
> --
>
> Reduced complexity of the patch seems to be the priority at the moment. Amit
> suggested that cleanup of an orphaned relation file is simple enough to be
> done on foreground and I agree.
>

Yeah, I think we should try and see if we can make it work but I
noticed that there are few places like AbortOutOfAnyTransaction where
we have the assumption that undo will be executed in the background.
We need to deal with it.

> "undo worker" is still there, but it only processes undo requests after server
> restart because relation data can only be changed in a transaction - it seems
> cleaner to launch a background worker for this than to hack the startup
> process.
>

But, I see there are still multiple undoworkers that are getting
launched and I am not sure if that works correctly because a
particular undoworker is connected to a database and then it starts
processing all the pending undo.

> Since the concept of undo requests is closely related to the undo worker, I
> removed undorequest.c too. The new (much simpler) undo worker gets the
> information on incomplete / aborted transactions from the undo log as
> mentioned above.
>
> SMGR enhancement
> 
>
> I used the 0001 patch from [3] rather than [4], although it's more invasive
> because I noticed somewhere in the discussion that there should be no reserved
> database OID for the undo log. (InvalidOid cannot be used because it's already
> in use for shared catalogs.)
>
> Components added
> 
>
> pg_undo_dump utility and test framework for undoread.c. BTW, undoread.c seems
> to need some refactoring.
>
>
> Following are a few areas which are not implemented yet because more
> discussion is needed there:
>
> Discarding
> --
>
> There's no discard worker for the URS infrastructure yet. I thought about
> discarding the undo log during checkpoint, but checkpoint should probably do
> more straightforward tasks than the calculation of a new discard pointer for
> each undo log, so a background worker is needed. A few notes on that:
>
>   * until the zheap AM gets added, only the transaction that creates the undo
> records needs to access them. This assumption should make the discarding
> algorithm a bit simpler. Note that with zheap, the other transactions need
> to look for old versions of tuples, so the concept of oldestXidHavingUndo
> variable is needed there.
>
>   * it's rather simple to pass pointer the URS pointer to the discard worker
> when transaction either committed or the undo has been executed.
>

Why can't we have a separate discard worker which keeps on scanning
the undorecords and discard accordingly? Giving the onus of foreground
process might be tricky because say discard worker is not up to speed
and we ran out of space to pass such information for each commit/abort
request.

>
> Do not execute the same undo record multiple times
> --
>
> Although I've noticed in the zheap code that it checks whether particular undo
> action was already undone, I think this functionality fits better in the URS
> layer.
>

If you want to track at undo record level, then won't it lead to
performance overhead and probably additional WAL overhead considering
this action needs to be WAL-logged. I think recording at page-level
might be a better idea.

>
> I can spend more time on this project, but need a hint which part I should
> focus on.
>

I can easily imagine that this needs a lot of work and I can try to
help with this as much as possible from my side. I feel at this stage
we should try to focus on undo-related work (to start with you can
look at finishing the undoprocessing work for which I have shared some
thoughts) and then probably at some point in time we need to rebase
zheap over this.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

2020-11-12 Thread Thomas Munro

On Thu, Nov 12, 2020 at 10:15 PM Antonin Houska  wrote:
> Thomas Munro  wrote:
> > Thanks.  We decided to redesign a couple of aspects of the undo
> > storage and record layers that this patch was intended to demonstrate,
> > and work on that is underway.  More on that soon.
>
> As my boss expressed in his recent blog post, we'd like to contribute to the
> zheap development, and a couple of developers from other companies are
> interested in this as well. Amit Kapila suggested that the "cleanup of
> orphaned files" feature is a good start point in getting the code into PG
> core, so I've spent some time on it and tried to rebase the patch set.

Hi Antonin,

I saw that -- great news! -- and have been meaning to write for a
while.  I think I am nearly ready to talk about it again.  I agree
100% that it's worth trying to do something much simpler than a new
access manager, and this was the simplest useful feature solving a
real-world-problem-that-people-actually-have we could come up with
(based on an idea from Robert).  I think it needs a convincing
explanation for why there is no scenario where the relfilenode is
recycled for a new unlucky table before the rollback is executed,
which might depend on details that you might be working on/changing
(scenarios where you execute undo twice because you forgot you already
did it).

> In fact what I did is not mere rebasing against the current master branch -
> I've also (besides various bug fixes) done some design changes.
>
> Incorporated the new Undo Record Set (URS) infrastructure
> -
>
> This is also pointed out in [0].
>
> I started from [1] and tried to implement some missing parts (e.g. proper
> closing of the URSs after crash), introduced UNDO_DEBUG preprocessor macro
> which makes the undo log segments very small and fixed some bugs that the
> small segments exposed.

Cool!  Getting up to speed on all these made up concepts like URS, and
getting all these pieces assembled and rebased and up and running is
already quite something, let alone adding missing parts and debugging.

> The most significant change I've done was removal of the undo requests from
> checkpoint. I could not find any particular bug / race conditions related to
> including the requests into the checkpoint, but I concluded that it's easier
> to think about consistency and checkpoint timings if we scan the undo log on
> restart (after recovery has finished) and create the requests from scratch.

Interesting.  I guess that would be closer to textbook three-phase ARIES.

> [2] shows where I ended up before I started to rebase this patchset.
>
> No background undo
> --
>
> Reduced complexity of the patch seems to be the priority at the moment. Amit
> suggested that cleanup of an orphaned relation file is simple enough to be
> done on foreground and I agree.
>
> "undo worker" is still there, but it only processes undo requests after server
> restart because relation data can only be changed in a transaction - it seems
> cleaner to launch a background worker for this than to hack the startup
> process.

I suppose the simplest useful system would be one does the work at
startup before allowing connections, and also in regular backends, and
panics if a backend ever exits while it has pending undo (panic =
"goto crash recovery").  Then you don't have to deal with undo workers
running at the same time as regular sessions which might run into
trouble reacquiring locks (for an AM I mean), or due to OIDs being
recycled with multiple checkpoints, or undo work that gets deferred
until the next restart of the server.

> Since the concept of undo requests is closely related to the undo worker, I
> removed undorequest.c too. The new (much simpler) undo worker gets the
> information on incomplete / aborted transactions from the undo log as
> mentioned above.
>
> SMGR enhancement
> 
>
> I used the 0001 patch from [3] rather than [4], although it's more invasive
> because I noticed somewhere in the discussion that there should be no reserved
> database OID for the undo log. (InvalidOid cannot be used because it's already
> in use for shared catalogs.)

I give up thinking about the colour of the BufferTag shed and went
back to magic database 9, mainly because there seemed to be more
pressing matters.  I don't even think it's that crazy to store this
type of system-wide data in pseudo databases, and I know of other
systems that do similar sorts of things without blinking...

> Following are a few areas which are not implemented yet because more
> discussion is needed there:

Hmm.  I'm thinking about these questions.

Re: POC: Cleaning up orphaned files using undo logs

2019-11-27 Thread Thomas Munro

On Thu, Nov 28, 2019 at 3:45 PM Michael Paquier  wrote:
> On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote:
> > Oops, right.  So it should just be added to the if condition.  Will do.
>
> It's been a couple of months and the discussion has stale.  It seems
> also that the patch was waiting for an update.  So I am marking it as
> RwF for now.  Please feel free to update it if you feel that's not
> adapted.

Thanks.  We decided to redesign a couple of aspects of the undo
storage and record layers that this patch was intended to demonstrate,
and work on that is underway.  More on that soon.

Re: POC: Cleaning up orphaned files using undo logs

2019-11-27 Thread Michael Paquier

On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote:
> Oops, right.  So it should just be added to the if condition.  Will do.

It's been a couple of months and the discussion has stale.  It seems
also that the patch was waiting for an update.  So I am marking it as
RwF for now.  Please feel free to update it if you feel that's not
adapted.
--
Michael


signature.asc
Description: PGP signature

Re: POC: Cleaning up orphaned files using undo logs

2019-09-18 Thread Amit Kapila

On Mon, Sep 16, 2019 at 10:37 PM Robert Haas  wrote:
>
> It seems to me that zheap went wrong in ending up with separate undo
> types for in-place and non-in-place updates. Why not just have ONE
> kind of undo record that describes an update, and allow that update to
> have either one TID or two TIDs depending on which kind of update it
> is? There may be a reason, but I don't know what it is, unless it's
> just that the UnpackedUndoRecord idea that I invented wasn't flexible
> enough and nobody thought of generalizing it. Curious to hear your
> thoughts on this.
>

I think not only TID's, but we also need to two uur_prevundo (previous
undo of the block) pointers.  This is required both when we have to
perform page-wise undo and chain traversal during visibility checks.
So, we can keep a combination of TID and prevundo.

The other thing is that during rollback when we collect the undo for
each page, applying the action for this undo need some thoughts.  For
example, we can't apply the undo to rollback both Insert and
non-inplace-update as both are on different pages.  The reason is that
the page where non-inplace-update has happened might have more undos
that need to be applied before this.  We can somehow make this undo
available to apply while collecting undo for both the heap pages.  I
think there is also a need to
identify which TID is for Insert and which is for non-inplace-update
part of the operation because we won't know that while applying undo
unless we check the state of a tuple on the page.

So, with this idea, we will make one undo record part of multiple
chains which might need some consideration at different places like
above.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-09-16 Thread Thomas Munro

On Tue, Sep 17, 2019 at 3:09 AM Kuntal Ghosh  wrote:
> On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro  wrote:
> > Agreed.  I added a line to break out of that loop if !block->in_use.
> >
> I think we should skip the block if !block->in_use. Because, the undo
> buffer can be registered in a subsequent block as well. For different
> operations, we can use different block_id to register the undo buffer
> in the redo record.

Oops, right.  So it should just be added to the if condition.  Will do.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-09-16 Thread Robert Haas

On Mon, Sep 16, 2019 at 11:09 AM Kuntal Ghosh
 wrote:
> Okay. In that case, we need to rethink the cases for multi-inserts and
> non-inlace updates both of which currently inserts multiple undo
> record corresponding to a single WAL record. For multi-inserts, it can
> be solved easily by moving all the offset information in the payload.
> But, for non-inlace updates, we insert one undo record for the update
> and one for the insert. Wondering whether we've to insert two WAL
> records - one for update and one for the new insert.

No, I think the solution is to put the information about both halves
of the non-in-place update in the same undo record.  I think the only
reason why that's tricky is because we've got two block numbers and
two offsets, and the only reason that's a problem is because
UnpackedUndoRecord only has one field for each of those things, and
that goes right back to Heikki's comments about the format not being
flexible enough. If you see some other problem, it would be
interesting to know what it is.

One thing I've been thinking about is: suppose that you're following
the undo chain for a tuple and you come to a non-in-place update
record. Can you get confused? I don't think so, because you can
compare the TID for which you're following the chain to the new TID
and the old TID in the record and it should match one or the other but
not both. But I don't think you even really need to do that much: if
you started with a deleted item, the first thing in the undo chain has
to be a delete or non-in-place update that got rid of it. And if you
started with a non-deleted item, then the beginning of the undo chain,
if it hasn't been discarded yet, will be the insert or non-in-place
update that created it. There's nowhere else that you can hit a
non-in-place update, and no room (that I can see) for any ambiguity.

It seems to me that zheap went wrong in ending up with separate undo
types for in-place and non-in-place updates. Why not just have ONE
kind of undo record that describes an update, and allow that update to
have either one TID or two TIDs depending on which kind of update it
is? There may be a reason, but I don't know what it is, unless it's
just that the UnpackedUndoRecord idea that I invented wasn't flexible
enough and nobody thought of generalizing it. Curious to hear your
thoughts on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-09-16 Thread Kuntal Ghosh

Hello Thomas,

On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro  wrote:
>
> > 1. In UndoLogAllocateInRecovery, when we find the current log number
> > from the list of registered blocks, we don't check whether the
> > block->in_use flag is true or not. In XLogResetInsertion, we just
> > reset in_use flag without reseting the blocks[]->rnode information.
> > So, if we don't check the in_use flag, it's possible that we'll
> > consult some block information from the previous WAL record. IMHO,
> > just adding an in_use check in UndoLogAllocateInRecovery will solve
> > the problem.
>
> Agreed.  I added a line to break out of that loop if !block->in_use.
>
I think we should skip the block if !block->in_use. Because, the undo
buffer can be registered in a subsequent block as well. For different
operations, we can use different block_id to register the undo buffer
in the redo record.

> BTW I am planning to simplify that code considerably, based on a plan
> to introduce a new rule: there can be only one undo record and
> therefore only one undo allocation per WAL record.
>
Okay. In that case, we need to rethink the cases for multi-inserts and
non-inlace updates both of which currently inserts multiple undo
record corresponding to a single WAL record. For multi-inserts, it can
be solved easily by moving all the offset information in the payload.
But, for non-inlace updates, we insert one undo record for the update
and one for the insert. Wondering whether we've to insert two WAL
records - one for update and one for the new insert.

> > 2. A transaction, inserts one undo record and generated a WAL record
> > for the same, say at WAL location 0/2000A000. Next, the undo record
> > gets discarded and WAL is generated to update the meta.discard pointer
> > at location 0/2000B000  At the same time, an ongoing checkpoint with
> > checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
> > Now, the system crashes.
> > Now, the recovery starts from the location 0/2000. When the
> > recovery of 0/2000A000 happens, it sees the undo record that it's
> > about to insert, is already discarded as per meta.discard (flushed by
> > checkpoint). In this case, should we just skip inserting the undo
> > record?
>
> I see two options:
>
> 1.  We make it so that if you're allocating in recovery and discard >
> insert, we'll just set discard = insert so you can proceed.  The code
> in undofile_get_segment_file() already copes with missing files during
> recovery.
>
Interesting. This should work.

>
> > 3. Currently, we create a backup image of the unlogged part of the
> > undo log's metadata only when some backend allocates some space from
> > the undo log (in UndoLogAllocate). This helps us restore the unlogged
> > meta part after a checkpoint.
> > When we perform an undo action, we also update the undo action
> > progress and emit an WAL record. The same operation can performed by
> > the undo worker which doesn't allocate any space from the undo log.
> > So, if an undo worker emits an WAL record to update undo action
> > progress after a checkpoint, it'll not be able to WAL log the backup
> > image of the meta unlogged part. IMHO, this breaks the recovery logic
> > of unlogged part of undo meta.
>
> I thought that was OK because those undo data updates don't depend on
> the insert pointer.  But I see what you mean: the next modification of
> the page that DOES depend on the insert pointer might not log the
> meta-data if it's not the first WAL record to touch it after a
> checkpoint.  Rats.  I'll have to think about that some more.
Cool.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-09-15 Thread Thomas Munro

On Mon, Sep 16, 2019 at 5:27 AM Kuntal Ghosh  wrote:
> While testing zheap over undo apis, we've found the following
> issues/scenarios that might need some fixes/discussions:

Thanks!

> 1. In UndoLogAllocateInRecovery, when we find the current log number
> from the list of registered blocks, we don't check whether the
> block->in_use flag is true or not. In XLogResetInsertion, we just
> reset in_use flag without reseting the blocks[]->rnode information.
> So, if we don't check the in_use flag, it's possible that we'll
> consult some block information from the previous WAL record. IMHO,
> just adding an in_use check in UndoLogAllocateInRecovery will solve
> the problem.

Agreed.  I added a line to break out of that loop if !block->in_use.

BTW I am planning to simplify that code considerably, based on a plan
to introduce a new rule: there can be only one undo record and
therefore only one undo allocation per WAL record.

> 2. A transaction, inserts one undo record and generated a WAL record
> for the same, say at WAL location 0/2000A000. Next, the undo record
> gets discarded and WAL is generated to update the meta.discard pointer
> at location 0/2000B000  At the same time, an ongoing checkpoint with
> checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
> Now, the system crashes.
> Now, the recovery starts from the location 0/2000. When the
> recovery of 0/2000A000 happens, it sees the undo record that it's
> about to insert, is already discarded as per meta.discard (flushed by
> checkpoint). In this case, should we just skip inserting the undo
> record?

I see two options:

1.  We make it so that if you're allocating in recovery and discard >
insert, we'll just set discard = insert so you can proceed.  The code
in undofile_get_segment_file() already copes with missing files during
recovery.

2.  We skip the insert as you said.

I think option 1 is probably best, otherwise you have to cope with
failure to insert by skipping, as you said.

> 3. Currently, we create a backup image of the unlogged part of the
> undo log's metadata only when some backend allocates some space from
> the undo log (in UndoLogAllocate). This helps us restore the unlogged
> meta part after a checkpoint.
> When we perform an undo action, we also update the undo action
> progress and emit an WAL record. The same operation can performed by
> the undo worker which doesn't allocate any space from the undo log.
> So, if an undo worker emits an WAL record to update undo action
> progress after a checkpoint, it'll not be able to WAL log the backup
> image of the meta unlogged part. IMHO, this breaks the recovery logic
> of unlogged part of undo meta.

I thought that was OK because those undo data updates don't depend on
the insert pointer.  But I see what you mean: the next modification of
the page that DOES depend on the insert pointer might not log the
meta-data if it's not the first WAL record to touch it after a
checkpoint.  Rats.  I'll have to think about that some more.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-09-15 Thread Kuntal Ghosh

Hello Thomas,

While testing zheap over undo apis, we've found the following
issues/scenarios that might need some fixes/discussions:

1. In UndoLogAllocateInRecovery, when we find the current log number
from the list of registered blocks, we don't check whether the
block->in_use flag is true or not. In XLogResetInsertion, we just
reset in_use flag without reseting the blocks[]->rnode information.
So, if we don't check the in_use flag, it's possible that we'll
consult some block information from the previous WAL record. IMHO,
just adding an in_use check in UndoLogAllocateInRecovery will solve
the problem.

2. A transaction, inserts one undo record and generated a WAL record
for the same, say at WAL location 0/2000A000. Next, the undo record
gets discarded and WAL is generated to update the meta.discard pointer
at location 0/2000B000  At the same time, an ongoing checkpoint with
checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
Now, the system crashes.
Now, the recovery starts from the location 0/2000. When the
recovery of 0/2000A000 happens, it sees the undo record that it's
about to insert, is already discarded as per meta.discard (flushed by
checkpoint). In this case, should we just skip inserting the undo
record?

3. Currently, we create a backup image of the unlogged part of the
undo log's metadata only when some backend allocates some space from
the undo log (in UndoLogAllocate). This helps us restore the unlogged
meta part after a checkpoint.
When we perform an undo action, we also update the undo action
progress and emit an WAL record. The same operation can performed by
the undo worker which doesn't allocate any space from the undo log.
So, if an undo worker emits an WAL record to update undo action
progress after a checkpoint, it'll not be able to WAL log the backup
image of the meta unlogged part. IMHO, this breaks the recovery logic
of unlogged part of undo meta.

Thoughts?

On Mon, Sep 2, 2019 at 9:47 AM Thomas Munro  wrote:
>
> On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh  
> wrote:
> > I'm getting the following assert failure while performing the recovery
> > with the same.
> > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> > File: "undolog.c", Line: 997)"
> >
> > I found that we don't emit an WAL record when we update the
> > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> > crash recovery, some new transaction may use that undo log which is
> > wrong, IMHO. Am I missing something?
>
> Thanks, right, that status logging is wrong, will fix in next version.
>
> --
> Thomas Munro
> https://enterprisedb.com

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-09-11 Thread Alvaro Herrera

On 2019-Sep-06, vignesh C wrote:

> Hi Thomas,
> 
> While testing one of the recovery scenarios I found one issue:
> FailedAssertion("!(logno == context->recovery_logno)

I marked this patch Waiting on Author.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC: Cleaning up orphaned files using undo logs

2019-09-05 Thread vignesh C

Hi Thomas,

While testing one of the recovery scenarios I found one issue:
FailedAssertion("!(logno == context->recovery_logno)

The details of the same is mentioned below:
The context's try_location was not updated in
UndoLogAllocateInRecovery, in PrepareUndoInsert the try_location was
updated with the undo record size.
In the subsequent UndoLogAllocateInRecovery as the value for
try_location was not initialized but only updated with the size the
logno will always not match if the recovery_logno is non zero and the
assert fails.

Fixed by setting the try_location in UndoLogAllocateInRecovery,
similar to try_location setting in UndoLogAllocate.
Patch for the same is attached.

Please have a look and add the changes in one of the upcoming version.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

On Mon, Sep 2, 2019 at 9:53 AM Thomas Munro  wrote:
>
> On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh  
> wrote:
> > I'm getting the following assert failure while performing the recovery
> > with the same.
> > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> > File: "undolog.c", Line: 997)"
> >
> > I found that we don't emit an WAL record when we update the
> > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> > crash recovery, some new transaction may use that undo log which is
> > wrong, IMHO. Am I missing something?
>
> Thanks, right, that status logging is wrong, will fix in next version.
>
> --
> Thomas Munro
> https://enterprisedb.com
>
>
From e69a9eea71562434cc0512871a9b6d7424ce93ec Mon Sep 17 00:00:00 2001
From: Vigneshwaran c 
Date: Fri, 30 Aug 2019 11:45:03 +0530
Subject: [PATCH] FailedAssertion("!(logno == context->recovery_logno) fix

try_location was not updated in UndoLogAllocateInRecovery, in
PrepareUndoInsert the try_location was updated with the undo record size.
In the subsequent UndoLogAllocateInRecovery as the value for try_location
was not initialized but only updated with the size the logno will always
not match if the recovery_logno is non zero and the assert fails. Fixed
by setting the try_location in UndoLogAllocateInRecovery, similar to
try_location setting in UndoLogAllocate.

Patch by Vigneshwaran C, reviewed by Dilip Kumar.
---
 src/backend/access/undo/undolog.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/src/backend/access/undo/undolog.c b/src/backend/access/undo/undolog.c
index 073221c..9acd570 100644
--- a/src/backend/access/undo/undolog.c
+++ b/src/backend/access/undo/undolog.c
@@ -960,6 +960,14 @@ UndoLogAllocateInRecovery(UndoLogAllocContext *context,
 			*need_xact_header =
 context->try_location == InvalidUndoRecPtr &&
 slot->meta.unlogged.insert == slot->meta.unlogged.this_xact_start;
+
+			/*
+			 * If no try_location was passed in, or if we switched logs, then we'll
+			 * return the current insertion point.
+			 */
+			if (context->try_location == InvalidUndoRecPtr)
+context->try_location = MakeUndoRecPtr(slot->logno, slot->meta.unlogged.insert);
+
 			*last_xact_start = slot->meta.unlogged.last_xact_start;
 			context->recovery_logno = slot->logno;

-- 
1.8.3.1

Re: POC: Cleaning up orphaned files using undo logs

2019-09-01 Thread Thomas Munro

On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh  wrote:
> I'm getting the following assert failure while performing the recovery
> with the same.
> "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> File: "undolog.c", Line: 997)"
>
> I found that we don't emit an WAL record when we update the
> slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> crash recovery, some new transaction may use that undo log which is
> wrong, IMHO. Am I missing something?

Thanks, right, that status logging is wrong, will fix in next version.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-30 Thread Kuntal Ghosh

Hello Thomas,

I was doing some testing for the scenario where the undo written by a
transaction overflows to multiple undo logs. For that I've modified
the following macro:
#define UndoLogMaxSize (1024 * 1024) /* 1MB undo log size */
(I should have used the provided pg_force_switch_undo though..)

I'm getting the following assert failure while performing the recovery
with the same.
"TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
File: "undolog.c", Line: 997)"

I found that we don't emit an WAL record when we update the
slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
crash recovery, some new transaction may use that undo log which is
wrong, IMHO. Am I missing something?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-24 Thread Robert Haas

On Wed, Aug 21, 2019 at 4:34 PM Robert Haas  wrote:
> ReleaseResourcesAndProcessUndo() is only supposed to be called after
> AbortTransaction(), and the additional steps it performs --
> AtCleanup_Portals() and AtEOXact_Snapshot() or alternatively
> AtSubCleanup_Portals -- are taken from Cleanup(Sub)Transaction.
> That's not crazy; the other steps in Cleanup(Sub)Transaction() look
> like stuff that's intended to be performed when we're totally done
> with this TransactionState stack entry, whereas these things are
> slightly higher-level cleanups that might even block undo (e.g.
> undropped portal prevents orphaned file cleanup).  Granted, there are
> no comments explaining why those particular cleanup steps are
> performed here, and it's possible some other approach is better, but I
> think perhaps it's not quite as flagrantly broken as you think.

Andres smacked me with the clue-bat off-list and now I understand why
this is broken: there's no guarantee that running the various
AtEOXact/AtCleanup functions actually puts the transaction back into a
good state.  They *might* return the system to the state that it was
in immediately following StartTransaction(), but they also might not.
Moreover, ProcessUndoRequestForEachLogCat uses PG_TRY/PG_CATCH and
then discards the error without performing *any cleanup at all* but
then goes on and tries to do undo for other undo log categories
anyway.  That is totally unsafe.

I think that there should only be one chance to perform undo actions,
and as I said or at least alluded to before, if that throws an error,
it shouldn't be caught by a TRY/CATCH block but should be handled by
the state machine in xact.c.  If we're not going to make the state
machine handle these conditions, the addition of
TRANS_UNDO/TBLOCK_UNDO/TBLOCK_SUBUNDO is really missing the boat.  I'm
still not quite sure of the exact sequence of steps: we clearly need
AtCleanup_Portals() and a bunch of the other stuff that happens during
CleanupTransaction(), ideally including the freeing of memory, to
happen before we try undo. But I don't have a great feeling for how to
make that happen, and it seems more desirable for undo to begin as
soon as the transaction fails rather than waiting until
Cleanup(Sub)Transaction() time. I think some more research is needed
here.

> I am also not convinced that semi-critical sections are a bad idea,

Regarding this, after further thought and discussion with Andres,
there are two cases here that call for somewhat different handling:
temporary undo, and subtransaction abort.

In the case of a subtransaction abort, we can't proceed with the
toplevel transaction unless we succeed in applying the
subtransaction's undo, but that does not require killing off the
backend.  It might be a better idea to just fail the containing
subtransaction with the error that occurred during undo apply; if
there are multiple levels of subtransactions present then we might
fail in the same way several times, but eventually we'll fail at the
top level, forcibly kick the undo into the background, and the session
can continue.  The background workers will, hopefully, eventually
recover the situation.  Even if they can't, because, say, the failure
is due to a bug or whatever, killing off the session doesn't really
help.

In the case of temporary undo, killing the session is a much more
appealing approach. If we don't, how will that undo ever get
processed?  We could retry at some point (like every time we return to
the toplevel command prompt?) or just ignore the fact that we didn't
manage to perform those undo actions and leave that undo there like an
albatross, accumulating more and more undo behind it until the session
exits or the disk fills up.  The latter strikes me as a terrible user
experience, especially because for wire protocol reasons we'd have to
swallow the errors or at best convert them to warnings, but YMMV.

Anyway, probably these cases should not be handled exactly the same
way, but exactly what to do probably depends on the previous question:
how exactly does the integration into xact.c's state machine work,
anyway?

Meanwhile, I've been working up a prototype of how the undorequest.c
stuff I sent previously could be integrated with xact.c.  In working
on that, I've realized that there seem to be two different tasks.  One
is tracking the information that we'll need to have available to
perform undo actions.  The other is the actual transaction state
manipulation: when and how do we abort transactions, cleanup
transactions, start new transactions specifically for undo?  How are
transactions performing undo specially marked, if at all?  The
attached patch includes a new module, undostate.c/h, which tries to
handle the first of those things; this is just a prototype, and is
missing some pieces marked with XXX, but I think it's probably the
right general direction.  It will still need to be plugged into a
framework for launching undo apply background workers (which might
require

Re: POC: Cleaning up orphaned files using undo logs

2019-08-24 Thread Robert Haas

On Fri, Aug 23, 2019 at 2:04 AM Thomas Munro  wrote:
> 2.  Strict self-update-only:  We could update it as part of
> transaction cleanup.  That is, when you commit or abort, probably at
> some time when your transaction is still advertised as running, you go
> and update your own transaction header with your the size.  If you
> never reach that stage, I think you can fix it during crash recovery,
> during the startup scan that feeds the rollback request queues.  That
> is, if you encounter a transaction header with length == 0, it must be
> the final one and its length is therefore known and can be updated,
> before you allow new transactions to begin.  There are some
> complications like backends that exit without crashing, which I
> haven't thought about.  As Amit just pointed out to me, that means
> that the update is not folded into the same buffer access as the next
> transaction, but perhaps you can mitigate that by not updating your
> header if the next header will be on the same page -- the next
> transaction can do it safely then (this page with the insert pointer
> on it can't be discarded).  As Dilip just pointed out to me, it means
> that you always do an update that you might not never need to do if
> the transaction is discarded, to which I have no answer.  Bleugh.

Andres and I have spent a lot of time on the phone over the last
couple of days and I think we both kind of like this option.  I don't
think that the costs are likely to be very significant: you're talking
about pinning, locking, dirtying, unlocking, and unpinning one buffer
at commit time, or maybe two if your transaction touched both logged
and unlogged tables.  If the transaction is short enough for that
overhead to matter, that buffer is probably already in shared_buffers,
is probably already dirty, and is probably already in your CPU's
cache. So I think the overhead will turn out to be low.

Moreover, I doubt that we want to separately discard every transaction
anyway.  If you have very light-weight transactions, you don't want to
add an extra WAL record per transaction anyway.  Increasing the number
of separate WAL records per transaction from say 5 to 6 would be a
significant additional cost.  You probably want to perform a discard,
say, every 5 seconds or sooner if you can discard at least 64kB of
undo, or something of that sort.  So we're not going to save the
overhead of updating the previous transaction header often enough to
make much difference unless we're discarding so aggressively that we
incur a much larger overhead elsewhere.  I think.

I am a little concerned about the backends that exit without crashing.
Andres seems to want to treat that case as a bug to be fixed, but I
doubt whether that's going to be practical.   We're really only
talking about extreme corner cases here, because
before_shmem_exit(ShutdownPostgres, 0) means we'll
AbortOutOfAnyTransaction() which should RecordTransactionAbort(). Only
if we fail in the AbortTransaction() prior to reaching
RecordTransactionAbort() will we manage to reach the later cleanup
stages without having written an abort record.  I haven't scrutinized
that code lately to see exactly how things can go wrong there, but
there shouldn't be a whole lot. However, there's probably a few
things, like maybe a really poorly-timed malloc() failure.

A zero-order solution would be to install a deadman switch. At
on_shmem_exit time, you must detach from any undo log to which you are
connected, so that somebody else can attach to it later.  We can stick
in a cross-check there that you haven't written any undo bytes to that
log and PANIC if you have.  Then the system must be water-tight.
Perhaps it's possible to do better: if we could identify the cases in
which such logic gets reached, we could try to guarantee that WAL is
written and the undo log safely detached before we get there.  But at
the various least we can promote ERROR/FATAL to PANIC in the relevant
case.

A better solution would be to detect the problem and make sure we
recover from it before reusing the undo log.  Suppose each undo log
has three states: (1) nobody's attached, (2) somebody's attached, and
(3) nobody's attached but the last record might need a fixup.  When we
start up, all undo logs are in state 3, and the discard worker runs
around and puts them into state 1.  Subsequently, they alternate
between states 1 and 2 for as long as the system remains up.  But if
as an exceptional case we reach on_shmem_exit without having detached
the undo log, because of cascading failures, then we put the undo log
in state 3.  The discard worker already knows how to move undo logs
from state 3 to state 1, and it can do the same thing here.  Until it
does nobody else can reuse that undo log.

I might be missing something, but I think that would nail this down
pretty tightly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-23 Thread Thomas Munro

On Wed, Aug 21, 2019 at 4:44 AM Andres Freund  wrote:
> On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> > 1.  Anyone is allowed to try to read or write data at any UndoRecPtr
> > that has been allocated, through the buffer pool (though you'd usually
> > want to check it with UndoRecPtrIsDiscarded() first, and only rely on
> > the system I'm describing to deal with races).
> >
> > 2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
> > cache miss, if the smgrread implementation wants to indicate that the
> > buffer has been discarded/truncated and that is expected (md.c won't
> > ever do that, but undofile.c can).
>
> Hm. This gives me a bit of a stomach ache. It somehow feels like a weird
> form of signalling.  Can't quite put my finger on why it makes me feel
> queasy.

Well, if we're going to allow concurrent access and discarding, there
has to be some part of the system where you can discover that the
thing you wanted is gone.  What's wrong with here?

Stepping back a bit... why do we have a new concept here?  The reason
we don't have anything like this for relations currently is that we
don't have live references to blocks that are expected to be
concurrently truncated, due to heavyweight locking.  But the whole
purpose of the undo system is to allow cheap truncation/discarding of
data that you *do* have live references to, and furthermore the
discarding is expected to be frequent.  The present experiment is
about trying to do so without throwing our hands up and using a
pessimistic lock.

At one point Robert and I discussed some kind of scheme where you'd
register your interest in a range of the log before you begin reading
(some kind of range locks or hazard pointers), so that you would block
discarding in that range, but the system would still allow someone to
read in the middle of the log while the discard worker concurrently
discards non-overlapping data at the end.  But I kept returning to the
idea that the buffer pool already has block-level range locking of
various kinds.  You can register your interest in a range by pinning
the buffers.  That's when you'll find out if the range is already
gone.  We could add an extra layer of range locking around that, but
it wouldn't be any better, it'd just thrash your bus a bit more, and
require more complexity in the discard worker (it has to defer
discarding a bit, and either block or go away and come back later).

> > 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> > unpinned buffers, and marks as BM_DISCARDED any that happen to be
> > pinned right now, so they can't be immediately invalidated.  Such
> > buffers are never written back and are eligible for reuse on the next
> > clock sweep, even if they're written into by a backend that managed to
> > do that when we were trying to discard.
>
> Hm. When is it legitimate for a backend to write into such a buffer? I
> guess that's about updating the previous transaction's next pointer? Or
> progress info?

Yes, previous transaction header's next pointer, and progress counter
during rollback.  We're mostly interested in the next pointer here,
because the progress counter update would normally not be updated at a
time when the page might be concurrently discarded.  The exception to
that is a superuser running CALL pg_force_discard_undo() (a
data-eating operation designed to escape a system that can't
successfully roll back and gets stuck, blowing away
not-yet-rolled-back undo records).

Here are some other ideas about how to avoid conflicts between
discarding and transaction header update:

1.  Lossy self-update-only: You could say that transactions are only
allowed to write to their own transaction header, and then have them
periodically update their own length in their own transaction header,
and then teach the discard worker that the length information is only
a starting point for a linear search for the next transaction based on
page header information.  That removes complexity for writers, but
adds complexity and IO and CPU to the discard worker.  Bleugh.

2.  Strict self-update-only:  We could update it as part of
transaction cleanup.  That is, when you commit or abort, probably at
some time when your transaction is still advertised as running, you go
and update your own transaction header with your the size.  If you
never reach that stage, I think you can fix it during crash recovery,
during the startup scan that feeds the rollback request queues.  That
is, if you encounter a transaction header with length == 0, it must be
the final one and its length is therefore known and can be updated,
before you allow new transactions to begin.  There are some
complications like backends that exit without crashing, which I
haven't thought about.  As Amit just pointed out to me, that means
that the update is not folded into the same buffer access as the next
transaction, but perhaps you can mitigate that by not updating your
header if the next header will be on the same

Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar

On Thu, Aug 22, 2019 at 9:55 PM Andres Freund  wrote:
>
> Hi
>
> On August 22, 2019 9:14:10 AM PDT, Dilip Kumar  wrote:
> > But, those requests will
> >ultimately be used for collecting the record by the bulk fetch.  So if
> >we are planning to change the bulk fetch to read forward then maybe we
> >don't need the valid last undo record pointer because that we will
> >anyway get while processing forward.  So just knowing the end of the
> >transaction is sufficient for us to know where to stop.  I am not sure
> >if this solution has any problem.  Probably  I should think again in
> >the morning when my mind is well-rested.
>
> I don't think we can easily do so for bulk apply without incurring 
> significant overhead. It's pretty cheap to read in forward order and then 
> process backwards on a page level - but for an entire transactions undo the 
> story is different. We can't necessarily keep all of it in memory, so we'd 
> have to read the undo twice to find the end. Right?
>
I was not talking about the entire transaction,  I was also telling
about the page level as you suggested.  I was just saying that we may
not need the start position of the last undo record of the transaction
for registering the rollback request (which we currently do).
However, we need to know the end of the transaction to know the last
page from which we need to start reading forward.

Let me explain with an example

Transaction1
first, undo start at 10
first, undo end at 100
second, undo start at 101
second, undo end at 200
..
last, undo start at 1000
last, undo end at  1100

Transaction2
first, undo start at 1101
first, undo end at 1200
second, undo start at 1201
second, undo end at 1300

Suppose we want to register the request for Transaction1.  Then
currently we need to know the start undo record pointer (10 as per
above example) and the last undo record pointer (1000).  But, we only
know the start undo record pointer(10) and the start of the next
transaction(1101).  So for calculating the start of the last record,
we use 1101 - 101 (length of the last record store 2 bytes before
1101).

So, now I am saying that maybe we don't need to compute the start of
last undo record (1000) because it's enough to know the end of the
last undo record(1100).  Because on whichever page the last undo
record ends, we can start from that page and read forward on that
page.

* All numbers I used in the above example can be considered as undo
record pointers.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar

On Thu, Aug 22, 2019 at 7:34 PM Robert Haas  wrote:
>
> On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> > Yeah, we can handle the bulk fetch as you suggested and it will make
> > it a lot easier.  But, currently while registering the undo request
> > (especially during the first pass) we need to compute the from_urecptr
> > and the to_urecptr. And,  for computing the from_urecptr,  we have the
> > end location of the transaction because we have the uur_next in the
> > transaction header and that will tell us the end of our transaction
> > but we still don't know the undo record pointer of the last record of
> > the transaction.  As of know, we read previous 2 bytes from the end of
> > the transaction to know the length of the last record and from there
> > we can compute the undo record pointer of the last record and that is
> > our from_urecptr.=
>
> I don't understand this.  If we're registering an undo request at "do"
> time, we don't need to compute the starting location; we can just
> remember the UndoRecPtr of the first record we inserted.  If we're
> reregistering an undo request after a restart, we can (and, I think,
> should) work forward from the discard location rather than backward
> from the insert location.

Right, we work froward from the discard location.  So after the
discard location,  while traversing the undo log when we encounter an
aborted transaction we need to register its rollback request.  And,
for doing that we need 1) start location of the first undo record . 2)
start location of the last undo record (last undo record pointer).

We already have 1).  But we have to compute 2).   For doing that if we
unpack the first undo record we will know the start of the next
transaction.  From there if we read the last two bytes then that will
have the length of the last undo record of our transaction.  So we can
compute 2) with below formula

start of the last undo record = start of the next transaction - length
of our transaction's last record.

Am I making sense here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Robert Haas

On Thu, Aug 22, 2019 at 12:54 AM Andres Freund  wrote:
> But why? It makes a *lot* more sense to have it in the beginning. I
> don't think bulk-fetch really requires it to be in the end - we can
> still process records forward on a page-by-page basis.

There are two separate needs here: to be able to go forward, and to be
able to go backward.  We have the length at the end of each record not
because we're stupid, but so that we can back up.  If we have another
way of backing up, then the thing to do is not to move that to
beginning of the record but to remove it entirely as unnecessary
wastage.  We can also think about how to improve forward traversal.

Considering each problem separately:

For forward traversal, we could simplify things somewhat by having
only 3 decoding stages instead of N decoding stages.  We really only
need (1) a stage for accumulating bytes until we have uur_info, and
then (2) a stage for accumulating bytes until we know the payload and
tuple lengths, and then (3) a stage for accumulating bytes until we
have the whole record.  We have a lot more stages than that right now
but I don't think we really need them for anything. Originally we had
them so that we could do incremental decoding to find the transaction
header in the record, but now that the transaction header is at a
fixed offset, I think the multiplicity of stages is just baggage.

We could simplify things more by deciding that the first two bytes of
the record are going to contain the record size. That would increase
the size of the record by 2 bytes, but we could (mostly) claw those
bytes back by not storing the size of both uur_payload and uur_tuple.
The size of the other one would be computed by subtraction: take the
total record size, subtract the size of whichever of those two things
we store, subtract the mandatory and optional headers that are
present, and the rest must be the other value. That would still add 2
bytes for records that contain neither a payload nor a tuple, but that
would probably be OK given that (a) a lot of records wouldn't be
affected, (b) the code would get simpler, and (c) something like this
seems necessary anyway given that we want to make the record format
more generic. With this approach instead of 3 stages we only need 2:
(1) accumulating bytes until we have the 2-byte length word, and (2)
accumulating bytes until we have the whole record.

For backward traversal, as I see it, there are basically two options.
One is to do what we're doing right now, and store the record length
at the end of the record. (That might mean that a record both begins
and ends with its own length, which is not a crazy design.) The other
is to do what I think you are proposing here: locate the beginning of
the first record on the page, presumably based on some information
stored in the page header, and then work forward through the page to
figure out where all the records start.  Then process them in reverse
order. That saves 2 bytes per record.  It's a little more expensive in
terms of CPU cycles, especially if you only need some of the records
on the page but not all of them, but that's probably not too bad.

I think I'm basically agreeing with what you are proposing but I think
it's important to spell out the underlying concerns, because otherwise
I'm afraid we might think we have a meeting of the minds when we don't
really.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Andres Freund

Hi

On August 22, 2019 9:14:10 AM PDT, Dilip Kumar  wrote:
> But, those requests will
>ultimately be used for collecting the record by the bulk fetch.  So if
>we are planning to change the bulk fetch to read forward then maybe we
>don't need the valid last undo record pointer because that we will
>anyway get while processing forward.  So just knowing the end of the
>transaction is sufficient for us to know where to stop.  I am not sure
>if this solution has any problem.  Probably  I should think again in
>the morning when my mind is well-rested.

I don't think we can easily do so for bulk apply without incurring significant 
overhead. It's pretty cheap to read in forward order and then process backwards 
on a page level - but for an entire transactions undo the story is different. 
We can't necessarily keep all of it in memory, so we'd have to read the undo 
twice to find the end. Right?

Andres

Andres

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar

On Thu, Aug 22, 2019 at 9:21 PM Dilip Kumar  wrote:
>
> On Thu, Aug 22, 2019 at 7:34 PM Robert Haas  wrote:
> >
> > On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> > > Yeah, we can handle the bulk fetch as you suggested and it will make
> > > it a lot easier.  But, currently while registering the undo request
> > > (especially during the first pass) we need to compute the from_urecptr
> > > and the to_urecptr. And,  for computing the from_urecptr,  we have the
> > > end location of the transaction because we have the uur_next in the
> > > transaction header and that will tell us the end of our transaction
> > > but we still don't know the undo record pointer of the last record of
> > > the transaction.  As of know, we read previous 2 bytes from the end of
> > > the transaction to know the length of the last record and from there
> > > we can compute the undo record pointer of the last record and that is
> > > our from_urecptr.=
> >
> > I don't understand this.  If we're registering an undo request at "do"
> > time, we don't need to compute the starting location; we can just
> > remember the UndoRecPtr of the first record we inserted.  If we're
> > reregistering an undo request after a restart, we can (and, I think,
> > should) work forward from the discard location rather than backward
> > from the insert location.
>
> Right, we work froward from the discard location.  So after the
> discard location,  while traversing the undo log when we encounter an
> aborted transaction we need to register its rollback request.  And,
> for doing that we need 1) start location of the first undo record . 2)
> start location of the last undo record (last undo record pointer).
>
> We already have 1).  But we have to compute 2).   For doing that if we
> unpack the first undo record we will know the start of the next
> transaction.  From there if we read the last two bytes then that will
> have the length of the last undo record of our transaction.  So we can
> compute 2) with below formula
>
> start of the last undo record = start of the next transaction - length
> of our transaction's last record.

Maybe I am saying that because I am just thinking how the requests are
registered as per the current code.  But, those requests will
ultimately be used for collecting the record by the bulk fetch.  So if
we are planning to change the bulk fetch to read forward then maybe we
don't need the valid last undo record pointer because that we will
anyway get while processing forward.  So just knowing the end of the
transaction is sufficient for us to know where to stop.  I am not sure
if this solution has any problem.  Probably  I should think again in
the morning when my mind is well-rested.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Robert Haas

On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> Yeah, we can handle the bulk fetch as you suggested and it will make
> it a lot easier.  But, currently while registering the undo request
> (especially during the first pass) we need to compute the from_urecptr
> and the to_urecptr. And,  for computing the from_urecptr,  we have the
> end location of the transaction because we have the uur_next in the
> transaction header and that will tell us the end of our transaction
> but we still don't know the undo record pointer of the last record of
> the transaction.  As of know, we read previous 2 bytes from the end of
> the transaction to know the length of the last record and from there
> we can compute the undo record pointer of the last record and that is
> our from_urecptr.=

I don't understand this.  If we're registering an undo request at "do"
time, we don't need to compute the starting location; we can just
remember the UndoRecPtr of the first record we inserted.  If we're
reregistering an undo request after a restart, we can (and, I think,
should) work forward from the discard location rather than backward
from the insert location.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Amit Kapila

On Thu, Aug 22, 2019 at 11:04 AM Dilip Kumar  wrote:
>
> On Thu, Aug 22, 2019 at 10:24 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> > > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > > > We can not know the complete size of the record even by reading the
> > > > > header because we have a payload that is variable part and payload
> > > > > length are stored in the payload header which again can be at random
> > > > > offset.
> > > >
> > > > Wait, but that's just purely self inflicted damage, no? The initial
> > > > length just needs to include the payload. And all this is not an issue
> > > > anymore?
> > > >
> > > Actually, we store the undo length only at the end of the record and
> > > that is for traversing the transaction's undo record chain during bulk
> > > fetch.  Ac such in the beginning of the record we don't have the undo
> > > length.  We do have uur_info but that just tell us which all optional
> > > header are included in the record.
> >
> > But why? It makes a *lot* more sense to have it in the beginning. I
> > don't think bulk-fetch really requires it to be in the end - we can
> > still process records forward on a page-by-page basis.
>
> Yeah, we can handle the bulk fetch as you suggested and it will make
> it a lot easier.  But, currently while registering the undo request
> (especially during the first pass) we need to compute the from_urecptr
> and the to_urecptr. And,  for computing the from_urecptr,  we have the
> end location of the transaction because we have the uur_next in the
> transaction header and that will tell us the end of our transaction
> but we still don't know the undo record pointer of the last record of
> the transaction.  As of know, we read previous 2 bytes from the end of
> the transaction to know the length of the last record and from there
> we can compute the undo record pointer of the last record and that is
> our from_urecptr.
>

How about if we store the location of the last record of the
transaction instead of the location of the next transaction in the
transaction header?  I think if we do that then discard worker might
need to do some additional work in some cases as it needs to tell the
location up to which discard is possible, however, many other cases
might get simplified.  With this also, when the log is switched while
writing records for the same transaction, the transaction header in
the first log will store the start location of the same transaction's
records in the next log.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar

On Thu, Aug 22, 2019 at 10:24 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > > We can not know the complete size of the record even by reading the
> > > > header because we have a payload that is variable part and payload
> > > > length are stored in the payload header which again can be at random
> > > > offset.
> > >
> > > Wait, but that's just purely self inflicted damage, no? The initial
> > > length just needs to include the payload. And all this is not an issue
> > > anymore?
> > >
> > Actually, we store the undo length only at the end of the record and
> > that is for traversing the transaction's undo record chain during bulk
> > fetch.  Ac such in the beginning of the record we don't have the undo
> > length.  We do have uur_info but that just tell us which all optional
> > header are included in the record.
>
> But why? It makes a *lot* more sense to have it in the beginning. I
> don't think bulk-fetch really requires it to be in the end - we can
> still process records forward on a page-by-page basis.

Yeah, we can handle the bulk fetch as you suggested and it will make
it a lot easier.  But, currently while registering the undo request
(especially during the first pass) we need to compute the from_urecptr
and the to_urecptr. And,  for computing the from_urecptr,  we have the
end location of the transaction because we have the uur_next in the
transaction header and that will tell us the end of our transaction
but we still don't know the undo record pointer of the last record of
the transaction.  As of know, we read previous 2 bytes from the end of
the transaction to know the length of the last record and from there
we can compute the undo record pointer of the last record and that is
our from_urecptr.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar

On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > We can not know the complete size of the record even by reading the
> > header because we have a payload that is variable part and payload
> > length are stored in the payload header which again can be at random
> > offset.
>
> Wait, but that's just purely self inflicted damage, no? The initial
> length just needs to include the payload. And all this is not an issue
> anymore?
>
Actually, we store the undo length only at the end of the record and
that is for traversing the transaction's undo record chain during bulk
fetch.  Ac such in the beginning of the record we don't have the undo
length.  We do have uur_info but that just tell us which all optional
header are included in the record.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Andres Freund

Hi,

On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > We can not know the complete size of the record even by reading the
> > > header because we have a payload that is variable part and payload
> > > length are stored in the payload header which again can be at random
> > > offset.
> >
> > Wait, but that's just purely self inflicted damage, no? The initial
> > length just needs to include the payload. And all this is not an issue
> > anymore?
> >
> Actually, we store the undo length only at the end of the record and
> that is for traversing the transaction's undo record chain during bulk
> fetch.  Ac such in the beginning of the record we don't have the undo
> length.  We do have uur_info but that just tell us which all optional
> header are included in the record.

But why? It makes a *lot* more sense to have it in the beginning. I
don't think bulk-fetch really requires it to be in the end - we can
still process records forward on a page-by-page basis.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Andres Freund

Hi,

On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> We can not know the complete size of the record even by reading the
> header because we have a payload that is variable part and payload
> length are stored in the payload header which again can be at random
> offset.

Wait, but that's just purely self inflicted damage, no? The initial
length just needs to include the payload. And all this is not an issue
anymore?

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar

On Wed, Aug 21, 2019 at 9:04 PM Robert Haas  wrote:
>
> On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar  wrote:
> > I have already attempted that part and I feel it is not making code
> > any simpler than what we have today.  For packing, it's fine because I
> > can process all the member once and directly pack it into one memory
> > chunk and I can insert that to the buffer by one call of
> > InsertUndoBytes and that will make the code simpler.
>
> OK...

>
> The bigger issue here is that we don't seem to be making very much
> progress toward improving the overall design.  Several significant
> improvements have been suggested:
>
> 1. Thomas suggested quite some time ago that we should make sure that
> the transaction header is the first optional header.

I think this is already done at least 2-3 version before.  So now we
are updating the transaction header directly by writing at that offset
and we don't need staging for this.
>  If we do that,
> then I'm not clear on why we even need this incremental unpacking
> stuff any more. The only reason for all of this machinery was so that
> we could find the transaction header at an unknown offset inside a
> complex record format; there is, if I understand correctly, no other
> case in which we want to incrementally decode a record.
> But if the
> transaction header is at a fixed offset, then there seems to be no
> need to even have incremental decoding at all.  Or it can be very
> simple, with three stages: (1) we don't yet have enough bytes to
> figure out how big the record is; (2) we have enough bytes to figure
> out how big the record is and we have figured that out but we don't
> yet have all of those bytes; and (3) we have the whole record, we can
> decode the whole thing and we're done.

We can not know the complete size of the record even by reading the
header because we have a payload that is variable part and payload
length are stored in the payload header which again can be at random
offset.   But, maybe we can still follow this idea which will make
unpacking far simpler. I have a few ideas
a) Store payload header right after the transaction header so that we
can easily know the complete record size.
b) Once we decode the first header by uur_info we can compute an exact
offset of the payload header and from there we can know the record
size.

>
> 2. Based on a discussion with Thomas, I suggested the GHOB stuff,
> which gets rid of the idea of transaction headers inside individual
> records altogether; instead, there is one undo blob per transaction
> (or maybe several if we overflow to another undo log) which begins
> with a sentinel byte that identifies it as owned by a transaction, and
> then the transaction header immediately follows that without being
> part of any record, and the records follow that data.  As with the
> previous idea, this gets rid of the need for incremental decoding
> because it gets rid of the need to find the transaction header inside
> of a bigger record. As Thomas put it to me off-list, it puts the
> records inside of a larger chunk of space owned by the transaction
> instead of putting the transaction header inside of some particular
> record; that seems more correct than the way we have it now.
>
> 3. Heikki suggested that the format should be much more generic and
> that more should be left up to the AM.  While neither Andres nor I are
> convinced that we should go as far in that direction as Heikki is
> proposing, the idea clearly has some merit, and we should probably be
> moving that way to some degree. For instance, the idea that we should
> store a block number and TID is a bit sketchy when you consider that a
> multi-insert operation really wants to store a TID list. The zheap
> tree has a ridiculous kludge to work around that problem; clearly we
> need something better.  We've also mentioned that, in the future, we
> might want to support TIDs that are not 6 bytes, and that even just
> looking at what's currently under development, zedstore wants to treat
> TIDs as 48-bit opaque quantities, not a 4-byte block number and a
> 2-byte item pointer offset.  So, there is clearly a need to go through
> the whole way we're doing this and rethink which parts are generic and
> which parts are AM-specific.
>
> 4. A related problem, which has been mentioned or at least alluded to
> by both Heikki and by me, is that we need a better way of handling the
> AM-specific data. Right now, the zheap code packs fixed-size things
> into the payload data and then finds them by knowing the offset where
> any particular data is within that field, but that's an unmaintainable
> mess.  The zheap code could be improved by at least defining those
> offsets as constants someplace and adding some comments explaining the
> payload formats of various undo records, but even if we do that, it's
> not going to generalize very well to anything more complicated than a
> few fixed-size bits of data.  I suggested using the pqformat stuff to
> try to

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Robert Haas

On Wed, Aug 14, 2019 at 2:57 AM Andres Freund  wrote:
> - My reading of the current xact.c integration is that it's not workable
>   as is. Undo is executed outside of a valid transaction state,
>   exceptions aren't properly undone, logic would need to be duplicated
>   to a significant degree, new kind of critical section.

Regarding this particular point:

ReleaseResourcesAndProcessUndo() is only supposed to be called after
AbortTransaction(), and the additional steps it performs --
AtCleanup_Portals() and AtEOXact_Snapshot() or alternatively
AtSubCleanup_Portals -- are taken from Cleanup(Sub)Transaction.
That's not crazy; the other steps in Cleanup(Sub)Transaction() look
like stuff that's intended to be performed when we're totally done
with this TransactionState stack entry, whereas these things are
slightly higher-level cleanups that might even block undo (e.g.
undropped portal prevents orphaned file cleanup).  Granted, there are
no comments explaining why those particular cleanup steps are
performed here, and it's possible some other approach is better, but I
think perhaps it's not quite as flagrantly broken as you think.

I am also not convinced that semi-critical sections are a bad idea,
although the if (SemiCritSectionCount > 0) test at the start of
ReleaseResourcesAndProcessUndo() looks wrong.  To roll back a
subtransaction, we must perform undo in the foreground, and if that
fails, the toplevel transaction can't be allowed to commit, full stop.
Since we expect this to be a (very) rare scenario, I don't know why
escalating to FATAL is a catastrophe.  The only other option is to do
something along the lines of SxactIsDoomed(), where we force all
subsequent commits (and sub-commits?) within the toplevel xact to
fail. You can argue that the latter is a better user experience, and
for SSI I certainly agree, but this case isn't quite the same: there's
a good chance we're dealing with a corrupted page or system
administrator intervention to try to kill off a long-running undo
task, and continuing in such cases seems a lot more dubious than after
a serializability failure, where retrying is the expected recovery
mode. The other case is where toplevel undo for a temporary table
fails.  It is unclear to me what, other than FATAL, could suffice
there.  I guess you could just let the session continue and leave the
transaction undone, leaving whatever MVCC machinery the table AM may
have look through it, but that sounds inferior to me. Rip the bandaid
off.

Some general complaints from my side about the xact.c changes:

1. The code structure doesn't seem quite right.  For example:

1a. ProcessUndoRequestForEachLogCat has a try/catch block, but it
seems to me that the job of a try/catch block is to provide structured
error-handling for code for resources for which there's no direct
handling in xact.c or resowner.c.  Here, we're inside of xact.c, so
why are we adding a try/catch block
1b. ReleaseResourcesAndProcessUndo does part of the work of cleaning
up a failed transaction but not all of it, the rest being done by
AbortTransaction, which is called before entering it, plus it also
kicks off the actual undo work.  I would expect a cleaner division of
responsibility.
1c. Having an undo request per UndoLogCategory rather than one per
transaction doesn't seem right to me; hopefully that will get cleaned
up when the undorequest.c stuff I sent before is integrated.
1d. The code at the end of FinishPreparedTransaction() seems to expect
that the called code will catch any error, but that clearing the error
state might need to happen here, and also that we should fire up a new
transaction; I suspect, but am not entirely sure, that that is not the
way it should work.  The code added earlier in that function also
looks suspicious, because it's filling up what is basically a
high-level control function with a bunch of low-level undo-specific
details.  In both places, the undo-specific concerns probably need to
be better-isolated.

2. Signaling is done using some odd-looking mechanisms.  For instance:

2a. The SemiCritSectionCount > 0 test at the top of
ReleaseResourcesAndProcessUndo that I complained about earlier looks
like a guard against reentrancy, but that must be the wrong way to get
there; it makes it impossible to reuse what is ostensibly a
general-purpose facility for any non-undo related purpose without
maybe breaking something.
2b. ResetUndoActionsInfo() is called from a bunch of place, but only 2
of those places have a comment explaining why, and the function
comment is pretty unilluminating. This looks like some kind of
signaling machinery, but it's not very clear to me what it's actually
trying to do.
2c. ResourceOwnerReleaseInternal() is directly calling
NeedToPerformUndoActions(), which feels like a layering violation.
2d. I'm not really sure that TRANS_UNDO is serving any useful purpose;
I think we need TBLOCK_UNDO and TBLOCK_SUBUNDO, but I'm not really
sure TRANS_UNDO is doing anything useful; the

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Andres Freund




On August 21, 2019 8:36:34 AM PDT, Robert Haas  wrote:
> We treat LWLockAcquire() as a no-fail operation in many
>places; in my opinion, that elog(ERROR) that we have for too many
>LWLocks should be changed to elog(PANIC) precisely because we do treat
>LWLockAcquire() as no-fail in lots of places in the code, but I think
>I suggested that once and got slapped down, and I haven't had the
>energy to fight about it.

Fwiw, that proposal has my vote.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Robert Haas

On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar  wrote:
> I have already attempted that part and I feel it is not making code
> any simpler than what we have today.  For packing, it's fine because I
> can process all the member once and directly pack it into one memory
> chunk and I can insert that to the buffer by one call of
> InsertUndoBytes and that will make the code simpler.

OK...

> But, while unpacking if I directly unpack to the UnpackUndoRecord then
> there are few complexities.  I am not saying those are difficult to
> implement but code may not look better.
>
> a) First, we need to add extra stages for unpacking as we need to do
> field by field.
>
> b) Some of the members like uur_payload and uur_tuple are not the same
> type in the UnpackUndoRecord compared to how it is stored in the page.
> In UnpackUndoRecord those are StringInfoData whereas on the page we
> store it as UndoRecordPayload header followed by the actual data.  I
> am not saying we can not unpack this directly we can do it like,
> first read the payload length from the page in uur_payload.len then
> read tuple length in uur_tuple.len then read both the data.  And, for
> that, we will have to add extra stages.

I don't think that this is true; or at least, I think it's avoidable.
The idea of an unpacking stage is that you refuse to advance to the
next stage until you've got a certain number of bytes of data; and
then you unpack everything that pertains to that stage.  If you have 2
4-byte fields that you want to unpack together, you can just wait
until you've 8 bytes of data and then unpack both.  You don't really
need 2 separate stages.  (Similarly, your concern about fields being
in a different order seems like it should be resolved by agreeing on
one ordering and having everything use it; I don't know why there
should be one order that is better in memory and another order that is
better on disk.)

The bigger issue here is that we don't seem to be making very much
progress toward improving the overall design.  Several significant
improvements have been suggested:

1. Thomas suggested quite some time ago that we should make sure that
the transaction header is the first optional header.  If we do that,
then I'm not clear on why we even need this incremental unpacking
stuff any more. The only reason for all of this machinery was so that
we could find the transaction header at an unknown offset inside a
complex record format; there is, if I understand correctly, no other
case in which we want to incrementally decode a record. But if the
transaction header is at a fixed offset, then there seems to be no
need to even have incremental decoding at all.  Or it can be very
simple, with three stages: (1) we don't yet have enough bytes to
figure out how big the record is; (2) we have enough bytes to figure
out how big the record is and we have figured that out but we don't
yet have all of those bytes; and (3) we have the whole record, we can
decode the whole thing and we're done.

2. Based on a discussion with Thomas, I suggested the GHOB stuff,
which gets rid of the idea of transaction headers inside individual
records altogether; instead, there is one undo blob per transaction
(or maybe several if we overflow to another undo log) which begins
with a sentinel byte that identifies it as owned by a transaction, and
then the transaction header immediately follows that without being
part of any record, and the records follow that data.  As with the
previous idea, this gets rid of the need for incremental decoding
because it gets rid of the need to find the transaction header inside
of a bigger record. As Thomas put it to me off-list, it puts the
records inside of a larger chunk of space owned by the transaction
instead of putting the transaction header inside of some particular
record; that seems more correct than the way we have it now.

3. Heikki suggested that the format should be much more generic and
that more should be left up to the AM.  While neither Andres nor I are
convinced that we should go as far in that direction as Heikki is
proposing, the idea clearly has some merit, and we should probably be
moving that way to some degree. For instance, the idea that we should
store a block number and TID is a bit sketchy when you consider that a
multi-insert operation really wants to store a TID list. The zheap
tree has a ridiculous kludge to work around that problem; clearly we
need something better.  We've also mentioned that, in the future, we
might want to support TIDs that are not 6 bytes, and that even just
looking at what's currently under development, zedstore wants to treat
TIDs as 48-bit opaque quantities, not a 4-byte block number and a
2-byte item pointer offset.  So, there is clearly a need to go through
the whole way we're doing this and rethink which parts are generic and
which parts are AM-specific.

4. A related problem, which has been mentioned or at least alluded to
by both Heikki and by me, is that we need a better way of

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Robert Haas

On Wed, Aug 21, 2019 at 6:38 AM Amit Kapila  wrote:
> > FWIW, although I also thought of doing what you are describing here, I
> > think Andres's proposal is probably preferable, because it's simpler.
> > There's not really any reason why we can't take the buffer locks from
> > within the critical section, and that way callers don't have to deal
> > with the extra step.
>
> IIRC, the reason this was done before starting critical section was
> because of coding convention mentioned in src/access/transam/README
> (Section: Write-Ahead Log Coding).   It says first pin and exclusive
> lock the shared buffers and then start critical section.  It might be
> that we can bypass that convention here, but I guess it is mainly to
> avoid any error in the critical section.  I have checked the
> LWLockAcquire path and there doesn't seem to be any reason that it
> will throw error except when the caller has acquired many locks at the
> same time which is not the case here.

Yeah, I think it's fine to deviate from that convention in this
respect.  We treat LWLockAcquire() as a no-fail operation in many
places; in my opinion, that elog(ERROR) that we have for too many
LWLocks should be changed to elog(PANIC) precisely because we do treat
LWLockAcquire() as no-fail in lots of places in the code, but I think
I suggested that once and got slapped down, and I haven't had the
energy to fight about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Amit Kapila

On Tue, Aug 20, 2019 at 8:10 PM Robert Haas  wrote:
>
> On Tue, Aug 20, 2019 at 2:42 AM Amit Kapila  wrote:
> > > Well, my main point, which so far has largely been ignored, was that we
> > > may not acquire page locks when we still need to search for victim
> > > buffers later. If we don't need to lock the pages up-front, but only do
> > > so once we're actually copying the records into the undo pages, then we
> > > don't a separate phase to acquire the locks. We can still hold all of
> > > the page locks at the same time, as long as we just acquire them at the
> > > later stage.
> >
> > Okay, IIUC, this means that we should have a separate phase where we
> > call LockUndoBuffers (or something like that) before
> > InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
> > will lock all the buffers pinned during PrepareUndoInsert.  We can
> > probably call LockUndoBuffers before entering the critical section to
> > avoid any kind of failure in critical section.  If so, that sounds
> > reasonable to me.
>
> I'm kind of scratching my head here, because this is clearly different
> than what Andres said in the quoted text to which you were replying.
> He clearly implied that we should acquire the buffer locks within the
> critical section during InsertPreparedUndo, and you responded by
> proposing to do it outside the critical section in a separate step.
> Regardless of which way is actually better, when somebody says "hey,
> let's do A!" and you respond by saying "sounds good, I'll go implement
> B!" that's not really helping us to get toward a solution.
>

I got confused by the statement "We can still hold all of the page
locks at the same time, as long as we just acquire them at the later
stage."

> FWIW, although I also thought of doing what you are describing here, I
> think Andres's proposal is probably preferable, because it's simpler.
> There's not really any reason why we can't take the buffer locks from
> within the critical section, and that way callers don't have to deal
> with the extra step.
>

IIRC, the reason this was done before starting critical section was
because of coding convention mentioned in src/access/transam/README
(Section: Write-Ahead Log Coding).   It says first pin and exclusive
lock the shared buffers and then start critical section.  It might be
that we can bypass that convention here, but I guess it is mainly to
avoid any error in the critical section.  I have checked the
LWLockAcquire path and there doesn't seem to be any reason that it
will throw error except when the caller has acquired many locks at the
same time which is not the case here.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar

On Tue, Aug 20, 2019 at 7:57 PM Robert Haas  wrote:
>
> On Mon, Aug 19, 2019 at 2:04 AM Dilip Kumar  wrote:
> > Currently, In UnpackedUndoRecord we store all members directly which
> > are set by the caller.  We store pointers to some header which are
> > allocated internally by the undo layer and the caller need not worry
> > about setting them.  So now you are suggesting to put other headers
> > also as structures in UnpackedUndoRecord.  I as such don't have much
> > problem in doing that but I think initially Robert designed
> > UnpackedUndoRecord structure this way so it will be good if Robert
> > provides his opinion on this.
>
> I don't believe that's what is being suggested.  It seems to me that
> the thing Andres is complaining about here has roots in the original
> sketch that I did for this code.  The oldest version I can find is
> here:
>
> https://github.com/EnterpriseDB/zheap/commit/7d194824a18f0c5e85c92451beab4bc6f044254c
>
> In this version, and I think still in the current version, there is a
> two-stage marshaling strategy.  First, the individual fields from the
> UnpackedUndoRecord get copied into global variables (yes, that was my
> fault, too, at least in part!) which are structures. Then, the
> structures get copied into the target buffer. The idea of that design
> was to keep the code simple, but it didn't really work out, because
> things got a lot more complicated between the time I wrote those 3244
> lines of code and the >3000 lines of code that live in this patch
> today. One thing that change was that we moved more and more in the
> direction of considering individual fields as separate objects to be
> separately included or excluded, whereas when I wrote that code I
> thought we were going to have groups of related fields that stood or
> fell together. That idea turned out to be wrong. (There is the
> even-larger question here of whether we ought to take Heikki's
> suggestion and make this whole thing a lot more generic, but let's
> start by discussing how the design that we have today could be
> better-implemented.)
>
> If I understand Andres correctly, he's arguing that we ought to get
> rid of the two-stage marshaling strategy.  During decoding, he wants
> data to go directly from the buffer that contains it to the
> UnpackedUndoRecord without ever being stored in the UnpackUndoContext.
> During insertion, he wants data to go directly from the
> UnpackedUndoRecord to the buffer that contains it.  Or, at least, if
> there has to be an intermediate format, he wants it to be just a chunk
> of raw bytes, rather than a bunch of individual fields like we have in
> UndoPackContext currently.  I think that's a reasonable goal.  I'm not
> as concerned about it as he is from a performance point of view, but I
> think it would make the code look nicer, and that would be good.  If
> we save CPU cycles along the way, that is also good.
>
> In broad outline, what this means is:
>
> 1. Any field in the UndoPackContext that starts with urec_ goes away.
> 2. Instead of something like InsertUndoBytes((char *)
> &(ucontext->urec_fxid), ...) we'd write InsertUndobytes((char *)
> >uur_fxid, ...).
> 3. Similarly instead of ReadUndoBytes((char *) >urec_fxid,
> ...) we'd write ReadUndoBytes((char *) >uur_fxid, ...).
> 4. It seems slightly trickier to handle the cases where we've got a
> structure instead of individual fields, like urec_hd.  But those could
> be broken down into field-by-field reads and writes, e.g. in this case
> one call for urec_type and a second for urec_info.
> 5. For uur_group and uur_logswitch, the code would need to allocate
> those subsidiary structures before copying into them.
>
> To me, that seems like it ought to be a pretty straightforward change
> that just makes things simpler.  We'd probably need to pass the
> UnpackedUndoRecord to BeginUnpackUndo instead of FinishUnpackUndo, and
> keep a pointer to it in the UnpackUndoContext, but that seems fine.
> FinishUnpackUndo would end up just about empty, maybe entirely empty.
>
> Is that a reasonable idea?
>
I have already attempted that part and I feel it is not making code
any simpler than what we have today.  For packing, it's fine because I
can process all the member once and directly pack it into one memory
chunk and I can insert that to the buffer by one call of
InsertUndoBytes and that will make the code simpler.

But, while unpacking if I directly unpack to the UnpackUndoRecord then
there are few complexities.  I am not saying those are difficult to
implement but code may not look better.

a) First, we need to add extra stages for unpacking as we need to do
field by field.

b) Some of the members like uur_payload and uur_tuple are not the same
type in the UnpackUndoRecord compared to how it is stored in the page.
In UnpackUndoRecord those are StringInfoData whereas on the page we
store it as UndoRecordPayload header followed by the actual data.  I
am not saying we can not unpack this directly we can do it like,

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas

On Tue, Aug 13, 2019 at 8:11 AM Robert Haas  wrote:
> > We can probably check the fxid queue and error queue to get that
> > value.  However, I am not sure if that is sufficient because incase we
> > perform the request in the foreground, it won't be present in queues.
>
> Oh, I forgot about that requirement.  I think I can fix it so it does
> that fairly easily, but it will require a little bit of redesign which
> I won't have time to do this week.

Here's a version with a quick (possibly buggy) prototype of the
oldest-FXID support.  It also includes a bunch of comment changes,
pgindent, and a few other tweaks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v2-0001-New-undo-request-manager-now-with-UndoRequestMana.patch
Description: Binary data

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas

On Mon, Aug 19, 2019 at 2:04 AM Dilip Kumar  wrote:
> Currently, In UnpackedUndoRecord we store all members directly which
> are set by the caller.  We store pointers to some header which are
> allocated internally by the undo layer and the caller need not worry
> about setting them.  So now you are suggesting to put other headers
> also as structures in UnpackedUndoRecord.  I as such don't have much
> problem in doing that but I think initially Robert designed
> UnpackedUndoRecord structure this way so it will be good if Robert
> provides his opinion on this.

I don't believe that's what is being suggested.  It seems to me that
the thing Andres is complaining about here has roots in the original
sketch that I did for this code.  The oldest version I can find is
here:

https://github.com/EnterpriseDB/zheap/commit/7d194824a18f0c5e85c92451beab4bc6f044254c

In this version, and I think still in the current version, there is a
two-stage marshaling strategy.  First, the individual fields from the
UnpackedUndoRecord get copied into global variables (yes, that was my
fault, too, at least in part!) which are structures. Then, the
structures get copied into the target buffer. The idea of that design
was to keep the code simple, but it didn't really work out, because
things got a lot more complicated between the time I wrote those 3244
lines of code and the >3000 lines of code that live in this patch
today. One thing that change was that we moved more and more in the
direction of considering individual fields as separate objects to be
separately included or excluded, whereas when I wrote that code I
thought we were going to have groups of related fields that stood or
fell together. That idea turned out to be wrong. (There is the
even-larger question here of whether we ought to take Heikki's
suggestion and make this whole thing a lot more generic, but let's
start by discussing how the design that we have today could be
better-implemented.)

If I understand Andres correctly, he's arguing that we ought to get
rid of the two-stage marshaling strategy.  During decoding, he wants
data to go directly from the buffer that contains it to the
UnpackedUndoRecord without ever being stored in the UnpackUndoContext.
During insertion, he wants data to go directly from the
UnpackedUndoRecord to the buffer that contains it.  Or, at least, if
there has to be an intermediate format, he wants it to be just a chunk
of raw bytes, rather than a bunch of individual fields like we have in
UndoPackContext currently.  I think that's a reasonable goal.  I'm not
as concerned about it as he is from a performance point of view, but I
think it would make the code look nicer, and that would be good.  If
we save CPU cycles along the way, that is also good.

In broad outline, what this means is:

1. Any field in the UndoPackContext that starts with urec_ goes away.
2. Instead of something like InsertUndoBytes((char *)
&(ucontext->urec_fxid), ...) we'd write InsertUndobytes((char *)
>uur_fxid, ...).
3. Similarly instead of ReadUndoBytes((char *) >urec_fxid,
...) we'd write ReadUndoBytes((char *) >uur_fxid, ...).
4. It seems slightly trickier to handle the cases where we've got a
structure instead of individual fields, like urec_hd.  But those could
be broken down into field-by-field reads and writes, e.g. in this case
one call for urec_type and a second for urec_info.
5. For uur_group and uur_logswitch, the code would need to allocate
those subsidiary structures before copying into them.

To me, that seems like it ought to be a pretty straightforward change
that just makes things simpler.  We'd probably need to pass the
UnpackedUndoRecord to BeginUnpackUndo instead of FinishUnpackUndo, and
keep a pointer to it in the UnpackUndoContext, but that seems fine.
FinishUnpackUndo would end up just about empty, maybe entirely empty.

Is that a reasonable idea?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas

On Tue, Aug 20, 2019 at 5:02 AM Thomas Munro  wrote:
> 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> unpinned buffers, and marks as BM_DISCARDED any that happen to be
> pinned right now, so they can't be immediately invalidated.  Such
> buffers are never written back and are eligible for reuse on the next
> clock sweep, even if they're written into by a backend that managed to
> do that when we were trying to discard.

This is definitely more concurrent, but it might be *too much*
concurrency. Suppose that backend #1 is inserting a row and updating
the transaction header for the previous transaction; meanwhile,
backend #2 is discarding the previous transaction. It could happen
that backend #1 locks the transaction header for the previous
transaction and is all set to log the insertion ... but then gets
context-switched out.  Now backend #2 swoops in and logs the discard.
Backend #1 now wakes up and finishes logging a change to a page that,
according to the logic of the WAL stream, no longer exists.

It's probably possible to make this work by ignoring WAL references to
discarded pages during replay, but that seems a bit dangerous. At
least, it loses some sanity checking that you might like to have.

It seems to me that you can avoid this if you require that a backend
that wants to set BM_DISCARDED to acquire at least a shared content
lock before doing so.  If you do that, then once a backend acquires
content lock(s) on the page(s) containing the transaction header for
the purposes of updating it, it can notice that the BM_DISCARDED flag
is set and choose not to update those pages after all.  I think that
would be a smart design choice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund

On 2019-08-20 09:44:23 -0700, Andres Freund wrote:
> On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> > Aside from code changes based on review (and I have more to come of
> > those), the attached experimental patchset (also at
> > https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol
> > that, I hope, allows for better concurrency, reliability and
> > readability, and removes a bunch of TODO notes about questionable
> > interlocking.  However, I'm not quite done figuring out if the bufmgr
> > interaction is right and will be manageable on the undoaccess side, so
> > I'm hoping to get some feedback, not asking for anyone to rebase on
> > top of it yet.
> > 
> > Previously, there were two LWLocks used to make sure that all
> > discarding was prevented while anyone was reading or writing data in
> > any part of an undo log, and (probably more problematically) vice
> > versa.  Here's a new approach that removes that blocking:

Oh,  more point I forgot to add: Cool!

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund

Hi,

On 2019-08-20 17:11:38 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 10:35 PM Andres Freund  wrote:
> > On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> 
> > I don't think we can normally pin the undo buffers properly at that
> > stage. Without knowing the correct contents of the table page - which we
> > can't know without holding some form of lock preventing modifications -
> > we can't know how big our undo records are going to be. And we can't
> > just have buffers that don't exist on disk in shared memory, and we
> > don't want to allocate undo that we then don't need. So I think what
> > we'd have to do at that stage, is to "pre-allocate" buffers for the
> > maximum amount of UNDO needed, but mark the associated bufferdesc as not
> > yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> > would not be set.
> >
> > So at the start of a function that will need to insert undo we'd need to
> > pre-reserve the maximum number of buffers we could potentially
> > need. That reservation stage would
> >
> > a) pin the page with the current end of the undo
> > b) if needed pin the page of older undo that we need to update (e.g. to
> >update the next pointer)
> > c) perform clock sweep etc to acquire (find or create) enough clean to
> >hold the maximum amount of undo needed. These buffers would be marked
> >as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
> >
> > I assume that we'd make a) cheap by keeping it pinned for undo logs that
> > a backend is actively attached to. b) should only be needed once in a
> > transaction, so it's not too bad. c) we'd probably need to amortize
> > across multiple undo insertions, by keeping the unused buffers pinned
> > until the end of the transaction.
> >
> > I assume that having the infrastructure c) might also make some code
> > for already in postgres easier. There's obviously some issues around
> > guaranteeing that the maximum number of such buffers isn't high.
> 
> 
> I have analyzed this further, I think there is a problem if the
> record/s will not fit into the current undo log and we will have to
> switch the log.  Because before knowing the actual record length we
> are not sure whether the undo log will switch or not and which undo
> log we will get.  And, without knowing the logno (rnode) how we are
> going to pin the buffers?  Am I missing something?

That's precisely why I was suggesting (at the start of the quoted block
above) to not associate the buffers with pages at that point. Instead
just have clean, pinned, *unassociated* buffers. Which can be
re-associated without any IO.


> Apart from this while analyzing the other code I have noticed that in
> the current PG code we have few occurrences where try to read buffer
> under the buffer lock held.

> 1.
> In gistplacetopage
> {
> ...
> for (; ptr; ptr = ptr->next)
> {
> /* Allocate new page */
> ptr->buffer = gistNewBuffer(rel);
> GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
> ptr->page = BufferGetPage(ptr->buffer);
> ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
> }
> 2. During page split we find new buffer while holding the lock on the
> current buffer.
> 
> That doesn't mean that we can't do better but I am just referring to
> the existing code where we already have such issues.

Those are pretty clearly edge-cases, whereas the undo case at hand is a
very common path. Note again that heapam.c goes to considerably trouble
to never do this for common cases.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund

Hi,

On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> Aside from code changes based on review (and I have more to come of
> those), the attached experimental patchset (also at
> https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol
> that, I hope, allows for better concurrency, reliability and
> readability, and removes a bunch of TODO notes about questionable
> interlocking.  However, I'm not quite done figuring out if the bufmgr
> interaction is right and will be manageable on the undoaccess side, so
> I'm hoping to get some feedback, not asking for anyone to rebase on
> top of it yet.
> 
> Previously, there were two LWLocks used to make sure that all
> discarding was prevented while anyone was reading or writing data in
> any part of an undo log, and (probably more problematically) vice
> versa.  Here's a new approach that removes that blocking:
> 
> 1.  Anyone is allowed to try to read or write data at any UndoRecPtr
> that has been allocated, through the buffer pool (though you'd usually
> want to check it with UndoRecPtrIsDiscarded() first, and only rely on
> the system I'm describing to deal with races).
> 
> 2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
> cache miss, if the smgrread implementation wants to indicate that the
> buffer has been discarded/truncated and that is expected (md.c won't
> ever do that, but undofile.c can).

Hm. This gives me a bit of a stomach ache. It somehow feels like a weird
form of signalling.  Can't quite put my finger on why it makes me feel
queasy.


> 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> unpinned buffers, and marks as BM_DISCARDED any that happen to be
> pinned right now, so they can't be immediately invalidated.  Such
> buffers are never written back and are eligible for reuse on the next
> clock sweep, even if they're written into by a backend that managed to
> do that when we were trying to discard.

Hm. When is it legitimate for a backend to write into such a buffer? I
guess that's about updating the previous transaction's next pointer? Or
progress info?


> 5.  Separating begin from discard allows the WAL logging for
> UndoLogDiscard() to do filesystem actions before logging, and other
> effects after logging, which have several nice properties if you work
> through the various crash scenarios.

Hm. ISTM we always need to log before doing some filesystem operation
(see also my recent complaint Robert and I are discussing at the bottom
of [1]). It's just that we can have a separate stage afterwards?

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoZc5JVYORsGYs8YnkSxUC%3DcLQF1Z%2BfcpH2TTKvqkS7MFg%40mail.gmail.com



> So now I'd like to get feedback on the sanity of this scheme.  I'm not
> saying it doesn't have bugs right now -- I've been trying to figure
> out good ways to test it and I'm not quite there yet -- but the
> concept.  One observation I have is that there were already code paths
> in undoaccess.c that can tolerate InvalidBuffer in recovery, due to
> the potentially different discard timing for DO vs REDO.  I think
> that's a point in favour of this scheme, but I can see that it's
> inconvenient to have to deal with InvalidBuffer whenever you read.

FWIW, I'm far from convinced that those are currently quite right. See
discussion pointed to above.



> I pulled in the latest code from undoprocessing as of today, and I
> might be a bit confused about "Defect and enhancement in multi-log
> support" some of which I have squashed into the  make undolog patch.
> BTW undoprocessing builds with initialized variable warnings in xact.c
> on clang today.

I've complained about that existance of that commit multiple times
now. So far without any comments.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund

Hi,

On 2019-08-20 09:08:29 -0400, Robert Haas wrote:
> On Sat, Aug 17, 2019 at 1:28 PM Andres Freund  wrote:
> > The primary one in the context here is that if we do *not* have to lock
> > the buffers all ahead of time, we can simplify the interface. We
> > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > we're doing right now, so we'd need another phase, called by the "user"
> > during undo insertion. But if we do not need to lock the buffers before
> > the insertion over all starts, the inserting location doesn't have to
> > care.
> >
> > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > time was imo fairly unconvincing. Following the "recipe" for WAL
> > insertions is a good idea when writing a new run-of-the-mill WAL
> > inserting location - but when writing a new fundamental facility, that
> > already needs to modify how WAL works, then I find that much less
> > convincing.
> 
> So, you seem to be talking about something here which is different
> than what I thought we were talking about.  One question is whether we
> need to lock all of the buffers "ahead of time," and I think the
> answer to that question is probably "no." Since nobody else can be
> writing to those buffers, and probably also nobody can be reading them
> except maybe for some debugging tool, it should be fine if we enter
> the critical section and then lock them at the point when we write the
> bytes. I mean, there shouldn't be any contention, and I don't see any
> other problems.

Right. As long as we are as restrictive about the number of buffers per
undo record, and number of records per WAL insertions, I don't see any
need to go further than that.


> > Well, in the version of code that I was reviewing here, I don't there is
> > such a limit (there is a limit for buffers per undo record, but no limit
> > on the number of records inserted together). I think Dilip added a limit
> > since.  And we have the issue of a lot of IO happening while holding
> > content locks on several pages.  So I don't think it's a straw man at
> > all.
> 
> Hmm, what do you mean by "a lot of IO happening while holding content
> locks on several pages"? We might XLogInsert() but there shouldn't be
> any buffer I/O going on at that point.

That's my primary complain with how the code is structured right
now. Right now we potentially perform IO while holding exclusive content
locks, often multiple ones even. When acquiring target pages for undo,
currently always already hold the table page exclusive locked, and if
there's more than one buffer for undo, we'll also hold the previous
buffers locked. And acquiring a buffer will often have to write out a
dirty buffer to the OS, and a lot of times that will then also require
the kernel to flush data out.  That's imo an absolute no-go for the
general case.


> If there is, I think that should be redesigned.  We should collect
> buffer pins first, without locking.  Then lock.  Then write.  Or maybe
> lock-and-write, but only after everything's pinned.

Right. It's easy enough to do that for the locks on undo pages
themselves. The harder part is the content lock on the "table page" - we
don't accurately know how many undo buffers we will need, without
holding the table lock (or preventing modifications in some other
manner).

I tried to outline the problem and potential solutions in more detail
in:
https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de


> The fact of calling XLogInsert() while holding buffer locks is not
> great, but I don't think it's any worse here than in any other part of
> the system, because the undo buffers aren't going to be suffering
> concurrent access from any other backend, and because there shouldn't
> be more than a few of them.

Yea. That's obviously undesirable, but also fundamentally required at
least in the general case. And it's not at all specific to undo.

> [ WAL logging protocol ]

> I didn't intend to suggest that you don't want this to be documented.
> What I intended to suggest was that you seem to want to deviate from
> the documented rules, and it seems to me that we shouldn't do that
> unless we change the rules first, and I don't know what you think the
> rules should be or why those rules are safe.

IDK. We have at least five different places that at the very least bend
the rules - but with a comment explaining why it's safe in the specific
case. Personally I don't really think the generic guideline needs to
list every potential edge-case.


> > I think what primarily makes me concerned is that it's not clear to me
> > what guarantees that discard is the only reason for the block to
> > potentially be missing. I contrast to most other similar cases where WAL
> > replay simply re-creates the objects when trying to replay an action
> > affecting such an object, here we simply skip over the WAL logged
> > operation. So if e.g. the entire underlying UNDO file got lost, we
> > neither

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas

On Mon, Aug 19, 2019 at 8:22 AM Amit Kapila  wrote:
> One point to remember in this regard is that we do need to modify the
> LSN in undo pages after writing WAL, so all the undo pages need to be
> locked by that time or we again need to take the lock on them.

Uh, but a big part of the point of setting the LSN on the pages is to
keep them from being written out before the corresponding WAL is
flushed to disk. If you released and reacquired the lock, the page
could be written out during the window in the middle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas

On Tue, Aug 20, 2019 at 2:42 AM Amit Kapila  wrote:
> > Well, my main point, which so far has largely been ignored, was that we
> > may not acquire page locks when we still need to search for victim
> > buffers later. If we don't need to lock the pages up-front, but only do
> > so once we're actually copying the records into the undo pages, then we
> > don't a separate phase to acquire the locks. We can still hold all of
> > the page locks at the same time, as long as we just acquire them at the
> > later stage.
>
> Okay, IIUC, this means that we should have a separate phase where we
> call LockUndoBuffers (or something like that) before
> InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
> will lock all the buffers pinned during PrepareUndoInsert.  We can
> probably call LockUndoBuffers before entering the critical section to
> avoid any kind of failure in critical section.  If so, that sounds
> reasonable to me.

I'm kind of scratching my head here, because this is clearly different
than what Andres said in the quoted text to which you were replying.
He clearly implied that we should acquire the buffer locks within the
critical section during InsertPreparedUndo, and you responded by
proposing to do it outside the critical section in a separate step.
Regardless of which way is actually better, when somebody says "hey,
let's do A!" and you respond by saying "sounds good, I'll go implement
B!" that's not really helping us to get toward a solution.

FWIW, although I also thought of doing what you are describing here, I
think Andres's proposal is probably preferable, because it's simpler.
There's not really any reason why we can't take the buffer locks from
within the critical section, and that way callers don't have to deal
with the extra step.

> >  My secondary point was that *none* of this actually is
> > documented, even if it's entirely unobvious to the reader that the
> > relevant code can only run during WAL insertion, due to being pretty far
> > removed from that.
>
> I think this can be clearly mentioned in README or someplace else.

It also needs to be adequately commented in the files and functions involved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas

On Mon, Aug 19, 2019 at 5:16 PM Andres Freund  wrote:
> Well, my main point, which so far has largely been ignored, was that we
> may not acquire page locks when we still need to search for victim
> buffers later. If we don't need to lock the pages up-front, but only do
> so once we're actually copying the records into the undo pages, then we
> don't a separate phase to acquire the locks. We can still hold all of
> the page locks at the same time, as long as we just acquire them at the
> later stage.

+1 for that approach.  I am in complete agreement.

> My secondary point was that *none* of this actually is
> documented, even if it's entirely unobvious to the reader that the
> relevant code can only run during WAL insertion, due to being pretty far
> removed from that.

+1 also for properly documenting stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas

On Sat, Aug 17, 2019 at 1:28 PM Andres Freund  wrote:
> The primary one in the context here is that if we do *not* have to lock
> the buffers all ahead of time, we can simplify the interface. We
> certainly can't lock the buffers over IO (due to buffer reclaim) as
> we're doing right now, so we'd need another phase, called by the "user"
> during undo insertion. But if we do not need to lock the buffers before
> the insertion over all starts, the inserting location doesn't have to
> care.
>
> Secondarily, all the reasoning for needing to lock all buffers ahead of
> time was imo fairly unconvincing. Following the "recipe" for WAL
> insertions is a good idea when writing a new run-of-the-mill WAL
> inserting location - but when writing a new fundamental facility, that
> already needs to modify how WAL works, then I find that much less
> convincing.

So, you seem to be talking about something here which is different
than what I thought we were talking about.  One question is whether we
need to lock all of the buffers "ahead of time," and I think the
answer to that question is probably "no." Since nobody else can be
writing to those buffers, and probably also nobody can be reading them
except maybe for some debugging tool, it should be fine if we enter
the critical section and then lock them at the point when we write the
bytes. I mean, there shouldn't be any contention, and I don't see any
other problems.

The other question is whether need to hold all of the buffer locks at
the same time, and that seems a lot more problematic to me.  It's hard
to say exactly whether this unsafe, because it depends on exactly what
you think we're doing here, and I don't see that you've really spelled
that out.  The normal thing to do is call PageSetLSN() on every page
before releasing the buffer lock, and that means holding all the
buffer locks until after we've called XLogInsert(). Now, you could
argue that we should skip setting the page LSN because the data ahead
of the insertion pointer is effectively invisible anyway, but I bet
that causes problems with checksums, at least, since they rely on the
page LSN being accurate to know whether to emit WAL when a buffer is
written. You could argue that we could do the XLogInsert() first and
only after that lock and dirty the pages one by one, but I think that
might break checkpoint interlocking, since it would then be possible
for the checkpoint scan to pass over a buffer that does not appear to
need writing for the current checkpoint but later gets dirtied and
stamped with an LSN that would have caused it to be written had it
been there at the time the checkpoint scan reached it. I really can't
rule out the possibility that there's some way to make something in
this area work, but I don't know what it is, and I think it's a fairly
risky area to go tinkering.

> Well, in the version of code that I was reviewing here, I don't there is
> such a limit (there is a limit for buffers per undo record, but no limit
> on the number of records inserted together). I think Dilip added a limit
> since.  And we have the issue of a lot of IO happening while holding
> content locks on several pages.  So I don't think it's a straw man at
> all.

Hmm, what do you mean by "a lot of IO happening while holding content
locks on several pages"? We might XLogInsert() but there shouldn't be
any buffer I/O going on at that point.  If there is, I think that
should be redesigned.  We should collect buffer pins first, without
locking.  Then lock.  Then write.  Or maybe lock-and-write, but only
after everything's pinned.  The fact of calling XLogInsert() while
holding buffer locks is not great, but I don't think it's any worse
here than in any other part of the system, because the undo buffers
aren't going to be suffering concurrent access from any other backend,
and because there shouldn't be more than a few of them.

> > 2. The write-ahead logging protocol says that you're supposed to lock
> > all the buffers at once.  See src/backend/access/transam/README.  If
> > you want to go patch that file, then this patch can follow whatever
> > the locking rules in the patched version are.  But until then, the
> > patch should follow *the actual rules* not some other protocol based
> > on a hand-wavy explanation in an email someplace. Otherwise, you've
> > got the same sort of undocumented disaster-waiting-to-happen that you
> > keep complaining about in other parts of this patch.  We need fewer of
> > those, not more!
>
> But that's not what I'm asking for? I don't even know where you take
> from that I don't want this to be documented. I'm mainly asking for a
> comment explaining why the current behaviour is what it is. Because I
> don't think an *implicit* "normal WAL logging rules" is sufficient
> explanation, because all the locking here happens one or two layers away
> from the WAL logging site - so it's absolutely *NOT* obvious that that's
> the explanation. And I don't think any of the locking sites actually

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Dilip Kumar

On Wed, Aug 14, 2019 at 10:35 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:

> I don't think we can normally pin the undo buffers properly at that
> stage. Without knowing the correct contents of the table page - which we
> can't know without holding some form of lock preventing modifications -
> we can't know how big our undo records are going to be. And we can't
> just have buffers that don't exist on disk in shared memory, and we
> don't want to allocate undo that we then don't need. So I think what
> we'd have to do at that stage, is to "pre-allocate" buffers for the
> maximum amount of UNDO needed, but mark the associated bufferdesc as not
> yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> would not be set.
>
> So at the start of a function that will need to insert undo we'd need to
> pre-reserve the maximum number of buffers we could potentially
> need. That reservation stage would
>
> a) pin the page with the current end of the undo
> b) if needed pin the page of older undo that we need to update (e.g. to
>update the next pointer)
> c) perform clock sweep etc to acquire (find or create) enough clean to
>hold the maximum amount of undo needed. These buffers would be marked
>as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
>
> I assume that we'd make a) cheap by keeping it pinned for undo logs that
> a backend is actively attached to. b) should only be needed once in a
> transaction, so it's not too bad. c) we'd probably need to amortize
> across multiple undo insertions, by keeping the unused buffers pinned
> until the end of the transaction.
>
> I assume that having the infrastructure c) might also make some code
> for already in postgres easier. There's obviously some issues around
> guaranteeing that the maximum number of such buffers isn't high.

I have analyzed this further, I think there is a problem if the
record/s will not fit into the current undo log and we will have to
switch the log.  Because before knowing the actual record length we
are not sure whether the undo log will switch or not and which undo
log we will get.  And, without knowing the logno (rnode) how we are
going to pin the buffers?  Am I missing something?

Thomas do you think we can get around this problem?

Apart from this while analyzing the other code I have noticed that in
the current PG code we have few occurrences where try to read buffer
under the buffer lock held.
1.
In gistplacetopage
{
...
for (; ptr; ptr = ptr->next)
{
/* Allocate new page */
ptr->buffer = gistNewBuffer(rel);
GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
ptr->page = BufferGetPage(ptr->buffer);
ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
}
2. During page split we find new buffer while holding the lock on the
current buffer.

That doesn't mean that we can't do better but I am just referring to
the existing code where we already have such issues.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Amit Kapila

On Tue, Aug 20, 2019 at 2:46 AM Andres Freund  wrote:
>
> On 2019-08-19 17:52:24 +0530, Amit Kapila wrote:
> > On Sat, Aug 17, 2019 at 10:58 PM Andres Freund  wrote:
> > >
> > > > Well, I don't understand why you're on about this.  We've discussed it
> > > > a number of times but I'm still confused.
> > >
> > > There's two reasons here:
> > >
> > > The primary one in the context here is that if we do *not* have to lock
> > > the buffers all ahead of time, we can simplify the interface. We
> > > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > > we're doing right now, so we'd need another phase, called by the "user"
> > > during undo insertion. But if we do not need to lock the buffers before
> > > the insertion over all starts, the inserting location doesn't have to
> > > care.
> > >
> > > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > > time was imo fairly unconvincing. Following the "recipe" for WAL
> > > insertions is a good idea when writing a new run-of-the-mill WAL
> > > inserting location - but when writing a new fundamental facility, that
> > > already needs to modify how WAL works, then I find that much less
> > > convincing.
> > >
> >
> > One point to remember in this regard is that we do need to modify the
> > LSN in undo pages after writing WAL, so all the undo pages need to be
> > locked by that time or we again need to take the lock on them.
>
> Well, my main point, which so far has largely been ignored, was that we
> may not acquire page locks when we still need to search for victim
> buffers later. If we don't need to lock the pages up-front, but only do
> so once we're actually copying the records into the undo pages, then we
> don't a separate phase to acquire the locks. We can still hold all of
> the page locks at the same time, as long as we just acquire them at the
> later stage.
>

Okay, IIUC, this means that we should have a separate phase where we
call LockUndoBuffers (or something like that) before
InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
will lock all the buffers pinned during PrepareUndoInsert.  We can
probably call LockUndoBuffers before entering the critical section to
avoid any kind of failure in critical section.  If so, that sounds
reasonable to me.

>  My secondary point was that *none* of this actually is
> documented, even if it's entirely unobvious to the reader that the
> relevant code can only run during WAL insertion, due to being pretty far
> removed from that.
>

I think this can be clearly mentioned in README or someplace else.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-19 Thread Andres Freund

On 2019-08-19 17:52:24 +0530, Amit Kapila wrote:
> On Sat, Aug 17, 2019 at 10:58 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> > > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > > > Again, I think it's not ok to just assume you can lock an 
> > > > > > essentially
> > > > > > unbounded number of buffers. This seems almost guaranteed to result 
> > > > > > in
> > > > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > > > >
> > > > > I think for controlling that we need to put a limit on max prepared
> > > > > undo?  I am not sure any other way of limiting the number of buffers
> > > > > because we must lock all the buffer in which we are going to insert
> > > > > the undo record under one WAL logged operation.
> > > >
> > > > I heard that a number of times. But I still don't know why that'd
> > > > actually be true. Why would it not be sufficient to just lock the buffer
> > > > currently being written to, rather than all buffers? It'd require a bit
> > > > of care updating the official current "logical end" of a log, but
> > > > otherwise ought to not be particularly hard? Only one backend can extend
> > > > the log after all, and until the log is externally visibily extended,
> > > > nobody can read or write those buffers, no?
> > >
> > > Well, I don't understand why you're on about this.  We've discussed it
> > > a number of times but I'm still confused.
> >
> > There's two reasons here:
> >
> > The primary one in the context here is that if we do *not* have to lock
> > the buffers all ahead of time, we can simplify the interface. We
> > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > we're doing right now, so we'd need another phase, called by the "user"
> > during undo insertion. But if we do not need to lock the buffers before
> > the insertion over all starts, the inserting location doesn't have to
> > care.
> >
> > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > time was imo fairly unconvincing. Following the "recipe" for WAL
> > insertions is a good idea when writing a new run-of-the-mill WAL
> > inserting location - but when writing a new fundamental facility, that
> > already needs to modify how WAL works, then I find that much less
> > convincing.
> >
> 
> One point to remember in this regard is that we do need to modify the
> LSN in undo pages after writing WAL, so all the undo pages need to be
> locked by that time or we again need to take the lock on them.

Well, my main point, which so far has largely been ignored, was that we
may not acquire page locks when we still need to search for victim
buffers later. If we don't need to lock the pages up-front, but only do
so once we're actually copying the records into the undo pages, then we
don't a separate phase to acquire the locks. We can still hold all of
the page locks at the same time, as long as we just acquire them at the
later stage.  My secondary point was that *none* of this actually is
documented, even if it's entirely unobvious to the reader that the
relevant code can only run during WAL insertion, due to being pretty far
removed from that.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

2019-08-19 Thread Amit Kapila

On Sat, Aug 17, 2019 at 10:58 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > > >
> > > > I think for controlling that we need to put a limit on max prepared
> > > > undo?  I am not sure any other way of limiting the number of buffers
> > > > because we must lock all the buffer in which we are going to insert
> > > > the undo record under one WAL logged operation.
> > >
> > > I heard that a number of times. But I still don't know why that'd
> > > actually be true. Why would it not be sufficient to just lock the buffer
> > > currently being written to, rather than all buffers? It'd require a bit
> > > of care updating the official current "logical end" of a log, but
> > > otherwise ought to not be particularly hard? Only one backend can extend
> > > the log after all, and until the log is externally visibily extended,
> > > nobody can read or write those buffers, no?
> >
> > Well, I don't understand why you're on about this.  We've discussed it
> > a number of times but I'm still confused.
>
> There's two reasons here:
>
> The primary one in the context here is that if we do *not* have to lock
> the buffers all ahead of time, we can simplify the interface. We
> certainly can't lock the buffers over IO (due to buffer reclaim) as
> we're doing right now, so we'd need another phase, called by the "user"
> during undo insertion. But if we do not need to lock the buffers before
> the insertion over all starts, the inserting location doesn't have to
> care.
>
> Secondarily, all the reasoning for needing to lock all buffers ahead of
> time was imo fairly unconvincing. Following the "recipe" for WAL
> insertions is a good idea when writing a new run-of-the-mill WAL
> inserting location - but when writing a new fundamental facility, that
> already needs to modify how WAL works, then I find that much less
> convincing.
>

One point to remember in this regard is that we do need to modify the
LSN in undo pages after writing WAL, so all the undo pages need to be
locked by that time or we again need to take the lock on them.

>
> > 1. It's absolutely fine to just put a limit on this, because the
> > higher-level facilities that use this shouldn't be doing a single
> > WAL-logged operation that touches a zillion buffers.

Right, by default a WAL log can only cover 4 buffers.  If we need to
touch more buffers, then the caller needs to call
XLogEnsureRecordSpace.  So, I agree with the point that generally, it
should be few buffers (2 or 3) of undo that need to be touched in a
single operation and if there are more, either callers need to change
or at the very least they need to be careful about the same.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-19 Thread Dilip Kumar

On Wed, Aug 14, 2019 at 10:35 PM Andres Freund  wrote:
>

>
> > >   - When reading an undo record, the whole stage of UnpackUndoData()
> > > reading data into a the UndoPackContext is omitted, reading directly
> > > into the UnpackedUndoRecord. That removes one further copy of the
> > > record format.
> > So we will read member by member to UnpackedUndoRecord?  because in
> > context we have at least a few headers packed and we can memcpy one
> > header at a time like UndoRecordHeader, UndoRecordBlock.
>
> Well, right now you then copy them again later, so not much is gained by
> that (although that later copy can happen without the content lock
> held). As I think I suggested before, I suspect that the best way would
> be to just memcpy() the data from the page(s) into an appropriately
> sized buffer with the content lock held, and then perform unpacking
> directly into UnpackedUndoRecord. Especially with the bulk API that will
> avoid having to do much work with locks held, and reduce memory usage by
> only unpacking the record(s) in a batch that are currently being looked
> at.
>
>
> > But that just a few of them so if we copy field by field in the
> > UnpackedUndoRecord then we can get rid of copying in context then copy
> > it back to the UnpackedUndoRecord.  Is this is what in your mind or
> > you want to store these structures (UndoRecordHeader, UndoRecordBlock)
> > directly into UnpackedUndoRecord?
>
> I at the moment see no reason not to?

Currently, In UnpackedUndoRecord we store all members directly which
are set by the caller.  We store pointers to some header which are
allocated internally by the undo layer and the caller need not worry
about setting them.  So now you are suggesting to put other headers
also as structures in UnpackedUndoRecord.  I as such don't have much
problem in doing that but I think initially Robert designed
UnpackedUndoRecord structure this way so it will be good if Robert
provides his opinion on this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-18 Thread Dilip Kumar

On Sat, Aug 17, 2019 at 9:35 PM Robert Haas  wrote:
>
> On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > >
> > > I think for controlling that we need to put a limit on max prepared
> > > undo?  I am not sure any other way of limiting the number of buffers
> > > because we must lock all the buffer in which we are going to insert
> > > the undo record under one WAL logged operation.
> >
> > I heard that a number of times. But I still don't know why that'd
> > actually be true. Why would it not be sufficient to just lock the buffer
> > currently being written to, rather than all buffers? It'd require a bit
> > of care updating the official current "logical end" of a log, but
> > otherwise ought to not be particularly hard? Only one backend can extend
> > the log after all, and until the log is externally visibily extended,
> > nobody can read or write those buffers, no?
>
> Well, I don't understand why you're on about this.  We've discussed it
> a number of times but I'm still confused.  I'll repeat my previous
> arguments on-list:
>
> 1. It's absolutely fine to just put a limit on this, because the
> higher-level facilities that use this shouldn't be doing a single
> WAL-logged operation that touches a zillion buffers.  We have been
> careful to avoid having WAL-logged operations touch an unbounded
> number of buffers in plenty of other places, like the btree code, and
> we are going to have to be similarly careful here for multiple
> reasons, deadlock avoidance being one.  So, saying, "hey, you're going
> to lock an unlimited number of buffers" is a straw man.  We aren't.
> We can't.

Right.  So basically, we need to put a limit on how many max undo can
be prepared under single WAL logged operation and that will internally
put the limit on max undo buffers.  Suppose we limit max_prepared_
undo to 2 then we need to lock at max 5 undo buffers.  We need to
somehow deal with the multi-insert code in the zheap because in that
code for inserting in a single page we write one undo record per range
if all the tuple which we are inserting on a single page are
interleaved.  But, maybe we can handle that by just inserting one undo
record which can have multiple ranges.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-17 Thread Andres Freund

Hi,

On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > >
> > > I think for controlling that we need to put a limit on max prepared
> > > undo?  I am not sure any other way of limiting the number of buffers
> > > because we must lock all the buffer in which we are going to insert
> > > the undo record under one WAL logged operation.
> >
> > I heard that a number of times. But I still don't know why that'd
> > actually be true. Why would it not be sufficient to just lock the buffer
> > currently being written to, rather than all buffers? It'd require a bit
> > of care updating the official current "logical end" of a log, but
> > otherwise ought to not be particularly hard? Only one backend can extend
> > the log after all, and until the log is externally visibily extended,
> > nobody can read or write those buffers, no?
>
> Well, I don't understand why you're on about this.  We've discussed it
> a number of times but I'm still confused.

There's two reasons here:

The primary one in the context here is that if we do *not* have to lock
the buffers all ahead of time, we can simplify the interface. We
certainly can't lock the buffers over IO (due to buffer reclaim) as
we're doing right now, so we'd need another phase, called by the "user"
during undo insertion. But if we do not need to lock the buffers before
the insertion over all starts, the inserting location doesn't have to
care.

Secondarily, all the reasoning for needing to lock all buffers ahead of
time was imo fairly unconvincing. Following the "recipe" for WAL
insertions is a good idea when writing a new run-of-the-mill WAL
inserting location - but when writing a new fundamental facility, that
already needs to modify how WAL works, then I find that much less
convincing.

> 1. It's absolutely fine to just put a limit on this, because the
> higher-level facilities that use this shouldn't be doing a single
> WAL-logged operation that touches a zillion buffers.  We have been
> careful to avoid having WAL-logged operations touch an unbounded
> number of buffers in plenty of other places, like the btree code, and
> we are going to have to be similarly careful here for multiple
> reasons, deadlock avoidance being one.  So, saying, "hey, you're going
> to lock an unlimited number of buffers" is a straw man.  We aren't.
> We can't.

Well, in the version of code that I was reviewing here, I don't there is
such a limit (there is a limit for buffers per undo record, but no limit
on the number of records inserted together). I think Dilip added a limit
since.  And we have the issue of a lot of IO happening while holding
content locks on several pages.  So I don't think it's a straw man at
all.

> 2. The write-ahead logging protocol says that you're supposed to lock
> all the buffers at once.  See src/backend/access/transam/README.  If
> you want to go patch that file, then this patch can follow whatever
> the locking rules in the patched version are.  But until then, the
> patch should follow *the actual rules* not some other protocol based
> on a hand-wavy explanation in an email someplace. Otherwise, you've
> got the same sort of undocumented disaster-waiting-to-happen that you
> keep complaining about in other parts of this patch.  We need fewer of
> those, not more!

But that's not what I'm asking for? I don't even know where you take
from that I don't want this to be documented. I'm mainly asking for a
comment explaining why the current behaviour is what it is. Because I
don't think an *implicit* "normal WAL logging rules" is sufficient
explanation, because all the locking here happens one or two layers away
from the WAL logging site - so it's absolutely *NOT* obvious that that's
the explanation. And I don't think any of the locking sites actually has
comments explaining why the locks are acquired at that time (in fact,
IIRC until the review some even only mentioned pinning, not locking).

> > > Suppose you insert one record for the transaction which split in
> > > block1 and 2.  Now, before this block is actually going to the disk
> > > the transaction committed and become all visible the undo logs are
> > > discarded.  It's possible that block 1 is completely discarded but
> > > block 2 is not because it might have undo for the next transaction.
> > > Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> > > their so we need to skip inserting undo for block 1 as it does not
> > > exist.
> >
> > Hm. I'm quite doubtful this is a good idea. How will this not force us
> > to a emit a lot more expensive durable operations while writing undo?
> > And doesn't this reduce error detection quite remarkably?
> >
> > Thomas, Robert?
> 
> I think you're going to

Re: POC: Cleaning up orphaned files using undo logs

2019-08-17 Thread Robert Haas

On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > Again, I think it's not ok to just assume you can lock an essentially
> > > unbounded number of buffers. This seems almost guaranteed to result in
> > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> >
> > I think for controlling that we need to put a limit on max prepared
> > undo?  I am not sure any other way of limiting the number of buffers
> > because we must lock all the buffer in which we are going to insert
> > the undo record under one WAL logged operation.
>
> I heard that a number of times. But I still don't know why that'd
> actually be true. Why would it not be sufficient to just lock the buffer
> currently being written to, rather than all buffers? It'd require a bit
> of care updating the official current "logical end" of a log, but
> otherwise ought to not be particularly hard? Only one backend can extend
> the log after all, and until the log is externally visibily extended,
> nobody can read or write those buffers, no?

Well, I don't understand why you're on about this.  We've discussed it
a number of times but I'm still confused.  I'll repeat my previous
arguments on-list:

1. It's absolutely fine to just put a limit on this, because the
higher-level facilities that use this shouldn't be doing a single
WAL-logged operation that touches a zillion buffers.  We have been
careful to avoid having WAL-logged operations touch an unbounded
number of buffers in plenty of other places, like the btree code, and
we are going to have to be similarly careful here for multiple
reasons, deadlock avoidance being one.  So, saying, "hey, you're going
to lock an unlimited number of buffers" is a straw man.  We aren't.
We can't.

2. The write-ahead logging protocol says that you're supposed to lock
all the buffers at once.  See src/backend/access/transam/README.  If
you want to go patch that file, then this patch can follow whatever
the locking rules in the patched version are.  But until then, the
patch should follow *the actual rules* not some other protocol based
on a hand-wavy explanation in an email someplace. Otherwise, you've
got the same sort of undocumented disaster-waiting-to-happen that you
keep complaining about in other parts of this patch.  We need fewer of
those, not more!

3. There is no reason to care about all of the buffers being locked at
once, because they are not unlimited in number (see point #1) and
nobody else is looking at them anyway (see the final sentence of what
I quoted above).

I think we are, or ought to be, talking about locking 2 (or maybe in
rare cases 3 or 4) undo buffers in connection with a single WAL
record.  If we're talking about more than that, then I think the
higher-level code needs to be changed.  If we're talking about that
many, then we don't need to be clever.  We can just do the standard
thing that the rest of the system does, and it will be fine just like
it is everywhere else.

> > Suppose you insert one record for the transaction which split in
> > block1 and 2.  Now, before this block is actually going to the disk
> > the transaction committed and become all visible the undo logs are
> > discarded.  It's possible that block 1 is completely discarded but
> > block 2 is not because it might have undo for the next transaction.
> > Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> > their so we need to skip inserting undo for block 1 as it does not
> > exist.
>
> Hm. I'm quite doubtful this is a good idea. How will this not force us
> to a emit a lot more expensive durable operations while writing undo?
> And doesn't this reduce error detection quite remarkably?
>
> Thomas, Robert?

I think you're going to need to spell out your assumptions in order
for me to be able to comment intelligently.  This is another thing
that seems pretty normal to me.  Generally, WAL replay might need to
recreate objects whose creation is not separately WAL-logged, and it
might need to skip operations on objects that have been dropped later
in the WAL stream and thus don't exist any more. This seems like an
instance of the latter pattern.  There's no reason to try to put valid
data into pages that we know have been discarded, and both inserting
and discarding undo data need to be logged anyway.

As a general point, I think the hope is that undo generated by
short-running transactions that commit and become all-visible quickly
will be cheap.  We should be able to dirty shared buffers but then
discard the data without ever writing it out to disk if we've logged a
discard of that data. Obviously, if you've got long-running
transactions that are either generating undo or holding old snapshots,
you're going to have to really flush the data, but we want to avoid
that when we can.  And the same is true on the standby: even if we
write the dirty data into shared buffers instead of skipping the write
altogether, we hope to be able to forget about those buffers when we
encounter a

Re: POC: Cleaning up orphaned files using undo logs

2019-08-16 Thread Dilip Kumar

On Fri, Aug 16, 2019 at 10:56 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar  wrote:
> > >
> > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> >
> > > >   I think that batch reading should just copy the underlying data into a
> > > >   char* buffer. Only the records that currently are being used by
> > > >   higher layers should get exploded into an unpacked record. That will
> > > >   reduce memory usage quite noticably (and I suspect it also drastically
> > > >   reduce the overhead due to a large context with a lot of small
> > > >   allocations that then get individually freed).
> > >
> > > Ok, I got your idea.  I will analyze it further and work on this if
> > > there is no problem.
> >
> > I think there is one problem that currently while unpacking the undo
> > record if the record is compressed (i.e. some of the fields does not
> > exist in the record) then we read those fields from the first record
> > on the page.  But, if we just memcpy the undo pages to the buffers and
> > delay the unpacking whenever it's needed seems that we would need to
> > know the page boundary and also we need to know the offset of the
> > first complete record on the page from where we can get that
> > information (which is currently in undo page header).
>
> I don't understand why that's a problem?
Okay, I was assuming that we will be only copying data part not
complete page including the page header.  If we copy the page header
as well we might be able to unpack the compressed record as well.

>
>
> > As of now even if we leave this issue apart I am not very clear what
> > benefit you are seeing in the way you are describing compared to the
> > way I am doing it now?
> >
> > a) Is it the multiple palloc? If so then we can allocate memory at
> > once and flatten the undo records in that.  Earlier, I was doing that
> > but we need to align each unpacked undo record so that we can access
> > them directly and based on Robert's suggestion I have modified it to
> > multiple palloc.
>
> Part of it.
>
> > b) Is it the memory size problem that the unpack undo record will take
> > more memory compared to the packed record?
>
> Part of it.
>
> > c) Do you think that we will not need to unpack all the records?  But,
> > I think eventually, at the higher level we will have to unpack all the
> > undo records ( I understand that it will be one at a time)
>
> Part of it. There's a *huge* difference between having a few hundred to
> thousand unpacked records, each consisting of several independent
> allocations, in memory and having one large block containing all
> packed records in a batch, and a few allocations for the few unpacked
> records that need to exist.
>
> There's also d) we don't need separate tiny memory copies while holding
> buffer locks etc.

Yeah, that too.  Yet another problem could be that how are we going to
process those record? Because for that we need to know all the undo
record pointers between start_urecptr and the end_urecptr right?  we
just have the big memory chunk and we have no idea how many undo
records are there and what are their undo record pointers.  And
without knowing that information, I am unable to imagine how we are
going to sort them based on block number.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Andres Freund

Hi,

On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar  wrote:
> >
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> 
> > >   I think that batch reading should just copy the underlying data into a
> > >   char* buffer. Only the records that currently are being used by
> > >   higher layers should get exploded into an unpacked record. That will
> > >   reduce memory usage quite noticably (and I suspect it also drastically
> > >   reduce the overhead due to a large context with a lot of small
> > >   allocations that then get individually freed).
> >
> > Ok, I got your idea.  I will analyze it further and work on this if
> > there is no problem.
> 
> I think there is one problem that currently while unpacking the undo
> record if the record is compressed (i.e. some of the fields does not
> exist in the record) then we read those fields from the first record
> on the page.  But, if we just memcpy the undo pages to the buffers and
> delay the unpacking whenever it's needed seems that we would need to
> know the page boundary and also we need to know the offset of the
> first complete record on the page from where we can get that
> information (which is currently in undo page header).

I don't understand why that's a problem?


> As of now even if we leave this issue apart I am not very clear what
> benefit you are seeing in the way you are describing compared to the
> way I am doing it now?
> 
> a) Is it the multiple palloc? If so then we can allocate memory at
> once and flatten the undo records in that.  Earlier, I was doing that
> but we need to align each unpacked undo record so that we can access
> them directly and based on Robert's suggestion I have modified it to
> multiple palloc.

Part of it.

> b) Is it the memory size problem that the unpack undo record will take
> more memory compared to the packed record?

Part of it.

> c) Do you think that we will not need to unpack all the records?  But,
> I think eventually, at the higher level we will have to unpack all the
> undo records ( I understand that it will be one at a time)

Part of it. There's a *huge* difference between having a few hundred to
thousand unpacked records, each consisting of several independent
allocations, in memory and having one large block containing all
packed records in a batch, and a few allocations for the few unpacked
records that need to exist.

There's also d) we don't need separate tiny memory copies while holding
buffer locks etc.

Greetings,

Andres Freund

1 2 3 4 >

1 - 100 of 305 matches

Mail list logo