Re: POC: Cleaning up orphaned files using undo logs

2019-09-18 Thread Amit Kapila
On Mon, Sep 16, 2019 at 10:37 PM Robert Haas  wrote:
>
> It seems to me that zheap went wrong in ending up with separate undo
> types for in-place and non-in-place updates. Why not just have ONE
> kind of undo record that describes an update, and allow that update to
> have either one TID or two TIDs depending on which kind of update it
> is? There may be a reason, but I don't know what it is, unless it's
> just that the UnpackedUndoRecord idea that I invented wasn't flexible
> enough and nobody thought of generalizing it. Curious to hear your
> thoughts on this.
>

I think not only TID's, but we also need to two uur_prevundo (previous
undo of the block) pointers.  This is required both when we have to
perform page-wise undo and chain traversal during visibility checks.
So, we can keep a combination of TID and prevundo.

The other thing is that during rollback when we collect the undo for
each page, applying the action for this undo need some thoughts.  For
example, we can't apply the undo to rollback both Insert and
non-inplace-update as both are on different pages.  The reason is that
the page where non-inplace-update has happened might have more undos
that need to be applied before this.  We can somehow make this undo
available to apply while collecting undo for both the heap pages.  I
think there is also a need to
identify which TID is for Insert and which is for non-inplace-update
part of the operation because we won't know that while applying undo
unless we check the state of a tuple on the page.

So, with this idea, we will make one undo record part of multiple
chains which might need some consideration at different places like
above.



--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-09-16 Thread Thomas Munro
On Tue, Sep 17, 2019 at 3:09 AM Kuntal Ghosh  wrote:
> On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro  wrote:
> > Agreed.  I added a line to break out of that loop if !block->in_use.
> >
> I think we should skip the block if !block->in_use. Because, the undo
> buffer can be registered in a subsequent block as well. For different
> operations, we can use different block_id to register the undo buffer
> in the redo record.

Oops, right.  So it should just be added to the if condition.  Will do.

-- 
Thomas Munro
https://enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-09-16 Thread Robert Haas
On Mon, Sep 16, 2019 at 11:09 AM Kuntal Ghosh
 wrote:
> Okay. In that case, we need to rethink the cases for multi-inserts and
> non-inlace updates both of which currently inserts multiple undo
> record corresponding to a single WAL record. For multi-inserts, it can
> be solved easily by moving all the offset information in the payload.
> But, for non-inlace updates, we insert one undo record for the update
> and one for the insert. Wondering whether we've to insert two WAL
> records - one for update and one for the new insert.

No, I think the solution is to put the information about both halves
of the non-in-place update in the same undo record.  I think the only
reason why that's tricky is because we've got two block numbers and
two offsets, and the only reason that's a problem is because
UnpackedUndoRecord only has one field for each of those things, and
that goes right back to Heikki's comments about the format not being
flexible enough. If you see some other problem, it would be
interesting to know what it is.

One thing I've been thinking about is: suppose that you're following
the undo chain for a tuple and you come to a non-in-place update
record. Can you get confused? I don't think so, because you can
compare the TID for which you're following the chain to the new TID
and the old TID in the record and it should match one or the other but
not both. But I don't think you even really need to do that much: if
you started with a deleted item, the first thing in the undo chain has
to be a delete or non-in-place update that got rid of it. And if you
started with a non-deleted item, then the beginning of the undo chain,
if it hasn't been discarded yet, will be the insert or non-in-place
update that created it. There's nowhere else that you can hit a
non-in-place update, and no room (that I can see) for any ambiguity.

It seems to me that zheap went wrong in ending up with separate undo
types for in-place and non-in-place updates. Why not just have ONE
kind of undo record that describes an update, and allow that update to
have either one TID or two TIDs depending on which kind of update it
is? There may be a reason, but I don't know what it is, unless it's
just that the UnpackedUndoRecord idea that I invented wasn't flexible
enough and nobody thought of generalizing it. Curious to hear your
thoughts on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-09-16 Thread Kuntal Ghosh
Hello Thomas,

On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro  wrote:
>
> > 1. In UndoLogAllocateInRecovery, when we find the current log number
> > from the list of registered blocks, we don't check whether the
> > block->in_use flag is true or not. In XLogResetInsertion, we just
> > reset in_use flag without reseting the blocks[]->rnode information.
> > So, if we don't check the in_use flag, it's possible that we'll
> > consult some block information from the previous WAL record. IMHO,
> > just adding an in_use check in UndoLogAllocateInRecovery will solve
> > the problem.
>
> Agreed.  I added a line to break out of that loop if !block->in_use.
>
I think we should skip the block if !block->in_use. Because, the undo
buffer can be registered in a subsequent block as well. For different
operations, we can use different block_id to register the undo buffer
in the redo record.

> BTW I am planning to simplify that code considerably, based on a plan
> to introduce a new rule: there can be only one undo record and
> therefore only one undo allocation per WAL record.
>
Okay. In that case, we need to rethink the cases for multi-inserts and
non-inlace updates both of which currently inserts multiple undo
record corresponding to a single WAL record. For multi-inserts, it can
be solved easily by moving all the offset information in the payload.
But, for non-inlace updates, we insert one undo record for the update
and one for the insert. Wondering whether we've to insert two WAL
records - one for update and one for the new insert.

> > 2. A transaction, inserts one undo record and generated a WAL record
> > for the same, say at WAL location 0/2000A000. Next, the undo record
> > gets discarded and WAL is generated to update the meta.discard pointer
> > at location 0/2000B000  At the same time, an ongoing checkpoint with
> > checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
> > Now, the system crashes.
> > Now, the recovery starts from the location 0/2000. When the
> > recovery of 0/2000A000 happens, it sees the undo record that it's
> > about to insert, is already discarded as per meta.discard (flushed by
> > checkpoint). In this case, should we just skip inserting the undo
> > record?
>
> I see two options:
>
> 1.  We make it so that if you're allocating in recovery and discard >
> insert, we'll just set discard = insert so you can proceed.  The code
> in undofile_get_segment_file() already copes with missing files during
> recovery.
>
Interesting. This should work.

>
> > 3. Currently, we create a backup image of the unlogged part of the
> > undo log's metadata only when some backend allocates some space from
> > the undo log (in UndoLogAllocate). This helps us restore the unlogged
> > meta part after a checkpoint.
> > When we perform an undo action, we also update the undo action
> > progress and emit an WAL record. The same operation can performed by
> > the undo worker which doesn't allocate any space from the undo log.
> > So, if an undo worker emits an WAL record to update undo action
> > progress after a checkpoint, it'll not be able to WAL log the backup
> > image of the meta unlogged part. IMHO, this breaks the recovery logic
> > of unlogged part of undo meta.
>
> I thought that was OK because those undo data updates don't depend on
> the insert pointer.  But I see what you mean: the next modification of
> the page that DOES depend on the insert pointer might not log the
> meta-data if it's not the first WAL record to touch it after a
> checkpoint.  Rats.  I'll have to think about that some more.
Cool.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-09-15 Thread Thomas Munro
On Mon, Sep 16, 2019 at 5:27 AM Kuntal Ghosh  wrote:
> While testing zheap over undo apis, we've found the following
> issues/scenarios that might need some fixes/discussions:

Thanks!

> 1. In UndoLogAllocateInRecovery, when we find the current log number
> from the list of registered blocks, we don't check whether the
> block->in_use flag is true or not. In XLogResetInsertion, we just
> reset in_use flag without reseting the blocks[]->rnode information.
> So, if we don't check the in_use flag, it's possible that we'll
> consult some block information from the previous WAL record. IMHO,
> just adding an in_use check in UndoLogAllocateInRecovery will solve
> the problem.

Agreed.  I added a line to break out of that loop if !block->in_use.

BTW I am planning to simplify that code considerably, based on a plan
to introduce a new rule: there can be only one undo record and
therefore only one undo allocation per WAL record.

> 2. A transaction, inserts one undo record and generated a WAL record
> for the same, say at WAL location 0/2000A000. Next, the undo record
> gets discarded and WAL is generated to update the meta.discard pointer
> at location 0/2000B000  At the same time, an ongoing checkpoint with
> checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
> Now, the system crashes.
> Now, the recovery starts from the location 0/2000. When the
> recovery of 0/2000A000 happens, it sees the undo record that it's
> about to insert, is already discarded as per meta.discard (flushed by
> checkpoint). In this case, should we just skip inserting the undo
> record?

I see two options:

1.  We make it so that if you're allocating in recovery and discard >
insert, we'll just set discard = insert so you can proceed.  The code
in undofile_get_segment_file() already copes with missing files during
recovery.

2.  We skip the insert as you said.

I think option 1 is probably best, otherwise you have to cope with
failure to insert by skipping, as you said.

> 3. Currently, we create a backup image of the unlogged part of the
> undo log's metadata only when some backend allocates some space from
> the undo log (in UndoLogAllocate). This helps us restore the unlogged
> meta part after a checkpoint.
> When we perform an undo action, we also update the undo action
> progress and emit an WAL record. The same operation can performed by
> the undo worker which doesn't allocate any space from the undo log.
> So, if an undo worker emits an WAL record to update undo action
> progress after a checkpoint, it'll not be able to WAL log the backup
> image of the meta unlogged part. IMHO, this breaks the recovery logic
> of unlogged part of undo meta.

I thought that was OK because those undo data updates don't depend on
the insert pointer.  But I see what you mean: the next modification of
the page that DOES depend on the insert pointer might not log the
meta-data if it's not the first WAL record to touch it after a
checkpoint.  Rats.  I'll have to think about that some more.

-- 
Thomas Munro
https://enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-09-15 Thread Kuntal Ghosh
Hello Thomas,

While testing zheap over undo apis, we've found the following
issues/scenarios that might need some fixes/discussions:

1. In UndoLogAllocateInRecovery, when we find the current log number
from the list of registered blocks, we don't check whether the
block->in_use flag is true or not. In XLogResetInsertion, we just
reset in_use flag without reseting the blocks[]->rnode information.
So, if we don't check the in_use flag, it's possible that we'll
consult some block information from the previous WAL record. IMHO,
just adding an in_use check in UndoLogAllocateInRecovery will solve
the problem.

2. A transaction, inserts one undo record and generated a WAL record
for the same, say at WAL location 0/2000A000. Next, the undo record
gets discarded and WAL is generated to update the meta.discard pointer
at location 0/2000B000  At the same time, an ongoing checkpoint with
checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
Now, the system crashes.
Now, the recovery starts from the location 0/2000. When the
recovery of 0/2000A000 happens, it sees the undo record that it's
about to insert, is already discarded as per meta.discard (flushed by
checkpoint). In this case, should we just skip inserting the undo
record?

3. Currently, we create a backup image of the unlogged part of the
undo log's metadata only when some backend allocates some space from
the undo log (in UndoLogAllocate). This helps us restore the unlogged
meta part after a checkpoint.
When we perform an undo action, we also update the undo action
progress and emit an WAL record. The same operation can performed by
the undo worker which doesn't allocate any space from the undo log.
So, if an undo worker emits an WAL record to update undo action
progress after a checkpoint, it'll not be able to WAL log the backup
image of the meta unlogged part. IMHO, this breaks the recovery logic
of unlogged part of undo meta.

Thoughts?

On Mon, Sep 2, 2019 at 9:47 AM Thomas Munro  wrote:
>
> On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh  
> wrote:
> > I'm getting the following assert failure while performing the recovery
> > with the same.
> > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> > File: "undolog.c", Line: 997)"
> >
> > I found that we don't emit an WAL record when we update the
> > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> > crash recovery, some new transaction may use that undo log which is
> > wrong, IMHO. Am I missing something?
>
> Thanks, right, that status logging is wrong, will fix in next version.
>
> --
> Thomas Munro
> https://enterprisedb.com



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-09-11 Thread Alvaro Herrera
On 2019-Sep-06, vignesh C wrote:

> Hi Thomas,
> 
> While testing one of the recovery scenarios I found one issue:
> FailedAssertion("!(logno == context->recovery_logno)

I marked this patch Waiting on Author.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: POC: Cleaning up orphaned files using undo logs

2019-09-05 Thread vignesh C
Hi Thomas,

While testing one of the recovery scenarios I found one issue:
FailedAssertion("!(logno == context->recovery_logno)

The details of the same is mentioned below:
The context's try_location was not updated in
UndoLogAllocateInRecovery, in PrepareUndoInsert the try_location was
updated with the undo record size.
In the subsequent UndoLogAllocateInRecovery as the value for
try_location was not initialized but only updated with the size the
logno will always not match if the recovery_logno is non zero and the
assert fails.

Fixed by setting the try_location in UndoLogAllocateInRecovery,
similar to try_location setting in UndoLogAllocate.
Patch for the same is attached.

Please have a look and add the changes in one of the upcoming version.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


On Mon, Sep 2, 2019 at 9:53 AM Thomas Munro  wrote:
>
> On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh  
> wrote:
> > I'm getting the following assert failure while performing the recovery
> > with the same.
> > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> > File: "undolog.c", Line: 997)"
> >
> > I found that we don't emit an WAL record when we update the
> > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> > crash recovery, some new transaction may use that undo log which is
> > wrong, IMHO. Am I missing something?
>
> Thanks, right, that status logging is wrong, will fix in next version.
>
> --
> Thomas Munro
> https://enterprisedb.com
>
>
From e69a9eea71562434cc0512871a9b6d7424ce93ec Mon Sep 17 00:00:00 2001
From: Vigneshwaran c 
Date: Fri, 30 Aug 2019 11:45:03 +0530
Subject: [PATCH] FailedAssertion("!(logno == context->recovery_logno) fix

try_location was not updated in UndoLogAllocateInRecovery, in
PrepareUndoInsert the try_location was updated with the undo record size.
In the subsequent UndoLogAllocateInRecovery as the value for try_location
was not initialized but only updated with the size the logno will always
not match if the recovery_logno is non zero and the assert fails. Fixed
by setting the try_location in UndoLogAllocateInRecovery, similar to
try_location setting in UndoLogAllocate.

Patch by Vigneshwaran C, reviewed by Dilip Kumar.
---
 src/backend/access/undo/undolog.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/src/backend/access/undo/undolog.c b/src/backend/access/undo/undolog.c
index 073221c..9acd570 100644
--- a/src/backend/access/undo/undolog.c
+++ b/src/backend/access/undo/undolog.c
@@ -960,6 +960,14 @@ UndoLogAllocateInRecovery(UndoLogAllocContext *context,
 			*need_xact_header =
 context->try_location == InvalidUndoRecPtr &&
 slot->meta.unlogged.insert == slot->meta.unlogged.this_xact_start;
+
+			/*
+			 * If no try_location was passed in, or if we switched logs, then we'll
+			 * return the current insertion point.
+			 */
+			if (context->try_location == InvalidUndoRecPtr)
+context->try_location = MakeUndoRecPtr(slot->logno, slot->meta.unlogged.insert);
+
 			*last_xact_start = slot->meta.unlogged.last_xact_start;
 			context->recovery_logno = slot->logno;
 
-- 
1.8.3.1



Re: POC: Cleaning up orphaned files using undo logs

2019-09-01 Thread Thomas Munro
On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh  wrote:
> I'm getting the following assert failure while performing the recovery
> with the same.
> "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> File: "undolog.c", Line: 997)"
>
> I found that we don't emit an WAL record when we update the
> slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> crash recovery, some new transaction may use that undo log which is
> wrong, IMHO. Am I missing something?

Thanks, right, that status logging is wrong, will fix in next version.

-- 
Thomas Munro
https://enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-30 Thread Kuntal Ghosh
Hello Thomas,

I was doing some testing for the scenario where the undo written by a
transaction overflows to multiple undo logs. For that I've modified
the following macro:
#define UndoLogMaxSize (1024 * 1024) /* 1MB undo log size */
(I should have used the provided pg_force_switch_undo though..)

I'm getting the following assert failure while performing the recovery
with the same.
"TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
File: "undolog.c", Line: 997)"

I found that we don't emit an WAL record when we update the
slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
crash recovery, some new transaction may use that undo log which is
wrong, IMHO. Am I missing something?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-24 Thread Robert Haas
On Wed, Aug 21, 2019 at 4:34 PM Robert Haas  wrote:
> ReleaseResourcesAndProcessUndo() is only supposed to be called after
> AbortTransaction(), and the additional steps it performs --
> AtCleanup_Portals() and AtEOXact_Snapshot() or alternatively
> AtSubCleanup_Portals -- are taken from Cleanup(Sub)Transaction.
> That's not crazy; the other steps in Cleanup(Sub)Transaction() look
> like stuff that's intended to be performed when we're totally done
> with this TransactionState stack entry, whereas these things are
> slightly higher-level cleanups that might even block undo (e.g.
> undropped portal prevents orphaned file cleanup).  Granted, there are
> no comments explaining why those particular cleanup steps are
> performed here, and it's possible some other approach is better, but I
> think perhaps it's not quite as flagrantly broken as you think.

Andres smacked me with the clue-bat off-list and now I understand why
this is broken: there's no guarantee that running the various
AtEOXact/AtCleanup functions actually puts the transaction back into a
good state.  They *might* return the system to the state that it was
in immediately following StartTransaction(), but they also might not.
Moreover, ProcessUndoRequestForEachLogCat uses PG_TRY/PG_CATCH and
then discards the error without performing *any cleanup at all* but
then goes on and tries to do undo for other undo log categories
anyway.  That is totally unsafe.

I think that there should only be one chance to perform undo actions,
and as I said or at least alluded to before, if that throws an error,
it shouldn't be caught by a TRY/CATCH block but should be handled by
the state machine in xact.c.  If we're not going to make the state
machine handle these conditions, the addition of
TRANS_UNDO/TBLOCK_UNDO/TBLOCK_SUBUNDO is really missing the boat.  I'm
still not quite sure of the exact sequence of steps: we clearly need
AtCleanup_Portals() and a bunch of the other stuff that happens during
CleanupTransaction(), ideally including the freeing of memory, to
happen before we try undo. But I don't have a great feeling for how to
make that happen, and it seems more desirable for undo to begin as
soon as the transaction fails rather than waiting until
Cleanup(Sub)Transaction() time. I think some more research is needed
here.

> I am also not convinced that semi-critical sections are a bad idea,

Regarding this, after further thought and discussion with Andres,
there are two cases here that call for somewhat different handling:
temporary undo, and subtransaction abort.

In the case of a subtransaction abort, we can't proceed with the
toplevel transaction unless we succeed in applying the
subtransaction's undo, but that does not require killing off the
backend.  It might be a better idea to just fail the containing
subtransaction with the error that occurred during undo apply; if
there are multiple levels of subtransactions present then we might
fail in the same way several times, but eventually we'll fail at the
top level, forcibly kick the undo into the background, and the session
can continue.  The background workers will, hopefully, eventually
recover the situation.  Even if they can't, because, say, the failure
is due to a bug or whatever, killing off the session doesn't really
help.

In the case of temporary undo, killing the session is a much more
appealing approach. If we don't, how will that undo ever get
processed?  We could retry at some point (like every time we return to
the toplevel command prompt?) or just ignore the fact that we didn't
manage to perform those undo actions and leave that undo there like an
albatross, accumulating more and more undo behind it until the session
exits or the disk fills up.  The latter strikes me as a terrible user
experience, especially because for wire protocol reasons we'd have to
swallow the errors or at best convert them to warnings, but YMMV.

Anyway, probably these cases should not be handled exactly the same
way, but exactly what to do probably depends on the previous question:
how exactly does the integration into xact.c's state machine work,
anyway?

Meanwhile, I've been working up a prototype of how the undorequest.c
stuff I sent previously could be integrated with xact.c.  In working
on that, I've realized that there seem to be two different tasks.  One
is tracking the information that we'll need to have available to
perform undo actions.  The other is the actual transaction state
manipulation: when and how do we abort transactions, cleanup
transactions, start new transactions specifically for undo?  How are
transactions performing undo specially marked, if at all?  The
attached patch includes a new module, undostate.c/h, which tries to
handle the first of those things; this is just a prototype, and is
missing some pieces marked with XXX, but I think it's probably the
right general direction.  It will still need to be plugged into a
framework for launching undo apply background workers (which might
require 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-24 Thread Robert Haas
On Fri, Aug 23, 2019 at 2:04 AM Thomas Munro  wrote:
> 2.  Strict self-update-only:  We could update it as part of
> transaction cleanup.  That is, when you commit or abort, probably at
> some time when your transaction is still advertised as running, you go
> and update your own transaction header with your the size.  If you
> never reach that stage, I think you can fix it during crash recovery,
> during the startup scan that feeds the rollback request queues.  That
> is, if you encounter a transaction header with length == 0, it must be
> the final one and its length is therefore known and can be updated,
> before you allow new transactions to begin.  There are some
> complications like backends that exit without crashing, which I
> haven't thought about.  As Amit just pointed out to me, that means
> that the update is not folded into the same buffer access as the next
> transaction, but perhaps you can mitigate that by not updating your
> header if the next header will be on the same page -- the next
> transaction can do it safely then (this page with the insert pointer
> on it can't be discarded).  As Dilip just pointed out to me, it means
> that you always do an update that you might not never need to do if
> the transaction is discarded, to which I have no answer.  Bleugh.

Andres and I have spent a lot of time on the phone over the last
couple of days and I think we both kind of like this option.  I don't
think that the costs are likely to be very significant: you're talking
about pinning, locking, dirtying, unlocking, and unpinning one buffer
at commit time, or maybe two if your transaction touched both logged
and unlogged tables.  If the transaction is short enough for that
overhead to matter, that buffer is probably already in shared_buffers,
is probably already dirty, and is probably already in your CPU's
cache. So I think the overhead will turn out to be low.

Moreover, I doubt that we want to separately discard every transaction
anyway.  If you have very light-weight transactions, you don't want to
add an extra WAL record per transaction anyway.  Increasing the number
of separate WAL records per transaction from say 5 to 6 would be a
significant additional cost.  You probably want to perform a discard,
say, every 5 seconds or sooner if you can discard at least 64kB of
undo, or something of that sort.  So we're not going to save the
overhead of updating the previous transaction header often enough to
make much difference unless we're discarding so aggressively that we
incur a much larger overhead elsewhere.  I think.

I am a little concerned about the backends that exit without crashing.
Andres seems to want to treat that case as a bug to be fixed, but I
doubt whether that's going to be practical.   We're really only
talking about extreme corner cases here, because
before_shmem_exit(ShutdownPostgres, 0) means we'll
AbortOutOfAnyTransaction() which should RecordTransactionAbort(). Only
if we fail in the AbortTransaction() prior to reaching
RecordTransactionAbort() will we manage to reach the later cleanup
stages without having written an abort record.  I haven't scrutinized
that code lately to see exactly how things can go wrong there, but
there shouldn't be a whole lot. However, there's probably a few
things, like maybe a really poorly-timed malloc() failure.

A zero-order solution would be to install a deadman switch. At
on_shmem_exit time, you must detach from any undo log to which you are
connected, so that somebody else can attach to it later.  We can stick
in a cross-check there that you haven't written any undo bytes to that
log and PANIC if you have.  Then the system must be water-tight.
Perhaps it's possible to do better: if we could identify the cases in
which such logic gets reached, we could try to guarantee that WAL is
written and the undo log safely detached before we get there.  But at
the various least we can promote ERROR/FATAL to PANIC in the relevant
case.

A better solution would be to detect the problem and make sure we
recover from it before reusing the undo log.  Suppose each undo log
has three states: (1) nobody's attached, (2) somebody's attached, and
(3) nobody's attached but the last record might need a fixup.  When we
start up, all undo logs are in state 3, and the discard worker runs
around and puts them into state 1.  Subsequently, they alternate
between states 1 and 2 for as long as the system remains up.  But if
as an exceptional case we reach on_shmem_exit without having detached
the undo log, because of cascading failures, then we put the undo log
in state 3.  The discard worker already knows how to move undo logs
from state 3 to state 1, and it can do the same thing here.  Until it
does nobody else can reuse that undo log.

I might be missing something, but I think that would nail this down
pretty tightly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-23 Thread Thomas Munro
On Wed, Aug 21, 2019 at 4:44 AM Andres Freund  wrote:
> On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> > 1.  Anyone is allowed to try to read or write data at any UndoRecPtr
> > that has been allocated, through the buffer pool (though you'd usually
> > want to check it with UndoRecPtrIsDiscarded() first, and only rely on
> > the system I'm describing to deal with races).
> >
> > 2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
> > cache miss, if the smgrread implementation wants to indicate that the
> > buffer has been discarded/truncated and that is expected (md.c won't
> > ever do that, but undofile.c can).
>
> Hm. This gives me a bit of a stomach ache. It somehow feels like a weird
> form of signalling.  Can't quite put my finger on why it makes me feel
> queasy.

Well, if we're going to allow concurrent access and discarding, there
has to be some part of the system where you can discover that the
thing you wanted is gone.  What's wrong with here?

Stepping back a bit... why do we have a new concept here?  The reason
we don't have anything like this for relations currently is that we
don't have live references to blocks that are expected to be
concurrently truncated, due to heavyweight locking.  But the whole
purpose of the undo system is to allow cheap truncation/discarding of
data that you *do* have live references to, and furthermore the
discarding is expected to be frequent.  The present experiment is
about trying to do so without throwing our hands up and using a
pessimistic lock.

At one point Robert and I discussed some kind of scheme where you'd
register your interest in a range of the log before you begin reading
(some kind of range locks or hazard pointers), so that you would block
discarding in that range, but the system would still allow someone to
read in the middle of the log while the discard worker concurrently
discards non-overlapping data at the end.  But I kept returning to the
idea that the buffer pool already has block-level range locking of
various kinds.  You can register your interest in a range by pinning
the buffers.  That's when you'll find out if the range is already
gone.  We could add an extra layer of range locking around that, but
it wouldn't be any better, it'd just thrash your bus a bit more, and
require more complexity in the discard worker (it has to defer
discarding a bit, and either block or go away and come back later).

> > 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> > unpinned buffers, and marks as BM_DISCARDED any that happen to be
> > pinned right now, so they can't be immediately invalidated.  Such
> > buffers are never written back and are eligible for reuse on the next
> > clock sweep, even if they're written into by a backend that managed to
> > do that when we were trying to discard.
>
> Hm. When is it legitimate for a backend to write into such a buffer? I
> guess that's about updating the previous transaction's next pointer? Or
> progress info?

Yes, previous transaction header's next pointer, and progress counter
during rollback.  We're mostly interested in the next pointer here,
because the progress counter update would normally not be updated at a
time when the page might be concurrently discarded.  The exception to
that is a superuser running CALL pg_force_discard_undo() (a
data-eating operation designed to escape a system that can't
successfully roll back and gets stuck, blowing away
not-yet-rolled-back undo records).

Here are some other ideas about how to avoid conflicts between
discarding and transaction header update:

1.  Lossy self-update-only: You could say that transactions are only
allowed to write to their own transaction header, and then have them
periodically update their own length in their own transaction header,
and then teach the discard worker that the length information is only
a starting point for a linear search for the next transaction based on
page header information.  That removes complexity for writers, but
adds complexity and IO and CPU to the discard worker.  Bleugh.

2.  Strict self-update-only:  We could update it as part of
transaction cleanup.  That is, when you commit or abort, probably at
some time when your transaction is still advertised as running, you go
and update your own transaction header with your the size.  If you
never reach that stage, I think you can fix it during crash recovery,
during the startup scan that feeds the rollback request queues.  That
is, if you encounter a transaction header with length == 0, it must be
the final one and its length is therefore known and can be updated,
before you allow new transactions to begin.  There are some
complications like backends that exit without crashing, which I
haven't thought about.  As Amit just pointed out to me, that means
that the update is not folded into the same buffer access as the next
transaction, but perhaps you can mitigate that by not updating your
header if the next header will be on the same 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 9:55 PM Andres Freund  wrote:
>
> Hi
>
> On August 22, 2019 9:14:10 AM PDT, Dilip Kumar  wrote:
> > But, those requests will
> >ultimately be used for collecting the record by the bulk fetch.  So if
> >we are planning to change the bulk fetch to read forward then maybe we
> >don't need the valid last undo record pointer because that we will
> >anyway get while processing forward.  So just knowing the end of the
> >transaction is sufficient for us to know where to stop.  I am not sure
> >if this solution has any problem.  Probably  I should think again in
> >the morning when my mind is well-rested.
>
> I don't think we can easily do so for bulk apply without incurring 
> significant overhead. It's pretty cheap to read in forward order and then 
> process backwards on a page level - but for an entire transactions undo the 
> story is different. We can't necessarily keep all of it in memory, so we'd 
> have to read the undo twice to find the end. Right?
>
I was not talking about the entire transaction,  I was also telling
about the page level as you suggested.  I was just saying that we may
not need the start position of the last undo record of the transaction
for registering the rollback request (which we currently do).
However, we need to know the end of the transaction to know the last
page from which we need to start reading forward.

Let me explain with an example

Transaction1
first, undo start at 10
first, undo end at 100
second, undo start at 101
second, undo end at 200
..
last, undo start at 1000
last, undo end at  1100

Transaction2
first, undo start at 1101
first, undo end at 1200
second, undo start at 1201
second, undo end at 1300

Suppose we want to register the request for Transaction1.  Then
currently we need to know the start undo record pointer (10 as per
above example) and the last undo record pointer (1000).  But, we only
know the start undo record pointer(10) and the start of the next
transaction(1101).  So for calculating the start of the last record,
we use 1101 - 101 (length of the last record store 2 bytes before
1101).

So, now I am saying that maybe we don't need to compute the start of
last undo record (1000) because it's enough to know the end of the
last undo record(1100).  Because on whichever page the last undo
record ends, we can start from that page and read forward on that
page.

* All numbers I used in the above example can be considered as undo
record pointers.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 7:34 PM Robert Haas  wrote:
>
> On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> > Yeah, we can handle the bulk fetch as you suggested and it will make
> > it a lot easier.  But, currently while registering the undo request
> > (especially during the first pass) we need to compute the from_urecptr
> > and the to_urecptr. And,  for computing the from_urecptr,  we have the
> > end location of the transaction because we have the uur_next in the
> > transaction header and that will tell us the end of our transaction
> > but we still don't know the undo record pointer of the last record of
> > the transaction.  As of know, we read previous 2 bytes from the end of
> > the transaction to know the length of the last record and from there
> > we can compute the undo record pointer of the last record and that is
> > our from_urecptr.=
>
> I don't understand this.  If we're registering an undo request at "do"
> time, we don't need to compute the starting location; we can just
> remember the UndoRecPtr of the first record we inserted.  If we're
> reregistering an undo request after a restart, we can (and, I think,
> should) work forward from the discard location rather than backward
> from the insert location.

Right, we work froward from the discard location.  So after the
discard location,  while traversing the undo log when we encounter an
aborted transaction we need to register its rollback request.  And,
for doing that we need 1) start location of the first undo record . 2)
start location of the last undo record (last undo record pointer).

We already have 1).  But we have to compute 2).   For doing that if we
unpack the first undo record we will know the start of the next
transaction.  From there if we read the last two bytes then that will
have the length of the last undo record of our transaction.  So we can
compute 2) with below formula

start of the last undo record = start of the next transaction - length
of our transaction's last record.

Am I making sense here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Robert Haas
On Thu, Aug 22, 2019 at 12:54 AM Andres Freund  wrote:
> But why? It makes a *lot* more sense to have it in the beginning. I
> don't think bulk-fetch really requires it to be in the end - we can
> still process records forward on a page-by-page basis.

There are two separate needs here: to be able to go forward, and to be
able to go backward.  We have the length at the end of each record not
because we're stupid, but so that we can back up.  If we have another
way of backing up, then the thing to do is not to move that to
beginning of the record but to remove it entirely as unnecessary
wastage.  We can also think about how to improve forward traversal.

Considering each problem separately:

For forward traversal, we could simplify things somewhat by having
only 3 decoding stages instead of N decoding stages.  We really only
need (1) a stage for accumulating bytes until we have uur_info, and
then (2) a stage for accumulating bytes until we know the payload and
tuple lengths, and then (3) a stage for accumulating bytes until we
have the whole record.  We have a lot more stages than that right now
but I don't think we really need them for anything. Originally we had
them so that we could do incremental decoding to find the transaction
header in the record, but now that the transaction header is at a
fixed offset, I think the multiplicity of stages is just baggage.

We could simplify things more by deciding that the first two bytes of
the record are going to contain the record size. That would increase
the size of the record by 2 bytes, but we could (mostly) claw those
bytes back by not storing the size of both uur_payload and uur_tuple.
The size of the other one would be computed by subtraction: take the
total record size, subtract the size of whichever of those two things
we store, subtract the mandatory and optional headers that are
present, and the rest must be the other value. That would still add 2
bytes for records that contain neither a payload nor a tuple, but that
would probably be OK given that (a) a lot of records wouldn't be
affected, (b) the code would get simpler, and (c) something like this
seems necessary anyway given that we want to make the record format
more generic. With this approach instead of 3 stages we only need 2:
(1) accumulating bytes until we have the 2-byte length word, and (2)
accumulating bytes until we have the whole record.

For backward traversal, as I see it, there are basically two options.
One is to do what we're doing right now, and store the record length
at the end of the record. (That might mean that a record both begins
and ends with its own length, which is not a crazy design.) The other
is to do what I think you are proposing here: locate the beginning of
the first record on the page, presumably based on some information
stored in the page header, and then work forward through the page to
figure out where all the records start.  Then process them in reverse
order. That saves 2 bytes per record.  It's a little more expensive in
terms of CPU cycles, especially if you only need some of the records
on the page but not all of them, but that's probably not too bad.

I think I'm basically agreeing with what you are proposing but I think
it's important to spell out the underlying concerns, because otherwise
I'm afraid we might think we have a meeting of the minds when we don't
really.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Andres Freund
Hi

On August 22, 2019 9:14:10 AM PDT, Dilip Kumar  wrote:
> But, those requests will
>ultimately be used for collecting the record by the bulk fetch.  So if
>we are planning to change the bulk fetch to read forward then maybe we
>don't need the valid last undo record pointer because that we will
>anyway get while processing forward.  So just knowing the end of the
>transaction is sufficient for us to know where to stop.  I am not sure
>if this solution has any problem.  Probably  I should think again in
>the morning when my mind is well-rested.

I don't think we can easily do so for bulk apply without incurring significant 
overhead. It's pretty cheap to read in forward order and then process backwards 
on a page level - but for an entire transactions undo the story is different. 
We can't necessarily keep all of it in memory, so we'd have to read the undo 
twice to find the end. Right?

Andres

Andres

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 9:21 PM Dilip Kumar  wrote:
>
> On Thu, Aug 22, 2019 at 7:34 PM Robert Haas  wrote:
> >
> > On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> > > Yeah, we can handle the bulk fetch as you suggested and it will make
> > > it a lot easier.  But, currently while registering the undo request
> > > (especially during the first pass) we need to compute the from_urecptr
> > > and the to_urecptr. And,  for computing the from_urecptr,  we have the
> > > end location of the transaction because we have the uur_next in the
> > > transaction header and that will tell us the end of our transaction
> > > but we still don't know the undo record pointer of the last record of
> > > the transaction.  As of know, we read previous 2 bytes from the end of
> > > the transaction to know the length of the last record and from there
> > > we can compute the undo record pointer of the last record and that is
> > > our from_urecptr.=
> >
> > I don't understand this.  If we're registering an undo request at "do"
> > time, we don't need to compute the starting location; we can just
> > remember the UndoRecPtr of the first record we inserted.  If we're
> > reregistering an undo request after a restart, we can (and, I think,
> > should) work forward from the discard location rather than backward
> > from the insert location.
>
> Right, we work froward from the discard location.  So after the
> discard location,  while traversing the undo log when we encounter an
> aborted transaction we need to register its rollback request.  And,
> for doing that we need 1) start location of the first undo record . 2)
> start location of the last undo record (last undo record pointer).
>
> We already have 1).  But we have to compute 2).   For doing that if we
> unpack the first undo record we will know the start of the next
> transaction.  From there if we read the last two bytes then that will
> have the length of the last undo record of our transaction.  So we can
> compute 2) with below formula
>
> start of the last undo record = start of the next transaction - length
> of our transaction's last record.

Maybe I am saying that because I am just thinking how the requests are
registered as per the current code.  But, those requests will
ultimately be used for collecting the record by the bulk fetch.  So if
we are planning to change the bulk fetch to read forward then maybe we
don't need the valid last undo record pointer because that we will
anyway get while processing forward.  So just knowing the end of the
transaction is sufficient for us to know where to stop.  I am not sure
if this solution has any problem.  Probably  I should think again in
the morning when my mind is well-rested.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Robert Haas
On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> Yeah, we can handle the bulk fetch as you suggested and it will make
> it a lot easier.  But, currently while registering the undo request
> (especially during the first pass) we need to compute the from_urecptr
> and the to_urecptr. And,  for computing the from_urecptr,  we have the
> end location of the transaction because we have the uur_next in the
> transaction header and that will tell us the end of our transaction
> but we still don't know the undo record pointer of the last record of
> the transaction.  As of know, we read previous 2 bytes from the end of
> the transaction to know the length of the last record and from there
> we can compute the undo record pointer of the last record and that is
> our from_urecptr.=

I don't understand this.  If we're registering an undo request at "do"
time, we don't need to compute the starting location; we can just
remember the UndoRecPtr of the first record we inserted.  If we're
reregistering an undo request after a restart, we can (and, I think,
should) work forward from the discard location rather than backward
from the insert location.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Amit Kapila
On Thu, Aug 22, 2019 at 11:04 AM Dilip Kumar  wrote:
>
> On Thu, Aug 22, 2019 at 10:24 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> > > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > > > We can not know the complete size of the record even by reading the
> > > > > header because we have a payload that is variable part and payload
> > > > > length are stored in the payload header which again can be at random
> > > > > offset.
> > > >
> > > > Wait, but that's just purely self inflicted damage, no? The initial
> > > > length just needs to include the payload. And all this is not an issue
> > > > anymore?
> > > >
> > > Actually, we store the undo length only at the end of the record and
> > > that is for traversing the transaction's undo record chain during bulk
> > > fetch.  Ac such in the beginning of the record we don't have the undo
> > > length.  We do have uur_info but that just tell us which all optional
> > > header are included in the record.
> >
> > But why? It makes a *lot* more sense to have it in the beginning. I
> > don't think bulk-fetch really requires it to be in the end - we can
> > still process records forward on a page-by-page basis.
>
> Yeah, we can handle the bulk fetch as you suggested and it will make
> it a lot easier.  But, currently while registering the undo request
> (especially during the first pass) we need to compute the from_urecptr
> and the to_urecptr. And,  for computing the from_urecptr,  we have the
> end location of the transaction because we have the uur_next in the
> transaction header and that will tell us the end of our transaction
> but we still don't know the undo record pointer of the last record of
> the transaction.  As of know, we read previous 2 bytes from the end of
> the transaction to know the length of the last record and from there
> we can compute the undo record pointer of the last record and that is
> our from_urecptr.
>

How about if we store the location of the last record of the
transaction instead of the location of the next transaction in the
transaction header?  I think if we do that then discard worker might
need to do some additional work in some cases as it needs to tell the
location up to which discard is possible, however, many other cases
might get simplified.  With this also, when the log is switched while
writing records for the same transaction, the transaction header in
the first log will store the start location of the same transaction's
records in the next log.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 10:24 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > > We can not know the complete size of the record even by reading the
> > > > header because we have a payload that is variable part and payload
> > > > length are stored in the payload header which again can be at random
> > > > offset.
> > >
> > > Wait, but that's just purely self inflicted damage, no? The initial
> > > length just needs to include the payload. And all this is not an issue
> > > anymore?
> > >
> > Actually, we store the undo length only at the end of the record and
> > that is for traversing the transaction's undo record chain during bulk
> > fetch.  Ac such in the beginning of the record we don't have the undo
> > length.  We do have uur_info but that just tell us which all optional
> > header are included in the record.
>
> But why? It makes a *lot* more sense to have it in the beginning. I
> don't think bulk-fetch really requires it to be in the end - we can
> still process records forward on a page-by-page basis.

Yeah, we can handle the bulk fetch as you suggested and it will make
it a lot easier.  But, currently while registering the undo request
(especially during the first pass) we need to compute the from_urecptr
and the to_urecptr. And,  for computing the from_urecptr,  we have the
end location of the transaction because we have the uur_next in the
transaction header and that will tell us the end of our transaction
but we still don't know the undo record pointer of the last record of
the transaction.  As of know, we read previous 2 bytes from the end of
the transaction to know the length of the last record and from there
we can compute the undo record pointer of the last record and that is
our from_urecptr.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > We can not know the complete size of the record even by reading the
> > header because we have a payload that is variable part and payload
> > length are stored in the payload header which again can be at random
> > offset.
>
> Wait, but that's just purely self inflicted damage, no? The initial
> length just needs to include the payload. And all this is not an issue
> anymore?
>
Actually, we store the undo length only at the end of the record and
that is for traversing the transaction's undo record chain during bulk
fetch.  Ac such in the beginning of the record we don't have the undo
length.  We do have uur_info but that just tell us which all optional
header are included in the record.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Andres Freund
Hi,

On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > We can not know the complete size of the record even by reading the
> > > header because we have a payload that is variable part and payload
> > > length are stored in the payload header which again can be at random
> > > offset.
> >
> > Wait, but that's just purely self inflicted damage, no? The initial
> > length just needs to include the payload. And all this is not an issue
> > anymore?
> >
> Actually, we store the undo length only at the end of the record and
> that is for traversing the transaction's undo record chain during bulk
> fetch.  Ac such in the beginning of the record we don't have the undo
> length.  We do have uur_info but that just tell us which all optional
> header are included in the record.

But why? It makes a *lot* more sense to have it in the beginning. I
don't think bulk-fetch really requires it to be in the end - we can
still process records forward on a page-by-page basis.

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Andres Freund
Hi,

On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> We can not know the complete size of the record even by reading the
> header because we have a payload that is variable part and payload
> length are stored in the payload header which again can be at random
> offset.

Wait, but that's just purely self inflicted damage, no? The initial
length just needs to include the payload. And all this is not an issue
anymore?

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar
On Wed, Aug 21, 2019 at 9:04 PM Robert Haas  wrote:
>
> On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar  wrote:
> > I have already attempted that part and I feel it is not making code
> > any simpler than what we have today.  For packing, it's fine because I
> > can process all the member once and directly pack it into one memory
> > chunk and I can insert that to the buffer by one call of
> > InsertUndoBytes and that will make the code simpler.
>
> OK...

>
> The bigger issue here is that we don't seem to be making very much
> progress toward improving the overall design.  Several significant
> improvements have been suggested:
>
> 1. Thomas suggested quite some time ago that we should make sure that
> the transaction header is the first optional header.

I think this is already done at least 2-3 version before.  So now we
are updating the transaction header directly by writing at that offset
and we don't need staging for this.
>  If we do that,
> then I'm not clear on why we even need this incremental unpacking
> stuff any more. The only reason for all of this machinery was so that
> we could find the transaction header at an unknown offset inside a
> complex record format; there is, if I understand correctly, no other
> case in which we want to incrementally decode a record.
> But if the
> transaction header is at a fixed offset, then there seems to be no
> need to even have incremental decoding at all.  Or it can be very
> simple, with three stages: (1) we don't yet have enough bytes to
> figure out how big the record is; (2) we have enough bytes to figure
> out how big the record is and we have figured that out but we don't
> yet have all of those bytes; and (3) we have the whole record, we can
> decode the whole thing and we're done.

We can not know the complete size of the record even by reading the
header because we have a payload that is variable part and payload
length are stored in the payload header which again can be at random
offset.   But, maybe we can still follow this idea which will make
unpacking far simpler. I have a few ideas
a) Store payload header right after the transaction header so that we
can easily know the complete record size.
b) Once we decode the first header by uur_info we can compute an exact
offset of the payload header and from there we can know the record
size.

>
> 2. Based on a discussion with Thomas, I suggested the GHOB stuff,
> which gets rid of the idea of transaction headers inside individual
> records altogether; instead, there is one undo blob per transaction
> (or maybe several if we overflow to another undo log) which begins
> with a sentinel byte that identifies it as owned by a transaction, and
> then the transaction header immediately follows that without being
> part of any record, and the records follow that data.  As with the
> previous idea, this gets rid of the need for incremental decoding
> because it gets rid of the need to find the transaction header inside
> of a bigger record. As Thomas put it to me off-list, it puts the
> records inside of a larger chunk of space owned by the transaction
> instead of putting the transaction header inside of some particular
> record; that seems more correct than the way we have it now.
>
> 3. Heikki suggested that the format should be much more generic and
> that more should be left up to the AM.  While neither Andres nor I are
> convinced that we should go as far in that direction as Heikki is
> proposing, the idea clearly has some merit, and we should probably be
> moving that way to some degree. For instance, the idea that we should
> store a block number and TID is a bit sketchy when you consider that a
> multi-insert operation really wants to store a TID list. The zheap
> tree has a ridiculous kludge to work around that problem; clearly we
> need something better.  We've also mentioned that, in the future, we
> might want to support TIDs that are not 6 bytes, and that even just
> looking at what's currently under development, zedstore wants to treat
> TIDs as 48-bit opaque quantities, not a 4-byte block number and a
> 2-byte item pointer offset.  So, there is clearly a need to go through
> the whole way we're doing this and rethink which parts are generic and
> which parts are AM-specific.
>
> 4. A related problem, which has been mentioned or at least alluded to
> by both Heikki and by me, is that we need a better way of handling the
> AM-specific data. Right now, the zheap code packs fixed-size things
> into the payload data and then finds them by knowing the offset where
> any particular data is within that field, but that's an unmaintainable
> mess.  The zheap code could be improved by at least defining those
> offsets as constants someplace and adding some comments explaining the
> payload formats of various undo records, but even if we do that, it's
> not going to generalize very well to anything more complicated than a
> few fixed-size bits of data.  I suggested using the pqformat stuff to
> try to 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Robert Haas
On Wed, Aug 14, 2019 at 2:57 AM Andres Freund  wrote:
> - My reading of the current xact.c integration is that it's not workable
>   as is. Undo is executed outside of a valid transaction state,
>   exceptions aren't properly undone, logic would need to be duplicated
>   to a significant degree, new kind of critical section.

Regarding this particular point:

ReleaseResourcesAndProcessUndo() is only supposed to be called after
AbortTransaction(), and the additional steps it performs --
AtCleanup_Portals() and AtEOXact_Snapshot() or alternatively
AtSubCleanup_Portals -- are taken from Cleanup(Sub)Transaction.
That's not crazy; the other steps in Cleanup(Sub)Transaction() look
like stuff that's intended to be performed when we're totally done
with this TransactionState stack entry, whereas these things are
slightly higher-level cleanups that might even block undo (e.g.
undropped portal prevents orphaned file cleanup).  Granted, there are
no comments explaining why those particular cleanup steps are
performed here, and it's possible some other approach is better, but I
think perhaps it's not quite as flagrantly broken as you think.

I am also not convinced that semi-critical sections are a bad idea,
although the if (SemiCritSectionCount > 0) test at the start of
ReleaseResourcesAndProcessUndo() looks wrong.  To roll back a
subtransaction, we must perform undo in the foreground, and if that
fails, the toplevel transaction can't be allowed to commit, full stop.
Since we expect this to be a (very) rare scenario, I don't know why
escalating to FATAL is a catastrophe.  The only other option is to do
something along the lines of SxactIsDoomed(), where we force all
subsequent commits (and sub-commits?) within the toplevel xact to
fail. You can argue that the latter is a better user experience, and
for SSI I certainly agree, but this case isn't quite the same: there's
a good chance we're dealing with a corrupted page or system
administrator intervention to try to kill off a long-running undo
task, and continuing in such cases seems a lot more dubious than after
a serializability failure, where retrying is the expected recovery
mode. The other case is where toplevel undo for a temporary table
fails.  It is unclear to me what, other than FATAL, could suffice
there.  I guess you could just let the session continue and leave the
transaction undone, leaving whatever MVCC machinery the table AM may
have look through it, but that sounds inferior to me. Rip the bandaid
off.

Some general complaints from my side about the xact.c changes:

1. The code structure doesn't seem quite right.  For example:

1a. ProcessUndoRequestForEachLogCat has a try/catch block, but it
seems to me that the job of a try/catch block is to provide structured
error-handling for code for resources for which there's no direct
handling in xact.c or resowner.c.  Here, we're inside of xact.c, so
why are we adding a try/catch block
1b. ReleaseResourcesAndProcessUndo does part of the work of cleaning
up a failed transaction but not all of it, the rest being done by
AbortTransaction, which is called before entering it, plus it also
kicks off the actual undo work.  I would expect a cleaner division of
responsibility.
1c. Having an undo request per UndoLogCategory rather than one per
transaction doesn't seem right to me; hopefully that will get cleaned
up when the undorequest.c stuff I sent before is integrated.
1d. The code at the end of FinishPreparedTransaction() seems to expect
that the called code will catch any error, but that clearing the error
state might need to happen here, and also that we should fire up a new
transaction; I suspect, but am not entirely sure, that that is not the
way it should work.  The code added earlier in that function also
looks suspicious, because it's filling up what is basically a
high-level control function with a bunch of low-level undo-specific
details.  In both places, the undo-specific concerns probably need to
be better-isolated.

2. Signaling is done using some odd-looking mechanisms.  For instance:

2a. The SemiCritSectionCount > 0 test at the top of
ReleaseResourcesAndProcessUndo that I complained about earlier looks
like a guard against reentrancy, but that must be the wrong way to get
there; it makes it impossible to reuse what is ostensibly a
general-purpose facility for any non-undo related purpose without
maybe breaking something.
2b. ResetUndoActionsInfo() is called from a bunch of place, but only 2
of those places have a comment explaining why, and the function
comment is pretty unilluminating. This looks like some kind of
signaling machinery, but it's not very clear to me what it's actually
trying to do.
2c. ResourceOwnerReleaseInternal() is directly calling
NeedToPerformUndoActions(), which feels like a layering violation.
2d. I'm not really sure that TRANS_UNDO is serving any useful purpose;
I think we need TBLOCK_UNDO and TBLOCK_SUBUNDO, but I'm not really
sure TRANS_UNDO is doing anything useful; the 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Andres Freund



On August 21, 2019 8:36:34 AM PDT, Robert Haas  wrote:
> We treat LWLockAcquire() as a no-fail operation in many
>places; in my opinion, that elog(ERROR) that we have for too many
>LWLocks should be changed to elog(PANIC) precisely because we do treat
>LWLockAcquire() as no-fail in lots of places in the code, but I think
>I suggested that once and got slapped down, and I haven't had the
>energy to fight about it.

Fwiw, that proposal has my vote.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Robert Haas
On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar  wrote:
> I have already attempted that part and I feel it is not making code
> any simpler than what we have today.  For packing, it's fine because I
> can process all the member once and directly pack it into one memory
> chunk and I can insert that to the buffer by one call of
> InsertUndoBytes and that will make the code simpler.

OK...

> But, while unpacking if I directly unpack to the UnpackUndoRecord then
> there are few complexities.  I am not saying those are difficult to
> implement but code may not look better.
>
> a) First, we need to add extra stages for unpacking as we need to do
> field by field.
>
> b) Some of the members like uur_payload and uur_tuple are not the same
> type in the UnpackUndoRecord compared to how it is stored in the page.
> In UnpackUndoRecord those are StringInfoData whereas on the page we
> store it as UndoRecordPayload header followed by the actual data.  I
> am not saying we can not unpack this directly we can do it like,
> first read the payload length from the page in uur_payload.len then
> read tuple length in uur_tuple.len then read both the data.  And, for
> that, we will have to add extra stages.

I don't think that this is true; or at least, I think it's avoidable.
The idea of an unpacking stage is that you refuse to advance to the
next stage until you've got a certain number of bytes of data; and
then you unpack everything that pertains to that stage.  If you have 2
4-byte fields that you want to unpack together, you can just wait
until you've 8 bytes of data and then unpack both.  You don't really
need 2 separate stages.  (Similarly, your concern about fields being
in a different order seems like it should be resolved by agreeing on
one ordering and having everything use it; I don't know why there
should be one order that is better in memory and another order that is
better on disk.)

The bigger issue here is that we don't seem to be making very much
progress toward improving the overall design.  Several significant
improvements have been suggested:

1. Thomas suggested quite some time ago that we should make sure that
the transaction header is the first optional header.  If we do that,
then I'm not clear on why we even need this incremental unpacking
stuff any more. The only reason for all of this machinery was so that
we could find the transaction header at an unknown offset inside a
complex record format; there is, if I understand correctly, no other
case in which we want to incrementally decode a record. But if the
transaction header is at a fixed offset, then there seems to be no
need to even have incremental decoding at all.  Or it can be very
simple, with three stages: (1) we don't yet have enough bytes to
figure out how big the record is; (2) we have enough bytes to figure
out how big the record is and we have figured that out but we don't
yet have all of those bytes; and (3) we have the whole record, we can
decode the whole thing and we're done.

2. Based on a discussion with Thomas, I suggested the GHOB stuff,
which gets rid of the idea of transaction headers inside individual
records altogether; instead, there is one undo blob per transaction
(or maybe several if we overflow to another undo log) which begins
with a sentinel byte that identifies it as owned by a transaction, and
then the transaction header immediately follows that without being
part of any record, and the records follow that data.  As with the
previous idea, this gets rid of the need for incremental decoding
because it gets rid of the need to find the transaction header inside
of a bigger record. As Thomas put it to me off-list, it puts the
records inside of a larger chunk of space owned by the transaction
instead of putting the transaction header inside of some particular
record; that seems more correct than the way we have it now.

3. Heikki suggested that the format should be much more generic and
that more should be left up to the AM.  While neither Andres nor I are
convinced that we should go as far in that direction as Heikki is
proposing, the idea clearly has some merit, and we should probably be
moving that way to some degree. For instance, the idea that we should
store a block number and TID is a bit sketchy when you consider that a
multi-insert operation really wants to store a TID list. The zheap
tree has a ridiculous kludge to work around that problem; clearly we
need something better.  We've also mentioned that, in the future, we
might want to support TIDs that are not 6 bytes, and that even just
looking at what's currently under development, zedstore wants to treat
TIDs as 48-bit opaque quantities, not a 4-byte block number and a
2-byte item pointer offset.  So, there is clearly a need to go through
the whole way we're doing this and rethink which parts are generic and
which parts are AM-specific.

4. A related problem, which has been mentioned or at least alluded to
by both Heikki and by me, is that we need a better way of 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Robert Haas
On Wed, Aug 21, 2019 at 6:38 AM Amit Kapila  wrote:
> > FWIW, although I also thought of doing what you are describing here, I
> > think Andres's proposal is probably preferable, because it's simpler.
> > There's not really any reason why we can't take the buffer locks from
> > within the critical section, and that way callers don't have to deal
> > with the extra step.
>
> IIRC, the reason this was done before starting critical section was
> because of coding convention mentioned in src/access/transam/README
> (Section: Write-Ahead Log Coding).   It says first pin and exclusive
> lock the shared buffers and then start critical section.  It might be
> that we can bypass that convention here, but I guess it is mainly to
> avoid any error in the critical section.  I have checked the
> LWLockAcquire path and there doesn't seem to be any reason that it
> will throw error except when the caller has acquired many locks at the
> same time which is not the case here.

Yeah, I think it's fine to deviate from that convention in this
respect.  We treat LWLockAcquire() as a no-fail operation in many
places; in my opinion, that elog(ERROR) that we have for too many
LWLocks should be changed to elog(PANIC) precisely because we do treat
LWLockAcquire() as no-fail in lots of places in the code, but I think
I suggested that once and got slapped down, and I haven't had the
energy to fight about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Amit Kapila
On Tue, Aug 20, 2019 at 8:10 PM Robert Haas  wrote:
>
> On Tue, Aug 20, 2019 at 2:42 AM Amit Kapila  wrote:
> > > Well, my main point, which so far has largely been ignored, was that we
> > > may not acquire page locks when we still need to search for victim
> > > buffers later. If we don't need to lock the pages up-front, but only do
> > > so once we're actually copying the records into the undo pages, then we
> > > don't a separate phase to acquire the locks. We can still hold all of
> > > the page locks at the same time, as long as we just acquire them at the
> > > later stage.
> >
> > Okay, IIUC, this means that we should have a separate phase where we
> > call LockUndoBuffers (or something like that) before
> > InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
> > will lock all the buffers pinned during PrepareUndoInsert.  We can
> > probably call LockUndoBuffers before entering the critical section to
> > avoid any kind of failure in critical section.  If so, that sounds
> > reasonable to me.
>
> I'm kind of scratching my head here, because this is clearly different
> than what Andres said in the quoted text to which you were replying.
> He clearly implied that we should acquire the buffer locks within the
> critical section during InsertPreparedUndo, and you responded by
> proposing to do it outside the critical section in a separate step.
> Regardless of which way is actually better, when somebody says "hey,
> let's do A!" and you respond by saying "sounds good, I'll go implement
> B!" that's not really helping us to get toward a solution.
>

I got confused by the statement "We can still hold all of the page
locks at the same time, as long as we just acquire them at the later
stage."

> FWIW, although I also thought of doing what you are describing here, I
> think Andres's proposal is probably preferable, because it's simpler.
> There's not really any reason why we can't take the buffer locks from
> within the critical section, and that way callers don't have to deal
> with the extra step.
>

IIRC, the reason this was done before starting critical section was
because of coding convention mentioned in src/access/transam/README
(Section: Write-Ahead Log Coding).   It says first pin and exclusive
lock the shared buffers and then start critical section.  It might be
that we can bypass that convention here, but I guess it is mainly to
avoid any error in the critical section.  I have checked the
LWLockAcquire path and there doesn't seem to be any reason that it
will throw error except when the caller has acquired many locks at the
same time which is not the case here.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-21 Thread Dilip Kumar
On Tue, Aug 20, 2019 at 7:57 PM Robert Haas  wrote:
>
> On Mon, Aug 19, 2019 at 2:04 AM Dilip Kumar  wrote:
> > Currently, In UnpackedUndoRecord we store all members directly which
> > are set by the caller.  We store pointers to some header which are
> > allocated internally by the undo layer and the caller need not worry
> > about setting them.  So now you are suggesting to put other headers
> > also as structures in UnpackedUndoRecord.  I as such don't have much
> > problem in doing that but I think initially Robert designed
> > UnpackedUndoRecord structure this way so it will be good if Robert
> > provides his opinion on this.
>
> I don't believe that's what is being suggested.  It seems to me that
> the thing Andres is complaining about here has roots in the original
> sketch that I did for this code.  The oldest version I can find is
> here:
>
> https://github.com/EnterpriseDB/zheap/commit/7d194824a18f0c5e85c92451beab4bc6f044254c
>
> In this version, and I think still in the current version, there is a
> two-stage marshaling strategy.  First, the individual fields from the
> UnpackedUndoRecord get copied into global variables (yes, that was my
> fault, too, at least in part!) which are structures. Then, the
> structures get copied into the target buffer. The idea of that design
> was to keep the code simple, but it didn't really work out, because
> things got a lot more complicated between the time I wrote those 3244
> lines of code and the >3000 lines of code that live in this patch
> today. One thing that change was that we moved more and more in the
> direction of considering individual fields as separate objects to be
> separately included or excluded, whereas when I wrote that code I
> thought we were going to have groups of related fields that stood or
> fell together. That idea turned out to be wrong. (There is the
> even-larger question here of whether we ought to take Heikki's
> suggestion and make this whole thing a lot more generic, but let's
> start by discussing how the design that we have today could be
> better-implemented.)
>
> If I understand Andres correctly, he's arguing that we ought to get
> rid of the two-stage marshaling strategy.  During decoding, he wants
> data to go directly from the buffer that contains it to the
> UnpackedUndoRecord without ever being stored in the UnpackUndoContext.
> During insertion, he wants data to go directly from the
> UnpackedUndoRecord to the buffer that contains it.  Or, at least, if
> there has to be an intermediate format, he wants it to be just a chunk
> of raw bytes, rather than a bunch of individual fields like we have in
> UndoPackContext currently.  I think that's a reasonable goal.  I'm not
> as concerned about it as he is from a performance point of view, but I
> think it would make the code look nicer, and that would be good.  If
> we save CPU cycles along the way, that is also good.
>
> In broad outline, what this means is:
>
> 1. Any field in the UndoPackContext that starts with urec_ goes away.
> 2. Instead of something like InsertUndoBytes((char *)
> &(ucontext->urec_fxid), ...) we'd write InsertUndobytes((char *)
> >uur_fxid, ...).
> 3. Similarly instead of ReadUndoBytes((char *) >urec_fxid,
> ...) we'd write ReadUndoBytes((char *) >uur_fxid, ...).
> 4. It seems slightly trickier to handle the cases where we've got a
> structure instead of individual fields, like urec_hd.  But those could
> be broken down into field-by-field reads and writes, e.g. in this case
> one call for urec_type and a second for urec_info.
> 5. For uur_group and uur_logswitch, the code would need to allocate
> those subsidiary structures before copying into them.
>
> To me, that seems like it ought to be a pretty straightforward change
> that just makes things simpler.  We'd probably need to pass the
> UnpackedUndoRecord to BeginUnpackUndo instead of FinishUnpackUndo, and
> keep a pointer to it in the UnpackUndoContext, but that seems fine.
> FinishUnpackUndo would end up just about empty, maybe entirely empty.
>
> Is that a reasonable idea?
>
I have already attempted that part and I feel it is not making code
any simpler than what we have today.  For packing, it's fine because I
can process all the member once and directly pack it into one memory
chunk and I can insert that to the buffer by one call of
InsertUndoBytes and that will make the code simpler.

But, while unpacking if I directly unpack to the UnpackUndoRecord then
there are few complexities.  I am not saying those are difficult to
implement but code may not look better.

a) First, we need to add extra stages for unpacking as we need to do
field by field.

b) Some of the members like uur_payload and uur_tuple are not the same
type in the UnpackUndoRecord compared to how it is stored in the page.
In UnpackUndoRecord those are StringInfoData whereas on the page we
store it as UndoRecordPayload header followed by the actual data.  I
am not saying we can not unpack this directly we can do it like,

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Tue, Aug 13, 2019 at 8:11 AM Robert Haas  wrote:
> > We can probably check the fxid queue and error queue to get that
> > value.  However, I am not sure if that is sufficient because incase we
> > perform the request in the foreground, it won't be present in queues.
>
> Oh, I forgot about that requirement.  I think I can fix it so it does
> that fairly easily, but it will require a little bit of redesign which
> I won't have time to do this week.

Here's a version with a quick (possibly buggy) prototype of the
oldest-FXID support.  It also includes a bunch of comment changes,
pgindent, and a few other tweaks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v2-0001-New-undo-request-manager-now-with-UndoRequestMana.patch
Description: Binary data


Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Mon, Aug 19, 2019 at 2:04 AM Dilip Kumar  wrote:
> Currently, In UnpackedUndoRecord we store all members directly which
> are set by the caller.  We store pointers to some header which are
> allocated internally by the undo layer and the caller need not worry
> about setting them.  So now you are suggesting to put other headers
> also as structures in UnpackedUndoRecord.  I as such don't have much
> problem in doing that but I think initially Robert designed
> UnpackedUndoRecord structure this way so it will be good if Robert
> provides his opinion on this.

I don't believe that's what is being suggested.  It seems to me that
the thing Andres is complaining about here has roots in the original
sketch that I did for this code.  The oldest version I can find is
here:

https://github.com/EnterpriseDB/zheap/commit/7d194824a18f0c5e85c92451beab4bc6f044254c

In this version, and I think still in the current version, there is a
two-stage marshaling strategy.  First, the individual fields from the
UnpackedUndoRecord get copied into global variables (yes, that was my
fault, too, at least in part!) which are structures. Then, the
structures get copied into the target buffer. The idea of that design
was to keep the code simple, but it didn't really work out, because
things got a lot more complicated between the time I wrote those 3244
lines of code and the >3000 lines of code that live in this patch
today. One thing that change was that we moved more and more in the
direction of considering individual fields as separate objects to be
separately included or excluded, whereas when I wrote that code I
thought we were going to have groups of related fields that stood or
fell together. That idea turned out to be wrong. (There is the
even-larger question here of whether we ought to take Heikki's
suggestion and make this whole thing a lot more generic, but let's
start by discussing how the design that we have today could be
better-implemented.)

If I understand Andres correctly, he's arguing that we ought to get
rid of the two-stage marshaling strategy.  During decoding, he wants
data to go directly from the buffer that contains it to the
UnpackedUndoRecord without ever being stored in the UnpackUndoContext.
During insertion, he wants data to go directly from the
UnpackedUndoRecord to the buffer that contains it.  Or, at least, if
there has to be an intermediate format, he wants it to be just a chunk
of raw bytes, rather than a bunch of individual fields like we have in
UndoPackContext currently.  I think that's a reasonable goal.  I'm not
as concerned about it as he is from a performance point of view, but I
think it would make the code look nicer, and that would be good.  If
we save CPU cycles along the way, that is also good.

In broad outline, what this means is:

1. Any field in the UndoPackContext that starts with urec_ goes away.
2. Instead of something like InsertUndoBytes((char *)
&(ucontext->urec_fxid), ...) we'd write InsertUndobytes((char *)
>uur_fxid, ...).
3. Similarly instead of ReadUndoBytes((char *) >urec_fxid,
...) we'd write ReadUndoBytes((char *) >uur_fxid, ...).
4. It seems slightly trickier to handle the cases where we've got a
structure instead of individual fields, like urec_hd.  But those could
be broken down into field-by-field reads and writes, e.g. in this case
one call for urec_type and a second for urec_info.
5. For uur_group and uur_logswitch, the code would need to allocate
those subsidiary structures before copying into them.

To me, that seems like it ought to be a pretty straightforward change
that just makes things simpler.  We'd probably need to pass the
UnpackedUndoRecord to BeginUnpackUndo instead of FinishUnpackUndo, and
keep a pointer to it in the UnpackUndoContext, but that seems fine.
FinishUnpackUndo would end up just about empty, maybe entirely empty.

Is that a reasonable idea?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Tue, Aug 20, 2019 at 5:02 AM Thomas Munro  wrote:
> 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> unpinned buffers, and marks as BM_DISCARDED any that happen to be
> pinned right now, so they can't be immediately invalidated.  Such
> buffers are never written back and are eligible for reuse on the next
> clock sweep, even if they're written into by a backend that managed to
> do that when we were trying to discard.

This is definitely more concurrent, but it might be *too much*
concurrency. Suppose that backend #1 is inserting a row and updating
the transaction header for the previous transaction; meanwhile,
backend #2 is discarding the previous transaction. It could happen
that backend #1 locks the transaction header for the previous
transaction and is all set to log the insertion ... but then gets
context-switched out.  Now backend #2 swoops in and logs the discard.
Backend #1 now wakes up and finishes logging a change to a page that,
according to the logic of the WAL stream, no longer exists.

It's probably possible to make this work by ignoring WAL references to
discarded pages during replay, but that seems a bit dangerous. At
least, it loses some sanity checking that you might like to have.

It seems to me that you can avoid this if you require that a backend
that wants to set BM_DISCARDED to acquire at least a shared content
lock before doing so.  If you do that, then once a backend acquires
content lock(s) on the page(s) containing the transaction header for
the purposes of updating it, it can notice that the BM_DISCARDED flag
is set and choose not to update those pages after all.  I think that
would be a smart design choice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund
On 2019-08-20 09:44:23 -0700, Andres Freund wrote:
> On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> > Aside from code changes based on review (and I have more to come of
> > those), the attached experimental patchset (also at
> > https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol
> > that, I hope, allows for better concurrency, reliability and
> > readability, and removes a bunch of TODO notes about questionable
> > interlocking.  However, I'm not quite done figuring out if the bufmgr
> > interaction is right and will be manageable on the undoaccess side, so
> > I'm hoping to get some feedback, not asking for anyone to rebase on
> > top of it yet.
> > 
> > Previously, there were two LWLocks used to make sure that all
> > discarding was prevented while anyone was reading or writing data in
> > any part of an undo log, and (probably more problematically) vice
> > versa.  Here's a new approach that removes that blocking:

Oh,  more point I forgot to add: Cool!




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund
Hi,

On 2019-08-20 17:11:38 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 10:35 PM Andres Freund  wrote:
> > On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> 
> > I don't think we can normally pin the undo buffers properly at that
> > stage. Without knowing the correct contents of the table page - which we
> > can't know without holding some form of lock preventing modifications -
> > we can't know how big our undo records are going to be. And we can't
> > just have buffers that don't exist on disk in shared memory, and we
> > don't want to allocate undo that we then don't need. So I think what
> > we'd have to do at that stage, is to "pre-allocate" buffers for the
> > maximum amount of UNDO needed, but mark the associated bufferdesc as not
> > yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> > would not be set.
> >
> > So at the start of a function that will need to insert undo we'd need to
> > pre-reserve the maximum number of buffers we could potentially
> > need. That reservation stage would
> >
> > a) pin the page with the current end of the undo
> > b) if needed pin the page of older undo that we need to update (e.g. to
> >update the next pointer)
> > c) perform clock sweep etc to acquire (find or create) enough clean to
> >hold the maximum amount of undo needed. These buffers would be marked
> >as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
> >
> > I assume that we'd make a) cheap by keeping it pinned for undo logs that
> > a backend is actively attached to. b) should only be needed once in a
> > transaction, so it's not too bad. c) we'd probably need to amortize
> > across multiple undo insertions, by keeping the unused buffers pinned
> > until the end of the transaction.
> >
> > I assume that having the infrastructure c) might also make some code
> > for already in postgres easier. There's obviously some issues around
> > guaranteeing that the maximum number of such buffers isn't high.
> 
> 
> I have analyzed this further, I think there is a problem if the
> record/s will not fit into the current undo log and we will have to
> switch the log.  Because before knowing the actual record length we
> are not sure whether the undo log will switch or not and which undo
> log we will get.  And, without knowing the logno (rnode) how we are
> going to pin the buffers?  Am I missing something?

That's precisely why I was suggesting (at the start of the quoted block
above) to not associate the buffers with pages at that point. Instead
just have clean, pinned, *unassociated* buffers. Which can be
re-associated without any IO.


> Apart from this while analyzing the other code I have noticed that in
> the current PG code we have few occurrences where try to read buffer
> under the buffer lock held.

> 1.
> In gistplacetopage
> {
> ...
> for (; ptr; ptr = ptr->next)
> {
> /* Allocate new page */
> ptr->buffer = gistNewBuffer(rel);
> GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
> ptr->page = BufferGetPage(ptr->buffer);
> ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
> }
> 2. During page split we find new buffer while holding the lock on the
> current buffer.
> 
> That doesn't mean that we can't do better but I am just referring to
> the existing code where we already have such issues.

Those are pretty clearly edge-cases, whereas the undo case at hand is a
very common path. Note again that heapam.c goes to considerably trouble
to never do this for common cases.

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund
Hi,

On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> Aside from code changes based on review (and I have more to come of
> those), the attached experimental patchset (also at
> https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol
> that, I hope, allows for better concurrency, reliability and
> readability, and removes a bunch of TODO notes about questionable
> interlocking.  However, I'm not quite done figuring out if the bufmgr
> interaction is right and will be manageable on the undoaccess side, so
> I'm hoping to get some feedback, not asking for anyone to rebase on
> top of it yet.
> 
> Previously, there were two LWLocks used to make sure that all
> discarding was prevented while anyone was reading or writing data in
> any part of an undo log, and (probably more problematically) vice
> versa.  Here's a new approach that removes that blocking:
> 
> 1.  Anyone is allowed to try to read or write data at any UndoRecPtr
> that has been allocated, through the buffer pool (though you'd usually
> want to check it with UndoRecPtrIsDiscarded() first, and only rely on
> the system I'm describing to deal with races).
> 
> 2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
> cache miss, if the smgrread implementation wants to indicate that the
> buffer has been discarded/truncated and that is expected (md.c won't
> ever do that, but undofile.c can).

Hm. This gives me a bit of a stomach ache. It somehow feels like a weird
form of signalling.  Can't quite put my finger on why it makes me feel
queasy.


> 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> unpinned buffers, and marks as BM_DISCARDED any that happen to be
> pinned right now, so they can't be immediately invalidated.  Such
> buffers are never written back and are eligible for reuse on the next
> clock sweep, even if they're written into by a backend that managed to
> do that when we were trying to discard.

Hm. When is it legitimate for a backend to write into such a buffer? I
guess that's about updating the previous transaction's next pointer? Or
progress info?


> 5.  Separating begin from discard allows the WAL logging for
> UndoLogDiscard() to do filesystem actions before logging, and other
> effects after logging, which have several nice properties if you work
> through the various crash scenarios.

Hm. ISTM we always need to log before doing some filesystem operation
(see also my recent complaint Robert and I are discussing at the bottom
of [1]). It's just that we can have a separate stage afterwards?

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoZc5JVYORsGYs8YnkSxUC%3DcLQF1Z%2BfcpH2TTKvqkS7MFg%40mail.gmail.com



> So now I'd like to get feedback on the sanity of this scheme.  I'm not
> saying it doesn't have bugs right now -- I've been trying to figure
> out good ways to test it and I'm not quite there yet -- but the
> concept.  One observation I have is that there were already code paths
> in undoaccess.c that can tolerate InvalidBuffer in recovery, due to
> the potentially different discard timing for DO vs REDO.  I think
> that's a point in favour of this scheme, but I can see that it's
> inconvenient to have to deal with InvalidBuffer whenever you read.

FWIW, I'm far from convinced that those are currently quite right. See
discussion pointed to above.



> I pulled in the latest code from undoprocessing as of today, and I
> might be a bit confused about "Defect and enhancement in multi-log
> support" some of which I have squashed into the  make undolog patch.
> BTW undoprocessing builds with initialized variable warnings in xact.c
> on clang today.

I've complained about that existance of that commit multiple times
now. So far without any comments.

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund
Hi,

On 2019-08-20 09:08:29 -0400, Robert Haas wrote:
> On Sat, Aug 17, 2019 at 1:28 PM Andres Freund  wrote:
> > The primary one in the context here is that if we do *not* have to lock
> > the buffers all ahead of time, we can simplify the interface. We
> > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > we're doing right now, so we'd need another phase, called by the "user"
> > during undo insertion. But if we do not need to lock the buffers before
> > the insertion over all starts, the inserting location doesn't have to
> > care.
> >
> > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > time was imo fairly unconvincing. Following the "recipe" for WAL
> > insertions is a good idea when writing a new run-of-the-mill WAL
> > inserting location - but when writing a new fundamental facility, that
> > already needs to modify how WAL works, then I find that much less
> > convincing.
> 
> So, you seem to be talking about something here which is different
> than what I thought we were talking about.  One question is whether we
> need to lock all of the buffers "ahead of time," and I think the
> answer to that question is probably "no." Since nobody else can be
> writing to those buffers, and probably also nobody can be reading them
> except maybe for some debugging tool, it should be fine if we enter
> the critical section and then lock them at the point when we write the
> bytes. I mean, there shouldn't be any contention, and I don't see any
> other problems.

Right. As long as we are as restrictive about the number of buffers per
undo record, and number of records per WAL insertions, I don't see any
need to go further than that.


> > Well, in the version of code that I was reviewing here, I don't there is
> > such a limit (there is a limit for buffers per undo record, but no limit
> > on the number of records inserted together). I think Dilip added a limit
> > since.  And we have the issue of a lot of IO happening while holding
> > content locks on several pages.  So I don't think it's a straw man at
> > all.
> 
> Hmm, what do you mean by "a lot of IO happening while holding content
> locks on several pages"? We might XLogInsert() but there shouldn't be
> any buffer I/O going on at that point.

That's my primary complain with how the code is structured right
now. Right now we potentially perform IO while holding exclusive content
locks, often multiple ones even. When acquiring target pages for undo,
currently always already hold the table page exclusive locked, and if
there's more than one buffer for undo, we'll also hold the previous
buffers locked. And acquiring a buffer will often have to write out a
dirty buffer to the OS, and a lot of times that will then also require
the kernel to flush data out.  That's imo an absolute no-go for the
general case.


> If there is, I think that should be redesigned.  We should collect
> buffer pins first, without locking.  Then lock.  Then write.  Or maybe
> lock-and-write, but only after everything's pinned.

Right. It's easy enough to do that for the locks on undo pages
themselves. The harder part is the content lock on the "table page" - we
don't accurately know how many undo buffers we will need, without
holding the table lock (or preventing modifications in some other
manner).

I tried to outline the problem and potential solutions in more detail
in:
https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de


> The fact of calling XLogInsert() while holding buffer locks is not
> great, but I don't think it's any worse here than in any other part of
> the system, because the undo buffers aren't going to be suffering
> concurrent access from any other backend, and because there shouldn't
> be more than a few of them.

Yea. That's obviously undesirable, but also fundamentally required at
least in the general case. And it's not at all specific to undo.

> [ WAL logging protocol ]

> I didn't intend to suggest that you don't want this to be documented.
> What I intended to suggest was that you seem to want to deviate from
> the documented rules, and it seems to me that we shouldn't do that
> unless we change the rules first, and I don't know what you think the
> rules should be or why those rules are safe.

IDK. We have at least five different places that at the very least bend
the rules - but with a comment explaining why it's safe in the specific
case. Personally I don't really think the generic guideline needs to
list every potential edge-case.


> > I think what primarily makes me concerned is that it's not clear to me
> > what guarantees that discard is the only reason for the block to
> > potentially be missing. I contrast to most other similar cases where WAL
> > replay simply re-creates the objects when trying to replay an action
> > affecting such an object, here we simply skip over the WAL logged
> > operation. So if e.g. the entire underlying UNDO file got lost, we
> > neither 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Mon, Aug 19, 2019 at 8:22 AM Amit Kapila  wrote:
> One point to remember in this regard is that we do need to modify the
> LSN in undo pages after writing WAL, so all the undo pages need to be
> locked by that time or we again need to take the lock on them.

Uh, but a big part of the point of setting the LSN on the pages is to
keep them from being written out before the corresponding WAL is
flushed to disk. If you released and reacquired the lock, the page
could be written out during the window in the middle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Tue, Aug 20, 2019 at 2:42 AM Amit Kapila  wrote:
> > Well, my main point, which so far has largely been ignored, was that we
> > may not acquire page locks when we still need to search for victim
> > buffers later. If we don't need to lock the pages up-front, but only do
> > so once we're actually copying the records into the undo pages, then we
> > don't a separate phase to acquire the locks. We can still hold all of
> > the page locks at the same time, as long as we just acquire them at the
> > later stage.
>
> Okay, IIUC, this means that we should have a separate phase where we
> call LockUndoBuffers (or something like that) before
> InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
> will lock all the buffers pinned during PrepareUndoInsert.  We can
> probably call LockUndoBuffers before entering the critical section to
> avoid any kind of failure in critical section.  If so, that sounds
> reasonable to me.

I'm kind of scratching my head here, because this is clearly different
than what Andres said in the quoted text to which you were replying.
He clearly implied that we should acquire the buffer locks within the
critical section during InsertPreparedUndo, and you responded by
proposing to do it outside the critical section in a separate step.
Regardless of which way is actually better, when somebody says "hey,
let's do A!" and you respond by saying "sounds good, I'll go implement
B!" that's not really helping us to get toward a solution.

FWIW, although I also thought of doing what you are describing here, I
think Andres's proposal is probably preferable, because it's simpler.
There's not really any reason why we can't take the buffer locks from
within the critical section, and that way callers don't have to deal
with the extra step.

> >  My secondary point was that *none* of this actually is
> > documented, even if it's entirely unobvious to the reader that the
> > relevant code can only run during WAL insertion, due to being pretty far
> > removed from that.
>
> I think this can be clearly mentioned in README or someplace else.

It also needs to be adequately commented in the files and functions involved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Mon, Aug 19, 2019 at 5:16 PM Andres Freund  wrote:
> Well, my main point, which so far has largely been ignored, was that we
> may not acquire page locks when we still need to search for victim
> buffers later. If we don't need to lock the pages up-front, but only do
> so once we're actually copying the records into the undo pages, then we
> don't a separate phase to acquire the locks. We can still hold all of
> the page locks at the same time, as long as we just acquire them at the
> later stage.

+1 for that approach.  I am in complete agreement.

> My secondary point was that *none* of this actually is
> documented, even if it's entirely unobvious to the reader that the
> relevant code can only run during WAL insertion, due to being pretty far
> removed from that.

+1 also for properly documenting stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Sat, Aug 17, 2019 at 1:28 PM Andres Freund  wrote:
> The primary one in the context here is that if we do *not* have to lock
> the buffers all ahead of time, we can simplify the interface. We
> certainly can't lock the buffers over IO (due to buffer reclaim) as
> we're doing right now, so we'd need another phase, called by the "user"
> during undo insertion. But if we do not need to lock the buffers before
> the insertion over all starts, the inserting location doesn't have to
> care.
>
> Secondarily, all the reasoning for needing to lock all buffers ahead of
> time was imo fairly unconvincing. Following the "recipe" for WAL
> insertions is a good idea when writing a new run-of-the-mill WAL
> inserting location - but when writing a new fundamental facility, that
> already needs to modify how WAL works, then I find that much less
> convincing.

So, you seem to be talking about something here which is different
than what I thought we were talking about.  One question is whether we
need to lock all of the buffers "ahead of time," and I think the
answer to that question is probably "no." Since nobody else can be
writing to those buffers, and probably also nobody can be reading them
except maybe for some debugging tool, it should be fine if we enter
the critical section and then lock them at the point when we write the
bytes. I mean, there shouldn't be any contention, and I don't see any
other problems.

The other question is whether need to hold all of the buffer locks at
the same time, and that seems a lot more problematic to me.  It's hard
to say exactly whether this unsafe, because it depends on exactly what
you think we're doing here, and I don't see that you've really spelled
that out.  The normal thing to do is call PageSetLSN() on every page
before releasing the buffer lock, and that means holding all the
buffer locks until after we've called XLogInsert(). Now, you could
argue that we should skip setting the page LSN because the data ahead
of the insertion pointer is effectively invisible anyway, but I bet
that causes problems with checksums, at least, since they rely on the
page LSN being accurate to know whether to emit WAL when a buffer is
written. You could argue that we could do the XLogInsert() first and
only after that lock and dirty the pages one by one, but I think that
might break checkpoint interlocking, since it would then be possible
for the checkpoint scan to pass over a buffer that does not appear to
need writing for the current checkpoint but later gets dirtied and
stamped with an LSN that would have caused it to be written had it
been there at the time the checkpoint scan reached it. I really can't
rule out the possibility that there's some way to make something in
this area work, but I don't know what it is, and I think it's a fairly
risky area to go tinkering.

> Well, in the version of code that I was reviewing here, I don't there is
> such a limit (there is a limit for buffers per undo record, but no limit
> on the number of records inserted together). I think Dilip added a limit
> since.  And we have the issue of a lot of IO happening while holding
> content locks on several pages.  So I don't think it's a straw man at
> all.

Hmm, what do you mean by "a lot of IO happening while holding content
locks on several pages"? We might XLogInsert() but there shouldn't be
any buffer I/O going on at that point.  If there is, I think that
should be redesigned.  We should collect buffer pins first, without
locking.  Then lock.  Then write.  Or maybe lock-and-write, but only
after everything's pinned.  The fact of calling XLogInsert() while
holding buffer locks is not great, but I don't think it's any worse
here than in any other part of the system, because the undo buffers
aren't going to be suffering concurrent access from any other backend,
and because there shouldn't be more than a few of them.

> > 2. The write-ahead logging protocol says that you're supposed to lock
> > all the buffers at once.  See src/backend/access/transam/README.  If
> > you want to go patch that file, then this patch can follow whatever
> > the locking rules in the patched version are.  But until then, the
> > patch should follow *the actual rules* not some other protocol based
> > on a hand-wavy explanation in an email someplace. Otherwise, you've
> > got the same sort of undocumented disaster-waiting-to-happen that you
> > keep complaining about in other parts of this patch.  We need fewer of
> > those, not more!
>
> But that's not what I'm asking for? I don't even know where you take
> from that I don't want this to be documented. I'm mainly asking for a
> comment explaining why the current behaviour is what it is. Because I
> don't think an *implicit* "normal WAL logging rules" is sufficient
> explanation, because all the locking here happens one or two layers away
> from the WAL logging site - so it's absolutely *NOT* obvious that that's
> the explanation. And I don't think any of the locking sites actually 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Dilip Kumar
On Wed, Aug 14, 2019 at 10:35 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:

> I don't think we can normally pin the undo buffers properly at that
> stage. Without knowing the correct contents of the table page - which we
> can't know without holding some form of lock preventing modifications -
> we can't know how big our undo records are going to be. And we can't
> just have buffers that don't exist on disk in shared memory, and we
> don't want to allocate undo that we then don't need. So I think what
> we'd have to do at that stage, is to "pre-allocate" buffers for the
> maximum amount of UNDO needed, but mark the associated bufferdesc as not
> yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> would not be set.
>
> So at the start of a function that will need to insert undo we'd need to
> pre-reserve the maximum number of buffers we could potentially
> need. That reservation stage would
>
> a) pin the page with the current end of the undo
> b) if needed pin the page of older undo that we need to update (e.g. to
>update the next pointer)
> c) perform clock sweep etc to acquire (find or create) enough clean to
>hold the maximum amount of undo needed. These buffers would be marked
>as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
>
> I assume that we'd make a) cheap by keeping it pinned for undo logs that
> a backend is actively attached to. b) should only be needed once in a
> transaction, so it's not too bad. c) we'd probably need to amortize
> across multiple undo insertions, by keeping the unused buffers pinned
> until the end of the transaction.
>
> I assume that having the infrastructure c) might also make some code
> for already in postgres easier. There's obviously some issues around
> guaranteeing that the maximum number of such buffers isn't high.


I have analyzed this further, I think there is a problem if the
record/s will not fit into the current undo log and we will have to
switch the log.  Because before knowing the actual record length we
are not sure whether the undo log will switch or not and which undo
log we will get.  And, without knowing the logno (rnode) how we are
going to pin the buffers?  Am I missing something?

Thomas do you think we can get around this problem?

Apart from this while analyzing the other code I have noticed that in
the current PG code we have few occurrences where try to read buffer
under the buffer lock held.
1.
In gistplacetopage
{
...
for (; ptr; ptr = ptr->next)
{
/* Allocate new page */
ptr->buffer = gistNewBuffer(rel);
GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
ptr->page = BufferGetPage(ptr->buffer);
ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
}
2. During page split we find new buffer while holding the lock on the
current buffer.

That doesn't mean that we can't do better but I am just referring to
the existing code where we already have such issues.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Amit Kapila
On Tue, Aug 20, 2019 at 2:46 AM Andres Freund  wrote:
>
> On 2019-08-19 17:52:24 +0530, Amit Kapila wrote:
> > On Sat, Aug 17, 2019 at 10:58 PM Andres Freund  wrote:
> > >
> > > > Well, I don't understand why you're on about this.  We've discussed it
> > > > a number of times but I'm still confused.
> > >
> > > There's two reasons here:
> > >
> > > The primary one in the context here is that if we do *not* have to lock
> > > the buffers all ahead of time, we can simplify the interface. We
> > > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > > we're doing right now, so we'd need another phase, called by the "user"
> > > during undo insertion. But if we do not need to lock the buffers before
> > > the insertion over all starts, the inserting location doesn't have to
> > > care.
> > >
> > > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > > time was imo fairly unconvincing. Following the "recipe" for WAL
> > > insertions is a good idea when writing a new run-of-the-mill WAL
> > > inserting location - but when writing a new fundamental facility, that
> > > already needs to modify how WAL works, then I find that much less
> > > convincing.
> > >
> >
> > One point to remember in this regard is that we do need to modify the
> > LSN in undo pages after writing WAL, so all the undo pages need to be
> > locked by that time or we again need to take the lock on them.
>
> Well, my main point, which so far has largely been ignored, was that we
> may not acquire page locks when we still need to search for victim
> buffers later. If we don't need to lock the pages up-front, but only do
> so once we're actually copying the records into the undo pages, then we
> don't a separate phase to acquire the locks. We can still hold all of
> the page locks at the same time, as long as we just acquire them at the
> later stage.
>

Okay, IIUC, this means that we should have a separate phase where we
call LockUndoBuffers (or something like that) before
InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
will lock all the buffers pinned during PrepareUndoInsert.  We can
probably call LockUndoBuffers before entering the critical section to
avoid any kind of failure in critical section.  If so, that sounds
reasonable to me.

>  My secondary point was that *none* of this actually is
> documented, even if it's entirely unobvious to the reader that the
> relevant code can only run during WAL insertion, due to being pretty far
> removed from that.
>

I think this can be clearly mentioned in README or someplace else.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-19 Thread Andres Freund
On 2019-08-19 17:52:24 +0530, Amit Kapila wrote:
> On Sat, Aug 17, 2019 at 10:58 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> > > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > > > Again, I think it's not ok to just assume you can lock an 
> > > > > > essentially
> > > > > > unbounded number of buffers. This seems almost guaranteed to result 
> > > > > > in
> > > > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > > > >
> > > > > I think for controlling that we need to put a limit on max prepared
> > > > > undo?  I am not sure any other way of limiting the number of buffers
> > > > > because we must lock all the buffer in which we are going to insert
> > > > > the undo record under one WAL logged operation.
> > > >
> > > > I heard that a number of times. But I still don't know why that'd
> > > > actually be true. Why would it not be sufficient to just lock the buffer
> > > > currently being written to, rather than all buffers? It'd require a bit
> > > > of care updating the official current "logical end" of a log, but
> > > > otherwise ought to not be particularly hard? Only one backend can extend
> > > > the log after all, and until the log is externally visibily extended,
> > > > nobody can read or write those buffers, no?
> > >
> > > Well, I don't understand why you're on about this.  We've discussed it
> > > a number of times but I'm still confused.
> >
> > There's two reasons here:
> >
> > The primary one in the context here is that if we do *not* have to lock
> > the buffers all ahead of time, we can simplify the interface. We
> > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > we're doing right now, so we'd need another phase, called by the "user"
> > during undo insertion. But if we do not need to lock the buffers before
> > the insertion over all starts, the inserting location doesn't have to
> > care.
> >
> > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > time was imo fairly unconvincing. Following the "recipe" for WAL
> > insertions is a good idea when writing a new run-of-the-mill WAL
> > inserting location - but when writing a new fundamental facility, that
> > already needs to modify how WAL works, then I find that much less
> > convincing.
> >
> 
> One point to remember in this regard is that we do need to modify the
> LSN in undo pages after writing WAL, so all the undo pages need to be
> locked by that time or we again need to take the lock on them.

Well, my main point, which so far has largely been ignored, was that we
may not acquire page locks when we still need to search for victim
buffers later. If we don't need to lock the pages up-front, but only do
so once we're actually copying the records into the undo pages, then we
don't a separate phase to acquire the locks. We can still hold all of
the page locks at the same time, as long as we just acquire them at the
later stage.  My secondary point was that *none* of this actually is
documented, even if it's entirely unobvious to the reader that the
relevant code can only run during WAL insertion, due to being pretty far
removed from that.

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-19 Thread Amit Kapila
On Sat, Aug 17, 2019 at 10:58 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > > >
> > > > I think for controlling that we need to put a limit on max prepared
> > > > undo?  I am not sure any other way of limiting the number of buffers
> > > > because we must lock all the buffer in which we are going to insert
> > > > the undo record under one WAL logged operation.
> > >
> > > I heard that a number of times. But I still don't know why that'd
> > > actually be true. Why would it not be sufficient to just lock the buffer
> > > currently being written to, rather than all buffers? It'd require a bit
> > > of care updating the official current "logical end" of a log, but
> > > otherwise ought to not be particularly hard? Only one backend can extend
> > > the log after all, and until the log is externally visibily extended,
> > > nobody can read or write those buffers, no?
> >
> > Well, I don't understand why you're on about this.  We've discussed it
> > a number of times but I'm still confused.
>
> There's two reasons here:
>
> The primary one in the context here is that if we do *not* have to lock
> the buffers all ahead of time, we can simplify the interface. We
> certainly can't lock the buffers over IO (due to buffer reclaim) as
> we're doing right now, so we'd need another phase, called by the "user"
> during undo insertion. But if we do not need to lock the buffers before
> the insertion over all starts, the inserting location doesn't have to
> care.
>
> Secondarily, all the reasoning for needing to lock all buffers ahead of
> time was imo fairly unconvincing. Following the "recipe" for WAL
> insertions is a good idea when writing a new run-of-the-mill WAL
> inserting location - but when writing a new fundamental facility, that
> already needs to modify how WAL works, then I find that much less
> convincing.
>

One point to remember in this regard is that we do need to modify the
LSN in undo pages after writing WAL, so all the undo pages need to be
locked by that time or we again need to take the lock on them.

>
> > 1. It's absolutely fine to just put a limit on this, because the
> > higher-level facilities that use this shouldn't be doing a single
> > WAL-logged operation that touches a zillion buffers.

Right, by default a WAL log can only cover 4 buffers.  If we need to
touch more buffers, then the caller needs to call
XLogEnsureRecordSpace.  So, I agree with the point that generally, it
should be few buffers (2 or 3) of undo that need to be touched in a
single operation and if there are more, either callers need to change
or at the very least they need to be careful about the same.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-19 Thread Dilip Kumar
On Wed, Aug 14, 2019 at 10:35 PM Andres Freund  wrote:
>

>
> > >   - When reading an undo record, the whole stage of UnpackUndoData()
> > > reading data into a the UndoPackContext is omitted, reading directly
> > > into the UnpackedUndoRecord. That removes one further copy of the
> > > record format.
> > So we will read member by member to UnpackedUndoRecord?  because in
> > context we have at least a few headers packed and we can memcpy one
> > header at a time like UndoRecordHeader, UndoRecordBlock.
>
> Well, right now you then copy them again later, so not much is gained by
> that (although that later copy can happen without the content lock
> held). As I think I suggested before, I suspect that the best way would
> be to just memcpy() the data from the page(s) into an appropriately
> sized buffer with the content lock held, and then perform unpacking
> directly into UnpackedUndoRecord. Especially with the bulk API that will
> avoid having to do much work with locks held, and reduce memory usage by
> only unpacking the record(s) in a batch that are currently being looked
> at.
>
>
> > But that just a few of them so if we copy field by field in the
> > UnpackedUndoRecord then we can get rid of copying in context then copy
> > it back to the UnpackedUndoRecord.  Is this is what in your mind or
> > you want to store these structures (UndoRecordHeader, UndoRecordBlock)
> > directly into UnpackedUndoRecord?
>
> I at the moment see no reason not to?

Currently, In UnpackedUndoRecord we store all members directly which
are set by the caller.  We store pointers to some header which are
allocated internally by the undo layer and the caller need not worry
about setting them.  So now you are suggesting to put other headers
also as structures in UnpackedUndoRecord.  I as such don't have much
problem in doing that but I think initially Robert designed
UnpackedUndoRecord structure this way so it will be good if Robert
provides his opinion on this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-18 Thread Dilip Kumar
On Sat, Aug 17, 2019 at 9:35 PM Robert Haas  wrote:
>
> On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > >
> > > I think for controlling that we need to put a limit on max prepared
> > > undo?  I am not sure any other way of limiting the number of buffers
> > > because we must lock all the buffer in which we are going to insert
> > > the undo record under one WAL logged operation.
> >
> > I heard that a number of times. But I still don't know why that'd
> > actually be true. Why would it not be sufficient to just lock the buffer
> > currently being written to, rather than all buffers? It'd require a bit
> > of care updating the official current "logical end" of a log, but
> > otherwise ought to not be particularly hard? Only one backend can extend
> > the log after all, and until the log is externally visibily extended,
> > nobody can read or write those buffers, no?
>
> Well, I don't understand why you're on about this.  We've discussed it
> a number of times but I'm still confused.  I'll repeat my previous
> arguments on-list:
>
> 1. It's absolutely fine to just put a limit on this, because the
> higher-level facilities that use this shouldn't be doing a single
> WAL-logged operation that touches a zillion buffers.  We have been
> careful to avoid having WAL-logged operations touch an unbounded
> number of buffers in plenty of other places, like the btree code, and
> we are going to have to be similarly careful here for multiple
> reasons, deadlock avoidance being one.  So, saying, "hey, you're going
> to lock an unlimited number of buffers" is a straw man.  We aren't.
> We can't.

Right.  So basically, we need to put a limit on how many max undo can
be prepared under single WAL logged operation and that will internally
put the limit on max undo buffers.  Suppose we limit max_prepared_
undo to 2 then we need to lock at max 5 undo buffers.  We need to
somehow deal with the multi-insert code in the zheap because in that
code for inserting in a single page we write one undo record per range
if all the tuple which we are inserting on a single page are
interleaved.  But, maybe we can handle that by just inserting one undo
record which can have multiple ranges.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-17 Thread Andres Freund
Hi,

On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > >
> > > I think for controlling that we need to put a limit on max prepared
> > > undo?  I am not sure any other way of limiting the number of buffers
> > > because we must lock all the buffer in which we are going to insert
> > > the undo record under one WAL logged operation.
> >
> > I heard that a number of times. But I still don't know why that'd
> > actually be true. Why would it not be sufficient to just lock the buffer
> > currently being written to, rather than all buffers? It'd require a bit
> > of care updating the official current "logical end" of a log, but
> > otherwise ought to not be particularly hard? Only one backend can extend
> > the log after all, and until the log is externally visibily extended,
> > nobody can read or write those buffers, no?
>
> Well, I don't understand why you're on about this.  We've discussed it
> a number of times but I'm still confused.

There's two reasons here:

The primary one in the context here is that if we do *not* have to lock
the buffers all ahead of time, we can simplify the interface. We
certainly can't lock the buffers over IO (due to buffer reclaim) as
we're doing right now, so we'd need another phase, called by the "user"
during undo insertion. But if we do not need to lock the buffers before
the insertion over all starts, the inserting location doesn't have to
care.

Secondarily, all the reasoning for needing to lock all buffers ahead of
time was imo fairly unconvincing. Following the "recipe" for WAL
insertions is a good idea when writing a new run-of-the-mill WAL
inserting location - but when writing a new fundamental facility, that
already needs to modify how WAL works, then I find that much less
convincing.


> 1. It's absolutely fine to just put a limit on this, because the
> higher-level facilities that use this shouldn't be doing a single
> WAL-logged operation that touches a zillion buffers.  We have been
> careful to avoid having WAL-logged operations touch an unbounded
> number of buffers in plenty of other places, like the btree code, and
> we are going to have to be similarly careful here for multiple
> reasons, deadlock avoidance being one.  So, saying, "hey, you're going
> to lock an unlimited number of buffers" is a straw man.  We aren't.
> We can't.

Well, in the version of code that I was reviewing here, I don't there is
such a limit (there is a limit for buffers per undo record, but no limit
on the number of records inserted together). I think Dilip added a limit
since.  And we have the issue of a lot of IO happening while holding
content locks on several pages.  So I don't think it's a straw man at
all.


> 2. The write-ahead logging protocol says that you're supposed to lock
> all the buffers at once.  See src/backend/access/transam/README.  If
> you want to go patch that file, then this patch can follow whatever
> the locking rules in the patched version are.  But until then, the
> patch should follow *the actual rules* not some other protocol based
> on a hand-wavy explanation in an email someplace. Otherwise, you've
> got the same sort of undocumented disaster-waiting-to-happen that you
> keep complaining about in other parts of this patch.  We need fewer of
> those, not more!

But that's not what I'm asking for? I don't even know where you take
from that I don't want this to be documented. I'm mainly asking for a
comment explaining why the current behaviour is what it is. Because I
don't think an *implicit* "normal WAL logging rules" is sufficient
explanation, because all the locking here happens one or two layers away
from the WAL logging site - so it's absolutely *NOT* obvious that that's
the explanation. And I don't think any of the locking sites actually has
comments explaining why the locks are acquired at that time (in fact,
IIRC until the review some even only mentioned pinning, not locking).


> > > Suppose you insert one record for the transaction which split in
> > > block1 and 2.  Now, before this block is actually going to the disk
> > > the transaction committed and become all visible the undo logs are
> > > discarded.  It's possible that block 1 is completely discarded but
> > > block 2 is not because it might have undo for the next transaction.
> > > Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> > > their so we need to skip inserting undo for block 1 as it does not
> > > exist.
> >
> > Hm. I'm quite doubtful this is a good idea. How will this not force us
> > to a emit a lot more expensive durable operations while writing undo?
> > And doesn't this reduce error detection quite remarkably?
> >
> > Thomas, Robert?
> 
> I think you're going to 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-17 Thread Robert Haas
On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > Again, I think it's not ok to just assume you can lock an essentially
> > > unbounded number of buffers. This seems almost guaranteed to result in
> > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> >
> > I think for controlling that we need to put a limit on max prepared
> > undo?  I am not sure any other way of limiting the number of buffers
> > because we must lock all the buffer in which we are going to insert
> > the undo record under one WAL logged operation.
>
> I heard that a number of times. But I still don't know why that'd
> actually be true. Why would it not be sufficient to just lock the buffer
> currently being written to, rather than all buffers? It'd require a bit
> of care updating the official current "logical end" of a log, but
> otherwise ought to not be particularly hard? Only one backend can extend
> the log after all, and until the log is externally visibily extended,
> nobody can read or write those buffers, no?

Well, I don't understand why you're on about this.  We've discussed it
a number of times but I'm still confused.  I'll repeat my previous
arguments on-list:

1. It's absolutely fine to just put a limit on this, because the
higher-level facilities that use this shouldn't be doing a single
WAL-logged operation that touches a zillion buffers.  We have been
careful to avoid having WAL-logged operations touch an unbounded
number of buffers in plenty of other places, like the btree code, and
we are going to have to be similarly careful here for multiple
reasons, deadlock avoidance being one.  So, saying, "hey, you're going
to lock an unlimited number of buffers" is a straw man.  We aren't.
We can't.

2. The write-ahead logging protocol says that you're supposed to lock
all the buffers at once.  See src/backend/access/transam/README.  If
you want to go patch that file, then this patch can follow whatever
the locking rules in the patched version are.  But until then, the
patch should follow *the actual rules* not some other protocol based
on a hand-wavy explanation in an email someplace. Otherwise, you've
got the same sort of undocumented disaster-waiting-to-happen that you
keep complaining about in other parts of this patch.  We need fewer of
those, not more!

3. There is no reason to care about all of the buffers being locked at
once, because they are not unlimited in number (see point #1) and
nobody else is looking at them anyway (see the final sentence of what
I quoted above).

I think we are, or ought to be, talking about locking 2 (or maybe in
rare cases 3 or 4) undo buffers in connection with a single WAL
record.  If we're talking about more than that, then I think the
higher-level code needs to be changed.  If we're talking about that
many, then we don't need to be clever.  We can just do the standard
thing that the rest of the system does, and it will be fine just like
it is everywhere else.

> > Suppose you insert one record for the transaction which split in
> > block1 and 2.  Now, before this block is actually going to the disk
> > the transaction committed and become all visible the undo logs are
> > discarded.  It's possible that block 1 is completely discarded but
> > block 2 is not because it might have undo for the next transaction.
> > Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> > their so we need to skip inserting undo for block 1 as it does not
> > exist.
>
> Hm. I'm quite doubtful this is a good idea. How will this not force us
> to a emit a lot more expensive durable operations while writing undo?
> And doesn't this reduce error detection quite remarkably?
>
> Thomas, Robert?

I think you're going to need to spell out your assumptions in order
for me to be able to comment intelligently.  This is another thing
that seems pretty normal to me.  Generally, WAL replay might need to
recreate objects whose creation is not separately WAL-logged, and it
might need to skip operations on objects that have been dropped later
in the WAL stream and thus don't exist any more. This seems like an
instance of the latter pattern.  There's no reason to try to put valid
data into pages that we know have been discarded, and both inserting
and discarding undo data need to be logged anyway.

As a general point, I think the hope is that undo generated by
short-running transactions that commit and become all-visible quickly
will be cheap.  We should be able to dirty shared buffers but then
discard the data without ever writing it out to disk if we've logged a
discard of that data. Obviously, if you've got long-running
transactions that are either generating undo or holding old snapshots,
you're going to have to really flush the data, but we want to avoid
that when we can.  And the same is true on the standby: even if we
write the dirty data into shared buffers instead of skipping the write
altogether, we hope to be able to forget about those buffers when we
encounter a 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-16 Thread Dilip Kumar
On Fri, Aug 16, 2019 at 10:56 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar  wrote:
> > >
> > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> >
> > > >   I think that batch reading should just copy the underlying data into a
> > > >   char* buffer. Only the records that currently are being used by
> > > >   higher layers should get exploded into an unpacked record. That will
> > > >   reduce memory usage quite noticably (and I suspect it also drastically
> > > >   reduce the overhead due to a large context with a lot of small
> > > >   allocations that then get individually freed).
> > >
> > > Ok, I got your idea.  I will analyze it further and work on this if
> > > there is no problem.
> >
> > I think there is one problem that currently while unpacking the undo
> > record if the record is compressed (i.e. some of the fields does not
> > exist in the record) then we read those fields from the first record
> > on the page.  But, if we just memcpy the undo pages to the buffers and
> > delay the unpacking whenever it's needed seems that we would need to
> > know the page boundary and also we need to know the offset of the
> > first complete record on the page from where we can get that
> > information (which is currently in undo page header).
>
> I don't understand why that's a problem?
Okay, I was assuming that we will be only copying data part not
complete page including the page header.  If we copy the page header
as well we might be able to unpack the compressed record as well.

>
>
> > As of now even if we leave this issue apart I am not very clear what
> > benefit you are seeing in the way you are describing compared to the
> > way I am doing it now?
> >
> > a) Is it the multiple palloc? If so then we can allocate memory at
> > once and flatten the undo records in that.  Earlier, I was doing that
> > but we need to align each unpacked undo record so that we can access
> > them directly and based on Robert's suggestion I have modified it to
> > multiple palloc.
>
> Part of it.
>
> > b) Is it the memory size problem that the unpack undo record will take
> > more memory compared to the packed record?
>
> Part of it.
>
> > c) Do you think that we will not need to unpack all the records?  But,
> > I think eventually, at the higher level we will have to unpack all the
> > undo records ( I understand that it will be one at a time)
>
> Part of it. There's a *huge* difference between having a few hundred to
> thousand unpacked records, each consisting of several independent
> allocations, in memory and having one large block containing all
> packed records in a batch, and a few allocations for the few unpacked
> records that need to exist.
>
> There's also d) we don't need separate tiny memory copies while holding
> buffer locks etc.

Yeah, that too.  Yet another problem could be that how are we going to
process those record? Because for that we need to know all the undo
record pointers between start_urecptr and the end_urecptr right?  we
just have the big memory chunk and we have no idea how many undo
records are there and what are their undo record pointers.  And
without knowing that information, I am unable to imagine how we are
going to sort them based on block number.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Andres Freund
Hi,

On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar  wrote:
> >
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> 
> > >   I think that batch reading should just copy the underlying data into a
> > >   char* buffer. Only the records that currently are being used by
> > >   higher layers should get exploded into an unpacked record. That will
> > >   reduce memory usage quite noticably (and I suspect it also drastically
> > >   reduce the overhead due to a large context with a lot of small
> > >   allocations that then get individually freed).
> >
> > Ok, I got your idea.  I will analyze it further and work on this if
> > there is no problem.
> 
> I think there is one problem that currently while unpacking the undo
> record if the record is compressed (i.e. some of the fields does not
> exist in the record) then we read those fields from the first record
> on the page.  But, if we just memcpy the undo pages to the buffers and
> delay the unpacking whenever it's needed seems that we would need to
> know the page boundary and also we need to know the offset of the
> first complete record on the page from where we can get that
> information (which is currently in undo page header).

I don't understand why that's a problem?


> As of now even if we leave this issue apart I am not very clear what
> benefit you are seeing in the way you are describing compared to the
> way I am doing it now?
> 
> a) Is it the multiple palloc? If so then we can allocate memory at
> once and flatten the undo records in that.  Earlier, I was doing that
> but we need to align each unpacked undo record so that we can access
> them directly and based on Robert's suggestion I have modified it to
> multiple palloc.

Part of it.

> b) Is it the memory size problem that the unpack undo record will take
> more memory compared to the packed record?

Part of it.

> c) Do you think that we will not need to unpack all the records?  But,
> I think eventually, at the higher level we will have to unpack all the
> undo records ( I understand that it will be one at a time)

Part of it. There's a *huge* difference between having a few hundred to
thousand unpacked records, each consisting of several independent
allocations, in memory and having one large block containing all
packed records in a batch, and a few allocations for the few unpacked
records that need to exist.

There's also d) we don't need separate tiny memory copies while holding
buffer locks etc.

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Thomas Munro
Hi Kuntal,

On Thu, Jul 25, 2019 at 5:40 PM Kuntal Ghosh  wrote:
> Here are some review comments on 0003-Add-undo-log-manager.patch. I've
> tried to avoid duplicate comments as much as possible.

Thanks!  Replies inline.  I'll be posting a new patch set shortly with
these and other fixes.

> 1. In UndoLogAllocate,
> + * time this backend as needed to write to an undo log at all or because
> s/as/has

Fixed.

> + * Maintain our tracking of the and the previous transaction start
> Do you mean current log's transaction start as well?

Right, fixed.

> 2. In UndoLogAllocateInRecovery,
> we try to find the current log from the first undo buffer. So, after a
> log switch, we always have to register at least one buffer from the
> current undo log first. If we're updating something in the previous
> log, the respective buffer should be registered after that. I think we
> should document this in the comments.

I'm not sure I understand. Is this working correctly today?

> 3. In UndoLogGetOldestRecord(UndoLogNumber logno, bool *full),
> it seems the 'full' parameter is not used anywhere. Do we still need this?
>
> + /* It's been recycled.  SO it must have been entirely discarded. */
> s/SO/So

Fixed.

> 4. In CleanUpUndoCheckPointFiles,
> we can emit a debug2 message with something similar to : 'removed
> unreachable undo metadata files'

Done.

> + if (unlink(path) != 0)
> + elog(ERROR, "could not unlink file \"%s\": %m", path);
> according to my observation, whenever we deal with a file operation,
> we usually emit a ereport message with errcode_for_file_access().
> Should we change it to ereport? There are other file operations as
> well including read(), OpenTransientFile() etc.

Done.

> 5. In CheckPointUndoLogs,
> + /* Capture snapshot while holding each mutex. */
> + LWLockAcquire(>mutex, LW_EXCLUSIVE);
> + serialized[num_logs++] = slot->meta;
> + LWLockRelease(>mutex);
> why do we need an exclusive lock to read something from the slot? A
> share lock seems to be sufficient.

OK.

> pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC) is called
> after pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE)
> without calling pgstat_report_wait_end(). I think you've done the
> same to avoid an extra function call. But, it differs from other
> places in the PG code. Perhaps, we should follow this approach
> everywhere.

Ok, changed.

> 6. In StartupUndoLogs,
> + if (fd < 0)
> + elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
> assuming your agreement upon changing above elog to ereport, the
> message should be more user friendly. May be something like 'cannot
> open pg_undo file'.

Done.

> + if ((size = read(fd, >meta, sizeof(slot->meta))) != 
> sizeof(slot->meta))
> The usage of size of doesn't look like a problem. But, we can save
> some extra padding bytes at the end if we use (offsetof + sizeof)
> approach similar to other places in PG.

It current ends in a 64 bit value, so there is no padding here.

> 7. In free_undo_log_slot,
> + /*
> + * When removing an undo log from a slot in shared memory, we acquire
> + * UndoLogLock, log->mutex and log->discard_lock, so that other code can
> + * hold any one of those locks to prevent the slot from being recycled.
> + */
> + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> + LWLockAcquire(>mutex, LW_EXCLUSIVE);
> + Assert(slot->logno != InvalidUndoLogNumber);
> + slot->logno = InvalidUndoLogNumber;
> + memset(>meta, 0, sizeof(slot->meta));
> + LWLockRelease(>mutex);
> + LWLockRelease(UndoLogLock);
> you've not taken the discard_lock as mentioned in the comment.

Right, I was half-way between two different ideas about how that
interlocking should work, but I have straightened this out now, and
will write about the overall locking model separately.

> 8. In find_undo_log_slot,
> + * 1.  If the calling code knows that it is attached to this lock or is the
> s/lock/slot

Fixed.

BTW I am experimenting with macros that would actually make assertions
about those programming rules.

> + * 2.  All other code should acquire log->mutex before accessing any members,
> + * and after doing so, check that the logno hasn't moved.  If it is not, the
> + * entire undo log must be assumed to be discarded (as if this function
> + * returned NULL) and the caller must behave accordingly.
> Perhaps, you meant '..check that the logno remains same. If it is not..'.

Fixed.

> + /*
> + * If we didn't find it, then it must already have been entirely
> + * discarded.  We create a negative cache entry so that we can answer
> + * this question quickly next time.
> + *
> + * TODO: We could track the lowest known undo log number, to reduce
> + * the negative cache entry bloat.
> + */
> This is an interesting thought. But, I'm wondering how we are going to
> search the discarded logno in the simple hash. I guess that's why it's
> in the TODO list.

Done.  Each backend tracks its idea of the lowest undo log that
exists.  There is a shared low_logno that is recomputed 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Thomas Munro
Hi Amit,

I've combined three of your messages into one below, and responded
inline.  New patch set to follow shortly, with the fixes listed below
(and others from other reviewers).

On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila  wrote:
> 0003-Add-undo-log-manager.patch
> 1.
> allocate_empty_undo_segment()
...
> + /* create two parents up if not exist */
> + parentdir = pstrdup(undo_path);
> + get_parent_directory(parentdir);
> + get_parent_directory(parentdir);
> + /* Can't create parent and it doesn't already exist? */
> + if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
>
>
> All of this code is almost same as we have code in
> TablespaceCreateDbspace still we have small differences like here you
> are using mkdir instead of MakePGDirectory which as far as I can see
> use similar permissions for creating directory.  Also, it checks
> whether the directory exists before trying to create it.  Is there a
> reason why we need to do a few things differently here if not, they
> can both the places use one common function?

Right, MakePGDirectory() arrived in commit da9b580d8990, and should
probably be used everywhere we create directories under pgdata.
Fixed.

Yeah, I think we could just use TablespaceCreateDbspace() for this, if
we are OK with teaching GetDatabasePath() and GetRelationPath() about
how to make special undo paths, OR we are OK with just using
"standard" paths, where undo files just live under database 9 (instead
of the special "undo" directory).  I stopped using a "9" directory in
earlier versions because undo moved to a separate namespace when we
agreed to use an extra discriminator in buffer tags and so forth; now
that we're back to using database number 9, the question of whether to
reflect that on the filesystem is back.  I have had some trouble
deciding which parts of the system should treat undo logs as some kind
of 'relation' (and the SLRU project will have to tackle the same
questions).  I'll think about that some more before making the change.

> 2.
> allocate_empty_undo_segment()
> {
> ..
> ..
> /* Flush the contents of the file to disk before the next checkpoint. */
> + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> ..
> }

> The comment in allocate_empty_undo_segment indicates that the code
> wants to flush before checkpoint, but the actual function tries to
> register the request with checkpointer.  Shouldn't this be similar to
> XLogFileInit where we use pg_fsync to flush the contents immediately?

I responded to the general question about when we sync files in an
earlier email.  I've updated the comments to make it clearer that it's
handing the work off, not doing it now.

> Another thing is that recently in commit 475861b261 (commit by you),
> we have introduced a mechanism to not fill the files with zero's for
> certain filesystems like ZFS.  Do we want similar behavior for undo
> files?

Good point.  I will create a separate thread to discuss how the
creation of a central file allocation routine (possibly with a GUC),
and see if we can come up with something reusable for this, but
independently committable.

> 3.
> +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
> +{
> + UndoLogSlot *slot;
> + size_t end;
> +
> + slot = find_undo_log_slot(logno, false);
> +
> + /* TODO review interlocking */
> +
> + Assert(slot != NULL);
> + Assert(slot->meta.end % UndoLogSegmentSize == 0);
> + Assert(new_end % UndoLogSegmentSize == 0);
> + Assert(InRecovery ||
> +CurrentSession->attached_undo_slots[slot->meta.category] == slot);
>
> Can you write some comments explaining the above Asserts?  Also, can
> you explain what interlocking issues are you worried about here?

I added comments about the assertions.  I will come back to the
interlocking in another message, which I've now addressed (alluded to
below as well).

> 4.
> while (end < new_end)
> + {
> + allocate_empty_undo_segment(logno, slot->meta.tablespace, end);
> + end += UndoLogSegmentSize;
> + }
> +
> + /* Flush the directory entries before next checkpoint. */
> + undofile_request_sync_dir(slot->meta.tablespace);
>
> I see that at two places after allocating empty undo segment, the
> patch performs undofile_request_sync_dir whereas it doesn't perform
> the same in UndoLogNewSegment? Is there a reason for the same or is it
> missed from one of the places?

You're right.  Done.

> 5.
> +static void
> +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
> {
> ..
> /*
> + * We didn't need to acquire the mutex to read 'end' above because only
> + * we write to it.  But we need the mutex to update it, because the
> + * checkpointer might read it concurrently.
>
> Is this assumption correct?  It seems patch also modified
> slot->meta.end during discard in function UndoLogDiscard.  I am
> referring below code:
>
> +UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
> {
> ..
> + /* Update shmem to show the new discard and end pointers. */
> + LWLockAcquire(>mutex, LW_EXCLUSIVE);
> + 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Dilip Kumar
On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar  wrote:
>
> On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:

> >   I think that batch reading should just copy the underlying data into a
> >   char* buffer. Only the records that currently are being used by
> >   higher layers should get exploded into an unpacked record. That will
> >   reduce memory usage quite noticably (and I suspect it also drastically
> >   reduce the overhead due to a large context with a lot of small
> >   allocations that then get individually freed).
>
> Ok, I got your idea.  I will analyze it further and work on this if
> there is no problem.

I think there is one problem that currently while unpacking the undo
record if the record is compressed (i.e. some of the fields does not
exist in the record) then we read those fields from the first record
on the page.  But, if we just memcpy the undo pages to the buffers and
delay the unpacking whenever it's needed seems that we would need to
know the page boundary and also we need to know the offset of the
first complete record on the page from where we can get that
information (which is currently in undo page header).

As of now even if we leave this issue apart I am not very clear what
benefit you are seeing in the way you are describing compared to the
way I am doing it now?

a) Is it the multiple palloc? If so then we can allocate memory at
once and flatten the undo records in that.  Earlier, I was doing that
but we need to align each unpacked undo record so that we can access
them directly and based on Robert's suggestion I have modified it to
multiple palloc.
b) Is it the memory size problem that the unpack undo record will take
more memory compared to the packed record?
c) Do you think that we will not need to unpack all the records?  But,
I think eventually, at the higher level we will have to unpack all the
undo records ( I understand that it will be one at a time)

Or am I completely missing something here?

>
>   That will make the
> >   sorting of undo a bit more CPU inefficient, because individual records
> >   will need to be partially unpacked for comparison, but I think that's
> >   going to be a far smaller loss than the win.
> Right.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Dilip Kumar
On Wed, Aug 14, 2019 at 10:35 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> > > - I think there's two fairly fundamental, and related, problems with
> > >   the sequence outlined above:
> > >
> > >   - We can't search for target buffers to store undo data, while holding
> > > the "table" page content locked.
> > >
> > > The easy way to solve would be to require sites performing UNDO
> > > logging to acquire victim pages before even acquiring other content
> > > locks. Perhaps the better approach could be for the undo layer to
> > > hold onto a number of clean buffers, and to keep the last page in an
> > > already written to undo log pinned.
> > >
> > >   - We can't search for victim buffers for further undo data while
> > > already holding other undo pages content locked. Doing so means that
> > > while we're e.g. doing IO to clean out the new page, old undo data
> > > on the previous page can't be read.
> > >
> > > This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
> > > pinning and locking buffers, it'd need to be split into two
> > > operations. One that acquires buffers and pins them, and one that
> > > locks them.  I think it's quite possible that the locking operation
> > > could just be delayed until InsertPreparedUndo().  But if we solve
> > > the above problem, most of this might already be solved.
> >
> > Basically, that means
> > - the caller should call PreparedUndoInsert before acquiring table
> > page content lock right? because the PreparedUndoInsert just compute
> > the size, allocate the space and pin+lock the buffers and for pinning
> > the buffers we must compute the size and allocate the space using undo
> > storage layer.
>
> I don't think we can normally pin the undo buffers properly at that
> stage. Without knowing the correct contents of the table page - which we
> can't know without holding some form of lock preventing modifications -
> we can't know how big our undo records are going to be. And we can't
> just have buffers that don't exist on disk in shared memory, and we
> don't want to allocate undo that we then don't need. So I think what
> we'd have to do at that stage, is to "pre-allocate" buffers for the
> maximum amount of UNDO needed, but mark the associated bufferdesc as not
> yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> would not be set.
>
> So at the start of a function that will need to insert undo we'd need to
> pre-reserve the maximum number of buffers we could potentially
> need. That reservation stage would

Maybe we can provide an interface where the caller will input the max
prepared undo (maybe BeginUndoRecordInsert) and based on that we can
compute the max number of buffers we could potentially need for this
particular operation.  Most of the operation insert/update/delete will
need 1 or 2 undo record so we can avoid pinning very high number of
buffers in most of the cases.   Currently, only for the multi-insert
implementation of zheap we might need multiple undo-records (1 undo
per range of record).

> a) pin the page with the current end of the undo
> b) if needed pin the page of older undo that we need to update (e.g. to
>update the next pointer)
> c) perform clock sweep etc to acquire (find or create) enough clean to
>hold the maximum amount of undo needed. These buffers would be marked
>as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
>
> I assume that we'd make a) cheap by keeping it pinned for undo logs that
> a backend is actively attached to. b) should only be needed once in a
> transaction, so it's not too bad. c) we'd probably need to amortize
> across multiple undo insertions, by keeping the unused buffers pinned
> until the end of the transaction.
>
> I assume that having the infrastructure c) might also make some code
> for already in postgres easier. There's obviously some issues around
> guaranteeing that the maximum number of such buffers isn't high.
>
>
> > - So basically, if we delay the lock till InsertPreparedUndo and call
> > PrepareUndoInsert before acquiring table page content lock this
> > problem is solved?
> >
> > Although I haven't yet analyzed the AM specific part that whether it's
> > always possible to call the PrepareUndoInsert(basically getting all
> > the undo record ready) before the page content lock.  But, I am sure
> > that won't be much difficult part.
>
> I think that is somewhere between not possible, and so expensive in a
> lot of cases that we'd not want to do it anyway. You'd at leasthave to
> first acquire a content lock on the page, mark the target tuple as
> locked, then unlock the page, reserve undo, lock the table page,
> actually update it.
>
>
> > >   - When reading an undo record, the whole stage of UnpackUndoData()
> > > reading data into a the UndoPackContext is omitted, reading directly
> > > into 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-14 Thread Andres Freund
Hi,

On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> > - I think there's two fairly fundamental, and related, problems with
> >   the sequence outlined above:
> >
> >   - We can't search for target buffers to store undo data, while holding
> > the "table" page content locked.
> >
> > The easy way to solve would be to require sites performing UNDO
> > logging to acquire victim pages before even acquiring other content
> > locks. Perhaps the better approach could be for the undo layer to
> > hold onto a number of clean buffers, and to keep the last page in an
> > already written to undo log pinned.
> >
> >   - We can't search for victim buffers for further undo data while
> > already holding other undo pages content locked. Doing so means that
> > while we're e.g. doing IO to clean out the new page, old undo data
> > on the previous page can't be read.
> >
> > This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
> > pinning and locking buffers, it'd need to be split into two
> > operations. One that acquires buffers and pins them, and one that
> > locks them.  I think it's quite possible that the locking operation
> > could just be delayed until InsertPreparedUndo().  But if we solve
> > the above problem, most of this might already be solved.
> 
> Basically, that means
> - the caller should call PreparedUndoInsert before acquiring table
> page content lock right? because the PreparedUndoInsert just compute
> the size, allocate the space and pin+lock the buffers and for pinning
> the buffers we must compute the size and allocate the space using undo
> storage layer.

I don't think we can normally pin the undo buffers properly at that
stage. Without knowing the correct contents of the table page - which we
can't know without holding some form of lock preventing modifications -
we can't know how big our undo records are going to be. And we can't
just have buffers that don't exist on disk in shared memory, and we
don't want to allocate undo that we then don't need. So I think what
we'd have to do at that stage, is to "pre-allocate" buffers for the
maximum amount of UNDO needed, but mark the associated bufferdesc as not
yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
would not be set.

So at the start of a function that will need to insert undo we'd need to
pre-reserve the maximum number of buffers we could potentially
need. That reservation stage would

a) pin the page with the current end of the undo
b) if needed pin the page of older undo that we need to update (e.g. to
   update the next pointer)
c) perform clock sweep etc to acquire (find or create) enough clean to
   hold the maximum amount of undo needed. These buffers would be marked
   as !BM_TAG_VALID | BUF_REFCOUNT_ONE.

I assume that we'd make a) cheap by keeping it pinned for undo logs that
a backend is actively attached to. b) should only be needed once in a
transaction, so it's not too bad. c) we'd probably need to amortize
across multiple undo insertions, by keeping the unused buffers pinned
until the end of the transaction.

I assume that having the infrastructure c) might also make some code
for already in postgres easier. There's obviously some issues around
guaranteeing that the maximum number of such buffers isn't high.


> - So basically, if we delay the lock till InsertPreparedUndo and call
> PrepareUndoInsert before acquiring table page content lock this
> problem is solved?
> 
> Although I haven't yet analyzed the AM specific part that whether it's
> always possible to call the PrepareUndoInsert(basically getting all
> the undo record ready) before the page content lock.  But, I am sure
> that won't be much difficult part.

I think that is somewhere between not possible, and so expensive in a
lot of cases that we'd not want to do it anyway. You'd at leasthave to
first acquire a content lock on the page, mark the target tuple as
locked, then unlock the page, reserve undo, lock the table page,
actually update it.


> >   - When reading an undo record, the whole stage of UnpackUndoData()
> > reading data into a the UndoPackContext is omitted, reading directly
> > into the UnpackedUndoRecord. That removes one further copy of the
> > record format.
> So we will read member by member to UnpackedUndoRecord?  because in
> context we have at least a few headers packed and we can memcpy one
> header at a time like UndoRecordHeader, UndoRecordBlock.

Well, right now you then copy them again later, so not much is gained by
that (although that later copy can happen without the content lock
held). As I think I suggested before, I suspect that the best way would
be to just memcpy() the data from the page(s) into an appropriately
sized buffer with the content lock held, and then perform unpacking
directly into UnpackedUndoRecord. Especially with the bulk API that will
avoid having to do 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-14 Thread Andres Freund
Hi,

On 2019-08-13 17:05:27 +0530, Dilip Kumar wrote:
> On Mon, Aug 5, 2019 at 11:59 PM Andres Freund  wrote:
> > (as I was out of context due to dealing with bugs, I've switched to
> > lookign at the current zheap/undoprocessing branch.
> >
> > On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> > > +/*
> > > + * Insert a previously-prepared undo records.
> > > + *
> > > + * This function will write the actual undo record into the buffers 
> > > which are
> > > + * already pinned and locked in PreparedUndoInsert, and mark them dirty. 
> > >  This
> > > + * step should be performed inside a critical section.
> > > + */
> >
> > Again, I think it's not ok to just assume you can lock an essentially
> > unbounded number of buffers. This seems almost guaranteed to result in
> > deadlocks. And there's limits on how many lwlocks one can hold etc.
> 
> I think for controlling that we need to put a limit on max prepared
> undo?  I am not sure any other way of limiting the number of buffers
> because we must lock all the buffer in which we are going to insert
> the undo record under one WAL logged operation.

I heard that a number of times. But I still don't know why that'd
actually be true. Why would it not be sufficient to just lock the buffer
currently being written to, rather than all buffers? It'd require a bit
of care updating the official current "logical end" of a log, but
otherwise ought to not be particularly hard? Only one backend can extend
the log after all, and until the log is externally visibily extended,
nobody can read or write those buffers, no?


> >
> > As far as I can tell there's simply no deadlock avoidance scheme in use
> > here *at all*? I must be missing something.
> 
> We are always locking buffer in block order so I am not sure how it
> can deadlock?  Am I missing something?

Do we really in all circumstances? Note that we update the transinfo
(and also progress) from earlier in the log. But my main point is more
that there's no documented deadlock avoidance scheme. Which imo means
there's none, because nobody will know to maintain it.



> > > + /*
> > > +  * During recovery, there might be some blocks 
> > > which are already
> > > +  * deleted due to some discard command so we can 
> > > just skip
> > > +  * inserting into those blocks.
> > > +  */
> > > + if (!BufferIsValid(buffer))
> > > + {
> > > + Assert(InRecovery);
> > > +
> > > + /*
> > > +  * Skip actual writing just update the 
> > > context so that we have
> > > +  * write offset for inserting into next 
> > > blocks.
> > > +  */
> > > + SkipInsertingUndoData(, BLCKSZ - 
> > > starting_byte);
> > > + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > > + break;
> > > + }
> >
> > How exactly can this happen?
> 
> Suppose you insert one record for the transaction which split in
> block1 and 2.  Now, before this block is actually going to the disk
> the transaction committed and become all visible the undo logs are
> discarded.  It's possible that block 1 is completely discarded but
> block 2 is not because it might have undo for the next transaction.
> Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> their so we need to skip inserting undo for block 1 as it does not
> exist.

Hm. I'm quite doubtful this is a good idea. How will this not force us
to a emit a lot more expensive durable operations while writing undo?
And doesn't this reduce error detection quite remarkably?

Thomas, Robert?





> > > + /* Read the undo record. */
> > > + UndoGetOneRecord(uur, urecptr, rnode, category, 
> > > );
> > > +
> > > + /* Release the discard lock after fetching the 
> > > record. */
> > > + if (!InHotStandby)
> > > + LWLockRelease(>discard_lock);
> > > + }
> > > + else
> > > + UndoGetOneRecord(uur, urecptr, rnode, category, 
> > > );
> >
> >
> > And then we do none of this in !one_page mode.
> UndoBulkFetchRecord is always called from the aborted transaction so
> its undo can never get discarded concurrently so ideally, we don't
> need to check for discard.

That's an undocumented assumption. Why would anybody reading the
interface know that?



> > > +static uint16
> > > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> > > +  UndoLogCategory category)
> > > +{
> > > + UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp);
> > > + BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp);
> > > + Buffer  buffer = 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-14 Thread Andres Freund
Hi,

On 2019-08-13 13:53:59 +0530, Dilip Kumar wrote:
> On Tue, Jul 30, 2019 at 1:32 PM Andres Freund  wrote:
> > > + /* Loop until we have fetched all the buffers in which we need to 
> > > write. */
> > > + while (size > 0)
> > > + {
> > > + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, 
> > > RBM_NORMAL);
> > > + xact_info->idx_undo_buffers[index++] = bufidx;
> > > + size -= (BLCKSZ - starting_byte);
> > > + starting_byte = UndoLogBlockHeaderSize;
> > > + cur_blk++;
> > > + }
> >
> > So, this locks a very large number of undo buffers at the same time, do
> > I see that correctly?  What guarantees that there are no deadlocks due
> > to multiple buffers locked at the same time (I guess the order inside
> > the log)? What guarantees that this is a small enough number that we can
> > even lock all of them at the same time?
> 
> I think we are locking them in the block order and that should avoid
> the deadlock.  I have explained in the comments.

Sorry for harping on this so much: But please, please, *always* document
things like this *immediately*. This is among the most crucial things to
document. There shouldn't need to be a reviewer prodding you to do so
many months after the code has been written. For one you've likely
forgotten details by then, but more importantly dependencies on the
locking scheme will have crept into further places - if it's not well
thought through that can be hrad to undo. And it wastes reviewer /
reader bandwidth.


> > Why do we need to lock all of them at the same time? That's not clear to
> > me.
> 
> Because this is called outside the critical section so we keep all the
> buffers locked what we want to update inside the critical section for
> single wal record.

I don't understand this explanation. What does keeping the buffers
locked have to do with the critical section? As explained in a later
email, I think the current approach is not acceptable - but even without
those issues, I don't see why we couldn't just lock the buffers at a
later stage?




> > > + for (i = 0; i < context->nprepared_undo_buffer; i++)
> > > + {
> >
> > How large do we expect this to get at most?
> >
> In BeginUndoRecordInsert we are computing it
> 
> + /* Compute number of buffers. */
> + nbuffers = (nprepared + MAX_UNDO_UPDATE_INFO) * MAX_BUFFER_PER_UNDO;

Since nprepared is variable, that doesn't really answer the question.



Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-14 Thread Dilip Kumar
On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
>
> Hi,
>

> - I think there's two fairly fundamental, and related, problems with
>   the sequence outlined above:
>
>   - We can't search for target buffers to store undo data, while holding
> the "table" page content locked.
>
> The easy way to solve would be to require sites performing UNDO
> logging to acquire victim pages before even acquiring other content
> locks. Perhaps the better approach could be for the undo layer to
> hold onto a number of clean buffers, and to keep the last page in an
> already written to undo log pinned.
>
>   - We can't search for victim buffers for further undo data while
> already holding other undo pages content locked. Doing so means that
> while we're e.g. doing IO to clean out the new page, old undo data
> on the previous page can't be read.
>
> This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
> pinning and locking buffers, it'd need to be split into two
> operations. One that acquires buffers and pins them, and one that
> locks them.  I think it's quite possible that the locking operation
> could just be delayed until InsertPreparedUndo().  But if we solve
> the above problem, most of this might already be solved.

Basically, that means
- the caller should call PreparedUndoInsert before acquiring table
page content lock right? because the PreparedUndoInsert just compute
the size, allocate the space and pin+lock the buffers and for pinning
the buffers we must compute the size and allocate the space using undo
storage layer.
- So basically, if we delay the lock till InsertPreparedUndo and call
PrepareUndoInsert before acquiring table page content lock this
problem is solved?

Although I haven't yet analyzed the AM specific part that whether it's
always possible to call the PrepareUndoInsert(basically getting all
the undo record ready) before the page content lock.  But, I am sure
that won't be much difficult part.

> - To me the current split between the packed and unpacked UNDO record
>   formats makes very little sense, the reasoning behind having them is
>   poorly if at all documented, results in extremely verbose code, and
>   isn't extensible.
>
>   When preparing to insert an undo record the in-buffer size is computed
>   with UndoRecordHeaderSize() (needs to know about all optional data)
>   from within PrepareUndoInsert() (which has a bunch a bunch of
>   additional knowledge about the record format). Then during insertion
>   InsertPreparedUndo(), first copies the UnpackedUndoRecord into an
>   UndoPackContext (again needing ...), and then, via InsertUndoData(),
>   copies that in small increments into the respective buffers (again
>   needing knowledge about the complete record format, two copies
>   even). Beside the code duplication, that also means the memory copies
>   are very inefficient, because they're all done in tiny increments,
>   multiple times.
>
>   When reading undo it's smilar: UnpackUndoData(), again in small
>   chunks, reads the buffer data into an UndoPackContext (another full
>   copy of the unpacked record format). But then FinishUnpackUndo()
>   *again* copies all that data, into an actual UnpackedUndoRecord
>   (again, with a copy of the record format, albeit slightly different
>   looking).
>
>   I'm not convinced by Heikki's argument that we shouldn't have
>   structure within undo records. In my opinion that is a significant
>   weakness of how WAL was initially designed, and even after Heikki's
>   work, still is a problem.  But this isn't the right design either.
>
>   Even if were to stay with the current fixed record format, I think
>   the current code needs a substantial redesign:
>
>   - I think 'packing' during insertion needs to serialize into a char*
> allocation during PrepareUndoInsert

ok
 computing the size in parallel
> (or perhaps in InsertPreparedUndo, but probably not).  The size of
> the record should never be split across record boundaries
> (i.e. we'll leave that space unused if we otherwise would need to
> split the size).
I think before UndoRecordAllocate we need to detect this part that
whether the size of the record will start from the last byte of the
page and if so then allocate one extra byte for the undo record.  Or
always allocate one extra byte for the undo record for handling this
case.  And, in FinalizeUndoAdvance only pass the size how much we have
actually consumed.

  The actual insertion would be a maximally sized
> memcpy() (so we'd as many memcpys as the buffer fits in, rather than
> one for each sub-type of a record).
>
> That allows to remove most of the duplicated knowledge of the record
> format, and makes insertions faster (by doing only large memcpys
> while holding exclusive content locks).
Right.
>
>   - When reading an undo record, the whole stage of UnpackUndoData()
> reading data into a the UndoPackContext 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-14 Thread Andres Freund
Hi,

On 2019-08-06 14:18:42 -0700, Andres Freund wrote:
> Here's the last section of my low-leve review. Plan to write a higher
> level summary afterwards, now that I have a better picture of the code.

General comments:

- For a feature of this complexity, there's very little architectural
  documentation. Some individual parts have a bit, but there's basically
  nothing over-arching. That makes it extremely hard for anybody that is
  not already involved to understand the design constraints, and even
  for people involved it's hard to understand.

  I think it's very important for this to have a document that first
  explains what the goals, and non-goals, of this feature are. And then
  secondly explains the chosen architecture referencing those
  constraints.  Once that's there it's a lot easier to review this
  patchset, to discuss the overall architecture, etc.


- There are too many different email threads and git branches. The same
  patches are discussed in different threads, layers exist in slightly
  diverging versions in different git trees. Again making it very hard
  for anybody not primarily focussing on undo to join the discussion.

  I think most of the older git branches should be renamed into
  something indicating their historic status. The remaining branches
  should be referenced from a wiki page (linked to in each submission of
  a new patch version), explaining what they're about.  I don't think
  it's realistic to have people infer meaning from the current branch
  names (undo, proposal-undo-log, undo-log-storage, undo-log-storage-v2,
  undo_interface_v1, undoprocessing).

  Given the size of the the overall project it's quite possibly not
  realistic to manage the whole work in a single git branch. With
  separate git branches, as currently done, it's however hard to
  understand which version of a what layer is used.  I think at the very
  least higher layers need to indicate the version of the underlying
  layers is being used.  I suggest just adding a "v01: " style prefix to
  all commit subjects a branch is rebased onto.

  It's also currently hard to understand what version of a layer is
  being discussed. I think submissions all need to include a version
  number (c.f. git format-patch's -v option), and that version ought to
  be included in the subject line. Each new major version of a patch
  should be started as a reply to the first message of a thread, to
  keep the structure of a discussion in a managable shape. New versions
  should include explicit explanations about the major changes compared
  to the last version.


- The code quality of pretty significant parts of the patchset is not
  even close to being good enough. There are areas with a lot of code
  duplication. There are very few higher level explanations for
  interfaces. There's a lot of "i++; /* increment i to increment it */"
  style comments, but not enough higher level comments. There are
  significant issues with parts of the code that aren't noted anywhere
  in comments, leading to reviewers having to repeatedly re-discover
  them (and wasting time on that).

  There's different naming styles in related code without a discernible
  pattern (e.g. UndoRecordSetInfo being followed by
  get_undo_rec_cid_offset). The word-ordering of different layers is
  confusing (e.g. BeginUndoRecordInsert vs UndoLogBeginInsert vs
  PrepareUndoInsert). Different important externally visible functions
  have names that don't allow to determine which is supposed to do what
  (PrepareUndoInsert vs BeginUndoRecordInsert).


More specific comments:

- The whole sequencing of undo record insertion in combination with WAL
  logging does not appear to be right. It's a bit hard to say, because
  there's very little documentation on what the intended default
  sequence of operations is.

  My understanding is that the currently expected pattern is to:

  1) Collect information / perform work needed to perform the action
 that needs to be UNDO logged. E.g. perform visibility
 determinations, wait for lockers, compute new infomask, etc.

 This will likely end with the "content" page(s) (e.g. a table's
 page) being exclusively locked.

  2) Estimate space needed for all UNDO logging (BeginUndoRecordInsert)

  3) Prepare for each undo record, this includes building the
 content for each undo record. PrepareUndoInsert(). This acquires,
 pins and locks buffers for undo.

  4) begin a critical section

  5) modify content page, mark buffer dirty

  6) write UNDO, using InsertPreparedUndo()

  7) associate undo with WAL record (RegisterUndoLogBuffers)

  8) write WAL

  9) End critical section

  But despite reading through the code, including READMEs, I'm doubtful
  that's quite the intended pattern. It REALLY can't be right that one
  needs to parse many function headers to figure out how the most basic
  use of undo could possibly work.  There needs to be very clear
  documentation about how to write undo 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Robert Haas
On Tue, Aug 13, 2019 at 6:28 AM Amit Kapila  wrote:
> So, for top-level transactions rollback, we can directly refer from
> UndoRequest *, the start and end locations.  But, what should we do
> for sub-transactions (rollback to savepoint)?  One related point is
> that we also need information about last_log_start_undo_location to
> update the undo apply progress (The basic idea is if the transactions
> undo is spanned across multiple logs, we update the progress in each
> of the logs.).  We can remember that in the transaction state or
> undorequest *.  Any suggestion?

The UndoRequest is only for top-level rollback.  Any state that you
need in order to do subtransaction rollback needs to be maintained
someplace else, probably in the transaction state state, or some
subsidiary data structure.  The point here is that the UndoRequest is
going to be stored in shared memory, but there is no reason ever to
store the information about a subtransaction in shared memory, because
that undo always has to be completed by the backend that is
responsible for that transaction. Those things should not get mixed
together.

> IIUC, for each transaction, we have to take a lock first time it
> attaches to a log and then the same lock at commit time.  It seems the
> work under lock is less, but still, can't this cause a contention?  It
> seems to me this is similar to what we saw in ProcArrayLock where work
> under lock was few instructions, but acquiring and releasing the lock
> by each backend at commit time was causing a bottleneck.

LWLocks are pretty fast these days and the critical section is pretty
short, so I think there's a chance it'll be just fine, but maybe it'll
cause enough cache line bouncing to be problematic. If so, I think
there are several possible ways to redesign the locking to improve
things, but it made sense to me to try the simple approach first.

> How will computation of oldestXidHavingUnappliedUndo will work?
>
> We can probably check the fxid queue and error queue to get that
> value.  However, I am not sure if that is sufficient because incase we
> perform the request in the foreground, it won't be present in queues.

Oh, I forgot about that requirement.  I think I can fix it so it does
that fairly easily, but it will require a little bit of redesign which
I won't have time to do this week.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Dilip Kumar
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > Need to do something else for a bit. More later.
>
>
> > + /*
> > +  * Compute the header size of the undo record.
> > +  */
> > +Size
> > +UndoRecordHeaderSize(uint16 uur_info)
> > +{
> > + Sizesize;
> > +
> > + /* Add fixed header size. */
> > + size = SizeOfUndoRecordHeader;
> > +
> > + /* Add size of transaction header if it presets. */
> > + if ((uur_info & UREC_INFO_TRANSACTION) != 0)
> > + size += SizeOfUndoRecordTransaction;
> > +
> > + /* Add size of rmid if it presets. */
> > + if ((uur_info & UREC_INFO_RMID) != 0)
> > + size += sizeof(RmgrId);
> > +
> > + /* Add size of reloid if it presets. */
> > + if ((uur_info & UREC_INFO_RELOID) != 0)
> > + size += sizeof(Oid);
> > +
> > + /* Add size of fxid if it presets. */
> > + if ((uur_info & UREC_INFO_XID) != 0)
> > + size += sizeof(FullTransactionId);
> > +
> > + /* Add size of cid if it presets. */
> > + if ((uur_info & UREC_INFO_CID) != 0)
> > + size += sizeof(CommandId);
> > +
> > + /* Add size of forknum if it presets. */
> > + if ((uur_info & UREC_INFO_FORK) != 0)
> > + size += sizeof(ForkNumber);
> > +
> > + /* Add size of prevundo if it presets. */
> > + if ((uur_info & UREC_INFO_PREVUNDO) != 0)
> > + size += sizeof(UndoRecPtr);
> > +
> > + /* Add size of the block header if it presets. */
> > + if ((uur_info & UREC_INFO_BLOCK) != 0)
> > + size += SizeOfUndoRecordBlock;
> > +
> > + /* Add size of the log switch header if it presets. */
> > + if ((uur_info & UREC_INFO_LOGSWITCH) != 0)
> > + size += SizeOfUndoRecordLogSwitch;
> > +
> > + /* Add size of the payload header if it presets. */
> > + if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> > + size += SizeOfUndoRecordPayload;
>
> There's numerous blocks with one if for each type, and the body copied
> basically the same for each alternative. That doesn't seem like a
> reasonable approach to me. Means that many places need to be adjusted
> when we invariably add another type, and seems likely to lead to bugs
> over time.
I think I have expressed my thought on this in another email
[https://www.postgresql.org/message-id/CAFiTN-vDrXuL6tHK1f_V9PAXp2%2BEFRpPtxCG_DRx08PZXAPkyw%40mail.gmail.com]
>
> > + /* Add size of the payload header if it presets. */
>
> FWIW, repeating the same comment, with or without minor differences, 10
> times is a bad idea. Especially when the comment doesn't add *any* sort
> of information.
Ok, fixed
>
> Also, "if it presets" presumably is a typo?
Fixed
>
>
> > +/*
> > + * Compute and return the expected size of an undo record.
> > + */
> > +Size
> > +UndoRecordExpectedSize(UnpackedUndoRecord *uur)
> > +{
> > + Sizesize;
> > +
> > + /* Header size. */
> > + size = UndoRecordHeaderSize(uur->uur_info);
> > +
> > + /* Payload data size. */
> > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + size += uur->uur_payload.len;
> > + size += uur->uur_tuple.len;
> > + }
> > +
> > + /* Add undo record length size. */
> > + size += sizeof(uint16);
> > +
> > + return size;
> > +}
> > +
> > +/*
> > + * Calculate the size of the undo record stored on the page.
> > + */
> > +static inline Size
> > +UndoRecordSizeOnPage(char *page_ptr)
> > +{
> > + uint16  uur_info = ((UndoRecordHeader *) page_ptr)->urec_info;
> > + Sizesize;
> > +
> > + /* Header size. */
> > + size = UndoRecordHeaderSize(uur_info);
> > +
> > + /* Payload data size. */
> > + if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + UndoRecordPayload *payload = (UndoRecordPayload *) (page_ptr 
> > + size);
> > +
> > + size += payload->urec_payload_len;
> > + size += payload->urec_tuple_len;
> > + }
> > +
> > + return size;
> > +}
> > +
> > +/*
> > + * Compute size of the Unpacked undo record in memory
> > + */
> > +Size
> > +UnpackedUndoRecordSize(UnpackedUndoRecord *uur)
> > +{
> > + Sizesize;
> > +
> > + size = sizeof(UnpackedUndoRecord);
> > +
> > + /* Add payload size if record contains payload data. */
> > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + size += uur->uur_payload.len;
> > + size += uur->uur_tuple.len;
> > + }
> > +
> > + return size;
> > +}
>
> These functions are all basically the same. We shouldn't copy code over
> and over like this.
UnpackedUndoRecordSize -> computes the size of the unpacked undo
record so it's different from above two, just payload part is common
so moved payload size to common function.

UndoRecordExpectedSize and UndoRecordSizeOnPage are two different
functions except for the header size 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Dilip Kumar
On Mon, Aug 5, 2019 at 11:59 PM Andres Freund  wrote:
>
> Hi,
>
> (as I was out of context due to dealing with bugs, I've switched to
> lookign at the current zheap/undoprocessing branch.
>
> On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> > +/*
> > + * Insert a previously-prepared undo records.
> > + *
> > + * This function will write the actual undo record into the buffers which 
> > are
> > + * already pinned and locked in PreparedUndoInsert, and mark them dirty.  
> > This
> > + * step should be performed inside a critical section.
> > + */
>
> Again, I think it's not ok to just assume you can lock an essentially
> unbounded number of buffers. This seems almost guaranteed to result in
> deadlocks. And there's limits on how many lwlocks one can hold etc.

I think for controlling that we need to put a limit on max prepared
undo?  I am not sure any other way of limiting the number of buffers
because we must lock all the buffer in which we are going to insert
the undo record under one WAL logged operation.

>
> As far as I can tell there's simply no deadlock avoidance scheme in use
> here *at all*? I must be missing something.

We are always locking buffer in block order so I am not sure how it
can deadlock?  Am I missing something?

>
>
> > + /* Main loop for writing the undo record. */
> > + do
> > + {
>
> I'd prefer this to not be a do{} while(true) loop - as written I need to
> read to the end to see what the condition is. I don't think we have any
> loops like that in the code.
Right, changed
>
>
> > + /*
> > +  * During recovery, there might be some blocks which 
> > are already
> > +  * deleted due to some discard command so we can just 
> > skip
> > +  * inserting into those blocks.
> > +  */
> > + if (!BufferIsValid(buffer))
> > + {
> > + Assert(InRecovery);
> > +
> > + /*
> > +  * Skip actual writing just update the 
> > context so that we have
> > +  * write offset for inserting into next 
> > blocks.
> > +  */
> > + SkipInsertingUndoData(, BLCKSZ - 
> > starting_byte);
> > + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > + break;
> > + }
>
> How exactly can this happen?

Suppose you insert one record for the transaction which split in
block1 and 2.  Now, before this block is actually going to the disk
the transaction committed and become all visible the undo logs are
discarded.  It's possible that block 1 is completely discarded but
block 2 is not because it might have undo for the next transaction.
Now, during recovery (FPW is off) if block 1 is missing but block 2 is
their so we need to skip inserting undo for block 1 as it does not
exist.

>
>
> > + else
> > + {
> > + page = BufferGetPage(buffer);
> > +
> > + /*
> > +  * Initialize the page whenever we try to 
> > write the first
> > +  * record in page.  We start writing 
> > immediately after the
> > +  * block header.
> > +  */
> > + if (starting_byte == UndoLogBlockHeaderSize)
> > + UndoPageInit(page, BLCKSZ, 
> > prepared_undo->urec->uur_info,
> > +  
> > ucontext.already_processed,
> > +  
> > prepared_undo->urec->uur_tuple.len,
> > +  
> > prepared_undo->urec->uur_payload.len);
> > +
> > + /*
> > +  * Try to insert the record into the current 
> > page. If it
> > +  * doesn't succeed then recall the routine 
> > with the next page.
> > +  */
> > + InsertUndoData(, page, 
> > starting_byte);
> > + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > + {
> > + MarkBufferDirty(buffer);
> > + break;
>
> At this point we're five indentation levels deep. I'd extract at least
> either the the per prepared undo code or the code performing the writing
> across block boundaries into a separate function. Perhaps both.

I have moved it to the separate function.
>
>
>
> > +/*
> > + * Helper function for UndoGetOneRecord
> > + *
> > + * If any of  rmid/reloid/xid/cid is not available in the undo record, then
> > + * it 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Amit Kapila
On Thu, Aug 8, 2019 at 7:01 PM Andres Freund  wrote:
> On 2019-08-07 14:50:17 +0530, Amit Kapila wrote:
> > On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
> > > On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
>
> > > >  typedef struct TwoPhaseFileHeader
> > > >  {
> > > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader
> > > >   uint16  gidlen; /* length of the GID - 
> > > > GID follows the header */
> > > >   XLogRecPtr  origin_lsn; /* lsn of this record at 
> > > > origin node */
> > > >   TimestampTz origin_timestamp;   /* time of prepare at origin node 
> > > > */
> > > > +
> > > > + /*
> > > > +  * We need the locations of the start and end undo record 
> > > > pointers when
> > > > +  * rollbacks are to be performed for prepared transactions using 
> > > > undo-based
> > > > +  * relations.  We need to store this information in the file as 
> > > > the user
> > > > +  * might rollback the prepared transaction after recovery and for 
> > > > that we
> > > > +  * need its start and end undo locations.
> > > > +  */
> > > > + UndoRecPtr  start_urec_ptr[UndoLogCategories];
> > > > + UndoRecPtr  end_urec_ptr[UndoLogCategories];
> > > >  } TwoPhaseFileHeader;
> > >
> > > Why do we not need that knowledge for undo processing of a non-prepared
> > > transaction?
>
> > The non-prepared transaction also needs to be aware of that.  It is
> > stored in TransactionStateData.  I am not sure if I understand your
> > question here.
>
> My concern is that I think it's fairly ugly to store data like this in
> the 2pc state file. And it's not an insubstantial amount of additional
> data either, compared to the current size, even when no undo is in
> use. There's a difference between an unused feature increasing backend
> local memory and increasing the size of WAL logged data. Obviously it's
> not by a huge amount, but still.  It also just feels wrong to me.
>
> We don't need the UndoRecPtr's when recovering from a crash/restart to
> process undo. Now we obviously don't want to unnecessarily search for
> data that is expensive to gather, which is a good reason for keeping
> track of this data. But I do wonder if this is the right approach.
>
> I know that Robert is working on a patch that revises the undo request
> layer somewhat, it's possible that this is best discussed afterwards.
>

Okay, we have started working on integrating with Robert's patch.  I
think not only this but many of the other things will also change.
So, I will respond to other comments after integrating with Robert's
patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Amit Kapila
On Fri, Aug 9, 2019 at 1:57 AM Robert Haas  wrote:
>
> On Thu, Aug 8, 2019 at 9:31 AM Andres Freund  wrote:
> > I know that Robert is working on a patch that revises the undo request
> > layer somewhat, it's possible that this is best discussed afterwards.
>
> Here's what I have at the moment.  This is not by any means a complete
> replacement for Amit's undo worker machinery, but it is a significant
> redesign (and I believe a significant improvement to) the queue
> management stuff from Amit's patch.  I wrote this pretty quickly, so
> while it passes simple testing, it probably has a number of bugs, and
> to actually use it, it would need to be integrated with xact.c; right
> now it's just a standalone module that doesn't do anything except let
> itself be tested.
>
> Some of the ways it is different from Amit's patches include:
>
> * It uses RBTree rather than binaryheap, so when we look ahead, we
> look ahead in the right order.
>
> * There's no limit to the lookahead distance; when looking ahead, it
> will search the entirety of all 3 RBTrees for an entry from the right
> database.
>
> * It doesn't have a separate hash table keyed by XID.  I didn't find
> that necessary.
>
> * It's better-isolated, as you can see from the fact that I've
> included a test module that tests this code without actually ever
> putting an UndoRequestManager in shared memory.  I would've liked to
> expand this test module, but I don't have time to do that today and
> felt it better to get this much sent out.
>
> * It has a lot of comments explaining the design and how it's intended
> to integrate with the rest of the system.
>
> Broadly, my vision for how this would get used is:
>
> - Create an UndoRecordManager in shared memory.
> - Before a transaction first attaches to a permanent or unlogged undo
> log, xact.c would call RegisterUndoRequest(); thereafter, xact.c would
> store a pointer to the UndoRecord for the lifetime of the toplevel
> transaction.

So, for top-level transactions rollback, we can directly refer from
UndoRequest *, the start and end locations.  But, what should we do
for sub-transactions (rollback to savepoint)?  One related point is
that we also need information about last_log_start_undo_location to
update the undo apply progress (The basic idea is if the transactions
undo is spanned across multiple logs, we update the progress in each
of the logs.).  We can remember that in the transaction state or
undorequest *.  Any suggestion?

> - Immediately after attaching to a permanent or unlogged undo log,
> xact.c would call UndoRequestSetLocation.
> - xact.c would track the number of bytes of permanent and unlogged
> undo records the transaction generates.  If the transaction goes onto
> abort, it reports these by calling FinalizeUndoRequest.
> - If the transaction commits, it doesn't need that information, but
> does need to call UnregisterUndoRequest() as a post-commit step in
> CommitTransaction().
>

IIUC, for each transaction, we have to take a lock first time it
attaches to a log and then the same lock at commit time.  It seems the
work under lock is less, but still, can't this cause a contention?  It
seems to me this is similar to what we saw in ProcArrayLock where work
under lock was few instructions, but acquiring and releasing the lock
by each backend at commit time was causing a bottleneck.

It might be due to some reason this won't matter in a similar way in
which case we can find that after integrating it with other patches
from undo processing machinery and rebasing zheap branch over it?

How will computation of oldestXidHavingUnappliedUndo will work?

We can probably check the fxid queue and error queue to get that
value.  However, I am not sure if that is sufficient because incase we
perform the request in the foreground, it won't be present in queues.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Dilip Kumar
On Tue, Jul 30, 2019 at 1:32 PM Andres Freund  wrote:
>
> Hi,
>
> Amit, short note: The patches aren't attached in patch order. Obviously
> a miniscule thing, but still nicer if that's not the case.
>
> Dilip, this also contains the start of a review for the undo record
> interface further down.

> > Subject: [PATCH 07/14] Provide interfaces to store and fetch undo records.
> >
> > Add the capability to form undo records and store them in undo logs.  We
> > also provide the capability to fetch the undo records.  This layer will use
> > undo-log-storage to reserve the space for the undo records and buffer
> > management routines to write and read the undo records.
> >
>
> > Undo records are stored in sequential order in the undo log.
>
> Maybe "In each und log undo records are stored in sequential order."?
Done
>
>
>
> > +++ b/src/backend/access/undo/README.undointerface
> > @@ -0,0 +1,29 @@
> > +Undo record interface layer
> > +---
> > +This is the next layer which sits on top of the undo log storage, which 
> > will
> > +provide an interface for prepare, insert, or fetch the undo records.  This
> > +layer will use undo-log-storage to reserve the space for the undo records
> > +and buffer management routine to write and read the undo records.
>
> The reference to "undo log storage" kinda seems like a reference into
> nothingness...
Changed
>
>
> > +Writing an undo record
> > +--
> > +To prepare an undo record, first, it will allocate required space using
> > +undo log storage module.  Next, it will pin and lock the required buffers 
> > and
> > +return an undo record pointer where it will insert the record.  Finally, it
> > +calls the Insert routine for final insertion of prepared record.  
> > Additionally,
> > +there is a mechanism for multi-insert, wherein multiple records are 
> > prepared
> > +and inserted at a time.
>
> I'm not sure whta this is telling me. Who is "it"?
>
> To me the filename ("interface"), and the title of this section,
> suggests this provides documentation on how to write code to insert undo
> records. But I don't think this does.
I have improved it
>
>
> > +Fetching and undo record
> > +
> > +To fetch an undo record, a caller must provide a valid undo record pointer.
> > +Optionally, the caller can provide a callback function with the 
> > information of
> > +the block and offset, which will help in faster retrieval of undo record,
> > +otherwise, it has to traverse the undo-chain.
>
> > +There is also an interface to bulk fetch the undo records.  Where the 
> > caller
> > +can provide a TO and FROM undo record pointer and the memory limit for 
> > storing
> > +the undo records.  This API will return all the undo record between FROM 
> > and TO
> > +undo record pointers if they can fit into provided memory limit otherwise, 
> > it
> > +return whatever can fit into the memory limit.  And, the caller can call it
> > +repeatedly until it fetches all the records.
>
> There's a lot of  terminology in this file that's not been introduced. I
> think this needs to be greatly expanded and restructured to allow people
> unfamiliar with the code to benefit.
I have improved it, but I think still I need to work on it to
introduce the terminology used.
>
>
> > +/*-
> > + *
> > + * undoaccess.c
> > + * entry points for inserting/fetching undo records
>
> > + * NOTES:
> > + * Undo record layout:
> > + *
> > + * Undo records are stored in sequential order in the undo log.  Each undo
> > + * record consists of a variable length header, tuple data, and payload
> > + * information.
>
> Is that actually true? There's records without tuples, no?
Right, changed this
>
> > The first undo record of each transaction contains a
> > + * transaction header that points to the next transaction's start
> > header.
>
> Seems like this needs to reference different persistence levels,
> otherwise it seems misleading, given there can be multiple first records
> in multiple undo logs?
I have changed it.
>
>
> > + * This allows us to discard the entire transaction's log at one-shot
> > rather
>
> s/at/in/
Fixed
>
> > + * than record-by-record.  The callers are not aware of transaction header,
>
> s/of/of the/
Fixed
>
> > + * this is entirely maintained and used by undo record layer.   See
>
> s/this/it/
Fixed
>
> > + * undorecord.h for detailed information about undo record header.
>
> s/undo record/the undo record/
Fixed
>
>
> I think at the very least there's explanations missing for:
> - what is the locking protocol for multiple buffers
> - what are the contexts for insertion
> - what phases an undo insertion happens in
> - updating previous records in general
> - what "packing" actually is
>
>
> > +
> > +/* Prototypes for static functions. */
>
>
> Don't think we commonly include that...
Changed, removed all unwanted prototypes
>
> > +static UnpackedUndoRecord 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-12 Thread Dilip Kumar
On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro  wrote:
>

> Hi Dilip,
>
> > commit 2f3c127b9e8bc7d27cf7adebff0a355684dfb94e
> > Author: Dilip Kumar 
> > Date:   Thu May 2 11:28:13 2019 +0530
> >
> >Provide interfaces to store and fetch undo records.
>
> +#include "commands/tablecmds.h"
> +#include "storage/block.h"
> +#include "storage/buf.h"
> +#include "storage/buf_internals.h"
> +#include "storage/bufmgr.h"
> +#include "miscadmin.h"
>
> "miscadmin.h" comes before "storage...".
Right, fixed.
>
> +/*
> + * Compute the size of the partial record on the undo page.
> + *
> + * Compute the complete record size by uur_info and variable field length
> + * stored in the page header and then subtract the offset of the record so 
> that
> + * we can get the exact size of partial record in this page.
> + */
> +static inline Size
> +UndoPagePartialRecSize(UndoPageHeader phdr)
> +{
> +Sizesize;
>
> We decided to use size_t everywhere in new code (except perhaps
> functions conforming to function pointer types that historically use
> Size in their type).
>
> +/*
> + * Compute the header size from undo record uur_info, stored in the page
> + * header.
> + */
> +size = UndoRecordHeaderSize(phdr->uur_info);
> +
> +/*
> + * Add length of the variable part and undo length. Now, we know the
> + * complete length of the undo record.
> + */
> +size += phdr->tuple_len + phdr->payload_len + sizeof(uint16);
> +
> +/*
> + * Subtract the size which is stored in the previous page to get the
> + * partial record size stored in this page.
> + */
> +size -= phdr->record_offset;
> +
> +return size;
>
> This is probably a stupid question but why isn't it enough to just
> store the offset of the first record that begins on this page, or 0
> for none yet?  Why do we need to worry about the partial record's
> payload etc?
Right, as this patch stand it would be enough to just store the offset
where the first complete record start.  But for undo page consistency
checker we need to mask the CID field in the partial record as well.
So we need to know how many bytes of the partial records are already
written in the previous page (phdr->record_offset), what all fields
are there in the partial record (uur_info) and the variable part to
compute the next record offset.  Currently, I have improved it by
storing the complete record length instead of payload and tuple length
but this we can further improve by storing the next record offset
directly that will avoid some computation.  I haven't worked on undo
consistency patch much in this version so I will analyze this further
in the next version.

>
> +UndoRecPtr
> +PrepareUndoInsert(UndoRecordInsertContext *context,
> +  UnpackedUndoRecord *urec,
> +  Oid dbid)
> +{
> ...
> +/* Fetch compression info for the transaction. */
> +compression_info = GetTopTransactionUndoCompressionInfo(category);
>
> How can this work correctly in recovery?  [Edit: it doesn't, as you
> just pointed out]
>
> I had started reviewing an older version of your patch (the version
> that had made it as far as the undoprocessing branch as of recently),
> before I had the bright idea to look for a newer version.  I was going
> to object to the global variable you had there in the earlier version.
> It seems to me that you have to be able to reproduce the exact same
> compression in recovery that you produced as "do" time, no?  How can
> TopTranasctionStateData be the right place for this in recovery?
>
> One data structure that could perhaps hold this would be
> UndoLogTableEntry (the per-backend cache, indexed by undo log number,
> with pretty fast lookups; used for things like
> UndoLogNumberGetCategory()).  As long as you never want to have
> inter-transaction compression, that should have the right scope to
> give recovery per-undo log tracking.  If you ever wanted to do
> compression between transactions too, maybe UndoLogSlot could work,
> but that'd have more complications.
Currently, I have read it from the first record on the page.
>
> +/*
> + * Read undo records of the transaction in bulk
> + *
> + * Read undo records between from_urecptr and to_urecptr until we exhaust the
> + * the memory size specified by undo_apply_size.  If we could not read all 
> the
> + * records till to_urecptr then the caller should consume current set
> of records
> + * and call this function again.
> + *
> + * from_urecptr- Where to start fetching the undo records.
> If we can not
> + *  read all the records because of memory limit then 
> this
> + *  will be set to the previous undo record
> pointer from where
> + *  we need to start fetching on next call.
> Otherwise it will
> + *  be set to InvalidUndoRecPtr.
> + * to_urecptr- Last undo record pointer to be fetched.
> + * undo_apply_size- Memory segment limit to collect 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-12 Thread Dilip Kumar
On Wed, Jul 24, 2019 at 11:28 AM Rushabh Lathia
 wrote:
>
> Hi,
>
> I have stated review of
> 0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few
> quick comments.
>
> 1) README.undointerface should provide more information like API details or
> the sequence in which API should get called.
I have improved the readme where I am describing the more user
specific details based on Robert's suggestions offlist.  I think I
need further improvement which can describe the order of api's to be
called.  Unfortunately that is not yet included in this patch set.
>
> 2) Information about the API's in the undoaccess.c file header block would
> good.  For reference please look at heapam.c.
Done
>
> 3) typo
>
> + * Later, during insert phase we will write actual records into thse buffers.
> + */
>
> %s/thse/these
Fixed
>
> 4) UndoRecordUpdateTransInfo() comments says that this must be called under
> the critical section, but seems like undo_xlog_apply_progress() do call it
> outside of critical section?  Is there exception, then should add comments?
> or Am I missing anything?
During recovery, there is an exception but we can add comments for the same.
I think I missed this in the latest patch, I will keep a note of it
and will do this in the next version.

>
>
> 5) In function UndoBlockGetFirstUndoRecord() below code:
>
> /* Calculate the size of the partial record. */
> partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
>phdr->tuple_len + phdr->payload_len -
>phdr->record_offset;
>
> can directly use UndoPagePartialRecSize().

This function is part of another patch in undoprocessing patch set
>
> 6)
>
> +static int
> +UndoGetBufferSlot(UndoRecordInsertContext *context,
> +  RelFileNode rnode,
> +  BlockNumber blk,
> +  ReadBufferMode rbm)
> +{
> +inti;
>
> In the above code variable "i" is mean "block index".  It would be good
> to give some valuable name to the variable, maybe "blockIndex" ?
>
Fixed
> 7)
>
> * We will also keep a previous undo record pointer to the first and last undo
>  * record of the transaction in the previous log.  The last undo record
>  * location is used find the previous undo record pointer during rollback.
>
>
> %s/used fine/used to find
Fixed
>
> 8)
>
> /*
>  * Defines the number of times we try to wait for rollback hash table to get
>  * initialized.  After these many attempts it will return error and the user
>  * can retry the operation.
>  */
> #define ROLLBACK_HT_INIT_WAIT_TRY  60
>
> %s/error/an error
This is part of different patch in undoprocessing patch set
>
> 9)
>
>  * we can get the exact size of partial record in this page.
>  */
>
> %s/of partial/of the partial"
This comment is removed in the latest code
>
> 10)
>
> * urecptr - current transaction's undo record pointer which need to be set in
> * the previous transaction's header.
>
> %s/need/needs
Done
>
> 11)
>
> /*
>  * If we are writing first undo record for the page the we can set the
>  * compression so that subsequent records from the same transaction can
>  * avoid including common information in the undo records.
>  */
>
>
> %s/the page the/the page then
>
> 12)
>
> /*
>  * If the transaction's undo records are split across the undo logs.  So
>  * we need to  update our own transaction header in the previous log.
>  */
>
> double space between "to" and "update"
Fixed
>
> 13)
>
> * The undo record should be freed by the caller by calling ReleaseUndoRecord.
>  * This function will old the pin on the buffer where we read the previous 
> undo
>  * record so that when this function is called repeatedly with the same 
> context
>
> %s/old/hold
Fixed
>
> I will continue further review for the same patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-12 Thread Dilip Kumar
On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila  wrote:
>
> On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar  wrote:
> >
> > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas  wrote:
> > >
> > > I don't like the fact that undoaccess.c has a new global,
> > > undo_compression_info.  I haven't read the code thoroughly, but do we
> > > really need that?  I think it's never modified (so it could just be
> > > declared const),
> >
> > Actually, this will get modified otherwise across undo record
> > insertion how we will know what was the values of the common fields in
> > the first record of the page.  Another option could be that every time
> > we insert the record, read the value from the first complete undo
> > record on the page but that will be costly because for every new
> > insertion we need to read the first undo record of the page.
> >
>
> This information won't be shared across transactions, so can't we keep
> it in top transaction's state?   It seems to me that will be better
> than to maintain it as a global state.

As replied separetly that during recovery we would not have
transaction state so I have decided to read from the first record on
the page please check in the latest patch.
>
> Few more comments on this patch:
> 1.
> PrepareUndoInsert()
> {
> ..
> + if (logswitched)
> + {
> ..
> + }
> + else
> + {
> ..
> + resize = true;
> ..
> + }
> +
> ..
> +
> + do
> + {
> + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, rbm);
> ..
> + rbm = RBM_ZERO;
> + cur_blk++;
> + } while (cur_size < size);
> +
> + /*
> + * Set/overwrite compression info if required and also exclude the common
> + * fields from the undo record if possible.
> + */
> + if (UndoSetCommonInfo(compression_info, urec, urecptr,
> +   context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf))
> + resize = true;
> +
> + if (resize)
> + size = UndoRecordExpectedSize(urec);
>
> I see that in some cases where resize is possible are checked before
> buffer allocation and some are after.  Isn't it better to do all these
> checks before buffer allocation?  Also, isn't it better to even
> compute changed size before buffer allocation as that might sometimes
> help in lesser buffer allocations?

Right, fixed.
>
> Can you find a better way to write
> :context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)?
>  It makes the line too long and difficult to understand.  Check for
> similar instances in the patch and if possible, change them as well.
This code is gone.  While replying I realised that I haven't scanned
complete code for such occurance.  I will work on that in next
version.
>
> 2.
> +InsertPreparedUndo(UndoRecordInsertContext *context)
> {
> ..
> /*
> + * Try to insert the record into the current page. If it
> + * doesn't succeed then recall the routine with the next page.
> + */
> + InsertUndoData(, page, starting_byte);
> + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> + {
> + MarkBufferDirty(buffer);
> + break;
> + }
> + MarkBufferDirty(buffer);
> ..
> }
>
> Can't we call MarkBufferDirty(buffer) just before 'if' check?  That
> will avoid calling it twice.

Done
>
> 3.
> + * Later, during insert phase we will write actual records into thse buffers.
> + */
> +struct PreparedUndoBuffer
>
> /thse/these
Done
>
> 4.
> + /*
> + * If we are writing first undo record for the page the we can set the
> + * compression so that subsequent records from the same transaction can
> + * avoid including common information in the undo records.
> + */
> + if (first_complete_undo)
>
> /page the we/page then we
This code is gone
>
> 5.
> PrepareUndoInsert()
> {
> ..
> After
> + * allocation We'll only advance by as many bytes as we turn out to need.
> + */
> + UndoRecordSetInfo(urec);
>
> Change the beginning of comment as: "After allocation, we'll .."

Done
>
> 6.
> PrepareUndoInsert()
> {
> ..
> * TODO:  instead of storing this in the transaction header we can
> + * have separate undo log switch header and store it there.
> + */
> + prevlogurp =
> + MakeUndoRecPtr(UndoRecPtrGetLogNo(prevlog_insert_urp),
> +(UndoRecPtrGetOffset(prevlog_insert_urp) - prevlen));
> +
>
> I don't think this TODO is valid anymore because now the patch has a
> separate log-switch header.

Yup.  Anyway now the log switch design is changed.
>
> 7.
> /*
> + * If undo log is switched then set the logswitch flag and also reset the
> + * compression info because we can use same compression info for the new
> + * undo log.
> + */
> + if (UndoRecPtrIsValid(prevlog_xact_start))
>
> /can/can't
Right.  But now compression code is changed so this comment does not exist.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-12 Thread Dilip Kumar
On Thu, Jul 18, 2019 at 4:58 PM Amit Kapila  wrote:
>
> On Tue, Jul 16, 2019 at 2:20 PM Dilip Kumar  wrote:
> >
>
> Few comments on the new patch:
>
> 1.
> Additionally,
> +there is a mechanism for multi-insert, wherein multiple records are prepared
> +and inserted at a time.
>
> Which mechanism are you talking about here?  By any chance is this
> related to some old code?

Current code also we have option to prepare multiple records and
insert them at once.  I have enhanced the comments to make it more
clear.
>
> 2.
> +Fetching and undo record
> +
> +To fetch an undo record, a caller must provide a valid undo record pointer.
> +Optionally, the caller can provide a callback function with the information 
> of
> +the block and offset, which will help in faster retrieval of undo record,
> +otherwise, it has to traverse the undo-chain.
>
> I think this is out-dated information.  You seem to forget updating
> README after latest changes in API.
Right, fixed.
>
> 3.
> + * The cid/xid/reloid/rmid information will be added in the undo record 
> header
> + * in the following cases:
> + * a) The first undo record of the transaction.
> + * b) First undo record of the page.
> + * c) All subsequent record for the transaction which is not the first
> + *   transaction on the page.
> + * Except above cases,  If the rmid/reloid/xid/cid is same in the subsequent
> + * records this information will not be stored in the record, these 
> information
> + * will be retrieved from the first undo record of that page.
> + * If any of the member rmid/reloid/xid/cid has changed, the changed
> information
> + * will be stored in the undo record and the remaining information will be
> + * retrieved from the first complete undo record of the page
> + */
> +UndoCompressionInfo undo_compression_info[UndoLogCategories];
>
> a. Do we want to compress fork_number also?  It is an optional field
> and is only include when undo record is for not MAIN_FORKNUM.  For
> zheap, this means it will never be included, but in future, it could
> be included for some other AM or some other use case.  So, not sure if
> there is any benefit in compressing the same.
Yeah, so as of now I haven't compressed forkno
>
> b. cid/xid/reloid/rmid - I think it is better to write it as rmid,
> reloid, xid, cid in the same order as you declare them in
> UndoPackStage.
>
> c. Some minor corrections. /Except above/Except for above/; /, If
> the/, if the/;  /is same/is the same/; /record, these
> information/record rather this information/
>
> d. I think there is no need to start the line "If any of the..." from
> a new line, it can be continued where the previous line ends.  Also,
> at the end of that line, add a full stop.

This comments are removed in new patch
>
> 4.
> /*
> + * Copy the compression global compression info to our context before
> + * starting prepare because this value might get updated multiple time in
> + * case of multi-prepare but the global value should be updated only after
> + * we have successfully inserted the undo record.
> + */
>
> In the above comment, the first 'compression' is not required. /time/times/
This comments are changed now as design is changed
>
> 5.
> +/*
> + * The below common information will be stored in the first undo
> record of the page.
> + * Every subsequent undo record will not store this information, if
> required this information
> + * will be retrieved from the first undo record of the page.
> + */
> +typedef struct UndoCompressionInfo
>
> The line length in the above comments exceeds the 80-char limit.  You
> might want to run pgindent to avoid such problems.

Fixed,
>
> 6.
> +/*
> + * Exclude the common info in undo record flag and also set the compression
> + * info in the context.
> + *
>
> 'flag' seems to be a redundant word here?
Obsolete comment as per new changes
>
> 7.
> +UndoSetCommonInfo(UndoCompressionInfo *compressioninfo,
> +   UnpackedUndoRecord *urec, UndoRecPtr urp,
> +   Buffer buffer)
> +{
> +
> + /*
> + * If we have valid compression info and the for the same transaction and
> + * the current undo record is on the same block as the last undo record
> + * then exclude the common information which are same as first complete
> + * record on the page.
> + */
> + if (compressioninfo->valid &&
> + FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) &&
> + UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp))
>
> Here the comment is just a verbal for of if-check.  How about writing
> it as: "Exclude the common information from the record which is same
> as the first record on the page."

Tried to improved in new code.
>
> 8.
> UndoSetCommonInfo()
> {
> ..
> if (compressioninfo->valid &&
> + FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) &&
> + UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp))
> + {
> + urec->uur_info &= ~UREC_INFO_XID;
> +
> + /* Don't include rmid if it's same. */
> + if (urec->uur_rmid == compressioninfo->rmid)

Re: POC: Cleaning up orphaned files using undo logs

2019-08-09 Thread Robert Haas
On Wed, Aug 7, 2019 at 6:57 AM Heikki Linnakangas  wrote:
> Yeah, that's also a problem with complicated WAL record types. Hopefully
> the complex cases are an exception, not the norm. A complex case is
> unlikely to fit any pre-defined set of fields anyway. (We could look at
> how e.g. protobuf works, if this is really a big problem. I'm not
> suggesting that we add a dependency just for this, but there might be
> some patterns or interfaces that we could mimic.)

I think what you're calling the complex cases are going to be pretty
normal cases, not something exotic, but I do agree with you that
making the infrastructure more generic is worth considering.  One idea
I had is to use the facilities from pqformat.h; have the generic code
read whatever the common fields are, and then pass the StringInfo to
the AM which can do whatever it wants with the rest of the record, but
probably these facilities would make it pretty easy to handle either a
series of fixed-length fields or alternatively variable-length data.
What do you think of that idea?

(That would not preclude doing compression on top, although I think
that feeding everything through pglz or even lz4/snappy may eat more
CPU cycles than we can really afford. The option is there, though.)

> If you remember, we did a big WAL format refactoring in 9.5, which moved
> some information from AM-specific structs to the common headers. Namely,
> the information on the relation blocks that the WAL record applies to.
> That was a very handy refactoring, and allowed tools like pg_waldump to
> print more detailed information about all WAL record types. For WAL
> records, moving the block information was natural, because there was
> special handling for full-page images anyway. However, I don't think we
> have enough experience with UNDO log yet, to know which fields would be
> best to include in the common undo header, and which to leave as
> AM-specific payload. I think we should keep the common header slim, and
> delegate to the AM routines.

Yeah, I remember. I'm not really sure I totally buy your argument that
we don't know what besides XID should go into an undo record: tuples
are a pretty important concept, and although there might be some
exceptions here and there, I have a hard time imagining that undo is
going to be primarily about anything other than identifying a tuple
and recording something you did to it.  On the other hand, you might
want to identify several tuples, or identify a tuple with a TID that's
not 6 bytes, so that's a good reason for allowing more flexibility.

Another point in being favor of being more flexible is that it's not
clear that there's any use case for third-party tools that work using
undo.  WAL drives replication and logical decoding and could be used
to drive incremental backup, but it's not really clear that similar
applications exist for undo.  If it's just private to the AM, the AM
might as well be responsible for it.  If that leads to code
duplication, we can create a library of common routines and AM users
can use them if they want.

> Hmm. If you're following an UNDO chain, from newest to oldest, I would
> assume that the newer record has enough information to decide whether
> you need to look at the previous record. If the previous record is no
> longer interesting, it might already be discarded away, after all.

I actually thought zedstore might need this pattern.  If you store an
XID with each undo pointer, as the current zheap code mostly does,
then you have enough information to decide whether you care about the
previous undo record before you fetch it. But a tuple stores only an
undo pointer, and you determine that the undo isn't discarded, you
have to fetch the record first and then possibly decide that you had
the right version in the first place.  Now, maybe that pattern doesn't
repeat, because the undo records could be set up to contain both an
XMIN and an XMAX, but not necessarily.  I don't know exactly what you
have in mind, but it doesn't seem totally crazy that an undo record
might contain the XID that created that version but not the XID that
created the prior version, and if so, you'll iterate backwards until
you either hit the end of undo or go one undo record past the version
you can see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-09 Thread Amit Kapila
On Fri, Jul 26, 2019 at 9:57 PM Amit Khandekar  wrote:
>
> On Fri, 26 Jul 2019 at 12:25, Amit Kapila  wrote:
> > I agree with all your other comments.
>
> Thanks for addressing the comments. Below is the continuation of my
> comments from 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch
> :
>
>
> + * Perform rollback request.  We need to connect to the database for first
> + * request and that is required because we access system tables while
> for first request and that is required => for the first request. This
> is required
>

Changed as per suggestion.

> ---
>
> +UndoLauncherShmemSize(void)
> +{
> +Sizesize;
> +
> +/*
> + * Need the fixed struct and the array of LogicalRepWorker.
> + */
> +size = sizeof(UndoApplyCtxStruct);
>
> The fixed structure size should be  offsetof(UndoApplyCtxStruct,
> workers) rather than sizeof(UndoApplyCtxStruct)
>
> ---
>

Why? I see the similar code in ApplyLauncherShmemSize.  If there is
any problem with this, then we might have similar problem in existing
code as well.

> In UndoWorkerCleanup(), we set individual fields of the
> UndoApplyWorker structure, whereas in UndoLauncherShmemInit(), for all
> the UndoApplyWorker array elements, we just memset all the
> UndoApplyWorker structure elements to 0. I think we should be
> consistent for the two cases. I guess we can just memset to 0 as you
> do in UndoLauncherShmemInit(), but this will cause the
> worker->undo_worker_queue to be 0 i.e. XID_QUEUE , whereas in
> UndoWorkerCleanup(), it is set to -1. Is the -1 value essential, or we
> can just set it to XID_QUEUE initially ?

Either is fine, because before launching the worker we set the valid
value.  It is better to set it as InvalidUndoWorkerQueue though.

> Also, if we just use memset in UndoWorkerCleanup(), we need to first
> save generation into a temp variable, and then after memset(), restore
> it back.
>

This sounds like an unnecessary trickery.

> That brought me to another point :
> We already have a macro ResetUndoRequestInfo(), so UndoWorkerCleanup()
> can just call ResetUndoRequestInfo().
> 
>

Hmm, both (UndoRequestInfo and UndoApplyWorker) are separate
structures, so how can we reuse them?

> +boolallow_peek;
> +
> +CHECK_FOR_INTERRUPTS();
> +
> +allow_peek = !TimestampDifferenceExceeds(started_at,
> Some comments would be good about what is allow_peek  used for. Something 
> like :
> "Arrange to prevent the worker from restarting quickly to switch databases"
>

Added a slightly different comment.

> -
> +++ b/src/backend/access/undo/README.UndoProcessing
> -
>
> +worker then start reading from one of the queues the requests for that
> start=>starts
> ---
>
> +work, it lingers for UNDO_WORKER_LINGER_MS (10s as default).  This avoids
> As per the latest definition, it is 20s. IMHO, there's no need to
> mention the default value in the readme.
> ---
>
> +++ b/src/backend/access/undo/discardworker.c
> ---
>
> + * portion of transaction that is overflowed into a separate log can
> be processed
> 80-col crossed.
>
> +#include "access/undodiscard.h"
> +#include "access/discardworker.h"
> Not in alphabetical order
>

Fixed all of the above four points.

>
> +++ b/src/backend/access/undo/undodiscard.c
> ---
>
> +next_insert = UndoLogGetNextInsertPtr(logno);
> I checked UndoLogGetNextInsertPtr() definition. It calls
> find_undo_log_slot() to get back the slot from logno. Why not make it
> accept slot as against logno ? At all other places, the slot->logno is
> passed, so it is convenient to just pass the slot there. And in
> UndoDiscardOneLog(), first call find_undo_log_slot() just before the
> above line (or call it at the end of the do-while loop).
>

I am not sure if this is a good idea because find_undo_log_slot is
purely a undolog module stuff, exposing it to outside doesn't seem
like a good idea to me.

> This way,
> during each of the UndoLogGetNextInsertPtr() calls in undorequest.c,
> we will have one less find_undo_log_slot() call.
>

I am not sure there is any performance benefit either because there is
a cache for the slots and it should get from there very quickly.  I
think we can avoid this repeated call and I have done so in the
attached patch.

> -
>
> In UndoDiscardOneLog(), there are at least 2 variable declarations
> that can be moved inside the do-while loop : uur and next_insert. I am
> not sure about the other variables viz : undofxid and
> latest_discardxid. Values of these variables in one iteration continue
> across to the second iteration. For latest_discardxid, it looks like
> we do want its value to be carried forward, but is it also true for
> undofxid ?
>

undofxid can be moved inside the loop, fixed that and other variables
pointed out by you.

> + /* If we reach here, this means there is something to discard. */
> + 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-09 Thread Amit Kapila
On Tue, Jul 23, 2019 at 8:12 PM Amit Khandekar  wrote:
>
> 
>
> Some further review comments for undoworker.c :
>
>
> +/* Sets the worker's lingering status. */
> +static void
> +UndoWorkerIsLingering(bool sleep)
> The function name sounds like "is the worker lingering ?". Can we
> rename it to something like "UndoWorkerSetLingering" ?
>

makes sense, changed as per suggestion.

> -
>
> + errmsg("undo worker slot %d is empty, cannot attach",
> + slot)));
>
> + }
> +
> + if (MyUndoWorker->proc)
> + {
> + LWLockRelease(UndoWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("undo worker slot %d is already used by "
> + "another worker, cannot attach", slot)));
>
> These two error messages can have a common error message "could not
> attach to worker slot", with errdetail separate for each of them :
> slot %d is empty.
> slot %d is already used by another worker.
>
> --
>

Changed as per suggestion.

> +static int
> +IsUndoWorkerAvailable(void)
> +{
> + int i;
> + int alive_workers = 0;
> +
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
>
> Should have bool return value.
>
> Also, why is it keeping track of number of alive workers ? Sounds like
> earlier it used to return number of alive workers ? If it indeed needs
> to just return true/false, we can do away with alive_workers.
>
> Also, *exclusive* lock is unnecessary.
>
> --
>

Changed as per suggestion.  Additionally, I changed the name of the
function to UndoWorkerIsAvailable(), so that it is similar to other
functions in the file.

> +if (UndoGetWork(false, false, , NULL) &&
> +IsUndoWorkerAvailable())
> +UndoWorkerLaunch(urinfo);
>
> There is no lock acquired between IsUndoWorkerAvailable() and
> UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable()
> returns true, there is a small window where UndoWorkerLaunch() does
> not find any worker slot with in_use false, causing assertion failure
> for (worker != NULL).
> --
>

I have removed the assert and instead added a warning.  I have also
added a comment from the place where we call UndoWorkerLaunch to
mention the race condition.

> +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
> +{
> + /* Block concurrent access. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> *Exclusive* lock is unnecessary.
> -
>

Right, changed to share lock.

> + LWLockRelease(UndoWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("undo worker slot %d is empty",
> + slot)));
> I believe there is no need to explicitly release an lwlock before
> raising an error, since the lwlocks get released during error
> recovery. Please check all other places where this is done.
> -
>

Fixed.

> + * Start new undo apply background worker, if possible otherwise return 
> false.
> worker, if possible otherwise => worker if possible, otherwise
> -
>
> +static bool
> +UndoWorkerLaunch(UndoRequestInfo urinfo)
> We don't check UndoWorkerLaunch() return value. Can't we make it's
> return value type void ?
>

Now, the function returns void and accordingly I have adjusted the
comment which should address both the above comments.

> Also, it would be better to have urinfo as pointer to UndoRequestInfo
> rather than UndoRequestInfo, so as to avoid structure copy.
> -
>

Okay, changed as per suggestion.

> +{
> + BackgroundWorker bgw;
> + BackgroundWorkerHandle *bgw_handle;
> + uint16 generation;
> + int i;
> + int slot = 0;
> We can remove variable i, and use slot variable in place of i.
> ---
>
> + snprintf(bgw.bgw_name, BGW_MAXLEN, "undo apply worker");
> I think it would be trivial to also append the worker->generation in
> the bgw_name.
> -
>

I am not sure if adding 'generation' is any useful. It might be better
to add database id as each worker can work on a particular database,
so that could be useful information.

>
> + if (!RegisterDynamicBackgroundWorker(, _handle))
> + {
> + /* Failed to start worker, so clean up the worker slot. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> + UndoWorkerCleanup(worker);
> + LWLockRelease(UndoWorkerLock);
> +
> + return false;
> + }
>
> Is it intentional that there is no (warning?) message logged when we
> can't register a bg worker ?
> -
>

Added a warning in that code path.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-09 Thread Amit Kapila
On Mon, Jul 22, 2019 at 8:39 PM Amit Khandekar  wrote:
>
> On Mon, 22 Jul 2019 at 14:21, Amit Kapila  wrote:
>
> I have started review of
> 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch. Below
> are some quick comments to start with:
>
> +++ b/src/backend/access/undo/undoworker.c
>
> +#include "access/xact.h"
> +#include "access/undorequest.h"
> Order is not alphabetical
>

Fixed this and a few others.

> + * Each undo worker then start reading from one of the queue the requests for
> start=>starts
> queue=>queues
>
> -
>
> + rc = WaitLatch(MyLatch,
> +WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
> +10L, WAIT_EVENT_BGWORKER_STARTUP);
> +
> + /* emergency bailout if postmaster has died */
> + if (rc & WL_POSTMASTER_DEATH)
> + proc_exit(1);
> I think now, thanks to commit cfdf4dc4fc9635a, you don't have to
> explicitly handle postmaster death; instead you can use
> WL_EXIT_ON_PM_DEATH. Please check at all such places where this is
> done in this patch.
>
> -
>

Fixed both of the above issues.

> +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
> +{
> + /* Block concurrent access. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> +
> + MyUndoWorker = >workers[slot];
> Not sure why MyUndoWorker is used here. Can't we use a local variable
> ? Or do we intentionally attach to the slot as a side-operation ?
>
> -
>

I have changed the code around this such that we first attach to the
slot and then get the required info.  Also, I don't see the need of
exclusive lock here, so changed it to shared lock.

> + * Get the dbid where the wroker should connect to and get the worker
> wroker=>worker
>
> -
>
> + BackgroundWorkerInitializeConnectionByOid(urinfo.dbid, 0, 0);
> 0, 0 => InvalidOid, 0
>
> + * Set the undo worker request queue from which the undo worker start
> + * looking for a work.
> start => should start
> a work => work
>
> --
>

Fixed both of these.

> + if (!InsertRequestIntoErrorUndoQueue(urinfo))
> I was thinking what happens if for some reason
> InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the
> entry will not be marked invalid, and so there will be no undo action
> carried out because I think the undo worker will exit. What happens
> next with this entry ?

I think this will change after integration with Robert's latest patch
on queues, so will address along with it if required.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-08 Thread Amit Kapila
On Fri, Aug 9, 2019 at 1:57 AM Robert Haas  wrote:
>
> On Thu, Aug 8, 2019 at 9:31 AM Andres Freund  wrote:
> > I know that Robert is working on a patch that revises the undo request
> > layer somewhat, it's possible that this is best discussed afterwards.
>
> Here's what I have at the moment.  This is not by any means a complete
> replacement for Amit's undo worker machinery, but it is a significant
> redesign (and I believe a significant improvement to) the queue
> management stuff from Amit's patch.
>

Thanks for working on this.  Neither Kuntal nor I have got time to
look into this part in detail.

>  I wrote this pretty quickly, so
> while it passes simple testing, it probably has a number of bugs, and
> to actually use it, it would need to be integrated with xact.c;
>

I can look into this and integrate with other parts of the patch next
week unless you are planning to do.  Right now, I am working on fixing
up some other comments raised on the patches which I will share today
or early next week after which I can start looking into this. I hope
that is fine with you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-08 Thread Robert Haas
On Thu, Aug 8, 2019 at 9:31 AM Andres Freund  wrote:
> I know that Robert is working on a patch that revises the undo request
> layer somewhat, it's possible that this is best discussed afterwards.

Here's what I have at the moment.  This is not by any means a complete
replacement for Amit's undo worker machinery, but it is a significant
redesign (and I believe a significant improvement to) the queue
management stuff from Amit's patch.  I wrote this pretty quickly, so
while it passes simple testing, it probably has a number of bugs, and
to actually use it, it would need to be integrated with xact.c; right
now it's just a standalone module that doesn't do anything except let
itself be tested.

Some of the ways it is different from Amit's patches include:

* It uses RBTree rather than binaryheap, so when we look ahead, we
look ahead in the right order.

* There's no limit to the lookahead distance; when looking ahead, it
will search the entirety of all 3 RBTrees for an entry from the right
database.

* It doesn't have a separate hash table keyed by XID.  I didn't find
that necessary.

* It's better-isolated, as you can see from the fact that I've
included a test module that tests this code without actually ever
putting an UndoRequestManager in shared memory.  I would've liked to
expand this test module, but I don't have time to do that today and
felt it better to get this much sent out.

* It has a lot of comments explaining the design and how it's intended
to integrate with the rest of the system.

Broadly, my vision for how this would get used is:

- Create an UndoRecordManager in shared memory.
- Before a transaction first attaches to a permanent or unlogged undo
log, xact.c would call RegisterUndoRequest(); thereafter, xact.c would
store a pointer to the UndoRecord for the lifetime of the toplevel
transaction.
- Immediately after attaching to a permanent or unlogged undo log,
xact.c would call UndoRequestSetLocation.
- xact.c would track the number of bytes of permanent and unlogged
undo records the transaction generates.  If the transaction goes onto
abort, it reports these by calling FinalizeUndoRequest.
- If the transaction commits, it doesn't need that information, but
does need to call UnregisterUndoRequest() as a post-commit step in
CommitTransaction().
- In the case of an abort, after calling FinalizeUndoRequest, xact.c
would call PerformUndoInBackground() to find out whether to do undo in
the background or the foreground.  If undo is to be done in the
foreground, the backend must go on to call UnregisterUndoRequest() if
undo succeeds, and RescheduleUndoRequest() if it fails.

- In the case of a prepared transaction, a pointer to the UndoRequest
would get stored in the GlobalTransaction (but nothing extra would get
stored in the twophase state file).
- COMMIT PREPARED calls UnregisterUndoRequest().
- ROLLBACK PREPARED calls PerformUndoInBackground; if told to do undo
in the foreground, it must go on to call either
UnregisterUndoRequest() on success or RescheduleUndoRequest() on
failure, just like in the regular abort case.

- After a crash, once recovery is complete but before we open for
connections, or at least before we allow any new undo activity, the
discard worker scans all the logs and makes a bunch of calls to
RecreateUndoRequest(). Then, for each prepared transaction that still
exists, it calls SuspendPreparedUndoRequest() and use the return value
to reset the UndoRequest pointer in the GlobalTransaction.  Only once
both of those steps are completed can undo workers be safely started.
- Undo workers call GetNextUndoRequest() to get the next task that
they should perform, and once they do, they "own" the undo request.
When undo succeeds or fails, they must call either
UnregisterUndoRequest() or RescheduleUndoRequest(), as appropriate,
just like for foreground undo.  Making sure this is water-tight will
probably require some well-done integration with xact.c, so that an
undo request that we "own" because we got it in a background undo
apply process looks exactly the same as one we "own" because it's our
transaction originally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0001-Draft-of-new-undo-request-manager.patch
Description: Binary data


Re: POC: Cleaning up orphaned files using undo logs

2019-08-08 Thread Andres Freund
Hi,

On 2019-08-07 14:50:17 +0530, Amit Kapila wrote:
> On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
> > On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > > +/*
> > > + * Binary heap comparison function to compare the time at which an error
> > > + * occurred for transactions.
> > > + *
> > > + * The error queue is sorted by next_retry_at and err_occurred_at.  
> > > Currently,
> > > + * the next_retry_at has some constant delay time (see 
> > > PushErrorQueueElem), so
> > > + * it doesn't make much sense to sort by both values.  However, in 
> > > future, if
> > > + * we have some different algorithm for next_retry_at, then it will work
> > > + * seamlessly.
> > > + */
> >
> > Why is it useful to have error_occurred_at be part of the comparison at
> > all? If we need a tiebraker, err_occurred_at isn't that (if we can get
> > conflicts for next_retry_at, then we can also get conflicts in
> > err_occurred_at).  Seems better to use something actually guaranteed to
> > be unique for a tiebreaker.
> >
> 
> This was to distinguish the cases where the request is failing
> multiple times with the request failed the first time.  I agree that
> we need a better unique identifier like FullTransactionid though.  Do
> let me know if you have any other suggestion?

Sure, I get why you have the field. Even if it were just for debugging
or such. Was just commenting upon it being used as part of the
comparison.  I'd just go for (next_retry_at, fxid).


> > > +  * backends. This will ensure that it won't get filled.
> > > +  */
> >
> > How does this ensure anything?
> >
> 
> Because based on this we will have a hard limit on the number of undo
> requests after which we won't allow more requests.  See some more
> detailed explanation for the same later in this email.   I think the
> comment needs to be updated.

Well, as your code stands, I don't think there is an actual hard limit
on the number of transactions needing to be undone due to the way errors
are handled. There's no consideration of prepared transactions.



> > > + START_CRIT_SECTION();
> > > +
> > > + /* Update the progress in the transaction header. */
> > > + UndoRecordUpdateTransInfo(, 0);
> > > +
> > > + /* WAL log the undo apply progress. */
> > > + {
> > > + XLogRecPtr  lsn;
> > > + xl_undoapply_progress xlrec;
> > > +
> > > + xlrec.urec_ptr = progress_urec_ptr;
> > > + xlrec.progress = block_num;
> > > +
> > > + XLogBeginInsert();
> > > + XLogRegisterData((char *) , sizeof(xlrec));
> > > +
> > > + RegisterUndoLogBuffers(, 1);
> > > + lsn = XLogInsert(RM_UNDOACTION_ID, 
> > > XLOG_UNDO_APPLY_PROGRESS);
> > > + UndoLogBuffersSetLSN(, lsn);
> > > + }
> > > +
> > > + END_CRIT_SECTION();
> > > +
> > > + /* Release undo buffers. */
> > > + FinishUndoRecordInsert();
> > > +}
> >
> > This whole prepare/execute split for updating apply pregress, and next
> > undo pointers makes no sense to me.
> >
> 
> Can you explain what is your concern here?  Basically, in the prepare
> phase, we do read and lock the buffer and in the actual update phase
> (which is under critical section), we update the contents in the
> shared buffer.  This is the same idea as we use in many places in
> code.

I'll comment on the concerns with the whole API separately.


> > >  typedef struct TwoPhaseFileHeader
> > >  {
> > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader
> > >   uint16  gidlen; /* length of the GID - GID 
> > > follows the header */
> > >   XLogRecPtr  origin_lsn; /* lsn of this record at 
> > > origin node */
> > >   TimestampTz origin_timestamp;   /* time of prepare at origin node */
> > > +
> > > + /*
> > > +  * We need the locations of the start and end undo record pointers 
> > > when
> > > +  * rollbacks are to be performed for prepared transactions using 
> > > undo-based
> > > +  * relations.  We need to store this information in the file as the 
> > > user
> > > +  * might rollback the prepared transaction after recovery and for 
> > > that we
> > > +  * need its start and end undo locations.
> > > +  */
> > > + UndoRecPtr  start_urec_ptr[UndoLogCategories];
> > > + UndoRecPtr  end_urec_ptr[UndoLogCategories];
> > >  } TwoPhaseFileHeader;
> >
> > Why do we not need that knowledge for undo processing of a non-prepared
> > transaction?

> The non-prepared transaction also needs to be aware of that.  It is
> stored in TransactionStateData.  I am not sure if I understand your
> question here.

My concern is that I think it's fairly ugly to store data like this in
the 2pc state file. And it's not an insubstantial amount of additional
data either, compared to the current size, even when no undo is in
use. There's a difference between an unused feature increasing backend
local memory and increasing the 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-08 Thread Amit Kapila
On Wed, Aug 7, 2019 at 5:06 PM Thomas Munro  wrote:
>
> On Thu, Aug 1, 2019 at 1:22 AM Amit Kapila  wrote:
> > On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila  
> > wrote:
> > > On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro  
> > > wrote:
> > > >  but
> > > > here's a small thing: I managed to reach an LWLock self-deadlock in
> > > > the undo worker launcher:
> > > >
> > >
> > > I could see the problem, will fix in next version.
> >
> > Fixed both of these problems in the patch just posted by me [1].
>
> I reran the script that found that problem, so I could play with the
> linger logic.
>

Thanks for the test.  I will look into it and get back to you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Thomas Munro
On Thu, Aug 1, 2019 at 1:22 AM Amit Kapila  wrote:
> On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila  wrote:
> > On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro  wrote:
> > >  but
> > > here's a small thing: I managed to reach an LWLock self-deadlock in
> > > the undo worker launcher:
> > >
> >
> > I could see the problem, will fix in next version.
>
> Fixed both of these problems in the patch just posted by me [1].

I reran the script that found that problem, so I could play with the
linger logic.  It creates N databases, and then it creates tables in
random databases (because I'm testing with the orphaned table cleanup
patch) and commits or rolls back at (say) 100 tx/sec.  While it's
doing that, you can look at the pg_stat_undo_logs view to see the
discard and insert pointers whizzing along nicely, but if you look at
the process table with htop or similar you can see that it's forking
undo apply workers at 100/sec (the pid keeps changing), whenever there
is more than one database involved.  With a single database it lingers
as I was expecting (and then creates problems when you want to drop
the database).  What I was expecting to see is that if you configure
the test to generate undo work in  2, 3 or 4 dbs, and you have
max_undo_workers set to 4, then you should finish up with 4 undo apply
workers hanging around to service the work calmly without any new
forking happening.  If you generate undo work in more than 4
databases, I was expecting to see the undo workers exiting and being
forked so that a total of 4 workers (at any time) can work their way
around the more-than-4 databases, but not switching as fast as they
can, so that we don't waste all our energy on forking and setup (how
fast exactly they should switch, I don't know, that's what I wanted to
see).  A more advanced thing to worry about, not yet tested, is how
well they'll handle asymmetrical work distributions (not enough
workers, but some databases producing a lot and some a little undo
work).  Script attached.

-- 
Thomas Munro
https://enterprisedb.com
# Install: python-psycopg2 (Debian/Ubuntu)
# Run with python2 ./test_undo_worker_local_balancing.py

import psycopg2 as pg
import random
import time

def make_conn(dbname):
  return pg.connect("dbname=" + dbname)

def run_test(dbs, txs_per_sec, commit_ratio, runtime):

  # first, create a control connection that we'll use to create databases
  conn = make_conn("postgres")
  conn.set_isolation_level(pg.extensions.ISOLATION_LEVEL_AUTOCOMMIT)
  cursor = conn.cursor()

  # recreate all the databases
  for n in range(dbs):
cursor.execute("drop database if exists db%d" % n)
cursor.execute("create database db%d" % n)

  # next, open a separate session to each database
  conns = [make_conn("db%d" % n) for n in range(dbs)]
  cursors = [conn.cursor() for conn in conns]

  # set up interesting GUCs in each session
  for cursor in cursors:
cursor.execute("set rollback_overflow_size = '0kB'")
cursor.connection.commit()

  # now do random work at the requested rate until our time runs out
  start = time.time()
  finish_at = start + runtime
  txs = 0
  table_number = 0
  now = start
  while now < finish_at:

# choose a random session, and run a transaction
cursor = random.choice(cursors)
cursor.execute("create table t%d ()" % table_number)
# decide whether to commit or roll back
if random.uniform(0.0, 1.0) < commit_ratio:
  cursor.connection.commit()
else:
  cursor.connection.rollback()
table_number += 1

# wait until it's time to start the next transaction
txs += 1
next_tx_at = (txs / txs_per_sec) + start
if next_tx_at < now:
  print "can't run transactions fast enough, not sleeping"
else:
  time.sleep(next_tx_at - now)
now = time.time()

if __name__ == "__main__":
  dbs = 4
  txs_per_sec = 1 #100.0
  commit_ratio = 0.0 # 0.0 = always roll back, 1.0 = always commit
  runtime = 120 # how long to run for, in seconds
  run_test(dbs, txs_per_sec, commit_ratio, runtime)


Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Heikki Linnakangas

On 07/08/2019 13:52, Dilip Kumar wrote:

I have one more problem related to compression of the command id
field.  Basically, the problem is that we don't set the command id in
the WAL and we will always store FirstCommandId in the undo[1].   So
suppose there were 2 operations under a different CID then during DO
time both the undo record will store the CID field in their respective
undo records but during REDO time, all the commands will store the
same CID(FirstCommandId) so as per the compression logic the
subsequent record for the same transaction will not store the CID
field.  I am not sure what is the best way to handle this but I have
few ideas.

1) Don't compress the CID field ever.
2) Write CID in WAL,  but just for compressing the CID field in undo
(which may not necessarily go to disk) we don't want to add extra 4
bytes in the WAL.


Most transactions have only a few commands, so you could optimize for 
that. If you use some kind of a variable-byte encoding for it, it could 
be a single byte or even just a few bits, for the common cases.


For the first version, I'd suggest keeping it simple, though, and 
optimize later.


- Heikki




Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Heikki Linnakangas

On 05/08/2019 16:24, Robert Haas wrote:

On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas  wrote:

I feel that the level of abstraction is not quite right. There are a
bunch of fields, like uur_block, uur_offset, uur_tuple, that are
probably useful for some UNDO resource managers (zheap I presume), but
seem kind of arbitrary. How is uur_tuple different from uur_payload?
Should they be named more generically as uur_payload1 and uur_payload2?
And why two, why not three or four different payloads? In the WAL record
format, there's a concept of "block id", which allows you to store N
number of different payloads in the record, I think that would be a
better approach. Or only have one payload, and let the resource manager
code divide it as it sees fit.

Many of the fields support a primitive type of compression, where a
field can be omitted if it has the same value as on the first record on
an UNDO page. That's handy. But again I don't like the fact that the
fields have been hard-coded into the UNDO record format. I can see e.g.
the relation oid to be useful for many AMs. But not all. And other AMs
might well want to store and deduplicate other things, aside from the
fields that are in the patch now. I'd like to move most of the fields to
AM specific code, and somehow generalize the compression. One approach
would be to let the AM store an arbitrary struct, and run it through a
general-purpose compression algorithm, using the UNDO page's first
record as the "dictionary".


I thought about this, too. I agree that there's something a little
unsatisfying about the current structure, but I haven't been able to
come up with something that seems definitively better. I think
something along the lines of what you are describing here might work
well, but I am VERY doubtful about the idea of a fixed-size struct. I
think AMs are going to want to store variable-length data: especially
tuples, but maybe also other stuff. For instance, imagine some AM that
wants to implement locking that's more fine-grained that the four
levels of tuple locks we have today: instead of just having key locks
and all-columns locks, you could want to store the exact columns to be
locked. Or maybe your TIDs are variable-width.


Sure, a fixed-size struct is quite limiting. My point is that all that 
packing of data into UNDO records should be AM-specific. Maybe it would 
be handy to have a a few common fields in the undo record header itself, 
but most data should be in the AM-specific payload, because it varies 
across AMs.



And the problem is that as soon as you move to something where you
pack in a bunch of variable-sized fields, you lose the ability to
refer to thinks using reasonable names.  That's where I came up with
the idea of an UnpackedUndoRecord: give the common fields that
"everyone's going to need" human-readable names, and jam only the
strange, AM-specific stuff into the payload.  But if those needs are
not actually universal but very much AM-specific, then I'm afraid
we're going to end up with deeply inscrutable code for packing and
unpacking records.  I imagine it's possible to come up with a good
structure for that, but I don't think we have one today.


Yeah, that's also a problem with complicated WAL record types. Hopefully 
the complex cases are an exception, not the norm. A complex case is 
unlikely to fit any pre-defined set of fields anyway. (We could look at 
how e.g. protobuf works, if this is really a big problem. I'm not 
suggesting that we add a dependency just for this, but there might be 
some patterns or interfaces that we could mimic.)


If you remember, we did a big WAL format refactoring in 9.5, which moved 
some information from AM-specific structs to the common headers. Namely, 
the information on the relation blocks that the WAL record applies to. 
That was a very handy refactoring, and allowed tools like pg_waldump to 
print more detailed information about all WAL record types. For WAL 
records, moving the block information was natural, because there was 
special handling for full-page images anyway. However, I don't think we 
have enough experience with UNDO log yet, to know which fields would be 
best to include in the common undo header, and which to leave as 
AM-specific payload. I think we should keep the common header slim, and 
delegate to the AM routines.


For UNDO records, having an XID on every record probably makes sense; 
all the use cases for UNDO log we've discussed are transactional. The 
rules on which UNDO records to apply and what/when to discard, depend on 
whether a transaction committed or aborted and when, so you need the XID 
for that. Although, the rule also depends on the AM; for cleaning up 
orphaned files, an UNDO record for a committed transaction can be 
discarded immediately, while zheap and zedstore records need to be kept 
around longer. So the exact rules for that will need to be AM-specific, 
too. Or maybe there are only a few different cases and we can enumerate 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Dilip Kumar
On Tue, Jul 30, 2019 at 1:54 PM Dilip Kumar  wrote:
>
> On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro  wrote:
> >
> > One data structure that could perhaps hold this would be
> > UndoLogTableEntry (the per-backend cache, indexed by undo log number,
> > with pretty fast lookups; used for things like
> > UndoLogNumberGetCategory()).  As long as you never want to have
> > inter-transaction compression, that should have the right scope to
> > give recovery per-undo log tracking.  If you ever wanted to do
> > compression between transactions too, maybe UndoLogSlot could work,
> > but that'd have more complications.
>
> I think this could be a good idea.  I had thought of keeping in the
> slot as my 3rd option but later I removed it thinking that we need to
> expose the compression field to the undo log layer.  I think keeping
> in the UndoLogTableEntry is a better idea than keeping in the slot.
> But, I still have the same problem that we need to expose undo
> record-level fields to undo log layer to compute the cache entry size.
>   OTOH, If we decide to get from the first record of the page (as I
> mentioned up thread) then I don't think there is any performance issue
> because we are inserting on the same page.  But, for doing that we
> need to unpack the complete undo record (guaranteed to be on one
> page).   And, UnpackUndoData will internally unpack the payload data
> as well which is not required in our case unless we change
> UnpackUndoData such that it unpacks only what the caller wants (one
> input parameter will do).
>
> I am not sure out of these two which idea is better?
>

I have one more problem related to compression of the command id
field.  Basically, the problem is that we don't set the command id in
the WAL and we will always store FirstCommandId in the undo[1].   So
suppose there were 2 operations under a different CID then during DO
time both the undo record will store the CID field in their respective
undo records but during REDO time, all the commands will store the
same CID(FirstCommandId) so as per the compression logic the
subsequent record for the same transaction will not store the CID
field.  I am not sure what is the best way to handle this but I have
few ideas.

1) Don't compress the CID field ever.
2) Write CID in WAL,  but just for compressing the CID field in undo
(which may not necessarily go to disk) we don't want to add extra 4
bytes in the WAL.

Any better idea to handle this?

[1] 
https://www.postgresql.org/message-id/CAFiTN-u2Ny2E-NgT8nmE65awJ7keOzePODZTEg98ceF%2BsNhRtw%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Amit Kapila
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
>

I am responding to some of the points where I need some more inputs or
some discussion is required.  Some of the things need more thoughts
which I will respond later and some are quite straight forward and
doesn't need much discussion.

>
> > +/*
> > + * Binary heap comparison function to compare the time at which an error
> > + * occurred for transactions.
> > + *
> > + * The error queue is sorted by next_retry_at and err_occurred_at.  
> > Currently,
> > + * the next_retry_at has some constant delay time (see 
> > PushErrorQueueElem), so
> > + * it doesn't make much sense to sort by both values.  However, in future, 
> > if
> > + * we have some different algorithm for next_retry_at, then it will work
> > + * seamlessly.
> > + */
>
> Why is it useful to have error_occurred_at be part of the comparison at
> all? If we need a tiebraker, err_occurred_at isn't that (if we can get
> conflicts for next_retry_at, then we can also get conflicts in
> err_occurred_at).  Seems better to use something actually guaranteed to
> be unique for a tiebreaker.
>

This was to distinguish the cases where the request is failing
multiple times with the request failed the first time.  I agree that
we need a better unique identifier like FullTransactionid though.  Do
let me know if you have any other suggestion?

>
>
> > + /*
> > +  * The rollback hash table is used to avoid duplicate undo requests by
> > +  * backends and discard worker.  The table must be able to accomodate 
> > all
> > +  * active undo requests.  The undo requests must appear in both xid 
> > and
> > +  * size requests queues or neither.  In same transaction, there can 
> > be two
> > +  * requests one for logged relations and another for unlogged 
> > relations.
> > +  * So, the rollback hash table size should be equal to two request 
> > queues,
> > +  * an error queue (currently this is same as request queue) and max
>
> "the same"? I assume this intended to mean the same size?
>

Yes. I will add the word size to be more clear.

>
> > +  * backends. This will ensure that it won't get filled.
> > +  */
>
> How does this ensure anything?
>

Because based on this we will have a hard limit on the number of undo
requests after which we won't allow more requests.  See some more
detailed explanation for the same later in this email.   I think the
comment needs to be updated.

>
> > +  * the binary heaps which can change.
> > +  */
> > + Assert(LWLockHeldByMeInMode(RollbackRequestLock, LW_EXCLUSIVE));
> > +
> > + /*
> > +  * We normally push the rollback request to undo workers if the size 
> > of
> > +  * same is above a certain threshold.
> > +  */
> > + if (req_size >= rollback_overflow_size * 1024 * 1024)
> > + {
>
> Why is this being checked with the lock held? Seems like this should be
> handled in a pre-check?
>

Yeah, it can be a pre-check, but I thought it is better to encapsulate
everything in the function as this is not an expensive check.  I think
we can move it to outside lock to avoid any such confusion.

>
> > + * allow_peek - if true, peeks a few element from each queue to check 
> > whether
> > + * any request matches current dbid.
> > + * remove_from_queue - if true, picks an element from the queue whose dbid
> > + * matches current dbid and remove it from the queue before returning the 
> > same
> > + * to caller.
> > + * urinfo - this is an OUT parameter that returns the details of undo 
> > request
> > + * whose undo action is still pending.
> > + * in_other_db_out - this is an OUT parameter.  If we've not found any work
> > + * for current database, but there is work for some other database, we set
> > + * this parameter as true.
> > + */
> > +bool
> > +UndoGetWork(bool allow_peek, bool remove_from_queue, UndoRequestInfo 
> > *urinfo,
> > + bool *in_other_db_out)
> > +{
>
>
> > + /*
> > +  * If some undo worker is already processing the rollback 
> > request or
> > +  * it is already processed, then we drop that request from 
> > the queue
> > +  * and fetch the next entry from the queue.
> > +  */
> > + if (!rh || UndoRequestIsInProgress(rh))
> > + {
> > + RemoveRequestFromQueue(cur_queue, 0);
> > + cur_undo_queue++;
> > + continue;
> > + }
>
> When is it possible to hit the in-progress case?
>

The same request is in two queues. It is possible that when the
request is being processed from xid queue by one of the workers, the
request from another queue is picked by another worker.  I think this
case won't exist after making rbtree based queues.

> > +/*
> > + * UpdateUndoApplyProgress - Updates how far undo actions from a particular
> > + * log have been applied while rolling back a 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Dilip Kumar
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > Need to do something else for a bit. More later.
>
> > + *   false, otherwise.
> > + */
> > +static bool
> > +UndoAlreadyApplied(FullTransactionId full_xid, UndoRecPtr to_urecptr)
> > +{
> > + UnpackedUndoRecord *uur = NULL;
> > + UndoRecordFetchContext  context;
> > +
> > + /* Fetch the undo record. */
> > + BeginUndoFetch();
> > + uur = UndoFetchRecord(, to_urecptr);
> > + FinishUndoFetch();
>
> Literally all the places that fetch a record, fetch them with exactly
> this combination of calls. If that's the pattern, what do we gain by
> this split?   Note that UndoBulkFetchRecord does *NOT* use an
> UndoRecordFetchContext, for reasons that are beyond me.

Actually, for the zheap or any other AM, where it needs to traverse
the transactions undo the chain. For example, in zheap we will get the
latest undo record pointer from the slot but we need to traverse the
undo record chain backward using the prevundo pointer store in the
undo record and find the undo record for a particular tuple.  Earlier,
there was a loop in UndoFetchRecord which were traversing the chain of
the undo until it finds the matching record and record was matched
using a callback.  There was also an optimization that if the current
record doesn't satisfy the callback then we keep the pin hold on the
buffer and go to the previous record in the chain.  Later based on the
review comments by Robert we have decided that finding the matching
undo record should be caller's responsibility so we have moved the
loop out of the UndoFetchRecord and kept it in the zheap code.  The
reason for keeping the context is that we can keep the buffer pin held
and remember that buffer in the context so that the caller can call
UndoFetchRecord in a loop and the pin will be held on the buffer from
which we have read the last undo record.

I agree that in undoprocessing patch set we always need to fetch one
record so instead of repeating this pattern everywhere we can write
one function and move this sequence of calls in that function.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Dilip Kumar
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > Need to do something else for a bit. More later.
>
> Here we go.
>
Thanks for the review.  I will work on them.  Currently, I need
suggestions on some of the review comments.
>
> > + /*
> > +  * Compute the header size of the undo record.
> > +  */
> > +Size
> > +UndoRecordHeaderSize(uint16 uur_info)
> > +{
> > + Sizesize;
> > +
> > + /* Add fixed header size. */
> > + size = SizeOfUndoRecordHeader;
> > +
> > + /* Add size of transaction header if it presets. */
> > + if ((uur_info & UREC_INFO_TRANSACTION) != 0)
> > + size += SizeOfUndoRecordTransaction;
> > +
> > + /* Add size of rmid if it presets. */
> > + if ((uur_info & UREC_INFO_RMID) != 0)
> > + size += sizeof(RmgrId);
> > +
> > + /* Add size of reloid if it presets. */
> > + if ((uur_info & UREC_INFO_RELOID) != 0)
> > + size += sizeof(Oid);
> > +

> There's numerous blocks with one if for each type, and the body copied
> basically the same for each alternative. That doesn't seem like a
> reasonable approach to me. Means that many places need to be adjusted
> when we invariably add another type, and seems likely to lead to bugs
> over time.
>
I agree with the point that we are repeating this in a couple of
function and doing different actions e.g.  In this function we are
computing the size and in some other function we are copying the
field.  I am not sure what would be the best way to handle it?  One
approach could just write one function which handles all these cases
but the caller will suggest what action to take.  Basically, it will
look like this.

Function (uur_info, action)
{
   if ((uur_info & UREC_INFO_TRANSACTION) != 0)
  {
 // if action is compute header size
  size += SizeOfUndoRecordTransaction;
//else if action is copy to dest
  dest = src
...
  }
Repeat for other types
}

But, IMHO that function will look confusing to anyone that what
exactly it's trying to achieve.  If anyone has a better idea please
suggest.

> > +/*
> > + * Insert the undo record into the input page from the unpack undo context.
> > + *
> > + * Caller can  call this function multiple times until desired stage is 
> > reached.
> > + * This will write the undo record into the page.
> > + */
> > +void
> > +InsertUndoData(UndoPackContext *ucontext, Page page, int starting_byte)
> > +{
> > + char   *writeptr = (char *) page + starting_byte;
> > + char   *endptr = (char *) page + BLCKSZ;
> > +
> > + switch (ucontext->stage)
> > + {
> > + case UNDO_PACK_STAGE_HEADER:
> > + /* Insert undo record header. */
> > + if (!InsertUndoBytes((char *) >urec_hd,
> > +  
> > SizeOfUndoRecordHeader, , endptr,
> > +  
> > >already_processed,
> > +  
> > >partial_bytes))
> > + return;
> > + ucontext->stage = UNDO_PACK_STAGE_TRANSACTION;
> > + /* fall through */
> > +
>
> I don't understand. The only purpose of this is that we can partially
> write a packed-but-not-actually-packed record onto a bunch of pages? And
> for that we have an endless chain of copy and pasted code calling
> InsertUndoBytes()? Copying data into shared buffers in tiny increments?
>
> If we need to this, what is the whole packed record format good for?
> Except for adding a bunch of functions with 10++ ifs and nearly
> identical code?
>
> Copying data is expensive. Copying data in tiny increments is more
> expensive. Copying data in tiny increments, with a bunch of branches, is
> even more expensive. Copying data in tiny increments, with a bunch of
> branches, is even more expensive, especially when it's shared
> memory. Copying data in tiny increments, with a bunch of branches, is
> even more expensive, especially when it's shared memory, especially when
> all that shared meory is locked at once.

My idea is, indeed of keeping all these fields duplicated in the
context, just allocate a single memory segment equal to the expected
record size (maybe the payload data can keep separate).  Now, based on
uur_info pack all the field of UnpackedUndoRecord in that memory
segment.  After that In InsertUndoData, we just need one call to
InsertUndoBytes for copying complete header in one shot and another
call for copying payload data.  Does this sound reasonable to you?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Kuntal Ghosh
Hello Andres,

On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
> > +/* Each worker queue is a binary heap. */
> > +typedef struct
> > +{
> > + binaryheap *bh;
> > + union
> > + {
> > + UndoXidQueue *xid_elems;
> > + UndoSizeQueue *size_elems;
> > + UndoErrorQueue *error_elems;
> > + }   q_choice;
> > +} UndoWorkerQueue;
>
> As we IIRC have decided to change this into a rbtree, I'll ignore
> related parts of the current code.  What is the status of that work?
> I've checked the git trees, without seeing anything? Your last mail with
> patches
> https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com
> doesn't seem to contain that either?
>
Yeah, we're changing this into a rbtree. This is still work-in-progress.

>
> > .
> > +#define GetErrorQueueNthElem(n) \
> > +( \
> > + AssertMacro(!ErrorQueueIsEmpty()), \
> > + DatumGetPointer(binaryheap_nth(UndoWorkerQueues[ERROR_QUEUE].bh, n)) \
> > +)
>
>
> -ETOOMANYMACROS
>
> I think nearly all of these shouldn't exist. See further below.
>
>
> > +#define SetErrorQueueElem(elem, e_dbid, e_full_xid, e_start_urec_ptr, 
> > e_retry_at, e_occurred_at) \
> > +( \
> > + GetErrorQueueElem(elem).dbid = e_dbid, \
> > + GetErrorQueueElem(elem).full_xid = e_full_xid, \
> > + GetErrorQueueElem(elem).start_urec_ptr = e_start_urec_ptr, \
> > + GetErrorQueueElem(elem).next_retry_at = e_retry_at, \
> > + GetErrorQueueElem(elem).err_occurred_at = e_occurred_at \
> > +)
>
> It's very very rarely a good idea to have macros that evaluate their
> arguments multiple times. It'll also never be a good idea to get the
> same element multiple times from a queue.  If needed - I'm very doubtful
> of that, given that there's a single caller - it should be a static
> inline function that gets the element once, stores it in a local
> variable, and then updates all the fields.
>
Noted. Earlier, Robert also raised the point of using so many macros.
He also suggested to use a single type of object that stores all the
information we need. It'll make things simpler and easier to
understand. In the upcoming patch set, we're removing all these
changes.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Thomas Munro
Hi,

I'll be responding to a bunch of long review emails in this thread
point by point separately, but just picking out a couple of points
here that jumped out at me:

On Wed, Aug 7, 2019 at 9:18 AM Andres Freund  wrote:
> > + {
> > + /*
> > +  * For the "shared" category, we only discard 
> > when the
> > +  * rm_undo_status callback tells us we can.
> > +  */
>
> Is there a description as to what the rm_status callback is intended to
> do? It currently is mandatory, is that intended?  Why does this only
> apply to shared records? And why just for SHARED, not for any of the others?

Yeah, I will respond to this.  After recent discussions with Robert
the whole UNDO_SHARED concept looks a bit shaky, and there's a better
way trying to get out -- more on that soon.

> > See
> > +  * DiscardWorkerMain.
>
> Hm. This actually reminds me of a complaint I have about this. ISTM that
> the logic for discarding itself should be separate from the discard
> worker. I'd just add that, and a UDF to invoke it, in a separate commit.

That's not a bad idea -- I have a 'pg_force_discard()' SP which I'll
include in my next patchset, which is a bit raw, which I'm planning to
make a bit smarter -- it might make sense to use the same code path
for that.

-- 
Thomas Munro
https://enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-06 Thread Andres Freund
Hi,

On 2019-08-06 00:56:26 -0700, Andres Freund wrote:
> Out of energy.

Here's the last section of my low-leve review. Plan to write a higher
level summary afterwards, now that I have a better picture of the code.


> +static void
> +UndoDiscardOneLog(UndoLogSlot *slot, TransactionId xmin, bool *hibernate)

I think the naming here is pretty confusing.  We have UndoDiscard(),
UndoDiscardOneLog(), UndoLogDiscard(). I don't think anybody really can
be expected to understand what is supposed to be what from these names.


> + /* Loop until we run out of discardable transactions in the given log. 
> */
> + do
> + {

for(;;) or while (true)


> + TransactionId wait_xid = InvalidTransactionId;
> + bool pending_abort = false;
> + bool request_rollback = false;
> + UndoStatus status;
> + UndoRecordFetchContext  context;
> +
> + next_insert = UndoLogGetNextInsertPtr(logno);
> +
> + /* There must be some undo data for a transaction. */
> + Assert(next_insert != undo_recptr);
> +
> + /* Fetch the undo record for the given undo_recptr. */
> + BeginUndoFetch();
> + uur = UndoFetchRecord(, undo_recptr);
> + FinishUndoFetch();
> +
> + if (uur != NULL)
> + {
> + if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED)

FWIW, this is precisely my problem with exposing such small
informational functions, which actually have to perform some work. As is
there's several places looking up the underlying undo slot, within just
these lines of code.

We do it once in UndoLogGetNextInsertPtr(). Then again in
UndoFetchRecord(). And then again in UndoRecPtrGetCategory(). And then
later again multiple times when actually discarding.  That perhaps
doesn't matter from a performance POV, but for me that indicates that
the APIs aren't quite right.


> + {
> + /*
> +  * For the "shared" category, we only discard 
> when the
> +  * rm_undo_status callback tells us we can.
> +  */

Is there a description as to what the rm_status callback is intended to
do? It currently is mandatory, is that intended?  Why does this only
apply to shared records? And why just for SHARED, not for any of the others?



> + else
> + {
> + TransactionId xid = 
> XidFromFullTransactionId(uur->uur_fxid);
> +
> + /*
> +  * Otherwise we use the CLOG and xmin to decide 
> whether to
> +  * wait, discard or roll back.
> +  *
> +  * XXX: We've added the transaction-in-progress 
> check to
> +  * avoid xids of in-progress autovacuum as 
> those are not
> +  * computed for oldestxmin calculation.

Hm. xids of autovacuum?  The concern here is the xid that autovacuum
might acquire when locking a relation for truncating a table at the end,
with wal_level=replica?  Because otherwise it shouldn't have any xids?



> See
> +  * DiscardWorkerMain.

Hm. This actually reminds me of a complaint I have about this. ISTM that
the logic for discarding itself should be separate from the discard
worker. I'd just add that, and a UDF to invoke it, in a separate commit.



> + /*
> +  * Add the aborted transaction to the rollback request 
> queues.
> +  *
> +  * We can ignore the abort for transactions whose 
> corresponding
> +  * database doesn't exist.
> +  */
> + if (request_rollback && 
> dbid_exists(uur->uur_txn->urec_dbid))
> + {
> + (void) RegisterUndoRequest(InvalidUndoRecPtr,
> + 
>undo_recptr,
> + 
>uur->uur_txn->urec_dbid,
> + 
>uur->uur_fxid);
> +
> + pending_abort = true;
> + }

As I, I think, said before: This imo should not be necessary.


> +
> + /*
> +  * We can discard upto this point when one of following 
> conditions is

*up to


> +  * met: (a) we need to wait for a transaction first. (b) there 
> is no
> +  * more log to process. (c) the transaction undo in current log 
> is
> +  * finished. (d) there is a pending abort.
> +  */

This comment is hard to understand. Perhaps you're missing some words?
Because it's e.g. 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-06 Thread Andres Freund
Hi,

On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> Need to do something else for a bit. More later.

Here we go.




> + /*
> +  * Compute the header size of the undo record.
> +  */
> +Size
> +UndoRecordHeaderSize(uint16 uur_info)
> +{
> + Sizesize;
> +
> + /* Add fixed header size. */
> + size = SizeOfUndoRecordHeader;
> +
> + /* Add size of transaction header if it presets. */
> + if ((uur_info & UREC_INFO_TRANSACTION) != 0)
> + size += SizeOfUndoRecordTransaction;
> +
> + /* Add size of rmid if it presets. */
> + if ((uur_info & UREC_INFO_RMID) != 0)
> + size += sizeof(RmgrId);
> +
> + /* Add size of reloid if it presets. */
> + if ((uur_info & UREC_INFO_RELOID) != 0)
> + size += sizeof(Oid);
> +
> + /* Add size of fxid if it presets. */
> + if ((uur_info & UREC_INFO_XID) != 0)
> + size += sizeof(FullTransactionId);
> +
> + /* Add size of cid if it presets. */
> + if ((uur_info & UREC_INFO_CID) != 0)
> + size += sizeof(CommandId);
> +
> + /* Add size of forknum if it presets. */
> + if ((uur_info & UREC_INFO_FORK) != 0)
> + size += sizeof(ForkNumber);
> +
> + /* Add size of prevundo if it presets. */
> + if ((uur_info & UREC_INFO_PREVUNDO) != 0)
> + size += sizeof(UndoRecPtr);
> +
> + /* Add size of the block header if it presets. */
> + if ((uur_info & UREC_INFO_BLOCK) != 0)
> + size += SizeOfUndoRecordBlock;
> +
> + /* Add size of the log switch header if it presets. */
> + if ((uur_info & UREC_INFO_LOGSWITCH) != 0)
> + size += SizeOfUndoRecordLogSwitch;
> +
> + /* Add size of the payload header if it presets. */
> + if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> + size += SizeOfUndoRecordPayload;

There's numerous blocks with one if for each type, and the body copied
basically the same for each alternative. That doesn't seem like a
reasonable approach to me. Means that many places need to be adjusted
when we invariably add another type, and seems likely to lead to bugs
over time.

> + /* Add size of the payload header if it presets. */

FWIW, repeating the same comment, with or without minor differences, 10
times is a bad idea. Especially when the comment doesn't add *any* sort
of information.

Also, "if it presets" presumably is a typo?


> +/*
> + * Compute and return the expected size of an undo record.
> + */
> +Size
> +UndoRecordExpectedSize(UnpackedUndoRecord *uur)
> +{
> + Sizesize;
> +
> + /* Header size. */
> + size = UndoRecordHeaderSize(uur->uur_info);
> +
> + /* Payload data size. */
> + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> + {
> + size += uur->uur_payload.len;
> + size += uur->uur_tuple.len;
> + }
> +
> + /* Add undo record length size. */
> + size += sizeof(uint16);
> +
> + return size;
> +}
> +
> +/*
> + * Calculate the size of the undo record stored on the page.
> + */
> +static inline Size
> +UndoRecordSizeOnPage(char *page_ptr)
> +{
> + uint16  uur_info = ((UndoRecordHeader *) page_ptr)->urec_info;
> + Sizesize;
> +
> + /* Header size. */
> + size = UndoRecordHeaderSize(uur_info);
> +
> + /* Payload data size. */
> + if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> + {
> + UndoRecordPayload *payload = (UndoRecordPayload *) (page_ptr + 
> size);
> +
> + size += payload->urec_payload_len;
> + size += payload->urec_tuple_len;
> + }
> +
> + return size;
> +}
> +
> +/*
> + * Compute size of the Unpacked undo record in memory
> + */
> +Size
> +UnpackedUndoRecordSize(UnpackedUndoRecord *uur)
> +{
> + Sizesize;
> +
> + size = sizeof(UnpackedUndoRecord);
> +
> + /* Add payload size if record contains payload data. */
> + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> + {
> + size += uur->uur_payload.len;
> + size += uur->uur_tuple.len;
> + }
> +
> + return size;
> +}

These functions are all basically the same. We shouldn't copy code over
and over like this.


> +/*
> + * Initiate inserting an undo record.
> + *
> + * This function will initialize the context for inserting and undo record
> + * which will be inserted by calling InsertUndoData.
> + */
> +void
> +BeginInsertUndo(UndoPackContext *ucontext, UnpackedUndoRecord *uur)
> +{
> + ucontext->stage = UNDO_PACK_STAGE_HEADER;
> + ucontext->already_processed = 0;
> + ucontext->partial_bytes = 0;
> +
> + /* Copy undo record header. */
> + ucontext->urec_hd.urec_type = uur->uur_type;
> + ucontext->urec_hd.urec_info = uur->uur_info;
> +
> + /* Copy undo record transaction header if it is present. */
> + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
> + memcpy(>urec_txn, uur->uur_txn, 
> SizeOfUndoRecordTransaction);
> +
> + /* 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Amit Kapila
On Mon, Aug 5, 2019 at 6:29 PM Robert Haas  wrote:
>
> On Mon, Aug 5, 2019 at 6:16 AM Amit Kapila  wrote:
> > For zheap, we collect all the records of a page, apply them together
> > and then write the entire page in WAL.  The progress of transaction is
> > updated at either transaction end (rollback complete) or after
> > processing some threshold of undo records.  So, generally, the WAL
> > won't be for each undo record apply.
>
> This explanation omits a crucial piece of the mechanism, because
> Heikki is asking what keeps the undo from being applied multiple
> times.
>

Okay, I didn't realize that.

>  When we apply the undo records to a page, we also adjust the
> undo pointers in the page.  Since we have an undo pointer per
> transaction slot, and each transaction has its own slot, if we apply
> all the undo for a transaction to a page, we can just clear the slot;
> if we somehow end up back at the same point later, we'll know not to
> apply the undo a second time because we'll see that there's no
> transaction slot pointing to the undo we were thinking of applying. If
> we roll back to a savepoint, or for some other reason choose to apply
> only some of the undo to a page, we can set the undo record pointer
> for the transaction back to the value it had before we generated any
> newer undo.  Then, we'll know that the newer undo doesn't need to be
> applied but the older undo can be applied.
>
> At least, I think that's how it's supposed to work.
>

Right, this is how it works.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Robert Haas
On Mon, Aug 5, 2019 at 12:42 PM Andres Freund  wrote:
> A good move in the right direction, imo.

I spent some more time thinking about this and talking to Thomas about
it and I'd like to propose a somewhat more aggressive restructuring
proposal, with the aim of getting a cleaner separation between layers
of this patch set.

Right now, the undo log storage stuff knows nothing about the contents
of an undo log, whereas the undo interface storage knows everything
about the contents of an undo log. In particular, it knows that it's a
series of records, and those records are grouped into transactions,
and it knows both the format of the individual records and also the
details of how transaction headers work. Nothing can use the undo log
storage system except for the undo interface layer, because the undo
interface layer assumes that all the data in the undo storage system
conforms to the record/recordset format which it defines. However,
there are a few warts: while the undo log storage patch doesn't know
anything about the contents of undo logs, it does know that that
transaction boundaries matter, and it signals to the undo interface
layer whether a transaction header should be inserted for a new
record.  That's a strange thing for the storage layer to be doing.
Also, in addition to three persistence levels, it knows about a fourth
undo log category for "special" data for multixact or TPD-like things.
That's another wart.

Suppose that we instead invent a new layer which sits on top of the
undo log storage layer.  This layer manages what I'm going to call
GHOBs, growable hunks of bytes.  (This is probably not the best name,
but I thought of it in 5 seconds during a complex technical
conversation, so bear with me.)  The GHOB layer supports
open/close/grow/write/overwrite operations.  Conceptually, you open a
GHOB with an initial size and a persistence level, and then you can
subsequently grow it unless you fill up the undo log in which case you
can't grow it any more; when you're done, you close it.  Opening and
closing a GHOB are operations that only make in-memory state changes.
Opening a GHOB finds a place where you could write the initial amount
of data you specify, but it doesn't actually write any data or change
any persistent state yet, except for making sure that nobody else can
grab that space as long as you have the GHOB open.  Closing a GHOB
tells the system that you're not going to grow the object any more,
which means some other GHOB can be placed immediately after the last
data you wrote.  Growing a GHOB doesn't do anything persistent either;
it just tests whether there would be room to write those bytes.  So,
the only operations that make actual persistent changes are write and
overwrite.  These operations just copy data into shared buffers and
mark them dirty, but they are set up so that you can integrate this
with whatever WAL-logging your doing for those operations, so that you
can make the same writes happen at redo time.

Then, on top of the GHOB layer, you have separate submodules for
different kinds of GHOBs.  Most importantly, you have a
transaction-GHOB manager, which opens a GHOB per persistence level the
first time somebody wants to write to it and closes those GHOBs at
end-of-xact.  AMs push records into the transaction-GHOB manager, and
it pushes them into GHOBs on the other side. Then you can also have a
multi-GHOB manager, which would replace what Thomas now has as a
separate undo log category.  The undo-log-storage layer wouldn't have
any fixed limit on the number of GHOBs that could be open at the same
time; it would just be the sum of whatever the individual GHOB type
managers can open. It would be important to keep that number fairly
small since there's not an unlimited supply of undo logs, but that
doesn't seem like a problem for any of the uses we currently have in
mind.  Each GHOB would begin with a magic number identifying the GHOB
type, and would have callbacks for everything else, like "how big is
this GHOB?" and "is it discardable?".

I'm not totally sure I've thought through all of the problems here,
but it seems like this might help us fix some of the aforementioned
layering inversions. The undo log storage system only knows about
storage: it doesn't have to help with things like transaction
boundaries any more, and it continues to be indifferent to the actual
contents of the storage.  At the GHOB layer, we know that we've got
chunks of storage which are the unit of undo discard, and we know that
they start with a magic number that identifies the type, but it
doesn't know whether they are internally broken into records or, if
so, how those records are organized. The individual GHOB managers do
know that stuff; for example, the transaction-GHOB manager would know
that AMs insert undo records and how those records are compressed and
so forth.  One thing that feels good about this system is that you
could actually write something like the test_undo module that Thomas

Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Andres Freund
Hi,

(as I was out of context due to dealing with bugs, I've switched to
lookign at the current zheap/undoprocessing branch.

On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> +/*
> + * Insert a previously-prepared undo records.
> + *
> + * This function will write the actual undo record into the buffers which are
> + * already pinned and locked in PreparedUndoInsert, and mark them dirty.  
> This
> + * step should be performed inside a critical section.
> + */

Again, I think it's not ok to just assume you can lock an essentially
unbounded number of buffers. This seems almost guaranteed to result in
deadlocks. And there's limits on how many lwlocks one can hold etc.

As far as I can tell there's simply no deadlock avoidance scheme in use
here *at all*? I must be missing something.


> + /* Main loop for writing the undo record. */
> + do
> + {

I'd prefer this to not be a do{} while(true) loop - as written I need to
read to the end to see what the condition is. I don't think we have any
loops like that in the code.


> + /*
> +  * During recovery, there might be some blocks which 
> are already
> +  * deleted due to some discard command so we can just 
> skip
> +  * inserting into those blocks.
> +  */
> + if (!BufferIsValid(buffer))
> + {
> + Assert(InRecovery);
> +
> + /*
> +  * Skip actual writing just update the context 
> so that we have
> +  * write offset for inserting into next blocks.
> +  */
> + SkipInsertingUndoData(, BLCKSZ - 
> starting_byte);
> + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> + break;
> + }

How exactly can this happen?


> + else
> + {
> + page = BufferGetPage(buffer);
> +
> + /*
> +  * Initialize the page whenever we try to write 
> the first
> +  * record in page.  We start writing 
> immediately after the
> +  * block header.
> +  */
> + if (starting_byte == UndoLogBlockHeaderSize)
> + UndoPageInit(page, BLCKSZ, 
> prepared_undo->urec->uur_info,
> +  
> ucontext.already_processed,
> +  
> prepared_undo->urec->uur_tuple.len,
> +  
> prepared_undo->urec->uur_payload.len);
> +
> + /*
> +  * Try to insert the record into the current 
> page. If it
> +  * doesn't succeed then recall the routine with 
> the next page.
> +  */
> + InsertUndoData(, page, starting_byte);
> + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> + {
> + MarkBufferDirty(buffer);
> + break;

At this point we're five indentation levels deep. I'd extract at least
either the the per prepared undo code or the code performing the writing
across block boundaries into a separate function. Perhaps both.



> +/*
> + * Helper function for UndoGetOneRecord
> + *
> + * If any of  rmid/reloid/xid/cid is not available in the undo record, then
> + * it will get the information from the first complete undo record in the
> + * page.
> + */
> +static void
> +GetCommonUndoRecInfo(UndoPackContext *ucontext, UndoRecPtr urp,
> +  RelFileNode rnode, UndoLogCategory 
> category, Buffer buffer)
> +{
> + /*
> +  * If any of the common header field is not available in the current 
> undo
> +  * record then we must read it from the first complete record of the 
> page.
> +  */

How is it guaranteed that the first record on the page is actually from
the current transaction? Can't there be a situation where that's from
another transaction?



> +/*
> + * Helper function for UndoFetchRecord and UndoBulkFetchRecord
> + *
> + * curbuf - If an input buffer is valid then this function will not release 
> the
> + * pin on that buffer.  If the buffer is not valid then it will assign curbuf
> + * with the first buffer of the current undo record and also it will keep the
> + * pin and lock on that buffer in a hope that while traversing the undo chain
> + * the caller might want to read the previous undo record from the same 
> block.
> + */

Wait, so at exit 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Andres Freund
Hi,

On 2019-08-05 11:25:10 -0400, Robert Haas wrote:
> The obvious thing to do seems to be to have UndoLogControl objects own
> SmgrRelations. That would be something of a novelty, since it looks
> like currently only a Relation ever owns an SMgrRelation, but the smgr
> infrastructure seems to have been set up in a generic way so as to
> permit that sort of thing, so it seems like it should be workable.

Yea, I think that'd be a good step.


I'm not 100% convinced it's quite enough, due to the way the undo smgr
only ever has a single file descriptor open, and that undo log segments
are fairly small, and that there'll often be multiple persistence levels
active at the same time. But the undo fd handling is probably a separate
concern than from who owns the smgr relations.


> I think this kind of design would address your concerns about using
> the unowned list, too, since the UndoLogControl objects would be
> owning the SMgrRelations.

Yup.


> How does all that sound?

A good move in the right direction, imo.

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Robert Haas
On Tue, Jul 30, 2019 at 4:02 AM Andres Freund  wrote:
> I'm a bit worried about expanding the use of
> ReadBufferWithoutRelcache(). Not so much because of the relcache itself,
> but because it requires doing separate smgropen() calls. While not
> crazily expensive, it's also not free. Especially combined with closing
> all such relations at transaction end (c.f. AtEOXact_SMgr).
>
> I'm somewhat inclined to think that this requires a slightly bigger
> refactoring than done in this patch. Imo at the very least the smgr
> entries ought not to be unowned. But working towards not haven to
> re-open the smgr entry for every single trival request ought to be part
> of this too.

I spent some time trying to analyze this today and I agree with you
that there seems to be room for improvement here. When I first looked
at your comments, I wasn't too convinced, because access patterns that
skip around between undo logs seem like they may be fairly common.
Admittedly, there are cases where we want to read from just one undo
log over and over again, and it would be good to optimize those, but I
was initially a bit unconvinced that that there was a problem here
worth being concerned about. Then I realized that you would also
repeat the smgropen() if you read a single record that happens to be
split across two pages, which seems a little silly.

But then I realized that we're being a little silly even in the case
where we're reading a single undo record that is stored entirely on a
single page.  We are certainly going to need to look up the undo log,
but as things stand, we'll basically do it twice. For example, in the
write path, we'll call UndoLogAllocate() and it will look up an
UndoLogControl object for the undo log of interest, and then we'll
call ReadBufferWithoutRelcache() which will call smgropen() which will
do a hash table lookup to find the SMgrRelation associated with that
undo log. That's not a large cost, as you say, but it does seem like
it might be better to avoid having two different lookups in the same
commonly-used code path, each of which peeks into a different
backend-private data structure for information about the very same
undo log.

The obvious thing to do seems to be to have UndoLogControl objects own
SmgrRelations. That would be something of a novelty, since it looks
like currently only a Relation ever owns an SMgrRelation, but the smgr
infrastructure seems to have been set up in a generic way so as to
permit that sort of thing, so it seems like it should be workable.
Perhaps UndoLogAllocate() function could return a pointer to the
UndoLogControl object as well as UndoRecPtr. Then, there could be a
function UndoLogWrite(UndoLogControl *, UndoRecPtr, char *, Size).  On
the read side, instead of calling UndoRecPtrAssignRelFileNode, maybe
the undo log storage layer should provide a function that again
returns an UndoLogControl, and then we could have a matching function
UndoLogRead(UndoLogControl *, UndoRecPtr, char *, Size).

I think this kind of design would address your concerns about using
the unowned list, too, since the UndoLogControl objects would be
owning the SMgrRelations.  It took me a while to understand why you
were concerned about using the unowned list, so I'm going to repeat it
in my own words to make sure I've got it right, and also to possibly
help out anyone else who may also have had difficulty grokking your
concern.  If we have a bunch of short transactions each of which
accesses the same relation, the relcache entry will remain open and
the file won't get closed in between, but if we have a bunch of short
transactions each of which accesses the same undo log, the undo log
will be closed and reopened at the operating system level for each
individual transaction. That happens because when an SMgrRelation is
"owned," the owner takes care of closing it, and so can keep it open
across transactions, but when it's "unowned," it's automatically
closed during transaction cleanup.  And we should fix it, because
closing and reopening the same file for every transaction
unnecessarily might be expensive enough to matter, at least a little
bit.

How does all that sound?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Robert Haas
On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas  wrote:
> I feel that the level of abstraction is not quite right. There are a
> bunch of fields, like uur_block, uur_offset, uur_tuple, that are
> probably useful for some UNDO resource managers (zheap I presume), but
> seem kind of arbitrary. How is uur_tuple different from uur_payload?
> Should they be named more generically as uur_payload1 and uur_payload2?
> And why two, why not three or four different payloads? In the WAL record
> format, there's a concept of "block id", which allows you to store N
> number of different payloads in the record, I think that would be a
> better approach. Or only have one payload, and let the resource manager
> code divide it as it sees fit.
>
> Many of the fields support a primitive type of compression, where a
> field can be omitted if it has the same value as on the first record on
> an UNDO page. That's handy. But again I don't like the fact that the
> fields have been hard-coded into the UNDO record format. I can see e.g.
> the relation oid to be useful for many AMs. But not all. And other AMs
> might well want to store and deduplicate other things, aside from the
> fields that are in the patch now. I'd like to move most of the fields to
> AM specific code, and somehow generalize the compression. One approach
> would be to let the AM store an arbitrary struct, and run it through a
> general-purpose compression algorithm, using the UNDO page's first
> record as the "dictionary".

I thought about this, too. I agree that there's something a little
unsatisfying about the current structure, but I haven't been able to
come up with something that seems definitively better. I think
something along the lines of what you are describing here might work
well, but I am VERY doubtful about the idea of a fixed-size struct. I
think AMs are going to want to store variable-length data: especially
tuples, but maybe also other stuff. For instance, imagine some AM that
wants to implement locking that's more fine-grained that the four
levels of tuple locks we have today: instead of just having key locks
and all-columns locks, you could want to store the exact columns to be
locked. Or maybe your TIDs are variable-width.

And the problem is that as soon as you move to something where you
pack in a bunch of variable-sized fields, you lose the ability to
refer to thinks using reasonable names.  That's where I came up with
the idea of an UnpackedUndoRecord: give the common fields that
"everyone's going to need" human-readable names, and jam only the
strange, AM-specific stuff into the payload.  But if those needs are
not actually universal but very much AM-specific, then I'm afraid
we're going to end up with deeply inscrutable code for packing and
unpacking records.  I imagine it's possible to come up with a good
structure for that, but I don't think we have one today.

> I don't like the way UndoFetchRecord returns a palloc'd
> UnpackedUndoRecord. I would prefer something similar to the xlogreader
> API, where a new call to UndoFetchRecord invalidates the previous
> result. On efficiency grounds, to avoid the palloc, but also to be
> consistent with xlogreader.

I don't think that's going to work very well, because we often need to
deal with multiple records at a time.  There is (or was) a bulk-fetch
interface, but I've also found while experimenting with this code that
it can be useful to do things like:

current = undo_fetch(starting_record);
loop:
next = undo_fetch(current->next_record_ptr);
if some_test(next):
break;
undo_free(current);
current = next;

I think we shouldn't view such cases as exceptions to the general
paradigm of looking at undo records one at a time, but instead as the
normal case for which everything is optimized.  Cases like orphaned
file cleanup where the number of undo records is probably small and
they're all independent of each other will, I think, turn out to be
the exception rather than the rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Robert Haas
On Mon, Aug 5, 2019 at 6:16 AM Amit Kapila  wrote:
> For zheap, we collect all the records of a page, apply them together
> and then write the entire page in WAL.  The progress of transaction is
> updated at either transaction end (rollback complete) or after
> processing some threshold of undo records.  So, generally, the WAL
> won't be for each undo record apply.

This explanation omits a crucial piece of the mechanism, because
Heikki is asking what keeps the undo from being applied multiple
times.  When we apply the undo records to a page, we also adjust the
undo pointers in the page.  Since we have an undo pointer per
transaction slot, and each transaction has its own slot, if we apply
all the undo for a transaction to a page, we can just clear the slot;
if we somehow end up back at the same point later, we'll know not to
apply the undo a second time because we'll see that there's no
transaction slot pointing to the undo we were thinking of applying. If
we roll back to a savepoint, or for some other reason choose to apply
only some of the undo to a page, we can set the undo record pointer
for the transaction back to the value it had before we generated any
newer undo.  Then, we'll know that the newer undo doesn't need to be
applied but the older undo can be applied.

At least, I think that's how it's supposed to work.  If you just
update the progress field, it doesn't guarantee anything, because in
the event of a crash, we could end up keeping the page changes but
losing the update to the progress, as they are part of separate undo
records.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Amit Kapila
On Mon, Aug 5, 2019 at 12:09 PM Heikki Linnakangas  wrote:
>
> On 05/08/2019 07:23, Thomas Munro wrote:
> > On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila  wrote:
> >> On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas  wrote:
> >>> Could we leave out the UNDO and discard worker processes for now?
> >>> Execute all UNDO actions immediately at rollback, and after crash
> >>> recovery. That would be fine for cleaning up orphaned files,
> >>
> >> Even if we execute all the undo actions on rollback, we need discard
> >> worker to discard undo at regular intervals.  Also, what if we get an
> >> error while applying undo actions during rollback?  Right now, we have
> >> a mechanism to push such a request to background worker and allow the
> >> session to continue.  Instead, we might want to Panic in such cases if
> >> we don't want to have background undo workers.
> >>
> >>> and it
> >>> would cut down the size of the patch to review.
> >>
> >> If we can find some way to handle all cases and everyone agrees to it,
> >> that would be good. In fact, we can try to get the basic stuff
> >> committed first and then try to get the rest (undo-worker machinery)
> >> done.
> >
> > I think it's definitely worth exploring.
>
> Yeah. For cleaning up orphaned files, if unlink() fails, we can just log
> the error and move on. That's what we do in the main codepath, too. For
> any other error, PANIC seems ok. We're not expecting any errors during
> undo processing, so it doesn't seems safe to continue running.
>
> Hmm. Since applying the undo record is WAL-logged, you could run out of
> disk space while creating the WAL record. That seems unpleasant.
>

We might get away by doing some minimum error handling for orphan file
cleanup patch, but this facility was supposed to be a generic
facility.  Assuming, all of us agree on error handling stuff, still, I
think we might not be able to get away with the requirement for
discard worker to discard the logs.

> >>> Can this race condition happen: Transaction A creates a table and an
> >>> UNDO record to remember it. The transaction is rolled back, and the file
> >>> is removed. Another transaction, B, creates a different table, and
> >>> chooses the same relfilenode. It loads the table with data, and commits.
> >>> Then the system crashes. After crash recovery, the UNDO record for the
> >>> first transaction is applied, and it removes the file that belongs to
> >>> the second table, created by transaction B.
> >>
> >> I don't think such a race exists, but we should verify it once.
> >> Basically, once the rollback is complete, we mark the transaction
> >> rollback as complete in the transaction header in undo and write a WAL
> >> for it.  After crash-recovery, we will skip such a transaction.  Isn't
> >> that sufficient to prevent such a race condition?
>
> Ok, I didn't realize there's a flag in the undo record to mark it as
> applied. Yeah, that fixes it. Seems a bit heavy-weight, but I guess it's
> fine. Do you do something different in zheap? I presume writing a WAL
> record for every applied undo record would be too heavy there.
>

For zheap, we collect all the records of a page, apply them together
and then write the entire page in WAL.  The progress of transaction is
updated at either transaction end (rollback complete) or after
processing some threshold of undo records.  So, generally, the WAL
won't be for each undo record apply.

> This needs some performance testing. We're creating one extra WAL record
> and one UNDO record for every file creation, and another WAL record on
> abort. It's probably cheap compared to all the other work done during
> table creation, but we should still get some numbers on it.
>

makes sense.





-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Heikki Linnakangas

On 05/08/2019 07:23, Thomas Munro wrote:

On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila  wrote:

On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas  wrote:

Could we leave out the UNDO and discard worker processes for now?
Execute all UNDO actions immediately at rollback, and after crash
recovery. That would be fine for cleaning up orphaned files,


Even if we execute all the undo actions on rollback, we need discard
worker to discard undo at regular intervals.  Also, what if we get an
error while applying undo actions during rollback?  Right now, we have
a mechanism to push such a request to background worker and allow the
session to continue.  Instead, we might want to Panic in such cases if
we don't want to have background undo workers.


and it
would cut down the size of the patch to review.


If we can find some way to handle all cases and everyone agrees to it,
that would be good. In fact, we can try to get the basic stuff
committed first and then try to get the rest (undo-worker machinery)
done.


I think it's definitely worth exploring.


Yeah. For cleaning up orphaned files, if unlink() fails, we can just log 
the error and move on. That's what we do in the main codepath, too. For 
any other error, PANIC seems ok. We're not expecting any errors during 
undo processing, so it doesn't seems safe to continue running.


Hmm. Since applying the undo record is WAL-logged, you could run out of 
disk space while creating the WAL record. That seems unpleasant.



Can this race condition happen: Transaction A creates a table and an
UNDO record to remember it. The transaction is rolled back, and the file
is removed. Another transaction, B, creates a different table, and
chooses the same relfilenode. It loads the table with data, and commits.
Then the system crashes. After crash recovery, the UNDO record for the
first transaction is applied, and it removes the file that belongs to
the second table, created by transaction B.


I don't think such a race exists, but we should verify it once.
Basically, once the rollback is complete, we mark the transaction
rollback as complete in the transaction header in undo and write a WAL
for it.  After crash-recovery, we will skip such a transaction.  Isn't
that sufficient to prevent such a race condition?


Ok, I didn't realize there's a flag in the undo record to mark it as 
applied. Yeah, that fixes it. Seems a bit heavy-weight, but I guess it's 
fine. Do you do something different in zheap? I presume writing a WAL 
record for every applied undo record would be too heavy there.


This needs some performance testing. We're creating one extra WAL record 
and one UNDO record for every file creation, and another WAL record on 
abort. It's probably cheap compared to all the other work done during 
table creation, but we should still get some numbers on it.


Some regression tests would be nice too.

- Heikki




Re: POC: Cleaning up orphaned files using undo logs

2019-08-04 Thread Thomas Munro
On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila  wrote:
> On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas  wrote:
> > I had a look at the UNDO patches at
> > https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com,
> > and at the patch to use the UNDO logs to clean up orphaned files, from
> > undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to
> > review?
>
> Yes, I am not sure of cleanup orphan file patch (Thomas can confirm
> the same), but others are latest.

I have a new patch set to post soon, handling all the feedback that
arrived in the past couple of weeks from 5 different reviewers (thanks
all!).

> > There are similar issues in CREATE/DROP DATABASE code. If you crash in
> > the middle of CREATE DATABASE, you can be left with orphaned files in
> > the data directory, or if you crash in the middle of DROP DATABASE, the
> > data might be gone already but the pg_database entry is still there. We
> > should plug those holes too.
> >
>
> +1. Interesting.

Huh.  Right.

> > Could we leave out the UNDO and discard worker processes for now?
> > Execute all UNDO actions immediately at rollback, and after crash
> > recovery. That would be fine for cleaning up orphaned files,
> >
>
> Even if we execute all the undo actions on rollback, we need discard
> worker to discard undo at regular intervals.  Also, what if we get an
> error while applying undo actions during rollback?  Right now, we have
> a mechanism to push such a request to background worker and allow the
> session to continue.  Instead, we might want to Panic in such cases if
> we don't want to have background undo workers.
>
> > and it
> > would cut down the size of the patch to review.
>
> If we can find some way to handle all cases and everyone agrees to it,
> that would be good. In fact, we can try to get the basic stuff
> committed first and then try to get the rest (undo-worker machinery)
> done.

I think it's definitely worth exploring.

> > Can this race condition happen: Transaction A creates a table and an
> > UNDO record to remember it. The transaction is rolled back, and the file
> > is removed. Another transaction, B, creates a different table, and
> > chooses the same relfilenode. It loads the table with data, and commits.
> > Then the system crashes. After crash recovery, the UNDO record for the
> > first transaction is applied, and it removes the file that belongs to
> > the second table, created by transaction B.
>
> I don't think such a race exists, but we should verify it once.
> Basically, once the rollback is complete, we mark the transaction
> rollback as complete in the transaction header in undo and write a WAL
> for it.  After crash-recovery, we will skip such a transaction.  Isn't
> that sufficient to prevent such a race condition?

The usual protection against relfilenode recycling applies: we don't
actually remove the files on disk until after the next checkpoint,
following the successful rollback.  That is, the executing the
rollback doesn't actually remove any files immediately, so you can't
reuse the OID yet.

There might be some problems like that if we tried to handle the
CREATE DATABASE orphans you mentioned too naively though.  Not sure.

> Thank you for looking into this work.

+1

-- 
Thomas Munro
https://enterprisedb.com




  1   2   3   >