Re: archive status ".ready" files may be created too early

2021-08-31 Thread alvhe...@alvh.no-ip.org
On 2021-Sep-01, Michael Paquier wrote:

> That's about 515e3d8, right?

Yes.

> I have not looked in details at what you have here, but this produces
> a compilation warning on Windows for me with this part of the patch:

This seems a tiny speck in a sea of bogosity.  If you want to silence
the warning, be my guest, but in the long run I am inclined to revert
the whole commit once I have a better picture of a way forward.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-31 Thread Michael Paquier
On Tue, Aug 31, 2021 at 08:52:05PM -0400, alvhe...@alvh.no-ip.org wrote:
> Yeah, that's becoming my conclusion too -- undo that, and start from
> scratch using the other idea.

That's about 515e3d8, right?  I have not looked in details at what you
have here, but this produces a compilation warning on Windows for me
with this part of the patch:
+RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr endpos)
+{
+   XLogSegNo   segno PG_USED_FOR_ASSERTS_ONLY;

segno gets to be an unreferenced local variable.
--
Michael


signature.asc
Description: PGP signature


Re: archive status ".ready" files may be created too early

2021-08-31 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-31, Andres Freund wrote:

> Maybe, but this is getting uglier and uglier.
> 
> I think patch should be reverted. It's not in a state that's appropriate for
> the backbranches.

Yeah, that's becoming my conclusion too -- undo that, and start from
scratch using the other idea.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-31 Thread Andres Freund
On 2021-08-31 23:31:15 +, Bossart, Nathan wrote:
> On 8/31/21, 1:30 PM, "Andres Freund"  wrote:
> > On 2021-08-31 18:09:36 +, Bossart, Nathan wrote:
> >> What appears to happen in this case is that bgwriter eventually creates a
> >> xl_running_xacts record and nudges walwriter to flush it to disk, at which
> >> point the .ready file(s) will be created.  That's admittedly a bit fragile.
> >
> > That's not guaranteed to happen. If e.g. the partial record is a checkpoint 
> > or
> > a xl_running_xacts, we'll not trigger further WAL writes in the background,
> > unless autovacuum ends up doing something.
> 
> Right.  Per the attached patch, a simple way to handle that could be
> to teach XLogBackgroundFlush() to flush to the "earliest" segment
> boundary if it doesn't find anything else to do.  I think you could
> still miss creating a .ready file for the previous segment in single-
> user mode, though.

Maybe, but this is getting uglier and uglier.

I think patch should be reverted. It's not in a state that's appropriate for
the backbranches.




Re: archive status ".ready" files may be created too early

2021-08-31 Thread Andres Freund
Hi,

On 2021-08-31 18:09:36 +, Bossart, Nathan wrote:
> On 8/31/21, 10:21 AM, "Andres Freund"  wrote:
> > What would trigger the flushing? We don't write out partially filled pages
> > unless
> > a) we're explicitly flushing an LSN on the partial page (e.g. because a
> >synchronous commit record resides on it)
> > b) there's an async commit (i.e. commit with synchronous_commit=off) on the
> >page
> 
> Ah, so your point is that an open transaction that has written a
> partial page on the next segment wouldn't trigger a flush.

Doesn't have to be a transaction, can be a checkpoint or xl_running_xacts, or
... as well.


> What appears to happen in this case is that bgwriter eventually creates a
> xl_running_xacts record and nudges walwriter to flush it to disk, at which
> point the .ready file(s) will be created.  That's admittedly a bit fragile.

That's not guaranteed to happen. If e.g. the partial record is a checkpoint or
a xl_running_xacts, we'll not trigger further WAL writes in the background,
unless autovacuum ends up doing something.

Regards,

Andres




Re: archive status ".ready" files may be created too early

2021-08-31 Thread Bossart, Nathan
On 8/31/21, 10:21 AM, "Andres Freund"  wrote:
> What would trigger the flushing? We don't write out partially filled pages
> unless
> a) we're explicitly flushing an LSN on the partial page (e.g. because a
>synchronous commit record resides on it)
> b) there's an async commit (i.e. commit with synchronous_commit=off) on the
>page

Ah, so your point is that an open transaction that has written a
partial page on the next segment wouldn't trigger a flush.  What
appears to happen in this case is that bgwriter eventually creates a
xl_running_xacts record and nudges walwriter to flush it to disk, at
which point the .ready file(s) will be created.  That's admittedly a
bit fragile.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-31 Thread Andres Freund
Hi,

On 2021-08-31 17:01:31 +, Bossart, Nathan wrote:
> > If the record ending at s4 + 10 isn't an async commit (and thus
> > XLogCtl->asyncXactLSN is smaller), and there are no further records, we can
> > end up waiting effectively forever for s2 (and s3) to be archived. If all
> > other connections (and autovac etc) are idle, there will be no XLogFlush()
> > calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If
> > already known flushed" path, because the the first page in s4 is only
> > partially filled.
> 
> I'm not following why s4 wouldn't be flushed in this example.  Even if
> the first page in s4 is only partially filled, that portion of the
> record should still get flushed, and we'll create the .ready files for
> s2 and s3 at that time.

What would trigger the flushing? We don't write out partially filled pages
unless
a) we're explicitly flushing an LSN on the partial page (e.g. because a
   synchronous commit record resides on it)
b) there's an async commit (i.e. commit with synchronous_commit=off) on the
   page

Greetings,

Andres Freund




Re: archive status ".ready" files may be created too early

2021-08-31 Thread Bossart, Nathan
On 8/31/21, 12:44 AM, "Andres Freund"  wrote:
> If there's now a flush request including all of s3, we'll have the following
> sequence of notifies:
>
> NotifySegmentsReadyForArchive(s1)
> nothing happens, smaller than s1+10
>
> NotifySegmentsReadyForArchive(s2)
> earliestSegBoundary = s4
> earliestSegBoundaryEndPtr = s4+10
> latestSegBoundary = s4
> latestSegBoundaryEndPtr = s4 + 10
> latest_boundary_seg = s1
>
> NotifySegmentsReadyForArchive(s3)
> nothing happens, flush is smaller than s4

When earliestSegBoundary is set to s4, latestSegBoundary will be set
to MaxXLogSegNo.

> If the record ending at s4 + 10 isn't an async commit (and thus
> XLogCtl->asyncXactLSN is smaller), and there are no further records, we can
> end up waiting effectively forever for s2 (and s3) to be archived. If all
> other connections (and autovac etc) are idle, there will be no XLogFlush()
> calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If
> already known flushed" path, because the the first page in s4 is only
> partially filled.

I'm not following why s4 wouldn't be flushed in this example.  Even if
the first page in s4 is only partially filled, that portion of the
record should still get flushed, and we'll create the .ready files for
s2 and s3 at that time.  I tested this by adding some debug logging
and creating a small record that crossed segment boundaries but didn't
fill the first page on the next segment, and the .ready file was
created as expected.  Is there a case where we wouldn't flush the end
of the record to disk?

During my testing, I did find an obvious bug.  We probably shouldn't
be calling NotifySegmentsReadyForArchive() when archiving isn't
enabled.

diff --git a/src/backend/postmaster/walwriter.c 
b/src/backend/postmaster/walwriter.c
index 6a1e16edc2..8ca0d8e616 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,7 +253,8 @@ WalWriterMain(void)
  * here to handle a race condition where WAL is flushed to disk prior
  * to registering the segment boundary.
  */
-NotifySegmentsReadyForArchive(GetFlushRecPtr());
+if (XLogArchivingActive())
+NotifySegmentsReadyForArchive(GetFlushRecPtr());

 /*
  * Do what we're here for; then, if XLogBackgroundFlush() found useful

Nathan



Re: archive status ".ready" files may be created too early

2021-08-31 Thread Andres Freund
Hi

On 2021-08-31 06:45:06 +, Bossart, Nathan wrote:
> On 8/30/21, 7:39 PM, "Andres Freund"  wrote:
> > On 2021-08-30 22:39:04 +, Bossart, Nathan wrote:
> >> If we called NotifySegmentsReadyForArchive() before we updated the
> >> flush location in shared memory, we might skip nudging the WAL writer
> >> even though we should.
> >
> > That's trivial to address - just have a local variable saying whether we 
> > need
> > to call NotifySegmentsReadyForArchive().
>
> I think we can remove the race condition entirely by moving boundary
> registration to before WALInsertLockRelease().  I attached a patch for
> discussion.

I think it's a bad idea to move more code to before
WALInsertLockRelease. There's a very limited number of xlog insert slots, and
WAL flushes (e.g. commits) need to wait for insertions to finish.


> >> > When s1 is fully written out, NotifySegmentsReadyForArchive() will set
> >> > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
> >> > ok. But when when s2 is flushed, we'll afaict happily create .ready files
> >> > for s2, s3 despite s4 not yet being written, because earliestSegBoundary 
> >> > is
> >> > now s4.
> >>
> >> In this case, the .ready files for s2 and s3 wouldn't be created until
> >> s4 is flushed to disk.
> >
> > I don't think that's true as the code stands today? The
> > NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to 
> > s4,
> > because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the
> > keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a
> > segment < 4 will then be able to flush s2 and s3?
>
> When flushRecPtr is less than both of the segment boundaries,
> NotifySegmentsReadyForArchive() will return without doing anything.
> At least, that was the intent.  If there is some reason it's not
> actually working that way, I can work on fixing it.

But that's not OK either! Consider a scenario when there's small records each
spanning just a bit into the next segment, and initially all the data is in
wal_buffers.

RegisterSegmentBoundary(s1, s1+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10

RegisterSegmentBoundary(s2, s2+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10
latestSegBoundary = s2
latestSegBoundaryEndPtr = s2 + 10

RegisterSegmentBoundary(s3, s3+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10
latestSegBoundary = s2
latestSegBoundaryEndPtr = s2 + 10

RegisterSegmentBoundary(s4, s4+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10
latestSegBoundary = s4
latestSegBoundaryEndPtr = s4 + 10

If there's now a flush request including all of s3, we'll have the following
sequence of notifies:

NotifySegmentsReadyForArchive(s1)
nothing happens, smaller than s1+10

NotifySegmentsReadyForArchive(s2)
earliestSegBoundary = s4
earliestSegBoundaryEndPtr = s4+10
latestSegBoundary = s4
latestSegBoundaryEndPtr = s4 + 10
latest_boundary_seg = s1

NotifySegmentsReadyForArchive(s3)
nothing happens, flush is smaller than s4

If the record ending at s4 + 10 isn't an async commit (and thus
XLogCtl->asyncXactLSN is smaller), and there are no further records, we can
end up waiting effectively forever for s2 (and s3) to be archived. If all
other connections (and autovac etc) are idle, there will be no XLogFlush()
calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If
already known flushed" path, because the the first page in s4 is only
partially filled.

Am I missing something?


> > I don't think it's sensible to fix these separately. It'd be one thing to do
> > that for HEAD, but on the back branches? And that this patch hasn't gotten 
> > any
> > performance testing is scary.
>
> Are there any specific performance tests you'd like to see?  I don't
> mind running a couple.

- Parallel copy with > 8 processes

- Parallel non-transactional insertion of small-medium records
  Simulates inserting rows within a transaction
- Parallel transactional insertion of small-medium sized records, with fsync=on
  Plain oltp writes
- Parallel transactional insertion of small-medium sized records, with fsync=off
  fsync=off to simulate a fast server-class SSD (where fsync is
  instantaneous). Of course, if you have one of those, you can also use that.

For the oltp ones I've had good experience simulating workloads with
pg_logical_emit_message(). That just hits the WAL path, *drastically* reducing
the variance / shortening the required test duration.

Greetings,

Andres Freund




Re: archive status ".ready" files may be created too early

2021-08-30 Thread Andres Freund
Hi,

On 2021-08-30 22:39:04 +, Bossart, Nathan wrote:
> On 8/30/21, 2:06 PM, "Andres Freund"  wrote:
> > When were you thinking of doing the latch sets? Adding a latch set for every
> > XLogWrite() wouldn't be cheap either. Both because syscalls under a lock
> > aren't free and because waking up walsender even more often isn't free (we
> > already have a few threads about reducing the signalling frequency).
> >
> > There's also the question of what to do with single user mode. We shouldn't
> > just skip creating .ready files there...
> 
> Good point.
> 
> > Although, the more I think about, the more I am confused about the trailing
> > if (XLogArchivingActive())
> > NotifySegmentsReadyForArchive(LogwrtResult.Flush);
> >
> > in XLogWrite(). Shouldn't that at the very least be inside the "If asked to
> > flush, do so" branch? Outside that and the finishing_seg branch
> > LogwrtResult.Flush won't have moved, right? So the call to
> > NotifySegmentsReadyForArchive() can't do anything, no?
> 
> The registration logic looks like this:
> 1. Register boundary
> 2. Get flush location from shared memory
> 3. If flush location >= our just-registered boundary, nudge
>the WAL writer to create .ready files if needed
> 
> If we called NotifySegmentsReadyForArchive() before we updated the
> flush location in shared memory, we might skip nudging the WAL writer
> even though we should.

That's trivial to address - just have a local variable saying whether we need
to call NotifySegmentsReadyForArchive().

Note that the finishing_seg path currently calls
NotifySegmentsReadyForArchive() before the shared memory flush location is
updated.


> > When s1 is fully written out, NotifySegmentsReadyForArchive() will set
> > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
> > ok. But when when s2 is flushed, we'll afaict happily create .ready files
> > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is
> > now s4.
> 
> In this case, the .ready files for s2 and s3 wouldn't be created until
> s4 is flushed to disk.

I don't think that's true as the code stands today? The
NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to s4,
because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the
keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a
segment < 4 will then be able to flush s2 and s3?


> > The whole approach here of delaying .ready creation for these types of
> > segments seems wrong to me. Doesn't the exact same problem also exist for
> > streaming rep - which one can also use to maintain a PITR archive? walsender
> > sends up to the flush location, and pg_receivewal's FindStreamingStart() 
> > will
> > afaict just continue receiving from after that point.
> 
> The problem with streaming replication is being discussed in a new
> thread [0].

I don't think it's sensible to fix these separately. It'd be one thing to do
that for HEAD, but on the back branches? And that this patch hasn't gotten any
performance testing is scary.

Greetings,

Andres Freund




Re: archive status ".ready" files may be created too early

2021-08-30 Thread Bossart, Nathan
On 8/30/21, 3:40 PM, "Bossart, Nathan"  wrote:
> On 8/30/21, 2:06 PM, "Andres Freund"  wrote:
>> Although, the more I think about, the more I am confused about the trailing
>> if (XLogArchivingActive())
>> NotifySegmentsReadyForArchive(LogwrtResult.Flush);
>>
>> in XLogWrite(). Shouldn't that at the very least be inside the "If asked to
>> flush, do so" branch? Outside that and the finishing_seg branch
>> LogwrtResult.Flush won't have moved, right? So the call to
>> NotifySegmentsReadyForArchive() can't do anything, no?
>
> The registration logic looks like this:
> 1. Register boundary
> 2. Get flush location from shared memory
> 3. If flush location >= our just-registered boundary, nudge
>the WAL writer to create .ready files if needed
>
> If we called NotifySegmentsReadyForArchive() before we updated the
> flush location in shared memory, we might skip nudging the WAL writer
> even though we should.

In the other thread [0], we're considering moving boundary
registration to before WALInsertLockRelease().  If I'm right that this
removes the race condition in question, we should be able to move the
call to NotifySegmentsReadyForArchive() at the end of XLogWrite() to
the if-asked-to-flush branch.

Nathan

[0] https://postgr.es/m/DE60B9AA-9670-47DA-9678-6C79BCD884E3%40amazon.com



Re: archive status ".ready" files may be created too early

2021-08-30 Thread Bossart, Nathan
On 8/30/21, 2:06 PM, "Andres Freund"  wrote:
> When were you thinking of doing the latch sets? Adding a latch set for every
> XLogWrite() wouldn't be cheap either. Both because syscalls under a lock
> aren't free and because waking up walsender even more often isn't free (we
> already have a few threads about reducing the signalling frequency).
>
> There's also the question of what to do with single user mode. We shouldn't
> just skip creating .ready files there...

Good point.

> Although, the more I think about, the more I am confused about the trailing
> if (XLogArchivingActive())
> NotifySegmentsReadyForArchive(LogwrtResult.Flush);
>
> in XLogWrite(). Shouldn't that at the very least be inside the "If asked to
> flush, do so" branch? Outside that and the finishing_seg branch
> LogwrtResult.Flush won't have moved, right? So the call to
> NotifySegmentsReadyForArchive() can't do anything, no?

The registration logic looks like this:
1. Register boundary
2. Get flush location from shared memory
3. If flush location >= our just-registered boundary, nudge
   the WAL writer to create .ready files if needed

If we called NotifySegmentsReadyForArchive() before we updated the
flush location in shared memory, we might skip nudging the WAL writer
even though we should.

> Nor does it seem like we'd ever need to call NotifySegmentsReadyForArchive()
> if we started writing on the current page - flushRecPtr can't move across a
> segment boundary in that case.

I think there is a chance that we've crossed one of our recorded
segment boundaries anytime the flush pointer moves.

> When s1 is fully written out, NotifySegmentsReadyForArchive() will set
> earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
> ok. But when when s2 is flushed, we'll afaict happily create .ready files
> for s2, s3 despite s4 not yet being written, because earliestSegBoundary is
> now s4.

In this case, the .ready files for s2 and s3 wouldn't be created until
s4 is flushed to disk.

> The whole approach here of delaying .ready creation for these types of
> segments seems wrong to me. Doesn't the exact same problem also exist for
> streaming rep - which one can also use to maintain a PITR archive? walsender
> sends up to the flush location, and pg_receivewal's FindStreamingStart() will
> afaict just continue receiving from after that point.

The problem with streaming replication is being discussed in a new
thread [0].

Nathan

[0] https://postgr.es/m/202108232252.dh7uxf6oxwcy%40alvherre.pgsql



Re: archive status ".ready" files may be created too early

2021-08-30 Thread Andres Freund
Hi,

On 2021-08-30 15:51:54 -0400, alvhe...@alvh.no-ip.org wrote:
> On 2021-Aug-28, Andres Freund wrote:
>
> > While rebasing the aio patchset ontop of HEAD I noticed that this commit 
> > added
> > another atomic operation to XLogWrite() with archiving enabled. The WAL 
> > write
> > path is really quite hot, and at least some of the
> > NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.
> >
> > I think we should at least try to make the fast-path where no segment
> > boundaries were crossed use no atomic operations.
>
> I think the best way to achieve this is is to rely completely on
> walwriter doing the segment notification, so that the WAL write done by
> backend would only do a latch set.

When were you thinking of doing the latch sets? Adding a latch set for every
XLogWrite() wouldn't be cheap either. Both because syscalls under a lock
aren't free and because waking up walsender even more often isn't free (we
already have a few threads about reducing the signalling frequency).

There's also the question of what to do with single user mode. We shouldn't
just skip creating .ready files there...


Although, the more I think about, the more I am confused about the trailing
if (XLogArchivingActive())
NotifySegmentsReadyForArchive(LogwrtResult.Flush);

in XLogWrite(). Shouldn't that at the very least be inside the "If asked to
flush, do so" branch? Outside that and the finishing_seg branch
LogwrtResult.Flush won't have moved, right? So the call to
NotifySegmentsReadyForArchive() can't do anything, no?

Nor does it seem like we'd ever need to call NotifySegmentsReadyForArchive()
if we started writing on the current page - flushRecPtr can't move across a
segment boundary in that case.


I hadn't yet realized that this commit doesn't just make XLogWrite() more
expensive, it also makes XLogInsertRecord() more expensive :(. Adding two
divisions to XLogInsertRecord() isn't nice, especially as it happens
even if !XLogArchivingActive().


I can't really convince myself this deals correctly with multiple segment
spanning records and with records spanning more than one segment? It'd be
easier to understand if the new XLogCtlData variables were documented...

If there's one record from segment s0 to s1 and one from s1 to s4, and
wal_buffers is big enough to contain them all, the first record will set
earliestSegBoundary = s1
the second
latestSegBoundary = s4.

When s1 is fully written out, NotifySegmentsReadyForArchive() will set
earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
ok. But when when s2 is flushed, we'll afaict happily create .ready files
for s2, s3 despite s4 not yet being written, because earliestSegBoundary is
now s4.

I think there's other issues as well.


The more I look at this commit, the less I believe it's right.


The whole approach here of delaying .ready creation for these types of
segments seems wrong to me. Doesn't the exact same problem also exist for
streaming rep - which one can also use to maintain a PITR archive? walsender
sends up to the flush location, and pg_receivewal's FindStreamingStart() will
afaict just continue receiving from after that point.


Greetings,

Andres Freund




Re: archive status ".ready" files may be created too early

2021-08-30 Thread Bossart, Nathan
On 8/30/21, 12:52 PM, "alvhe...@alvh.no-ip.org"  wrote:
> On 2021-Aug-28, Andres Freund wrote:
>
>> While rebasing the aio patchset ontop of HEAD I noticed that this commit 
>> added
>> another atomic operation to XLogWrite() with archiving enabled. The WAL write
>> path is really quite hot, and at least some of the
>> NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.
>>
>> I think we should at least try to make the fast-path where no segment
>> boundaries were crossed use no atomic operations.
>
> I think the best way to achieve this is is to rely completely on
> walwriter doing the segment notification, so that the WAL write done by
> backend would only do a latch set.

+1.  If we do that, we may also want to move
NotifySegmentsReadyForArchive() to after the call to
XLogBackgroundFlush() in the WAL writer.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-30 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-28, Andres Freund wrote:

> While rebasing the aio patchset ontop of HEAD I noticed that this commit added
> another atomic operation to XLogWrite() with archiving enabled. The WAL write
> path is really quite hot, and at least some of the
> NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.
> 
> I think we should at least try to make the fast-path where no segment
> boundaries were crossed use no atomic operations.

I think the best way to achieve this is is to rely completely on
walwriter doing the segment notification, so that the WAL write done by
backend would only do a latch set.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-28 Thread Andres Freund
Hi,

On 2021-08-23 15:55:03 -0400, alvhe...@alvh.no-ip.org wrote:
> On 2021-Aug-23, Bossart, Nathan wrote:
> 
> > Ah, okay.  BTW the other changes you mentioned made sense to me.
> 
> Thanks.  I've pushed this now to all live branches.

While rebasing the aio patchset ontop of HEAD I noticed that this commit added
another atomic operation to XLogWrite() with archiving enabled. The WAL write
path is really quite hot, and at least some of the
NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.

I think we should at least try to make the fast-path where no segment
boundaries were crossed use no atomic operations.

Greetings,

Andres Freund




Re: archive status ".ready" files may be created too early

2021-08-25 Thread Bossart, Nathan
On 8/25/21, 11:01 AM, "Fujii Masao"  wrote:
> If LogwrtResult.Flush >= EndPos, which means that another process already
> has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush.
> This situation also means that that another process called
> NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right?

If the segment boundary wasn't registered before the other process
called NotifySegmentsReadyForArchive(), then it couldn't have used the
boundary for deciding which .ready files to create.

> If this understanding is right, there seems no need to wake walwriter up here
> so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain.
> Thought?

We're actually discussing this right now in another thread [0].  I
think we might be able to get rid of that part if we move the boundary
registration to before we release the WAL insert lock(s).

Nathan

[0] https://postgr.es/m/DE60B9AA-9670-47DA-9678-6C79BCD884E3%40amazon.com



Re: archive status ".ready" files may be created too early

2021-08-25 Thread Fujii Masao




On 2021/08/24 4:55, alvhe...@alvh.no-ip.org wrote:

On 2021-Aug-23, Bossart, Nathan wrote:


Ah, okay.  BTW the other changes you mentioned made sense to me.


Thanks.  I've pushed this now to all live branches.


Thanks a lot!

+   /*
+* There's a chance that the record was already flushed to disk 
and we
+* missed marking segments as ready for archive.  If this 
happens, we
+* nudge the WALWriter, which will take care of notifying 
segments as
+* needed.
+*/
+   if (StartSeg != EndSeg && XLogArchivingActive() &&
+   LogwrtResult.Flush >= EndPos && 
ProcGlobal->walwriterLatch)
+   SetLatch(ProcGlobal->walwriterLatch);

Is this really necessary?

If LogwrtResult.Flush >= EndPos, which means that another process already
has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush.
This situation also means that that another process called
NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right?

If this understanding is right, there seems no need to wake walwriter up here
so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain.
Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-23, Bossart, Nathan wrote:

> On 8/23/21, 12:55 PM, "alvhe...@alvh.no-ip.org"  
> wrote:
> > Thanks.  I've pushed this now to all live branches.
> 
> Thank you!  I appreciate the thorough reviews.  Should we make a new
> thread for the streaming replication fix?

Yeah, this one is long enough :-)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-23 Thread Bossart, Nathan
On 8/23/21, 12:55 PM, "alvhe...@alvh.no-ip.org"  wrote:
> Thanks.  I've pushed this now to all live branches.

Thank you!  I appreciate the thorough reviews.  Should we make a new
thread for the streaming replication fix?

Nathan



Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-23, Bossart, Nathan wrote:

> Ah, okay.  BTW the other changes you mentioned made sense to me.

Thanks.  I've pushed this now to all live branches.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
 It does it in a really, really complicated way
 why does it need to be complicated?
 Because it's MakeMaker.




Re: archive status ".ready" files may be created too early

2021-08-23 Thread Bossart, Nathan
On 8/23/21, 10:31 AM, "alvhe...@alvh.no-ip.org"  wrote:
> On 2021-Aug-23, alvhe...@alvh.no-ip.org wrote:
>
>> The only way .ready files are created is that XLogNotifyWrite() is
>> called.  For regular WAL files during regular operation, that only
>> happens in XLogNotifyWriteSeg().  That, in turn, only happens in
>> NotifySegmentsReadyForArchive().  But if the system runs and never
>> writes WAL records that cross WAL boundaries, that function will see
>> that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno,
>> and return without doing anything.  So no segments will be notified.
>
> Nevermind -- I realized that all segments get registered, not just those
> for which we generate continuation records.

Ah, okay.  BTW the other changes you mentioned made sense to me.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-23, alvhe...@alvh.no-ip.org wrote:

> The only way .ready files are created is that XLogNotifyWrite() is
> called.  For regular WAL files during regular operation, that only
> happens in XLogNotifyWriteSeg().  That, in turn, only happens in
> NotifySegmentsReadyForArchive().  But if the system runs and never
> writes WAL records that cross WAL boundaries, that function will see
> that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno,
> and return without doing anything.  So no segments will be notified.

Nevermind -- I realized that all segments get registered, not just those
for which we generate continuation records.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"




Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-23, Bossart, Nathan wrote:

> Sorry, I'm still not following this one.  If we skipped creating
> .ready segments due to a crash, we rely on RemoveOldXlogFiles() to
> create them as needed in the end-of-recovery checkpoint.  If a record
> fits perfectly in the end of a segment, we'll still register it as a
> boundary for the next segment (hence why we use XLByteToSeg() instead
> of XLByteToPrevSeg()).  If database activity stops completely, there
> shouldn't be anything to mark ready.

The only way .ready files are created is that XLogNotifyWrite() is
called.  For regular WAL files during regular operation, that only
happens in XLogNotifyWriteSeg().  That, in turn, only happens in
NotifySegmentsReadyForArchive().  But if the system runs and never
writes WAL records that cross WAL boundaries, that function will see
that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno,
and return without doing anything.  So no segments will be notified.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-23 Thread Bossart, Nathan
On 8/23/21, 9:33 AM, "alvhe...@alvh.no-ip.org"  wrote:
> On 2021-Aug-23, Bossart, Nathan wrote:
>
>> Hm.  My expectation would be that if there are no registrations, we
>> cannot create .ready files for the flushed segments.  The scenario
>> where I can see that happening is when a record gets flushed to disk
>> prior to registration.  In that case, we'll still eventually register
>> the record and wake up the WAL writer process, which will take care of
>> creating the .ready files that were missed earlier.  Is there another
>> case you are thinking of where we could miss registration for a cross-
>> segment record altogether?
>
> I'm thinking of the case where no record cross segment boundaries ever.

Sorry, I'm still not following this one.  If we skipped creating
.ready segments due to a crash, we rely on RemoveOldXlogFiles() to
create them as needed in the end-of-recovery checkpoint.  If a record
fits perfectly in the end of a segment, we'll still register it as a
boundary for the next segment (hence why we use XLByteToSeg() instead
of XLByteToPrevSeg()).  If database activity stops completely, there
shouldn't be anything to mark ready.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-23, Bossart, Nathan wrote:

> Hm.  My expectation would be that if there are no registrations, we
> cannot create .ready files for the flushed segments.  The scenario
> where I can see that happening is when a record gets flushed to disk
> prior to registration.  In that case, we'll still eventually register
> the record and wake up the WAL writer process, which will take care of
> creating the .ready files that were missed earlier.  Is there another
> case you are thinking of where we could miss registration for a cross-
> segment record altogether?

I'm thinking of the case where no record cross segment boundaries ever.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-23 Thread Bossart, Nathan
On 8/23/21, 8:50 AM, "alvhe...@alvh.no-ip.org"  wrote:
> ... while reading the resulting code after backpatching to all branches,
> I realized that if there are no registrations whatsoever, then archiving
> won't do anything, which surely is the wrong thing to do.  The correct
> behavior should be "if there are no registrations, then *all* flushed
> segments can be notified".

Hm.  My expectation would be that if there are no registrations, we
cannot create .ready files for the flushed segments.  The scenario
where I can see that happening is when a record gets flushed to disk
prior to registration.  In that case, we'll still eventually register
the record and wake up the WAL writer process, which will take care of
creating the .ready files that were missed earlier.  Is there another
case you are thinking of where we could miss registration for a cross-
segment record altogether?

Nathan



Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-21, Bossart, Nathan wrote:

> > Well, the thing I realized is that these three helper functions have
> > exactly one caller each.  I think the compiler is going to inline them,
> > so there isn't going to be a function call in the assembly.  I haven't
> > verified this, though.
> 
> Good point.  It looks like they're getting inlined for me.

I still didn't like it, because it looks like we're creating an API for
which there can be only one caller.  So I expanded the functions in the
caller.  It doesn't look too bad.  However ...

... while reading the resulting code after backpatching to all branches,
I realized that if there are no registrations whatsoever, then archiving
won't do anything, which surely is the wrong thing to do.  The correct
behavior should be "if there are no registrations, then *all* flushed
segments can be notified".

I'll fix that ...

Another thing I didn't like is that you used a name ending in RecPtr for
the LSN, which gives no indication that it really is the *end* LSN, not
the start pointer.  And it won't play nice with the need to add the
*start* LSN which we'll need to implement solving the equivalent problem
for streaming replication.  I'll rename those to
earliestSegBoundaryEndPtr and latestSegBoundaryEndPtr.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.
>From 6e4c3fd6f5687ec5762de121344ce35c1c890812 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 23 Aug 2021 09:06:09 -0400
Subject: [PATCH v15] Avoid creating archive status ".ready" files too early

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL page of
the segment is written, the archive status file is created, and the
archiver may process it.  If PostgreSQL crashes before it is able to
write and flush the rest of the record (in the next WAL segment), the
wrong version of the first segment file lingers in the archive, which
causes operations such as point-in-time restores to fail.

To fix this, keep track of records that span across segments and ensure
that segments are only marked ready-for-archival once such records have
been completely written to disk.

This has always been wrong, so backpatch all the way back.

Author: Nathan Bossart 
Reviewed-by: Kyotaro Horiguchi 
Reviewed-by: Ryo Matsumura 
Reviewed-by: Andrey Borodin 
Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 220 +--
 src/backend/access/transam/xlogarchive.c |  17 +-
 src/backend/postmaster/walwriter.c   |   7 +
 src/backend/replication/walreceiver.c|   6 +-
 src/include/access/xlog.h|   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   1 +
 8 files changed, 234 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..8e7c3a364a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -724,6 +724,18 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastFpwDisableRecPtr;
 
 	slock_t		info_lck;		/* locks shared variables shown above */
+
+	/*
+	 * Variables used to track cross-segment records.  Protected by
+	 * segtrack_lck.
+	 */
+	XLogSegNo	lastNotifiedSeg;
+	XLogSegNo	earliestSegBoundary;
+	XLogRecPtr	earliestSegBoundaryRecPtr;
+	XLogSegNo	latestSegBoundary;
+	XLogRecPtr	latestSegBoundaryRecPtr;
+
+	slock_t		segtrack_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
 static XLogCtlData *XLogCtl = NULL;
@@ -920,6 +932,7 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		   XLogSegNo *endlogSegNo);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
+static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
@@ -1154,23 +1167,56 @@ XLogInsertRecord(XLogRecData *rdata,
 	END_CRIT_SECTION();
 
 	/*
-	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
+	 * If we crossed page boundary, update LogwrtRqst.Write; if we crossed
+	 * 

Re: archive status ".ready" files may be created too early

2021-08-20 Thread Bossart, Nathan
On 8/20/21, 4:52 PM, "alvhe...@alvh.no-ip.org"  wrote:
> On 2021-Aug-20, Bossart, Nathan wrote:
>
>> I was looking at moving the function calls out of the spinlock region.
>> I don't think the functions are doing anything too expensive, and they
>> help clean up NotifySegmentsReadyForArchive() quite a bit, but I
>> understand why it might be against project policy to do something like
>> that.  It would be easy enough to get rid of the helper functions if
>> that was concern.
>
> Well, the thing I realized is that these three helper functions have
> exactly one caller each.  I think the compiler is going to inline them,
> so there isn't going to be a function call in the assembly.  I haven't
> verified this, though.

Good point.  It looks like they're getting inlined for me.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-20, Bossart, Nathan wrote:

> I was looking at moving the function calls out of the spinlock region.
> I don't think the functions are doing anything too expensive, and they
> help clean up NotifySegmentsReadyForArchive() quite a bit, but I
> understand why it might be against project policy to do something like
> that.  It would be easy enough to get rid of the helper functions if
> that was concern.

Well, the thing I realized is that these three helper functions have
exactly one caller each.  I think the compiler is going to inline them,
so there isn't going to be a function call in the assembly.  I haven't
verified this, though.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)




Re: archive status ".ready" files may be created too early

2021-08-20 Thread Bossart, Nathan
On 8/20/21, 4:00 PM, "alvhe...@alvh.no-ip.org"  wrote:
> Attached is v14 which uses a separate spinlock.

This looks good to me.

I was looking at moving the function calls out of the spinlock region.
I don't think the functions are doing anything too expensive, and they
help clean up NotifySegmentsReadyForArchive() quite a bit, but I
understand why it might be against project policy to do something like
that.  It would be easy enough to get rid of the helper functions if
that was concern.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org
Attached is v14 which uses a separate spinlock.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto.  No es cierto, y si fuera cierto,
 no me acuerdo." (Augusto Pinochet a una corte de justicia)
>From 4180334f51a1f343ce5795d0b62f3aa298b472da Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 20 Aug 2021 19:25:31 +
Subject: [PATCH v14] Avoid creating archive status ".ready" files too early

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL page of
the segment is written, the archive status file is created, and the
archiver may process it.  If PostgreSQL crashes before it is able to
write and flush the rest of the record (in the next WAL segment), the
wrong version of the first segment file lingers in the archive, which
causes operations such as point-in-time restores to fail.

To fix this, keep track of records that span across segments and ensure
that segments are only marked ready-for-archival once such records have
been completely written to disk.

Author: Nathan Bossart 
Reviewed-by: Kyotaro Horiguchi 
Reviewed-by: Ryo Matsumura 
Reviewed-by: Andrey Borodin 
Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 242 +--
 src/backend/access/transam/xlogarchive.c |  17 +-
 src/backend/postmaster/walwriter.c   |   7 +
 src/backend/replication/walreceiver.c|   6 +-
 src/include/access/xlog.h|   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   1 +
 8 files changed, 256 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..95f03adef8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -724,6 +724,18 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastFpwDisableRecPtr;
 
 	slock_t		info_lck;		/* locks shared variables shown above */
+
+	/*
+	 * Variables used to track cross-segment records.  Protected by
+	 * segtrack_lck.
+	 */
+	XLogSegNo	lastNotifiedSeg;
+	XLogSegNo	earliestSegBoundary;
+	XLogRecPtr	earliestSegBoundaryRecPtr;
+	XLogSegNo	latestSegBoundary;
+	XLogRecPtr	latestSegBoundaryRecPtr;
+
+	slock_t		segtrack_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
 static XLogCtlData *XLogCtl = NULL;
@@ -920,6 +932,9 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		   XLogSegNo *endlogSegNo);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
+static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos);
+static bool GetLatestSegmentBoundary(XLogRecPtr flushed, XLogSegNo *latest_boundary_seg);
+static void RemoveSegmentBoundariesUpTo(XLogSegNo seg);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
@@ -1154,23 +1169,56 @@ XLogInsertRecord(XLogRecData *rdata,
 	END_CRIT_SECTION();
 
 	/*
-	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
+	 * If we crossed page boundary, update LogwrtRqst.Write; if we crossed
+	 * segment boundary, register that and wake up walwriter.
 	 */
 	if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
 	{
+		XLogSegNo	StartSeg;
+		XLogSegNo	EndSeg;
+
+		XLByteToSeg(StartPos, StartSeg, wal_segment_size);
+		XLByteToSeg(EndPos, EndSeg, wal_segment_size);
+
+		/*
+		 * Register our crossing the segment boundary if that occurred.
+		 *
+		 * Note that we did not use XLByteToPrevSeg() for determining the
+		 * ending segment.  This is so that a record that fits perfectly into
+		 * the end of the segment is marked ready for archival as soon as the
+		 * flushed pointer jumps to the next segment.
+		 */
+		if (StartSeg != EndSeg && XLogArchivingActive())
+			RegisterSegmentBoundary(EndSeg, EndPos);
+
+		/*
+		 * Advance LogwrtRqst.Write so that it includes new block(s).
+		 *
+		 * We do this after registering the segment boundary so that the
+		 * comparison with the flushed pointer below can use the latest value
+		 * known globally.
+		 */
 		SpinLockAcquire(>info_lck);
-		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write 

Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-20, Bossart, Nathan wrote:

> > On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan  wrote:

> >> This led me to revisit the two-element
> >> approach that was discussed upthread.  What if we only stored the
> >> earliest and latest segment boundaries at any given time?  Once the
> >> earliest boundary is added, it never changes until the segment is
> >> flushed and it is removed.  The latest boundary, however, will be
> >> updated any time we register another segment.  Once the earliest
> >> boundary is removed, we replace it with the latest boundary.  This
> >> strategy could cause us to miss intermediate boundaries, but AFAICT
> >> the worst case scenario is that we hold off creating .ready files a
> >> bit longer than necessary.

> I've attached a patch to demonstrate what I'm thinking.

There is only one thing I didn't like in this new version, which is that
we're holding info_lck too much.  I've seen info_lck contention be a
problem in some workloads and I'd rather not add more stuff to it.  I'd
rather we stick with using a new lock object to protect all the data we
need for this job.

Should this new lock object be a spinlock or an lwlock?  I think a
spinlock would generally be better because it's lower overhead and we
can't use it in shared mode anywhere, which would be the greatest
argument for an lwlock.  However, I think we avoid letting code run with
spinlocks held that's not straight-line code, and we have some function
calls there.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-20 Thread Robert Haas
On Fri, Aug 20, 2021 at 1:52 PM alvhe...@alvh.no-ip.org
 wrote:
> Actually, you were right.  Hash tables in shared memory can be expanded.
> There are some limitations (the hash "directory" is fixed size, which
> means the hash table get less efficient if it grows too much), but you
> can definitely create more hash entries than the initial size.  See for
> example element_alloc(), which covers the case of a hash table being
> IS_PARTITIONED -- something that only shmem hash tables can be.  Note
> that ShmemInitHash passes the HASH_ALLOC flag and uses ShmemAllocNoError
> as allocation function, which acquires memory from the shared segment.

I realize that the code supports this ... but I thought we had
established a policy that only the main lock manager's shared hash
tables, and not any others, are actually allowed to make use of this
functionality.  See commit 7c797e7194d969f974abf579cacf30ffdccdbb95.

It seems like a dangerous thing to rely on in any case, since we can't
predict how much extra shared memory might actually be available.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: archive status ".ready" files may be created too early

2021-08-20 Thread Robert Haas
On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan  wrote:
> Thinking about this stuff further, I was wondering if one way to
> handle the bounded shared hash table problem would be to replace the
> latest boundary in the map whenever it was full.  But at that point,
> do we even need a hash table?  This led me to revisit the two-element
> approach that was discussed upthread.  What if we only stored the
> earliest and latest segment boundaries at any given time?  Once the
> earliest boundary is added, it never changes until the segment is
> flushed and it is removed.  The latest boundary, however, will be
> updated any time we register another segment.  Once the earliest
> boundary is removed, we replace it with the latest boundary.  This
> strategy could cause us to miss intermediate boundaries, but AFAICT
> the worst case scenario is that we hold off creating .ready files a
> bit longer than necessary.

I think this is a promising approach. We could also have a small
fixed-size array, so that we only have to risk losing track of
anything when we overflow the array. But I guess I'm still unconvinced
that there's a real possibility of genuinely needing multiple
elements. Suppose we are thinking of adding a second element to the
array (or the hash table). I feel like it's got to be safe to just
remove the first one. If not, then apparently the WAL record that
caused us to make the first entry isn't totally flushed yet - which I
still think is impossible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-20, Bossart, Nathan wrote:

> On 8/20/21, 8:29 AM, "Robert Haas"  wrote:

> > We can't expand the hash table either. It has an initial and maximum
> > size of 16 elements, which means it's basically an expensive array,
> > and which also means that it imposes a new limit of 16 *
> > wal_segment_size on the size of WAL records. If you exceed that limit,
> > I think things just go boom... which I think is not acceptable. I
> > think we can have records in the multi-GB range of wal_level=logical
> > and someone chooses a stupid replica identity setting.
> 
> I was under the impression that shared hash tables could be expanded
> as necessary, but from your note and the following comment, that does
> not seem to be true:

Actually, you were right.  Hash tables in shared memory can be expanded.
There are some limitations (the hash "directory" is fixed size, which
means the hash table get less efficient if it grows too much), but you
can definitely create more hash entries than the initial size.  See for
example element_alloc(), which covers the case of a hash table being
IS_PARTITIONED -- something that only shmem hash tables can be.  Note
that ShmemInitHash passes the HASH_ALLOC flag and uses ShmemAllocNoError
as allocation function, which acquires memory from the shared segment.

This is a minor thing -- it doesn't affect the fact that the hash table
is possibly being misused and inefficient -- but I thought it was worth
pointing out.


As an example, consider the LOCK / PROCLOCK hash tables.  These can
contain more elements than max_backends * max_locks_per_transaction.
Those elements consume shared memory from the "allocation slop" in the
shared memory segment.  It's tough when it happens (as far as I know the
memory is never "returned" once such a hash table grows to use that
space), but it does work.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-20 Thread Bossart, Nathan
On 8/20/21, 10:08 AM, "Robert Haas"  wrote:
> On Fri, Aug 20, 2021 at 12:36 PM Bossart, Nathan  wrote:
>> If a record spans multiple segments, we only register one segment
>> boundary.  For example, if I insert a record that starts at segment
>> number 1 and stops at 10, I'll insert one segment boundary for segment
>> 10.  We'll only create .ready files for segments 1 through 9 once this
>> record is completely flushed to disk.
>
> Oh ... OK. So is there any experimental scenario in which the hash
> table ends up with more than 1 entry? And if so, how does that happen?

I was able to do this by turning synchronous_commit off, increasing
wal_buffers substantially, and adding sleeps to XLogWrite().

>> If there isn't a way to ensure that the number of entries we need to
>> store is bounded, I'm tempted to propose my original patch [0], which
>> just moves .ready file creation to the very end of XLogWrite().  It's
>> probably not a complete solution, but it might be better than what's
>> there today.
>
> Doesn't that allocate memory inside a critical section? I would have
> thought it would cause an immediate assertion failure.

I could probably replace the list with two local variables (start and
end segments).

Thinking about this stuff further, I was wondering if one way to
handle the bounded shared hash table problem would be to replace the
latest boundary in the map whenever it was full.  But at that point,
do we even need a hash table?  This led me to revisit the two-element
approach that was discussed upthread.  What if we only stored the
earliest and latest segment boundaries at any given time?  Once the
earliest boundary is added, it never changes until the segment is
flushed and it is removed.  The latest boundary, however, will be
updated any time we register another segment.  Once the earliest
boundary is removed, we replace it with the latest boundary.  This
strategy could cause us to miss intermediate boundaries, but AFAICT
the worst case scenario is that we hold off creating .ready files a
bit longer than necessary.

I'll work on a patch to illustrate what I'm thinking.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-20 Thread Robert Haas
On Fri, Aug 20, 2021 at 12:36 PM Bossart, Nathan  wrote:
> If a record spans multiple segments, we only register one segment
> boundary.  For example, if I insert a record that starts at segment
> number 1 and stops at 10, I'll insert one segment boundary for segment
> 10.  We'll only create .ready files for segments 1 through 9 once this
> record is completely flushed to disk.

Oh ... OK. So is there any experimental scenario in which the hash
table ends up with more than 1 entry? And if so, how does that happen?

> > It's actually not clear to me why we need to track multiple entries
> > anyway. The scenario postulated by Horiguchi-san in
> > https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota@gmail.com
> > seems to require that the write position be multiple segments ahead of
> > the flush position, but that seems impossible with the present code,
> > because XLogWrite() calls issue_xlog_fsync() at once if the segment is
> > filled. So I think, at least with the present code, any record that
> > isn't completely flushed to disk has to be at least partially in the
> > current segment. And there can be only one record that starts in some
> > earlier segment and ends in this one.
>
> We register the boundaries XLogInsertRecord(), which AFAICT just bumps
> the global write request pointer ahead, so I'm not sure we can make
> any assumptions about what is written/flushed at that time.  (I see
> that we do end up calling XLogFlush() for XLOG_SWITCH records in
> XLogInsertRecord(), but I don't see any other cases where we actually
> write anything in this function.)  Am I missing something?

Well, I'm not sure. But I *think* that the code as it exists today is
smart enough not to try to archive a segment that hasn't been
completely flushed, and the gap is only that even though the segment
might be completely flushed, some portion of the record that is part
of a later segment might not be flushed, and thus after a crash we
might overwrite the already-flushed contents. The patch can make an
implementation choice to do some work at XLogInsertRecord() time if it
likes, but there's no real hazard at that point. The hazard only
exists, or so I think, once a segment that contains part of the record
is fully on disk. But that means, if my previous logic is correct,
that the hazard can only exist for at most 1 record at any point in
time.

> If there isn't a way to ensure that the number of entries we need to
> store is bounded, I'm tempted to propose my original patch [0], which
> just moves .ready file creation to the very end of XLogWrite().  It's
> probably not a complete solution, but it might be better than what's
> there today.

Doesn't that allocate memory inside a critical section? I would have
thought it would cause an immediate assertion failure.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: archive status ".ready" files may be created too early

2021-08-20 Thread Bossart, Nathan
On 8/20/21, 8:29 AM, "Robert Haas"  wrote:
> On Fri, Aug 20, 2021 at 10:50 AM alvhe...@alvh.no-ip.org
>  wrote:
>> 1. We use a hash table in shared memory.  That's great.  The part that's
>>not so great is that in both places where we read items from it, we
>>have to iterate in some way.  This seems a bit silly.  An array would
>>serve us better, if only we could expand it as needed.  However, in
>>shared memory we can't do that.  (I think the list of elements we
>>need to memoize is arbitrary long, if enough processes can be writing
>>WAL at the same time.)
>
> We can't expand the hash table either. It has an initial and maximum
> size of 16 elements, which means it's basically an expensive array,
> and which also means that it imposes a new limit of 16 *
> wal_segment_size on the size of WAL records. If you exceed that limit,
> I think things just go boom... which I think is not acceptable. I
> think we can have records in the multi-GB range of wal_level=logical
> and someone chooses a stupid replica identity setting.

If a record spans multiple segments, we only register one segment
boundary.  For example, if I insert a record that starts at segment
number 1 and stops at 10, I'll insert one segment boundary for segment
10.  We'll only create .ready files for segments 1 through 9 once this
record is completely flushed to disk.

I was under the impression that shared hash tables could be expanded
as necessary, but from your note and the following comment, that does
not seem to be true:

* Note: for a shared-memory hashtable, nelem needs to be a pretty good
* estimate, since we can't expand the table on the fly.  But an unshared
* hashtable can be expanded on-the-fly, so it's better for nelem to be
* on the small side and let the table grow if it's exceeded.  An overly
* large nelem will penalize hash_seq_search speed without buying much.

> It's actually not clear to me why we need to track multiple entries
> anyway. The scenario postulated by Horiguchi-san in
> https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota@gmail.com
> seems to require that the write position be multiple segments ahead of
> the flush position, but that seems impossible with the present code,
> because XLogWrite() calls issue_xlog_fsync() at once if the segment is
> filled. So I think, at least with the present code, any record that
> isn't completely flushed to disk has to be at least partially in the
> current segment. And there can be only one record that starts in some
> earlier segment and ends in this one.

We register the boundaries XLogInsertRecord(), which AFAICT just bumps
the global write request pointer ahead, so I'm not sure we can make
any assumptions about what is written/flushed at that time.  (I see
that we do end up calling XLogFlush() for XLOG_SWITCH records in
XLogInsertRecord(), but I don't see any other cases where we actually
write anything in this function.)  Am I missing something?

> I will be the first to admit that the forced end-of-segment syncs
> suck. They often stall every backend in the entire system at the same
> time. Everyone fills up the xlog segment really fast and then stalls
> HARD while waiting for that sync to happen. So it's arguably better
> not to do more things that depend on that being how it works, but I
> think needing a variable-size amount of shared memory is even worse.
> If we're going to track multiple entries here we need some rule that
> bounds how many of them we can need to track. If the number of entries
> is defined by the number of segment boundaries that a particular
> record crosses, it's effectively unbounded, because right now WAL
> records can be pretty much arbitrarily big.

If there isn't a way to ensure that the number of entries we need to
store is bounded, I'm tempted to propose my original patch [0], which
just moves .ready file creation to the very end of XLogWrite().  It's
probably not a complete solution, but it might be better than what's
there today.

Nathan

[0] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com



Re: archive status ".ready" files may be created too early

2021-08-20 Thread Robert Haas
On Fri, Aug 20, 2021 at 10:50 AM alvhe...@alvh.no-ip.org
 wrote:
> 1. We use a hash table in shared memory.  That's great.  The part that's
>not so great is that in both places where we read items from it, we
>have to iterate in some way.  This seems a bit silly.  An array would
>serve us better, if only we could expand it as needed.  However, in
>shared memory we can't do that.  (I think the list of elements we
>need to memoize is arbitrary long, if enough processes can be writing
>WAL at the same time.)

We can't expand the hash table either. It has an initial and maximum
size of 16 elements, which means it's basically an expensive array,
and which also means that it imposes a new limit of 16 *
wal_segment_size on the size of WAL records. If you exceed that limit,
I think things just go boom... which I think is not acceptable. I
think we can have records in the multi-GB range of wal_level=logical
and someone chooses a stupid replica identity setting.

It's actually not clear to me why we need to track multiple entries
anyway. The scenario postulated by Horiguchi-san in
https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota@gmail.com
seems to require that the write position be multiple segments ahead of
the flush position, but that seems impossible with the present code,
because XLogWrite() calls issue_xlog_fsync() at once if the segment is
filled. So I think, at least with the present code, any record that
isn't completely flushed to disk has to be at least partially in the
current segment. And there can be only one record that starts in some
earlier segment and ends in this one.

I will be the first to admit that the forced end-of-segment syncs
suck. They often stall every backend in the entire system at the same
time. Everyone fills up the xlog segment really fast and then stalls
HARD while waiting for that sync to happen. So it's arguably better
not to do more things that depend on that being how it works, but I
think needing a variable-size amount of shared memory is even worse.
If we're going to track multiple entries here we need some rule that
bounds how many of them we can need to track. If the number of entries
is defined by the number of segment boundaries that a particular
record crosses, it's effectively unbounded, because right now WAL
records can be pretty much arbitrarily big.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org
Two things.

1. We use a hash table in shared memory.  That's great.  The part that's
   not so great is that in both places where we read items from it, we
   have to iterate in some way.  This seems a bit silly.  An array would
   serve us better, if only we could expand it as needed.  However, in
   shared memory we can't do that.  (I think the list of elements we
   need to memoize is arbitrary long, if enough processes can be writing
   WAL at the same time.)

   Now that I think about this again, maybe it's limited by
   NUM_XLOGINSERT_LOCKS, since there can only be that many records being
   written down at a time ...

2. There is a new LWLock acquisition that may be a new contention point.
   We acquire the lock in these cases:
   - WAL record insert, when a segment boundary is crosses (rare
 enough).
   - XLogWrite, when a segment needs to be notified.

   Looking again, I think the last point might be a problem actually,
   because XLogWrite is called with WALWriteLock held.  Maybe we should
   take the NotifySegmentsReadyForArchive() call outside the section
   locked by WALWriteLock (so put it in XLogWrite callers instead of
   XLogWrite itself).

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)




Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org
In v12 I moved the code around a bit and reworded some comments.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 7d1475578431e265a5e7f8b94a6b0791b68a9a31 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 17 Aug 2021 03:52:43 +
Subject: [PATCH v12] Avoid creating archive status ".ready" files too early

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL page of
the segment is written, the archive status file is created, and the
archiver may process it.  If PostgreSQL crashes before it is able to
write and flush the rest of the record (in the next WAL segment), the
wrong version of the first segment file lingers in the archive, which
causes operations such as point-in-time restores to fail.

To fix this, keep track of records that span across segments and ensure
that segments are only marked ready-for-archival once such records have
been completely written to disk.

XXX Note that this may add a new LWLock acquisition in the transaction
commit path.  (It doesn't happen in all cases.)

Author: Nathan Bossart 
Reviewed-by: Kyotaro Horiguchi 
Reviewed-by: Ryo Matsumura 
Reviewed-by: Andrey Borodin 
Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 307 ++-
 src/backend/access/transam/xlogarchive.c |  17 +-
 src/backend/postmaster/walwriter.c   |   7 +
 src/backend/replication/walreceiver.c|   6 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/access/xlog.h|   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   1 +
 src/tools/pgindent/typedefs.list |   1 +
 10 files changed, 323 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..01da4ab7b8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -512,6 +512,16 @@ typedef enum ExclusiveBackupState
  */
 static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
 
+/*
+ * Entries for SegmentBoundaryMap.  Each such entry represents a WAL
+ * record that ends in endpos and crosses the WAL segment boundary.
+ */
+typedef struct SegmentBoundaryEntry
+{
+	XLogSegNo	seg;			/* must be first */
+	XLogRecPtr	endpos;
+} SegmentBoundaryEntry;
+
 /*
  * Shared state data for WAL insertion.
  */
@@ -723,6 +733,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * The last segment we've marked ready for archival.  Protected by
+	 * info_lck.
+	 */
+	XLogSegNo	lastNotifiedSeg;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -736,6 +752,23 @@ static WALInsertLockPadded *WALInsertLocks = NULL;
  */
 static ControlFileData *ControlFile = NULL;
 
+/*
+ * Segment boundary map.
+ *
+ * This hash table tracks WAL records that cross WAL segment boundaries.
+ * For WAL archiving it is critical not to mark a segment as archivable
+ * until both halves of the record are flushed: otherwise if we crash
+ * after archiving the segment containing the first half and before
+ * flushing the second half of the record, then after recovery the
+ * primary would overwrite the first half with new data, making the
+ * segment archived prior to the crash a bogus copy.
+ *
+ * XXX note that an equivalent problem exists with streaming replication.
+ *
+ * Protected by SegmentBoundaryLock.
+ */
+static HTAB *SegmentBoundaryMap = NULL;
+
 /*
  * Calculate the amount of space left on the page after 'endptr'. Beware
  * multiple evaluation!
@@ -920,6 +953,13 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		   XLogSegNo *endlogSegNo);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
+static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos);
+static XLogSegNo GetLastNotifiedSegment(void);
+static void SetLastNotifiedSegment(XLogSegNo seg);
+static bool GetLatestSegmentBoundary(XLogSegNo last_notified,
+	 XLogRecPtr flushed,
+	 XLogSegNo *latest_boundary_seg);
+static void RemoveSegmentBoundariesUpTo(XLogSegNo seg);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord 

Re: archive status ".ready" files may be created too early

2021-08-18 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-18, Bossart, Nathan wrote:

> As long as XLogBackgroundFlush() found work to do, it would take care
> of notifying, but I don't think we can depend on that.  However, since
> we're primarily using the WAL writer to take care of the case when the
> record has already been flushed, notifying beforehand seems fine to
> me.  If XLogBackgroundFlush() does end up calling XLogWrite(), it'll
> call it again, anyway.

Agreed.

> In the attached patch, I modified XLogInsertRecord() to simply set the
> latch if we detect that flushRecPtr has advanced.

Right, that's what I was thinking.  I modified that slightly to use
LogwrtResult.Flush (which should be fresh enough) instead of calling
GetFlushRecPtr again, which saves a bit.  I also changed it to > instead
of >=, because if I understand you correctly we only care to notify if
the flush pointer advanced, not in the case it stayed the same.

I made a few other cosmetic tweaks -- added comment to
SegmentBoundaryEntry and renamed the 'pos' to 'endpos'; renamed argument
'notify' of XLogArchiveNotify to 'nudge' (because having two different
"notify" concepts in that function seemed confusing).

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html
>From 39190a4e33a4cb35402ec9451ce6678f6e9aa494 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 17 Aug 2021 03:52:43 +
Subject: [PATCH v10] Avoid creating archive status ".ready" files too early.

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL
page of the segment is written, the archive status file will be
created.  If PostgreSQL crashes before it is able to write the rest
of the record, it will end up reusing segments that have already
been marked as ready-for-archival.  However, the archiver process
may have already processed the old version of the segment, so the
wrong version of the segment may be backed-up.  This backed-up
segment will cause operations such as point-in-time restores to
fail.

To fix this, we keep track of records that span across segments and
ensure that segments are only marked ready-for-archival once such
records have been completely written to disk.
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 291 ++-
 src/backend/access/transam/xlogarchive.c |  17 +-
 src/backend/postmaster/walwriter.c   |   7 +
 src/backend/replication/walreceiver.c|   6 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/access/xlog.h|   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   1 +
 src/tools/pgindent/typedefs.list |   1 +
 10 files changed, 309 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..d2ccf2a7bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -512,6 +512,16 @@ typedef enum ExclusiveBackupState
  */
 static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
 
+/*
+ * Entries for SegmentBoundaryMap.  Each such entry represents a WAL
+ * record that ends in endpos and crosses a WAL segment boundary.
+ */
+typedef struct SegmentBoundaryEntry
+{
+	XLogSegNo	seg;			/* must be first */
+	XLogRecPtr	endpos;
+} SegmentBoundaryEntry;
+
 /*
  * Shared state data for WAL insertion.
  */
@@ -723,6 +733,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * The last segment we've marked ready for archival.  Protected by
+	 * info_lck.
+	 */
+	XLogSegNo	lastNotifiedSeg;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -736,6 +752,12 @@ static WALInsertLockPadded *WALInsertLocks = NULL;
  */
 static ControlFileData *ControlFile = NULL;
 
+/*
+ * Segment boundary map, used for marking segments as ready for archival.
+ * Protected by SegmentBoundaryLock.
+ */
+static HTAB *SegmentBoundaryMap = NULL;
+
 /*
  * Calculate the amount of space left on the page after 'endptr'. Beware
  * multiple evaluation!
@@ -962,6 +984,13 @@ static XLogRecPtr 

Re: archive status ".ready" files may be created too early

2021-08-18 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-18, Bossart, Nathan wrote:

> I'll add it after XLogBackgroundFlush().

I was wondering which would be better: before or after.
XLogBackgroundFlush would do it anyway, so if you do it after then it's
not clear to me that it'd do anything (I mean we should not do any new
calls of NotifySegmentsReadyForArchive and just rely on the one in
XLogBackgroundFlush -> XLogWrite).

The advantage of doing NotifySegmentsReadyForArchive before
XLogBackgroundFlush is that the files would be created sooner, so the
archiver can be working in parallel while walwriter does its other
thing; then we'd reach the NotifySegmentsReadyForArchive in
XLogBackgroundFlush and it'd find nothing to do most of the time, which
is just fine.

> I think we'll also want to set the WAL writer's latch in case it is
> hibernating.

Yeah.  (That's another advantage of doing it in walwriter rather than
bgwriter: we don't publish bgwriter's latch anywhere AFAICS).

> Another approach could be to keep the NotifySegmentsReadyForArchive()
> call in XLogInsertRecord(), but only call it if the flush pointer is
> beyond the boundary we just registered.  Or we could only set the
> latch in XLogInsertRecord() if we detect that the flush pointer has
> advanced.

Hmm.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-18 Thread Bossart, Nathan
On 8/18/21, 10:48 AM, "alvhe...@alvh.no-ip.org"  wrote:
> On 2021-Aug-18, alvhe...@alvh.no-ip.org wrote:
>
>> I realize this means there's a contradiction with my previous argument,
>> in that synchronous transaction commit calls XLogWrite at some point, so
>> we *are* putting the client-connected backend in charge of creating the
>> notify files.  However, that only happens on transaction commit, where
>> we already accept responsibility for the WAL flush, not on each
>> individual XLOG record insert; also, the WAL writer will take care of it
>> sometimes, for transactions that are long-enough lived.
>
> Eh.  I just said WAL writer will sometimes do it, and that's true
> because it'll occur in XLogBackgroundFlush.  But upthread I wimped out
> of having WAL writer call NotifySegmentsReadyForArchive() and instead
> opined to give responsibility to bgwriter.  However, thinking about it
> again, maybe it does make sense to have walwriter do it too directly.
> This causes no harm to walwriter's time constraints, since *it will have
> to do it via XLogBackgroundFlush anyway*.

I'll add it after XLogBackgroundFlush().  I think we'll also want to
set the WAL writer's latch in case it is hibernating.

Another approach could be to keep the NotifySegmentsReadyForArchive()
call in XLogInsertRecord(), but only call it if the flush pointer is
beyond the boundary we just registered.  Or we could only set the
latch in XLogInsertRecord() if we detect that the flush pointer has
advanced.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-18 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-18, alvhe...@alvh.no-ip.org wrote:

> I realize this means there's a contradiction with my previous argument,
> in that synchronous transaction commit calls XLogWrite at some point, so
> we *are* putting the client-connected backend in charge of creating the
> notify files.  However, that only happens on transaction commit, where
> we already accept responsibility for the WAL flush, not on each
> individual XLOG record insert; also, the WAL writer will take care of it
> sometimes, for transactions that are long-enough lived.

Eh.  I just said WAL writer will sometimes do it, and that's true
because it'll occur in XLogBackgroundFlush.  But upthread I wimped out
of having WAL writer call NotifySegmentsReadyForArchive() and instead
opined to give responsibility to bgwriter.  However, thinking about it
again, maybe it does make sense to have walwriter do it too directly.
This causes no harm to walwriter's time constraints, since *it will have
to do it via XLogBackgroundFlush anyway*.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-18 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-18, Bossart, Nathan wrote:

> On 8/18/21, 10:06 AM, "alvhe...@alvh.no-ip.org"  
> wrote:
> > So that comment suggests that we should give the responsibility to bgwriter.
> > This seems good enough to me.  I suppose if bgwriter has a long run of
> > buffers to write it could take a little bit of time (a few hundred
> > milliseconds?) but I think that should be okay.
> 
> Do you think bgwriter should be the only caller of
> NotifySegmentsReadyForArchive(), or should we still have XLogWrite()
> call it?

I think XLogWrite should absolutely be the primary caller.  The one in
bgwriter should be a backstop for the case you describe where the flush
pointer advanced past the registration point in XLogInsertRecord.

I realize this means there's a contradiction with my previous argument,
in that synchronous transaction commit calls XLogWrite at some point, so
we *are* putting the client-connected backend in charge of creating the
notify files.  However, that only happens on transaction commit, where
we already accept responsibility for the WAL flush, not on each
individual XLOG record insert; also, the WAL writer will take care of it
sometimes, for transactions that are long-enough lived.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it." (ncm, http://lwn.net/Articles/174769/)




Re: archive status ".ready" files may be created too early

2021-08-18 Thread Bossart, Nathan
On 8/18/21, 10:06 AM, "alvhe...@alvh.no-ip.org"  wrote:
> So that comment suggests that we should give the responsibility to bgwriter.
> This seems good enough to me.  I suppose if bgwriter has a long run of
> buffers to write it could take a little bit of time (a few hundred
> milliseconds?) but I think that should be okay.

Do you think bgwriter should be the only caller of
NotifySegmentsReadyForArchive(), or should we still have XLogWrite()
call it?

Nathan



Re: archive status ".ready" files may be created too early

2021-08-18 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-17, alvhe...@alvh.no-ip.org wrote:

> However, why do it in a WAL-producing client-connected backend?  It
> strikes me as a bad thing to do, because you are possibly causing delays
> for client-connected backends.  I suggest that we should give this task
> to the WAL writer process -- say, have XLogBackgroundFlush do it.

Reading the comments on walwriter.c I am hesitant of having walwriter do
it:

>  * Because the walwriter's cycle is directly linked to the maximum delay
>  * before async-commit transactions are guaranteed committed, it's probably
>  * unwise to load additional functionality onto it.  For instance, if you've
>  * got a yen to create xlog segments further in advance, that'd be better done
>  * in bgwriter than in walwriter.

So that comment suggests that we should give the responsibility to bgwriter.
This seems good enough to me.  I suppose if bgwriter has a long run of
buffers to write it could take a little bit of time (a few hundred
milliseconds?) but I think that should be okay.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No renuncies a nada. No te aferres a nada."




Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-17, Bossart, Nathan wrote:
> On 8/17/21, 2:13 PM, "alvhe...@alvh.no-ip.org"  
> wrote:
>
> > So, why isn't it that we call Register in XLogInsertRecord, and
> > Notify in XLogWrite?
> 
> We do.  However, we also call NotifySegmentsReadyForArchive() in
> XLogInsertRecord() to handle the probably-unlikely scenario that the
> flush pointer has already advanced past the to-be-registered boundary.
> This ensures that the .ready files are created as soon as possible.

I see.

I have two thoughts on that.  First, why not do it outside the block
that tests for crossing a segment boundary?  If that's a good thing to
do, then we should do it always.

However, why do it in a WAL-producing client-connected backend?  It
strikes me as a bad thing to do, because you are possibly causing delays
for client-connected backends.  I suggest that we should give this task
to the WAL writer process -- say, have XLogBackgroundFlush do it.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
 (George Orwell's The Lord of the Rings)




Re: archive status ".ready" files may be created too early

2021-08-17 Thread Bossart, Nathan
On 8/17/21, 2:13 PM, "alvhe...@alvh.no-ip.org"  wrote:
> On 2021-Aug-17, Bossart, Nathan wrote:
>
>> The main reason for registering the boundaries in XLogInsertRecord()
>> is that it has the required information about the WAL record
>> boundaries.  I do not think that XLogWrite() has this information.
>
> Doh, of course.  So, why isn't it that we call Register in
> XLogInsertRecord, and Notify in XLogWrite?

We do.  However, we also call NotifySegmentsReadyForArchive() in
XLogInsertRecord() to handle the probably-unlikely scenario that the
flush pointer has already advanced past the to-be-registered boundary.
This ensures that the .ready files are created as soon as possible.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-17, Bossart, Nathan wrote:

> The main reason for registering the boundaries in XLogInsertRecord()
> is that it has the required information about the WAL record
> boundaries.  I do not think that XLogWrite() has this information.

Doh, of course.  So, why isn't it that we call Register in
XLogInsertRecord, and Notify in XLogWrite?

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)




Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org
The thing I still don't understand about this patch is why we call
RegisterSegmentBoundaryEntry and NotifySegmentsReadyForArchive in
XLogInsertRecord.

My model concept of this would have these routines called only in
XLogFlush / XLogWrite, which are conceptually about transferring data
from in-memory WAL buffers into the on-disk WAL store (pg_xlog files).
As I understand, XLogInsertRecord is about copying bytes from the
high-level operation (heap insert etc) into WAL buffers.  So doing the
register/notify dance in both places should be redundant and
unnecessary.


In the NotifySegmentsReadyForArchive() call at the bottom of XLogWrite,
we use flushed=InvalidXLogRecPtr.  Why is that?  Surely we can use
LogwrtResult.Flush, just like in the other callsite there?  (If we're
covering for somebody advancing FlushRecPtr concurrently, I think we
add a comment to explain that.  But why do we need to do that in the
first place?)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)




Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-17, Bossart, Nathan wrote:

> I think we are in agreement.  If we assume that the flush pointer
> jumps along record boundaries and segment boundaries, the solution
> would be to avoid using the flush pointer when it points to a segment
> boundary (given that the segment boundary is not also a record
> boundary).  Instead, we'd only send up to the start position of the
> last record in the segment to standbys.

Agreed.

An implementation for that would be to test the flush pointer for it
being a segment boundary, and in that case we (acquire segment boundary
lock and) test for presence in the segment boundary map.  If present,
then retreat the pointer to the record's start address.

This means that we acquire the segment boundary lock rarely.  I was
concerned that we'd need to acquire it every time we read the flush
pointer, which would have been a disaster.

Thanks

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-08-17 Thread Bossart, Nathan
On 8/17/21, 10:44 AM, "alvhe...@alvh.no-ip.org"  wrote:
> On 2021-Aug-17, Bossart, Nathan wrote:
>> I've been thinking about the next steps for this one, too.  ISTM we'll
>> need to basically assume that the flush pointer jumps along record
>> boundaries except for the cross-segment records.  I don't know if that
>> is the safest assumption, but I think the alternative involves
>> recording every record boundary in the map.
>
> I'm not sure I understand your idea correctly.  Perhaps another solution
> is to assume that the flush pointer jumps along record boundaries
> *including* for cross-segment records.  The problem stems precisely from
> the fact that we set the flush pointer at segment boundaries, even when
> they aren't record boundary.

I think we are in agreement.  If we assume that the flush pointer
jumps along record boundaries and segment boundaries, the solution
would be to avoid using the flush pointer when it points to a segment
boundary (given that the segment boundary is not also a record
boundary).  Instead, we'd only send up to the start position of the
last record in the segment to standbys.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-17, Bossart, Nathan wrote:

> On 8/16/21, 5:09 PM, "alvhe...@alvh.no-ip.org"  
> wrote:
> > The reason for the latter is that I suspect the segment boundary map
> > will also have a use to fix the streaming replication issue I mentioned
> > earlier in the thread.  This also makes me think that we'll want the wal
> > record *start* address to be in the hash table too, not just its *end*
> > address.  So we'll use the start-1 as position to send, rather than the
> > end-of-segment when GetFlushRecPtr() returns that.
> 
> I've been thinking about the next steps for this one, too.  ISTM we'll
> need to basically assume that the flush pointer jumps along record
> boundaries except for the cross-segment records.  I don't know if that
> is the safest assumption, but I think the alternative involves
> recording every record boundary in the map.

I'm not sure I understand your idea correctly.  Perhaps another solution
is to assume that the flush pointer jumps along record boundaries
*including* for cross-segment records.  The problem stems precisely from
the fact that we set the flush pointer at segment boundaries, even when
they aren't record boundary.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree.  (Don Knuth)




Re: archive status ".ready" files may be created too early

2021-08-16 Thread alvhe...@alvh.no-ip.org
So why do we call this structure "record boundary map", when the
boundaries it refers to are segment boundaries?  I think we should call
it "segment boundary map" instead ... and also I think we should use
that name in the lock that protects it, so instead of ArchNotifyLock, it
could be SegmentBoundaryLock or perhaps WalSegmentBoundaryLock.

The reason for the latter is that I suspect the segment boundary map
will also have a use to fix the streaming replication issue I mentioned
earlier in the thread.  This also makes me think that we'll want the wal
record *start* address to be in the hash table too, not just its *end*
address.  So we'll use the start-1 as position to send, rather than the
end-of-segment when GetFlushRecPtr() returns that.

As for 0x, I think it would be cleaner to do a
#define MaxXLogSegNo with that value in the line immediately after
typedef XLogSegNo, rather than use the numerical value directly in the
assignment.

Typo in comment atop RemoveRecordBoundariesUpTo: it reads "up to an",
should read "up to and".

I think the API of GetLatestRecordBoundarySegment would be better by
returning the boolean and having the segment as out argument.  Then you
could do the caller more cleanly,

if (GetLatestRecordBoundarySegment(last_notified, flushed, 
_boundary_segment))
{
   SetLastNotified( ... );
   RemoveRecordBoundaries( ... );
   LWLockRelease( ... );
   for (..)
 XLogArchiveNotifySeg(...);
   PgArchWakeup();
}
else
   LWLockRelease(...);

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)




Re: archive status ".ready" files may be created too early

2021-08-06 Thread Kyotaro Horiguchi
At Fri, 6 Aug 2021 00:21:34 +, "Bossart, Nathan"  
wrote in 
> On 8/5/21, 12:39 AM, "Kyotaro Horiguchi"  wrote:
> > Honestly I don't like having this initialization in XLogWrite.  We
> > should and I think can initialize it earlier.  It seems to me the most
> > appropriate timing to initialize the variable is just before running
> > the end-of-recovery checkpoint).  Since StartupXLOG knows the first
> > segment to write .  If it were set to 0, that doesn't matter at all.
> > We can get rid of the new symbol by doing this.
> 
> This seems like a good idea to me.  I made this change in v5.  I
> performed some basic testing, and it seems to reliably initialize
> lastNotifiedSeg correctly.
> 
> > +   /*
> > +* EndOfLog resides on the next segment of the last finished one. Set 
> > the
> > +* last finished segment as lastNotifiedSeg now.  In the case where the
> > +* last crash has left the last several segments not being marked as
> > +* .ready, the checkpoint just after does that for all finished 
> > segments.
> > +* There's a corner case where the checkpoint advances segment, but that
> > +* ends up at most with a duplicate archive notification.
> > +*/
> 
> I'm not quite following the corner case you've described here.  Is it
> possible that the segment that EndOfLog points to will be eligible for
> removal after the checkpoint?

Archiving doesn't immediately mean removal.  A finished segment is
ought to be archived right away.  Since the EndOfLog segment must not
get marked .ready, setting lastNotifiedSeg to the previous segment is
quite right, but if the end-of-recovery checkpoint advances segment,
EndOfLog is marked .ready at the XLogFlush just after.  But, sorry,
what I forgot at the time was the checkpoint also moves
lastNotifiedSeg.  So, sorry, that corner case does not exist.

> In v5 of the patch, I've also added an extra call to
> NotifySegmentsReadyForArchive() in the same place we previously
> created the .ready files.  I think this helps notify archiver sooner
> in certain cases (e.g., asynchronous commit).

In v5, NotifySegmentsReadyForArchive() still holds ArchNotifyLock
including .ready file creations. Since the notification loop doesn't
need the hash itself, the loop can be took out of the lock section?

current:
LWLockAcquire(ArchNotifyLock, LW_EXCLUSIVE);
last_notified = GetLastNotifiedSegment();
latest_boundary_seg = GetLatestRecordBoundarySegment(last_notified, 
flushed, );

if (found)
{
SetLastNotifiedSegment(latest_boundary_seg - 1);
for (seg = last_notified + 1; seg < latest_boundary_seg; seg++)
XLogArchiveNotifySeg(seg, false);

RemoveRecordBoundariesUpTo(latest_boundary_seg);

PgArchWakeup();
}
LWLockRelease(ArchNotifyLock);

But we can release the lock earlier.

LWLockAcquire(ArchNotifyLock, LW_EXCLUSIVE);
last_notified = GetLastNotifiedSegment();
latest_boundary_seg = GetLatestRecordBoundarySegment(last_notified, 
flushed, );

if (found)
{
SetLastNotifiedSegment(latest_boundary_seg - 1);
RemoveRecordBoundariesUpTo(latest_boundary_seg);
}
LWLockRelease(ArchNotifyLock);

if (found)
{
for (seg = last_notified + 1; seg < latest_boundary_seg; seg++)
XLogArchiveNotifySeg(seg, false);

PgArchWakeup();
}

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: archive status ".ready" files may be created too early

2021-08-05 Thread Kyotaro Horiguchi
At Thu, 5 Aug 2021 05:15:04 +, "Bossart, Nathan"  
wrote in 
> On 8/4/21, 9:05 PM, "Kyotaro Horiguchi"  wrote:
> > By the way about the v3 patch,
> >
> > +#define InvalidXLogSegNo   ((XLogSegNo) 0x)
> >
> > Like InvalidXLogRecPtr, the first valid segment number is 1 so we can
> > use 0 as InvalidXLogSegNo.
> 
> It's been a while since I wrote this, but if I remember correctly, the
> issue with using 0 is that we could end up initializing
> lastNotifiedSeg to InvalidXLogSegNo in XLogWrite().  Eventually, we'd
> initialize it to 1, but we will have skipped creating the .ready file
> for the first segment.

Maybe this?

+   
SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1);

Hmm. Theoretically 0 is invalid as segment number. So we'd better not
using 0 as a valid value of lastNotifiedSeg.

Honestly I don't like having this initialization in XLogWrite.  We
should and I think can initialize it earlier.  It seems to me the most
appropriate timing to initialize the variable is just before running
the end-of-recovery checkpoint).  Since StartupXLOG knows the first
segment to write .  If it were set to 0, that doesn't matter at all.
We can get rid of the new symbol by doing this.

Maybe something like this:

>   {
>   /*
>* There is no partial block to copy. Just set InitializedUpTo, 
> and
>* let the first attempt to insert a log record to initialize 
> the next
>* buffer.
>*/
>   XLogCtl->InitializedUpTo = EndOfLog;
>   }
> 
+   /*
+* EndOfLog resides on the next segment of the last finished one. Set 
the
+* last finished segment as lastNotifiedSeg now.  In the case where the
+* last crash has left the last several segments not being marked as
+* .ready, the checkpoint just after does that for all finished 
segments.
+* There's a corner case where the checkpoint advances segment, but that
+* ends up at most with a duplicate archive notification.
+*/
+   XLByteToSeg(EndOfLog, EndOfLogSeg, wal_segment_size);
+   Assert(EndOfLogSeg > 0);
+   SetLastNotifiedSegment(EndOfLogSeg - 1);
+
>   LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;

Does this makes sense?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center






Re: archive status ".ready" files may be created too early

2021-08-04 Thread Bossart, Nathan
On 8/4/21, 9:05 PM, "Kyotaro Horiguchi"  wrote:
> By the way about the v3 patch,
>
> +#define InvalidXLogSegNo   ((XLogSegNo) 0x)
>
> Like InvalidXLogRecPtr, the first valid segment number is 1 so we can
> use 0 as InvalidXLogSegNo.

It's been a while since I wrote this, but if I remember correctly, the
issue with using 0 is that we could end up initializing
lastNotifiedSeg to InvalidXLogSegNo in XLogWrite().  Eventually, we'd
initialize it to 1, but we will have skipped creating the .ready file
for the first segment.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-04 Thread Kyotaro Horiguchi
By the way about the v3 patch,

+#define InvalidXLogSegNo   ((XLogSegNo) 0x)

Like InvalidXLogRecPtr, the first valid segment number is 1 so we can
use 0 as InvalidXLogSegNo.

BootStrapXLOG():
/* Create first XLOG segment file */
openLogFile = XLogFileInit(1);

KeepLogSeg():
/* avoid underflow, don't go below 1 */
if (currSegNo <= keep_segs)
segno = 1;

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: archive status ".ready" files may be created too early

2021-08-04 Thread Kyotaro Horiguchi
At Tue, 3 Aug 2021 21:32:18 +, "Bossart, Nathan"  
wrote in 
> On 8/2/21, 7:37 PM, "Kyotaro Horiguchi"  wrote:
> > I'm afraid that using hash to store boundary info is too much. Isn't a
> > ring buffer enough for this use?  In that case it is enough to
> > remember only the end LSN of the segment spanning records.  It is easy
> > to expand the buffer if needed.
> 
> I agree that the hash table requires a bit more memory than what is
> probably necessary, but I'm not sure I agree that maintaining a custom
> data structure to save a few kilobytes of memory is worth the effort.

Memory is one of my concerns but more significant point was required
CPU cycles by GetLatestRecordBoundarySegment.  So I don't mind it is
using a hash if the loop on the hash didn't block other backends.

Addition to that, while NotifySegmentsReadyForArchive() is notifying
pending segments, other backends simultaneously reach there are
blocked until the notification, incuding file creation, finishes.  I
don't think that's great. Couldn't we set lastNotifiedSegment before
the loop?  At the moment a backend decides to notify some segments,
others no longer need to consider those segments.  Even if the backend
crashes meanwhile, as you mentionied below, it's safe since the
unnotified segments are notifed after restart.

> > @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
> > LogwrtResult.Flush = LogwrtResult.Write;
> > /* end of page */
> >
> > if (XLogArchivingActive())
> > -   XLogArchiveNotifySeg(openLogSegNo);
> > +   
> > SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1);
> >
> > Is it safe?  If server didn't notified of WAL files for recent 3
> > finished segments in the previous server life, they need to be
> > archived this life time. But this omits maybe all of the tree.
> > (I didn't confirm that behavior..)
> 
> I tested this scenario out earlier [0].  It looks like the call to
> XLogArchiveCheckDone() in RemoveOldXlogFiles() will take care of
> creating any .ready files we missed.

Yeah, I reclled of that behvaior. In that case crash recovery reads up
to just before the last (continued) record in the last finished
segment.  On the other hand if creash recovery was able to read that
record, it's safe to archive the last segment immediately after
recovery.  So that behavior is safe. Thanks!

> >> I believe my worry was that we'd miss notifying a segment as soon as
> >> possible if the record was somehow flushed prior to registering the
> >> record boundary in the map.  If that's actually impossible, then I
> >> would agree that the extra call to NotifySegmentsReadyForArchive() is
> >> unnecessary.
> >
> > I don't think that XLogWrite(up to LSN=X) can happen before
> > XLogInsert(endpos = X) ends.
> 
> Is there anything preventing that from happening?  At the location
> where we are registering the record boundary, we've already called
> CopyXLogRecordToWAL(), and neither the WAL insertion lock nor the
> WALWriteLock are held.  Even if we register the boundary before
> updating the shared LogwrtRqst.Write, there's a chance that someone
> else has already moved it ahead and called XLogWrite().  I think the
> worst case scenario is that we hold off creating .ready files longer
> than necessary, but IMO that's still a worthwhile thing to do.

Oh, boundary registration happens actually after an insertion ends
(but before XLogInsert ends). The missed segment is never processed
due to the qualification by lastNotifiedSeg.

Does it work that RegisterRecordBoundaryEntry omits registering of the
bounary if it finds lastNotifiedSeg have gone too far?

> Nathan
> 
> [0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com
> 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: archive status ".ready" files may be created too early

2021-08-03 Thread Bossart, Nathan
On 8/2/21, 7:37 PM, "Kyotaro Horiguchi"  wrote:
> I'm afraid that using hash to store boundary info is too much. Isn't a
> ring buffer enough for this use?  In that case it is enough to
> remember only the end LSN of the segment spanning records.  It is easy
> to expand the buffer if needed.

I agree that the hash table requires a bit more memory than what is
probably necessary, but I'm not sure I agree that maintaining a custom
data structure to save a few kilobytes of memory is worth the effort.

> @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
> LogwrtResult.Flush = LogwrtResult.Write;  
>   /* end of page */
>
> if (XLogArchivingActive())
> -   XLogArchiveNotifySeg(openLogSegNo);
> +   
> SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1);
>
> Is it safe?  If server didn't notified of WAL files for recent 3
> finished segments in the previous server life, they need to be
> archived this life time. But this omits maybe all of the tree.
> (I didn't confirm that behavior..)

I tested this scenario out earlier [0].  It looks like the call to
XLogArchiveCheckDone() in RemoveOldXlogFiles() will take care of
creating any .ready files we missed.

>> I believe my worry was that we'd miss notifying a segment as soon as
>> possible if the record was somehow flushed prior to registering the
>> record boundary in the map.  If that's actually impossible, then I
>> would agree that the extra call to NotifySegmentsReadyForArchive() is
>> unnecessary.
>
> I don't think that XLogWrite(up to LSN=X) can happen before
> XLogInsert(endpos = X) ends.

Is there anything preventing that from happening?  At the location
where we are registering the record boundary, we've already called
CopyXLogRecordToWAL(), and neither the WAL insertion lock nor the
WALWriteLock are held.  Even if we register the boundary before
updating the shared LogwrtRqst.Write, there's a chance that someone
else has already moved it ahead and called XLogWrite().  I think the
worst case scenario is that we hold off creating .ready files longer
than necessary, but IMO that's still a worthwhile thing to do.

Nathan

[0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com



Re: archive status ".ready" files may be created too early

2021-08-02 Thread Kyotaro Horiguchi
At Mon, 2 Aug 2021 23:28:19 +, "Bossart, Nathan"  
wrote in 
> On 8/2/21, 2:42 PM, "Alvaro Herrera"  wrote:
> > I think it's okay to make lastNotifiedSeg protected by just info_lck,
> > and RecordBoundaryMap protected by just ArchNotifyLock.  It's okay to
> > acquire the spinlock inside the lwlock-protected area, as long as we
> > make sure never to do the opposite.  (And we sure don't want to hold
> > info_lck long enough that a LWLock acquisition would occur in the
> > meantime).  So I modified things that way, and also added another
> > function to set the seg if it's unset, with a single spinlock
> > acquisition (rather than acqquire, read, release, acqquire, set,
> > release, which sounds like it would have trouble behaving.)
> 
> The patch looks good to me.

+   for (seg = flushed_seg; seg > last_notified; seg--)
+   {
+   RecordBoundaryEntry *entry;
+
+   entry = (RecordBoundaryEntry *) hash_search(RecordBoundaryMap,
+   
(void *) , HASH_FIND,

I'm afraid that using hash to store boundary info is too much. Isn't a
ring buffer enough for this use?  In that case it is enough to
remember only the end LSN of the segment spanning records.  It is easy
to expand the buffer if needed.

+   if (!XLogSegNoIsInvalid(latest_boundary_seg))

It is a matter of taste, but I see latest_boundary_seg !=
InvalidXLogSegNo more frequentlyl, maybe to avoid double negation.


@@ -1167,10 +1195,33 @@ XLogInsertRecord(XLogRecData *rdata,
SpinLockRelease(>info_lck);
}
 
+   /*
+* Record the record boundary if we crossed the segment boundary.  This 
is
...
+   XLByteToSeg(StartPos, StartSeg, wal_segment_size);
+   XLByteToSeg(EndPos, EndSeg, wal_segment_size);
+
+   if (StartSeg != EndSeg && XLogArchivingActive())
+   {

The immediately prceding if block is for cross-page records. So we can
reduce the overhaed by the above calculations by moving it to the
preceding if-block.


+RegisterRecordBoundaryEntry(XLogSegNo seg, XLogRecPtr pos)

The seg is restricted to the segment that pos resides on.  The caller
is free from caring that restriction if the function takes only pos.
It adds a small overhead to calculate segment number from the LSN but
I think it doesn't matter so much. (Or if we don't use hash, that
calculation is not required at all).


@@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
LogwrtResult.Flush = LogwrtResult.Write;
/* end of page */
 
if (XLogArchivingActive())
-   XLogArchiveNotifySeg(openLogSegNo);
+   
SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1);

Is it safe?  If server didn't notified of WAL files for recent 3
finished segments in the previous server life, they need to be
archived this life time. But this omits maybe all of the tree.
(I didn't confirm that behavior..)

> > I find it highly suspicious that the patch does an archiver notify (i.e.
> > creation of the .ready file) in XLogInsertRecord().  Is that a sane
> > thing to do?  Sounds to me like that should be attempted in XLogFlush
> > only.  This appeared after Kyotaro's patch at [1] and before your patch
> > at [2].
> 
> I believe my worry was that we'd miss notifying a segment as soon as
> possible if the record was somehow flushed prior to registering the
> record boundary in the map.  If that's actually impossible, then I
> would agree that the extra call to NotifySegmentsReadyForArchive() is
> unnecessary.

I don't think that XLogWrite(up to LSN=X) can happen before
XLogInsert(endpos = X) ends.

> >> I think the main downside of making lastNotifiedSeg an atomic is that
> >> the value we first read in NotifySegmentsReadyForArchive() might not
> >> be up-to-date, which means we might hold off creating .ready files
> >> longer than necessary.
> >
> > I'm not sure I understand how this would be a problem.  If we block
> > somebody from setting a newer value, they'll just set the value
> > immediately after we release the lock.  Will we reread the value
> > afterwards to see if it changed?
> 
> I think you are right.  If we see an old value for lastNotifiedSeg,
> the worst case is that we take the ArchNotifyLock, read
> lastNotifiedSeg again (which should then be up-to-date), and then

Agreed.

> basically do nothing.  I suspect initializing lastNotifiedSeg might
> still be a little tricky, though.  Do you think it is important to try
> to avoid this spinlock for lastNotifiedSeg?  IIUC it's acquired at the
> end of every call to XLogWrite() already, and we'd still need to
> acquire it for the flush pointer, anyway.

As mentioned above, I think it needs more cosideration.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: archive status ".ready" files may be created too early

2021-08-02 Thread Bossart, Nathan
On 8/2/21, 2:42 PM, "Alvaro Herrera"  wrote:
> I think it's okay to make lastNotifiedSeg protected by just info_lck,
> and RecordBoundaryMap protected by just ArchNotifyLock.  It's okay to
> acquire the spinlock inside the lwlock-protected area, as long as we
> make sure never to do the opposite.  (And we sure don't want to hold
> info_lck long enough that a LWLock acquisition would occur in the
> meantime).  So I modified things that way, and also added another
> function to set the seg if it's unset, with a single spinlock
> acquisition (rather than acqquire, read, release, acqquire, set,
> release, which sounds like it would have trouble behaving.)

The patch looks good to me.

> I find it highly suspicious that the patch does an archiver notify (i.e.
> creation of the .ready file) in XLogInsertRecord().  Is that a sane
> thing to do?  Sounds to me like that should be attempted in XLogFlush
> only.  This appeared after Kyotaro's patch at [1] and before your patch
> at [2].

I believe my worry was that we'd miss notifying a segment as soon as
possible if the record was somehow flushed prior to registering the
record boundary in the map.  If that's actually impossible, then I
would agree that the extra call to NotifySegmentsReadyForArchive() is
unnecessary.

>> I think the main downside of making lastNotifiedSeg an atomic is that
>> the value we first read in NotifySegmentsReadyForArchive() might not
>> be up-to-date, which means we might hold off creating .ready files
>> longer than necessary.
>
> I'm not sure I understand how this would be a problem.  If we block
> somebody from setting a newer value, they'll just set the value
> immediately after we release the lock.  Will we reread the value
> afterwards to see if it changed?

I think you are right.  If we see an old value for lastNotifiedSeg,
the worst case is that we take the ArchNotifyLock, read
lastNotifiedSeg again (which should then be up-to-date), and then
basically do nothing.  I suspect initializing lastNotifiedSeg might
still be a little tricky, though.  Do you think it is important to try
to avoid this spinlock for lastNotifiedSeg?  IIUC it's acquired at the
end of every call to XLogWrite() already, and we'd still need to
acquire it for the flush pointer, anyway.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-02 Thread Alvaro Herrera
On 2021-Jul-31, Bossart, Nathan wrote:

> This is why I was trying to get away with just using info_lck for
> reading lastNotifiedSeg.  ArchNotifyLock is mostly intended to protect
> RecordBoundaryMap.  However, since lastNotifiedSeg is used in
> functions like GetLatestRecordBoundarySegment() that access the map, I
> found it easier to reason about things if I knew that it wouldn't
> change as long as I held ArchNotifyLock.

I think it's okay to make lastNotifiedSeg protected by just info_lck,
and RecordBoundaryMap protected by just ArchNotifyLock.  It's okay to
acquire the spinlock inside the lwlock-protected area, as long as we
make sure never to do the opposite.  (And we sure don't want to hold
info_lck long enough that a LWLock acquisition would occur in the
meantime).  So I modified things that way, and also added another
function to set the seg if it's unset, with a single spinlock
acquisition (rather than acqquire, read, release, acqquire, set,
release, which sounds like it would have trouble behaving.)

I haven't tried your repro with this yet.

I find it highly suspicious that the patch does an archiver notify (i.e.
creation of the .ready file) in XLogInsertRecord().  Is that a sane
thing to do?  Sounds to me like that should be attempted in XLogFlush
only.  This appeared after Kyotaro's patch at [1] and before your patch
at [2].

[1] 
https://postgr.es/m/20201014.090628.839639906081252194.horikyota@gmail.com
[2] https://postgr.es/m/eff40306-8e8a-4259-b181-c84f3f066...@amazon.com

I also just realized that Kyotaro's patch there also tried to handle the
streaming replication issue I was talking about.

> I think the main downside of making lastNotifiedSeg an atomic is that
> the value we first read in NotifySegmentsReadyForArchive() might not
> be up-to-date, which means we might hold off creating .ready files
> longer than necessary.

I'm not sure I understand how this would be a problem.  If we block
somebody from setting a newer value, they'll just set the value
immediately after we release the lock.  Will we reread the value
afterwards to see if it changed?

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
>From 87c77464049c719c2075bd94d4e9ed168c7cf159 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 30 Jul 2021 18:35:52 -0400
Subject: [PATCH v3] Avoid creating archive status ".ready" files too early.

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL
page of the segment is written, the archive status file will be
created.  If PostgreSQL crashes before it is able to write the rest
of the record, it will end up reusing segments that have already
been marked as ready-for-archival.  However, the archiver process
may have already processed the old version of the segment, so the
wrong version of the segment may be backed-up.  This backed-up
segment will cause operations such as point-in-time restores to
fail.

To fix this, we keep track of records that span across segments and
ensure that segments are only marked ready-for-archival once such
records have been completely written to disk.
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 286 ++-
 src/backend/access/transam/xlogarchive.c |  14 +-
 src/backend/replication/walreceiver.c|   6 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   5 +
 7 files changed, 296 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb01e..99fe1ac0a2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -512,6 +512,13 @@ typedef enum ExclusiveBackupState
  */
 static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
 
+/* entries for RecordBoundaryMap, used to mark segments ready for archival */
+typedef struct RecordBoundaryEntry
+{
+	XLogSegNo seg;	/* must be first */
+	XLogRecPtr pos;
+} RecordBoundaryEntry;
+
 /*
  * Shared state data for WAL insertion.
  */
@@ -723,6 +730,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * The last segment we've marked ready for archival.  Protected by
+	 * info_lck.
+	 */
+	XLogSegNo lastNotifiedSeg;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 

Re: archive status ".ready" files may be created too early

2021-07-31 Thread Bossart, Nathan
On 7/31/21, 9:12 AM, "Alvaro Herrera"  wrote:
> On 2021-Jul-31, Bossart, Nathan wrote:
>
>> On 7/30/21, 4:52 PM, "Alvaro Herrera"  wrote:
>
>> > I noticed that XLogCtl->lastNotifiedSeg is protected by both the
>> > info_lck and ArchNotifyLock.  I think it it's going to be protected by
>> > the lwlock, then we should drop the use of the spinlock.
>>
>> That seems reasonable to me.  This means that the lock is acquired at
>> the end of every XLogWrite(),
>
> Uhh, actually that there sounds really bad; it's going to be a serious
> contention point.
>
> Another option might be to make it an atomic.

This is why I was trying to get away with just using info_lck for
reading lastNotifiedSeg.  ArchNotifyLock is mostly intended to protect
RecordBoundaryMap.  However, since lastNotifiedSeg is used in
functions like GetLatestRecordBoundarySegment() that access the map, I
found it easier to reason about things if I knew that it wouldn't
change as long as I held ArchNotifyLock.

I think the main downside of making lastNotifiedSeg an atomic is that
the value we first read in NotifySegmentsReadyForArchive() might not
be up-to-date, which means we might hold off creating .ready files
longer than necessary.

Nathan



Re: archive status ".ready" files may be created too early

2021-07-31 Thread Alvaro Herrera
On 2021-Jul-31, Bossart, Nathan wrote:

> On 7/30/21, 4:52 PM, "Alvaro Herrera"  wrote:

> > I noticed that XLogCtl->lastNotifiedSeg is protected by both the
> > info_lck and ArchNotifyLock.  I think it it's going to be protected by
> > the lwlock, then we should drop the use of the spinlock.
> 
> That seems reasonable to me.  This means that the lock is acquired at
> the end of every XLogWrite(),

Uhh, actually that there sounds really bad; it's going to be a serious
contention point.

Another option might be to make it an atomic.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-07-30 Thread Bossart, Nathan
On 7/30/21, 4:52 PM, "Alvaro Herrera"  wrote:
> I think that creating files out of order might be problematic.  On the
> archiver side, pgarch_readyXlog() expects to return the oldest
> archivable file; but if we create a newer segment's .ready file first,
> it is possible that a directory scan would return that newer file before
> the older segment's .ready file appears.
>
> However, the comments in pgarch_readyXlog() aren't super convincing that
> processing the files in order is actually a correctness requirement, so
> perhaps it doesn't matter all that much.

I can't think of a reason it'd be needed from a correctness
perspective.  After a quick scan, I couldn't find any promises about
archival order in the documentation, either.  In any case, it doesn't
look like there's a risk that the archiver will skip files when the
.ready files are created out of order.

> I noticed that XLogCtl->lastNotifiedSeg is protected by both the
> info_lck and ArchNotifyLock.  I think it it's going to be protected by
> the lwlock, then we should drop the use of the spinlock.

That seems reasonable to me.  This means that the lock is acquired at
the end of every XLogWrite(), but the other places that acquire the
lock only do so once per WAL segment.  Plus, the call to
NotifySegmentsReadyForArchive() at the end of every XLogWrite() should
usually only need the lock for a short amount of time to retrieve a
value from shared memory.

> We set archiver's latch on each XLogArchiveNotify(), but if we're doing
> it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is
> better to create all the .ready files first and do PgArchWakeup() at the
> end.  I'm not convinced that this is useful but let's at least discard
> the idea explicitly if not.

I don't have a terribly strong opinion, but I would lean towards
setting the latch for each call to XLogArchiveNotify() so that the
archiver process can get started as soon as a segment is ready.
However, I doubt that holding off until the end of the loop has any
discernible effect in most cases.

Nathan



Re: archive status ".ready" files may be created too early

2021-07-30 Thread Alvaro Herrera
On 2021-Jul-30, Alvaro Herrera wrote:

> We set archiver's latch on each XLogArchiveNotify(), but if we're doing
> it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is
> better to create all the .ready files first and do PgArchWakeup() at the
> end.  I'm not convinced that this is useful but let's at least discard
> the idea explicitly if not.

hm, this causes an ABI change so it's not backpatchable.


-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL.  Most of the surprises
are of the "oh wow!  That's cool" Not the "oh shit!" kind.  :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php




Re: archive status ".ready" files may be created too early

2021-07-30 Thread Alvaro Herrera
On 2021-Jul-30, Bossart, Nathan wrote:

> Yes, that was what I was worried about.  However, I just performed a
> variety of tests with just 0001 applied, and I am beginning to suspect
> my concerns were unfounded.  With wal_buffers set very high,
> synchronous_commit set to off, and a long sleep at the end of
> XLogWrite(), I can reliably cause the archive status files to lag far
> behind the current open WAL segment.  However, even if I crash at this
> time, the .ready files are created when the server restarts (albeit
> out of order).

I think that creating files out of order might be problematic.  On the
archiver side, pgarch_readyXlog() expects to return the oldest
archivable file; but if we create a newer segment's .ready file first,
it is possible that a directory scan would return that newer file before
the older segment's .ready file appears.

However, the comments in pgarch_readyXlog() aren't super convincing that
processing the files in order is actually a correctness requirement, so
perhaps it doesn't matter all that much.


I noticed that XLogCtl->lastNotifiedSeg is protected by both the
info_lck and ArchNotifyLock.  I think it it's going to be protected by
the lwlock, then we should drop the use of the spinlock.


We set archiver's latch on each XLogArchiveNotify(), but if we're doing
it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is
better to create all the .ready files first and do PgArchWakeup() at the
end.  I'm not convinced that this is useful but let's at least discard
the idea explicitly if not.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine."(Julien PUYDT)
>From f9d94c614d6cf640fdaa09785d0817a1d86b9658 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 30 Jul 2021 18:35:52 -0400
Subject: [PATCH v2] Avoid creating archive status ".ready" files too early.

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL
page of the segment is written, the archive status file will be
created.  If PostgreSQL crashes before it is able to write the rest
of the record, it will end up reusing segments that have already
been marked as ready-for-archival.  However, the archiver process
may have already processed the old version of the segment, so the
wrong version of the segment may be backed-up.  This backed-up
segment will cause operations such as point-in-time restores to
fail.

To fix this, we keep track of records that span across segments and
ensure that segments are only marked ready-for-archival once such
records have been completely written to disk.
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 279 ++-
 src/backend/access/transam/xlogarchive.c |  14 +-
 src/backend/replication/walreceiver.c|   6 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   3 +
 7 files changed, 289 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 26fa2b6c8f..700bbccff8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -531,6 +531,13 @@ typedef enum ExclusiveBackupState
  */
 static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
 
+/* entries for RecordBoundaryMap, used to mark segments ready for archival */
+typedef struct RecordBoundaryEntry
+{
+	XLogSegNo seg;	/* must be first */
+	XLogRecPtr pos;
+} RecordBoundaryEntry;
+
 /*
  * Shared state data for WAL insertion.
  */
@@ -742,6 +749,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * The last segment we've marked ready for archival.  Protected by
+	 * ArchNotifyLock.
+	 */
+	XLogSegNo lastNotifiedSeg;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -755,6 +768,12 @@ static WALInsertLockPadded *WALInsertLocks = NULL;
  */
 static ControlFileData *ControlFile = NULL;
 
+/*
+ * Record boundary map, used for marking segments as ready for archival.
+ * Protected by ArchNotifyLock.
+ */
+static HTAB *RecordBoundaryMap = NULL;
+
 /*
  * Calculate the amount of space left on the page after 'endptr'. Beware
  * multiple evaluation!
@@ -983,6 +1002,12 @@ 

Re: archive status ".ready" files may be created too early

2021-07-30 Thread Bossart, Nathan
On 7/30/21, 3:23 PM, "Alvaro Herrera"  wrote:
> That's great to hear.  I'll give 0001 a look again.

Much appreciated.

Nathan



Re: archive status ".ready" files may be created too early

2021-07-30 Thread Alvaro Herrera
On 2021-Jul-30, Bossart, Nathan wrote:

> On 7/30/21, 11:34 AM, "Alvaro Herrera"  wrote:
> > Hmm ... I'm not sure we're prepared to backpatch this kind of change.
> > It seems a bit too disruptive to how replay works.  I think patch we
> > should be focusing solely on patch 0001 to surgically fix the precise
> > bug you see.  Does patch 0002 exist because you think that a system with
> > only 0001 will not correctly deal with a crash at the right time?
> 
> Yes, that was what I was worried about.  However, I just performed a
> variety of tests with just 0001 applied, and I am beginning to suspect
> my concerns were unfounded.  With wal_buffers set very high,
> synchronous_commit set to off, and a long sleep at the end of
> XLogWrite(), I can reliably cause the archive status files to lag far
> behind the current open WAL segment.  However, even if I crash at this
> time, the .ready files are created when the server restarts (albeit
> out of order).  This appears to be due to the call to
> XLogArchiveCheckDone() in RemoveOldXlogFiles().  Therefore, we can
> likely abandon 0002.

That's great to hear.  I'll give 0001 a look again.

> > Now, the reason I'm looking at this patch series is that we're seeing a
> > related problem with walsender/walreceiver, which apparently are capable
> > of creating a file in the replica that ends up not existing in the
> > primary after a crash, for a reason closely related to what you
> > describe for WAL archival.  I'm not sure what is going on just yet, so
> > I'm not going to try and explain because I'm likely to get it wrong.
> 
> I've suspected that this is due to the use of the flushed location for
> the send pointer, which AFAICT needn't align with a WAL record
> boundary.

Yeah, I had gotten as far as the GetFlushRecPtr but haven't tracked down
what happens with a contrecord.


-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-07-30 Thread Bossart, Nathan
On 7/30/21, 11:34 AM, "Alvaro Herrera"  wrote:
> Hmm ... I'm not sure we're prepared to backpatch this kind of change.
> It seems a bit too disruptive to how replay works.  I think patch we
> should be focusing solely on patch 0001 to surgically fix the precise
> bug you see.  Does patch 0002 exist because you think that a system with
> only 0001 will not correctly deal with a crash at the right time?

Yes, that was what I was worried about.  However, I just performed a
variety of tests with just 0001 applied, and I am beginning to suspect
my concerns were unfounded.  With wal_buffers set very high,
synchronous_commit set to off, and a long sleep at the end of
XLogWrite(), I can reliably cause the archive status files to lag far
behind the current open WAL segment.  However, even if I crash at this
time, the .ready files are created when the server restarts (albeit
out of order).  This appears to be due to the call to
XLogArchiveCheckDone() in RemoveOldXlogFiles().  Therefore, we can
likely abandon 0002.

> Now, the reason I'm looking at this patch series is that we're seeing a
> related problem with walsender/walreceiver, which apparently are capable
> of creating a file in the replica that ends up not existing in the
> primary after a crash, for a reason closely related to what you
> describe for WAL archival.  I'm not sure what is going on just yet, so
> I'm not going to try and explain because I'm likely to get it wrong.

I've suspected that this is due to the use of the flushed location for
the send pointer, which AFAICT needn't align with a WAL record
boundary.

/*
 * Streaming the current timeline on a primary.
 *
 * Attempt to send all data that's already been written out and
 * fsync'd to disk.  We cannot go further than what's been 
written out
 * given the current implementation of WALRead().  And in any 
case
 * it's unsafe to send WAL that is not securely down to disk on 
the
 * primary: if the primary subsequently crashes and restarts, 
standbys
 * must not have applied any WAL that got lost on the primary.
 */
 SendRqstPtr = GetFlushRecPtr();

Nathan



Re: archive status ".ready" files may be created too early

2021-07-30 Thread Alvaro Herrera
On 2021-Jul-28, Bossart, Nathan wrote:

> On 7/27/21, 6:05 PM, "Alvaro Herrera"  wrote:

> > I'm not sure I understand what's the reason not to store this value in
> > pg_control; I feel like I'm missing something.  Can you please explain?
> 
> Thanks for taking a look.
> 
> The only reason I can think of is that it could make back-patching
> difficult.  I don't mind working on a version of the patch that uses
> pg_control.  Back-patching this fix might be a stretch, anyway.

Hmm ... I'm not sure we're prepared to backpatch this kind of change.
It seems a bit too disruptive to how replay works.  I think patch we
should be focusing solely on patch 0001 to surgically fix the precise
bug you see.  Does patch 0002 exist because you think that a system with
only 0001 will not correctly deal with a crash at the right time?


Now, the reason I'm looking at this patch series is that we're seeing a
related problem with walsender/walreceiver, which apparently are capable
of creating a file in the replica that ends up not existing in the
primary after a crash, for a reason closely related to what you
describe for WAL archival.  I'm not sure what is going on just yet, so
I'm not going to try and explain because I'm likely to get it wrong.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-07-27 Thread Bossart, Nathan
On 7/27/21, 6:05 PM, "Alvaro Herrera"  wrote:
> On 2021-Feb-19, Bossart, Nathan wrote:
>
>> 0002 adds logic for persisting the last notified segment through
>> crashes.  This is needed because a poorly-timed crash could otherwise
>> cause us to skip marking segments as ready-for-archival altogether.
>> This file is only used for primary servers, as there exists a separate
>> code path for marking segments as ready-for-archive for standbys.
>
> I'm not sure I understand what's the reason not to store this value in
> pg_control; I feel like I'm missing something.  Can you please explain?

Thanks for taking a look.

The only reason I can think of is that it could make back-patching
difficult.  I don't mind working on a version of the patch that uses
pg_control.  Back-patching this fix might be a stretch, anyway.

> There were some comments earlier in the thread about the maximum size of
> a record.  As I recall, you can have records of arbitrary size if you
> have COMMIT with a large number of relation invalidation messages being
> included in the xlog record, or a large number of XIDs of
> subtransactions in the transaction.  Spanning several segments is
> possible, AFAIU.

This is my understanding, too.

Nathan



Re: archive status ".ready" files may be created too early

2021-07-27 Thread Alvaro Herrera
On 2021-Feb-19, Bossart, Nathan wrote:

> 0002 adds logic for persisting the last notified segment through
> crashes.  This is needed because a poorly-timed crash could otherwise
> cause us to skip marking segments as ready-for-archival altogether.
> This file is only used for primary servers, as there exists a separate
> code path for marking segments as ready-for-archive for standbys.

I'm not sure I understand what's the reason not to store this value in
pg_control; I feel like I'm missing something.  Can you please explain?

There were some comments earlier in the thread about the maximum size of
a record.  As I recall, you can have records of arbitrary size if you
have COMMIT with a large number of relation invalidation messages being
included in the xlog record, or a large number of XIDs of
subtransactions in the transaction.  Spanning several segments is
possible, AFAIU.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)




Re: archive status ".ready" files may be created too early

2021-01-27 Thread Bossart, Nathan
On 1/26/21, 6:36 PM, "Kyotaro Horiguchi"  wrote:
> At Tue, 26 Jan 2021 19:13:57 +, "Bossart, Nathan"  
> wrote in
>> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi"  wrote:
>> > At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan" 
>> >  wrote in
>> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi"  wrote:
>> >> > You're right in that regard. There's a window where partial record is
>> >> > written when write location passes F0 after insertion location passes
>> >> > F1. However, remembering all spanning records seems overkilling to me.
>> >>
>> >> I'm curious why you feel that recording all cross-segment records is
>> >> overkill.  IMO it seems far simpler to just do that rather than try to
>> >
>> > Sorry, my words are not enough. Remembering all spanning records in
>> > *shared memory* seems to be overkilling.  Much more if it is stored in
>> > shared hash table.  Even though it rarely the case, it can fail hard
>> > way when reaching the limit. If we could do well by remembering just
>> > two locations, we wouldn't need to worry about such a limitation.
>>
>> I don't think it will fail if we reach max_size for the hash table.
>> The comment above ShmemInitHash() has this note:
>>
>> * max_size is the estimated maximum number of hashtable entries.  This is
>> * not a hard limit, but the access efficiency will degrade if it is
>> * exceeded substantially (since it's used to compute directory size and
>> * the hash table buckets will get overfull).
>
> That description means that a shared hash has a directory with fixed
> size thus there may be synonyms, which causes degradation. Even though
> buckets are preallocated with the specified number, since the minimum
> directory size is 256, buckets are allocated at least 256 in a long
> run. Minimum on-the-fly allocation size is 32. I haven't calcuated
> further precicely, but I'm worried about the amount of spare shared
> memory the hash can allocate.

On my machine, hash_estimate_size() for the table returns 5,968 bytes.
That estimate is for a max_size of 16.  In my testing, I've been able
to need up to 6 elements in this table, but that required turning off
synchronous_commit, adding a long sleep at the end of XLogWrite(), and
increasing wal_buffers substantially.  This leads me to think that a
max_size of 16 elements is typically sufficient.  (I may have also
accidentally demonstrated that only storing two record boundaries
could be insufficient.)

>> > Another concern about the concrete patch:
>> >
>> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
>> > LWLock every time XLogWrite is called while segment archive is being
>> > held off.  I don't think it is acceptable and I think it could be a
>> > problem when many backends are competing on WAL.
>>
>> This is a fair point.  I did some benchmarking with a few hundred
>> connections all doing writes, and I was not able to discern any
>> noticeable performance impact.  My guess is that contention on this
>> new lock is unlikely because callers of XLogWrite() must already hold
>> WALWriteLock.  Otherwise, I believe we only acquire ArchNotifyLock no
>> more than once per segment to record new record boundaries.
>
> Thanks. I agree that the reader-reader contention is not a problem due
> to existing serialization by WALWriteLock. Adding an entry happens
> only at segment boundary so the ArchNotifyLock doesn't seem to be a
> problem.
>
> However the function prolongs the WALWriteLock section. Couldn't we
> somehow move the call to NotifySegmentsReadyForArchive in XLogWrite
> out of the WALWriteLock section?

I don't see a clean way to do that.  XLogWrite() assumes that
WALWriteLock is held when it is called, and it doesn't release it at
any point.  I think we'd have to move NotifySegmentsReadyForArchive()
to the callers of XLogWrite() if we wanted to avoid holding onto
WALWriteLock for longer.  Unless we can point to a measurable
performance penalty, I'm not sure this is worth it.

Nathan



Re: archive status ".ready" files may be created too early

2021-01-26 Thread Kyotaro Horiguchi
At Tue, 26 Jan 2021 19:13:57 +, "Bossart, Nathan"  
wrote in 
> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi"  wrote:
> > At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan"  
> > wrote in
> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi"  wrote:
> >> > You're right in that regard. There's a window where partial record is
> >> > written when write location passes F0 after insertion location passes
> >> > F1. However, remembering all spanning records seems overkilling to me.
> >>
> >> I'm curious why you feel that recording all cross-segment records is
> >> overkill.  IMO it seems far simpler to just do that rather than try to
> >
> > Sorry, my words are not enough. Remembering all spanning records in
> > *shared memory* seems to be overkilling.  Much more if it is stored in
> > shared hash table.  Even though it rarely the case, it can fail hard
> > way when reaching the limit. If we could do well by remembering just
> > two locations, we wouldn't need to worry about such a limitation.
> 
> I don't think it will fail if we reach max_size for the hash table.
> The comment above ShmemInitHash() has this note:
> 
> * max_size is the estimated maximum number of hashtable entries.  This is
> * not a hard limit, but the access efficiency will degrade if it is
> * exceeded substantially (since it's used to compute directory size and
> * the hash table buckets will get overfull).

That description means that a shared hash has a directory with fixed
size thus there may be synonyms, which causes degradation. Even though
buckets are preallocated with the specified number, since the minimum
directory size is 256, buckets are allocated at least 256 in a long
run. Minimum on-the-fly allocation size is 32. I haven't calcuated
further precicely, but I'm worried about the amount of spare shared
memory the hash can allocate.

> > Another concern about the concrete patch:
> >
> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
> > LWLock every time XLogWrite is called while segment archive is being
> > held off.  I don't think it is acceptable and I think it could be a
> > problem when many backends are competing on WAL.
> 
> This is a fair point.  I did some benchmarking with a few hundred
> connections all doing writes, and I was not able to discern any
> noticeable performance impact.  My guess is that contention on this
> new lock is unlikely because callers of XLogWrite() must already hold
> WALWriteLock.  Otherwise, I believe we only acquire ArchNotifyLock no
> more than once per segment to record new record boundaries.

Thanks. I agree that the reader-reader contention is not a problem due
to existing serialization by WALWriteLock. Adding an entry happens
only at segment boundary so the ArchNotifyLock doesn't seem to be a
problem.

However the function prolongs the WALWriteLock section. Couldn't we
somehow move the call to NotifySegmentsReadyForArchive in XLogWrite
out of the WALWriteLock section?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: archive status ".ready" files may be created too early

2021-01-26 Thread Bossart, Nathan
On 1/2/21, 8:55 AM, "Andrey Borodin"  wrote:
> Recently we had somewhat related incident. Do I understand correctly that 
> this incident is related to the bug discussed in this thread?

I'm not sure that we can rule it out, but the log pattern I've
typically seen for this is "invalid contrecord length."  The issue is
that we're marking segments as ready for archive when the segment is
fully written versus when its WAL records are fully written (since its
WAL records may cross into the next segment).  The fact that you're
seeing zeroes at the end of your archived segments leads me to think
it is unlikely that you are experiencing this bug.

Nathan



Re: archive status ".ready" files may be created too early

2021-01-26 Thread Bossart, Nathan
On 12/17/20, 9:15 PM, "Kyotaro Horiguchi"  wrote:
> At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan"  
> wrote in
>> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi"  wrote:
>> > You're right in that regard. There's a window where partial record is
>> > written when write location passes F0 after insertion location passes
>> > F1. However, remembering all spanning records seems overkilling to me.
>>
>> I'm curious why you feel that recording all cross-segment records is
>> overkill.  IMO it seems far simpler to just do that rather than try to
>
> Sorry, my words are not enough. Remembering all spanning records in
> *shared memory* seems to be overkilling.  Much more if it is stored in
> shared hash table.  Even though it rarely the case, it can fail hard
> way when reaching the limit. If we could do well by remembering just
> two locations, we wouldn't need to worry about such a limitation.

I don't think it will fail if we reach max_size for the hash table.
The comment above ShmemInitHash() has this note:

* max_size is the estimated maximum number of hashtable entries.  This is
* not a hard limit, but the access efficiency will degrade if it is
* exceeded substantially (since it's used to compute directory size and
* the hash table buckets will get overfull).

> Another concern about the concrete patch:
>
> NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
> LWLock every time XLogWrite is called while segment archive is being
> held off.  I don't think it is acceptable and I think it could be a
> problem when many backends are competing on WAL.

This is a fair point.  I did some benchmarking with a few hundred
connections all doing writes, and I was not able to discern any
noticeable performance impact.  My guess is that contention on this
new lock is unlikely because callers of XLogWrite() must already hold
WALWriteLock.  Otherwise, I believe we only acquire ArchNotifyLock no
more than once per segment to record new record boundaries.

Nathan



Re: archive status ".ready" files may be created too early

2021-01-05 Thread Andrey Borodin
Hi!

Thanks for working on this.

> 18 дек. 2020 г., в 10:42, Kyotaro Horiguchi  
> написал(а):
> 
> I noticed that we can cause the continuation record flushed
> immedately. 

I've took a look into the code and want to share some thoughts.

1. Maybe we could tend to avoid interlacing field protected by different locks 
in XLogCtlData? We can place lastNotifiedSeg somewhere near field that is 
protected by WALWriteLock. I'm not sure it's useful idea.
2. In XLogInsertRecord() we release >info_lck just to compute few 
bytes. And possibly aquire it back. Maybe just hold the lock a little longer?
3. Things that are done by GetLastNotifiedSegment() could just be atomic read? 
I'm not sure it's common practice.

Thanks!

Best regards, Andrey Borodin.



Re: archive status ".ready" files may be created too early

2021-01-02 Thread Andrey Borodin
Hi!

I was looking to review something in CF. This seems like a thread of some 
interest to me.

Recently we had somewhat related incident. Do I understand correctly that this 
incident is related to the bug discussed in this thread?

Primary instance was killed by OOM
[ 2020-11-12 15:27:03.732 MSK ,,,739,0 ]:LOG:  server process (PID 40189) 
was terminated by signal 9: Killed
after recovery it archived some WAL segments.
[ 2020-11-12 15:27:31.477 MSK ,,,739,0 ]:LOG:  database system is ready to 
accept connections
INFO: 2020/11/12 15:27:32.059541 FILE PATH: 000E0001C02F00AF.br
INFO: 2020/11/12 15:27:32.114319 FILE PATH: 000E0001C02F00B3.br

then PITR failed on another host
[ 2020-11-12 16:26:33.024 MSK ,,,51414,0 ]:LOG:  restored log file 
"000E0001C02F00B3" from archive
[ 2020-11-12 16:26:33.042 MSK ,,,51414,0 ]:LOG:  invalid record length at 
1C02F/B3FFF778: wanted 24, got 0
[ 2020-11-12 16:26:33.042 MSK ,,,51414,0 ]:LOG:  invalid record length at 
1C02F/B3FFF778: wanted 24, got 0

archived segment has some zeroes at the end
rmgr: XLOGlen (rec/tot): 51/  1634, tx:  0, lsn: 
1C02F/B3FFF058, prev 1C02F/B3FFEFE8, desc: FPI_FOR_HINT , blkref #0: rel 
1663/14030/16384 blk 140 FPW
rmgr: Heaplen (rec/tot):129/   129, tx: 3890578935, lsn: 
1C02F/B3FFF6C0, prev 1C02F/B3FFF058, desc: HOT_UPDATE off 34 xmax 3890578935 ; 
new off 35 xmax 0, blkref #0: rel 1663/14030/16384 blk 140
rmgr: Transaction len (rec/tot): 46/46, tx: 3890578935, lsn: 
1C02F/B3FFF748, prev 1C02F/B3FFF6C0, desc: COMMIT 2020-11-12 15:27:31.507363 MSK
pg_waldump: FATAL:  error in WAL record at 1C02F/**B3FFF748**: invalid record 
length at 1C02F/**B3FFF778**: wanted 24, got 0

Meanwhile next segment points to previous record at **B3FFF748**
postgres@man-odszl7u4361o8m3z:/tmp$ pg_waldump 000E0001C02F00B4| head
rmgr: Heaplen (rec/tot):129/   129, tx: 3890578936, lsn: 
1C02F/B4000A68, prev 1C02F/**B3FFF778**, desc: HOT_UPDATE off 35 xmax 
3890578936 ; new off 36 xmax 0, blkref #0: rel 1663/14030/16384 blk 140
rmgr: Transaction len (rec/tot): 46/46, tx: 3890578936, lsn: 
1C02F/B4000AF0, prev 1C02F/B4000A68, desc: COMMIT 2020-11-12 15:27:32.509443 MSK

Best regards, Andrey Borodin.



Re: archive status ".ready" files may be created too early

2020-12-17 Thread Kyotaro Horiguchi
At Wed, 16 Dec 2020 11:01:20 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> - Record the beginning LSN of the first cross-seg record and the end
>   LSN of the last cross-seg recrod in a consecutive segments bonded by
>   cross-seg recrods. Spcifically X and Y below.
> 
>X Z Y
>[recA]  [recB] [recC]
>   [seg A] [seg B] [seg C] [seg D] [seg E]
> (1)(2.2)(2.2)  (2.1)   (2.1)   (1)
> 
> 1. If we wrote upto before X or beyond Y at a segment boundary, notify
>   the finished segment immediately.
> 
>   1.1. If we have written beyond Y, clear the recorded region.
> 
> 2. Otherwise we don't notify the segment immediately:
> 
>   2.1. If write request was up to exactly the current segment boundary
> and we know the end LSN of the record there (that is, it is recC
> above), extend the request to the end LSN. Then notify the segment
> after the record is written to the end.
> 
>   2.2. Otherwise (that is recA or recB), we don't know whether the
> last record of the last segment is ends just at the segment boundary
> (Z) or a record spans between segments (recB). Anyway even if there
> is such a record there, we don't know where it ends.  As the result
> what we can do there is only to refrain from notifying. It doesn't
> matter so much since we have already inserted recC so we will soon
> reach recC and will notify up to seg C.
> 
> There might be a case where we insert up to Y before writing up to Z,
> the segment-region X-Y contains non-connected segment boundary in that
> case. It is processed as if it is a connected segment
> boundary. However, like 2.2 above, It doesn't matter since we write up
> to Y soon.

I noticed that we can cause the continuation record flushed
immedately. So in the attached,

1. If there's no remembered cross-segment boundary or we're out of the
  region X-Y, notify the finished segment immediately.

2. Otherwise we don't notify the segment immedately

 2.1. If we are finishing the last semgment known to continue to the
   next segment, extend write request to the end of the recrod *and*
   force to write then flush up to there.

 2.2. (the same to the above)

3. In the case of 2.1, we can flush the previous segment immediately
  so do that.

X. When we notify a segment, clear the rememberd region if we have got
  out of the region.


The attached is changed in the following points:

- Fixed some bugs that I confusedly refer to write-lsn instead of flush-lsn.

- Changed to urge flushing up to the end of a continuation record, not
  only waiting for the recrod to be written.

- More agressively clear the remembered region.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 95e438ee448c1686c946909f1fc84ec95ee6c7d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 18 Dec 2020 13:45:29 +0900
Subject: [PATCH v5 1/2] Avoid archiving a WAL segment that continues to the
 next segment

If the last record of a finshed segment continues to the next segment,
we need to defer archiving of the segment until the record is flushed
to the end. Otherwise crash recovery can overwrite the last record of
a segment and history diverges between archive and pg_wal.
---
 src/backend/access/transam/xlog.c   | 221 +++-
 src/backend/replication/walsender.c |  14 +-
 src/include/access/xlog.h   |   1 +
 3 files changed, 224 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b1e5d2dbff..b0d3ba2c5a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -733,6 +733,16 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/* The last segment notified to be archived. Protected by WALWriteLock */
+	XLogSegNo	lastNotifiedSeg;
+
+	/*
+	 * Remember the region we need to consider refraining from archiving
+	 * finished segments immediately. Protected by info_lck.
+	 */
+	XLogRecPtr	firstSegContRecStart;
+	XLogRecPtr	lastSegContRecEnd;
+	
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -1170,6 +1180,9 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
 	{
+		XLogSegNo startseg;
+		XLogSegNo endseg;
+
 		SpinLockAcquire(>info_lck);
 		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
@@ -1177,6 +1190,22 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* update local result copy while I have the chance */
 		LogwrtResult = XLogCtl->LogwrtResult;
 		SpinLockRelease(>info_lck);
+
+		/*
+		 * Remember the range of segment boundaries that are connected by a
+		 * continuation record.
+		 */
+		XLByteToSeg(StartPos, startseg, wal_segment_size);
+		XLByteToPrevSeg(EndPos, endseg, wal_segment_size);
+
+		if (startseg != endseg)
+		{
+			SpinLockAcquire(>info_lck);
+			if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr)
+

Re: archive status ".ready" files may be created too early

2020-12-17 Thread Kyotaro Horiguchi
At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan"  
wrote in 
> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi"  wrote:
> > You're right in that regard. There's a window where partial record is
> > written when write location passes F0 after insertion location passes
> > F1. However, remembering all spanning records seems overkilling to me.
> 
> I'm curious why you feel that recording all cross-segment records is
> overkill.  IMO it seems far simpler to just do that rather than try to

Sorry, my words are not enough. Remembering all spanning records in
*shared memory* seems to be overkilling.  Much more if it is stored in
shared hash table.  Even though it rarely the case, it can fail hard
way when reaching the limit. If we could do well by remembering just
two locations, we wouldn't need to worry about such a limitation.

> reason about all these different scenarios and rely on various
> (and possibly fragile) assumptions.  You only need to record the end

After the previous mail sent, I noticed that the assumption on
record-length was not needed. So that way no longer need any of the
assumption^^;

> location of records that cross into the next segment (or that fit
> perfectly into the end of the current one) and to evaluate which
> segments to mark .ready as the "flushed" LSN advances.  I'd expect
> that in most cases we wouldn't need to store more than a couple of
> record boundaries, so it's not like we'd normally be storing dozens of
> boundaries.  Even if we did need to store several boundaries, AFAICT
> the approach I'm proposing should still work well enough.

I didn't say it doesn't work, just overkill.

Another concern about the concrete patch:

NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
LWLock every time XLogWrite is called while segment archive is being
held off.  I don't think it is acceptable and I think it could be a
problem when many backends are competing on WAL.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: archive status ".ready" files may be created too early

2020-12-17 Thread Bossart, Nathan
On 12/15/20, 2:33 AM, "Kyotaro Horiguchi"  wrote:
> You're right in that regard. There's a window where partial record is
> written when write location passes F0 after insertion location passes
> F1. However, remembering all spanning records seems overkilling to me.

I'm curious why you feel that recording all cross-segment records is
overkill.  IMO it seems far simpler to just do that rather than try to
reason about all these different scenarios and rely on various
(and possibly fragile) assumptions.  You only need to record the end
location of records that cross into the next segment (or that fit
perfectly into the end of the current one) and to evaluate which
segments to mark .ready as the "flushed" LSN advances.  I'd expect
that in most cases we wouldn't need to store more than a couple of
record boundaries, so it's not like we'd normally be storing dozens of
boundaries.  Even if we did need to store several boundaries, AFAICT
the approach I'm proposing should still work well enough.

Nathan



Re: archive status ".ready" files may be created too early

2020-12-15 Thread Kyotaro Horiguchi
At Tue, 15 Dec 2020 19:32:57 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 14 Dec 2020 18:25:23 +, "Bossart, Nathan"  
> wrote in 
> > I wonder if these are safe assumptions to make.  For your example, if
> > we've written record B to the WAL buffers, but neither record A nor B
> > have been written to disk or flushed, aren't we still in trouble?
> 
> You're right in that regard. There's a window where partial record is
> written when write location passes F0 after insertion location passes
> F1. However, remembering all spanning records seems overkilling to me.
> 
> I modifed the previous patch so that it remembers the start LSN of the
> *oldest* corss-segment continuation record in the last consecutive
> bonded segments, and the end LSN of the latest cross-segmetn
> continuation record. This doesn't foreget past segment boundaries.
> The region is cleard when WAL-write LSN goes beyond the remembered end
> LSN.  So the region may contain several wal-segments that are not
> connected to the next one, but that doesn't matter so much.

Mmm. Even tough it'a PoC, it was too bogus. I fixed it to work saner
way.

- Record the beginning LSN of the first cross-seg record and the end
  LSN of the last cross-seg recrod in a consecutive segments bonded by
  cross-seg recrods. Spcifically X and Y below.

   X Z Y
   [recA]  [recB] [recC]
  [seg A] [seg B] [seg C] [seg D] [seg E]
(1)(2.2)(2.2)  (2.1)   (2.1)   (1)

1. If we wrote upto before X or beyond Y at a segment boundary, notify
  the finished segment immediately.

  1.1. If we have written beyond Y, clear the recorded region.

2. Otherwise we don't notify the segment immediately:

  2.1. If write request was up to exactly the current segment boundary
and we know the end LSN of the record there (that is, it is recC
above), extend the request to the end LSN. Then notify the segment
after the record is written to the end.

  2.2. Otherwise (that is recA or recB), we don't know whether the
last record of the last segment is ends just at the segment boundary
(Z) or a record spans between segments (recB). Anyway even if there
is such a record there, we don't know where it ends.  As the result
what we can do there is only to refrain from notifying. It doesn't
matter so much since we have already inserted recC so we will soon
reach recC and will notify up to seg C.

There might be a case where we insert up to Y before writing up to Z,
the segment-region X-Y contains non-connected segment boundary in that
case. It is processed as if it is a connected segment
boundary. However, like 2.2 above, It doesn't matter since we write up
to Y soon.

At Tue, 15 Dec 2020 19:32:57 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
me> I added an assertion that a record must be shorter than a wal segment
me> to XLogRecordAssemble(). This guarantees the assumption to be true?
me> (The condition is tentative, would need to be adjusted.)

Changed the assertion more direct way.

me> Also, the attached is a PoC.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 50b3b05dd0eed79cd0b97991e82090b9d569cbac Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 16 Dec 2020 10:36:14 +0900
Subject: [PATCH v4 1/2] Avoid archiving a WAL segment that continues to the
 next  segment

If the last record of a finshed segment continues to the next segment,
we need to defer archiving of the segment until the record is flushed
to the end. Otherwise crash recovery can overwrite the last record of
a segment and history diverges between archive and pg_wal.
---
 src/backend/access/transam/xlog.c   | 214 +++-
 src/backend/replication/walsender.c |  14 +-
 src/include/access/xlog.h   |   1 +
 3 files changed, 217 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8dd225c2e1..8705809160 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -723,6 +723,16 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/* The last segment notified to be archived. Protected by WALWriteLock */
+	XLogSegNo	lastNotifiedSeg;
+
+	/*
+	 * Remember the oldest and newest known segment that ends with a
+	 * continuation record.
+	 */
+	XLogRecPtr	firstSegContRecStart;
+	XLogRecPtr	lastSegContRecEnd;
+	
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
 	{
+		XLogSegNo startseg;
+		XLogSegNo endseg;
+
 		SpinLockAcquire(>info_lck);
 		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
@@ -1167,6 +1180,24 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* update local result copy while I have the chance */
 		LogwrtResult = XLogCtl->LogwrtResult;
 		SpinLockRelease(>info_lck);

Re: archive status ".ready" files may be created too early

2020-12-15 Thread Kyotaro Horiguchi
At Mon, 14 Dec 2020 18:25:23 +, "Bossart, Nathan"  
wrote in 
> Apologies for the long delay.
> 
> I've spent a good amount of time thinking about this bug and trying
> out a few different approaches for fixing it.  I've attached a work-
> in-progress patch for my latest attempt.
> 
> On 10/13/20, 5:07 PM, "Kyotaro Horiguchi"  wrote:
> >   F0F1
> > A  F  B
> > |-|-|-|
> >seg Xseg X+1   seg X+2
> >
> > Matsumura-san has a concern about the case where there are two (or
> > more) partially-flushed segment-spanning records at the same time.
> >
> > This patch remembers only the last cross-segment record. If we were
> > going to flush up to F0 after Record-B had been written, we would fail
> > to hold-off archiving seg-X. This patch is based on a assumption that
> > that case cannot happen because we don't leave a pending page at the
> > time of segment switch and no records don't span over three or more
> > segments.
> 
> I wonder if these are safe assumptions to make.  For your example, if
> we've written record B to the WAL buffers, but neither record A nor B
> have been written to disk or flushed, aren't we still in trouble?

You're right in that regard. There's a window where partial record is
written when write location passes F0 after insertion location passes
F1. However, remembering all spanning records seems overkilling to me.

I modifed the previous patch so that it remembers the start LSN of the
*oldest* corss-segment continuation record in the last consecutive
bonded segments, and the end LSN of the latest cross-segmetn
continuation record. This doesn't foreget past segment boundaries.
The region is cleard when WAL-write LSN goes beyond the remembered end
LSN.  So the region may contain several wal-segments that are not
connected to the next one, but that doesn't matter so much.


> Also, is there actually any limit on WAL record length that means that
> it is impossible for a record to span over three or more segments?

Even though it is not a hard limit, AFAICS as mentioned before the
longest possible record is what log_newpages() emits. that is up to
about 500kBytes for now. I think we don't want to make the length
longer. If we set the wal_segment_size to 1MB and set the block size
to 16kB or more, we would have a recrod spanning over three or more
segments but I don't think that is a sane configuration and that kind
of issue could happen elsewhere.

> Perhaps these assumptions are true, but it doesn't seem obvious to me
> that they are, and they might be pretty fragile.

I added an assertion that a record must be shorter than a wal segment
to XLogRecordAssemble(). This guarantees the assumption to be true?
(The condition is tentative, would need to be adjusted.)

> The attached patch doesn't make use of these assumptions.  Instead, we
> track the positions of the records that cross segment boundaries in a
> small hash map, and we use that to determine when it is safe to mark a
> segment as ready for archival.  I think this approach resembles
> Matsumura-san's patch from June.
> 
> As before, I'm not handling replication, archive_timeout, and
> persisting latest-marked-ready through crashes yet.  For persisting
> the latest-marked-ready segment through crashes, I was thinking of
> using a new file that stores the segment number.


Also, the attached is a PoC.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 5714bd064d61135c41a64ecc39aeff74c25a0e74 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 15 Dec 2020 16:24:13 +0900
Subject: [PATCH v3] PoC: Avoid archiving a WAL segment that continues to the
 next segment

If the last record of a finshed segment continues to the next segment,
we need to defer archiving of the segment until the record is flushed
to the end. Otherwise crash recovery can overwrite the last record of
a segment and history diverges between archive and pg_wal.
---
 src/backend/access/transam/xlog.c   | 187 +++-
 src/backend/access/transam/xloginsert.c |   3 +
 src/backend/replication/walsender.c |  14 +-
 src/include/access/xlog.h   |   1 +
 4 files changed, 193 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8dd225c2e1..98da521601 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -723,6 +723,16 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/* The last segment notified to be archived. Protected by WALWriteLock */
+	XLogSegNo	lastNotifiedSeg;
+
+	/*
+	 * Remember the oldest and newest known segment that ends with a
+	 * continuation record.
+	 */
+	XLogRecPtr	firstSegContRecStart;
+	XLogRecPtr	lastSegContRecEnd;
+	
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	if (StartPos / XLOG_BLCKSZ != EndPos / 

Re: archive status ".ready" files may be created too early

2020-11-29 Thread Anastasia Lubennikova

On 14.10.2020 03:06, Kyotaro Horiguchi wrote:

The patch includes a fix for primary->standby case. But I'm not sure
we can do that in the cascaded case. A standby is not aware of the
structure of a WAL blob and has no idea of up-to-where to send the
received blobs. However, if we can rely on the behavior of CopyData
that we always receive a blob as a whole sent from the sender at once,
the cascaded standbys are free from the trouble (as far as the
cascaded-standby doesn't crash just before writing the last-half of a
record into pg_wal and after archiving the last full-segment, which
seems unlikely.).

regards.



Status update for a commitfest entry.

This entry was "Waiting on author" during this CF. As I see, the latest 
message contains new version of the patch.

Does it need more work? Are you going to continue working on it?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: archive status ".ready" files may be created too early

2020-10-13 Thread Kyotaro Horiguchi
Thanks for visiting this thread.

At Mon, 12 Oct 2020 15:04:40 +0300, Heikki Linnakangas  wrote 
in 
> On 07/07/2020 12:02, matsumura@fujitsu.com wrote:
> > At Monday, July 6, 2020 05:13:40 +, "Kyotaro Horiguchi
> > " wrote in
>  after WAL buffer is filled up to the requested position. So when it
>  crosses segment boundary we know the all past corss segment-boundary
>  records are stable. That means all we need to remember is only the
>  position of the latest corss-boundary record.
> >>>
> >>> I could not agree. In the following case, it may not work well.
> >>> - record-A and record-B (record-B is a newer one) is copied, and
> >>> - lastSegContRecStart/End points to record-B's, and
> >>> - FlushPtr is proceeded to in the middle of record-A.
> >>
> >> IIUC, that means record-B is a cross segment-border record and we hav
> >> e
> >> flushed beyond the recrod-B. In that case crash recovery afterwards
> >> can read the complete record-B and will finish recovery *after* the
> >> record-B. That's what we need here.
> > I'm sorry I didn't explain enough.
> > Record-A and Record-B are cross segment-border records.
> > Record-A spans segment X and X+1
> > Record-B spans segment X+2 and X+3.
> > If both records have been inserted to WAL buffer,
> > lastSegContRecStart/End points to Record-B.
> > If a writer flushes upto the middle of segment-X+1,
> > NotifyStableSegments() allows the writer to notify segment-X.
> > Is my understanding correct?
> 
> I think this little ASCII drawing illustrates the above scenario:
> 
> A  F  B
> |-|-|-|
>seg Xseg X+1   seg X+2
> 
> A and B are Record-A and Record-B. F is the current flush
> pointer.

I modified the figure a bit for the explanation below.

  F0F1
A  F  B
|-|-|-|
   seg Xseg X+1   seg X+2

Matsumura-san has a concern about the case where there are two (or
more) partially-flushed segment-spanning records at the same time.

This patch remembers only the last cross-segment record. If we were
going to flush up to F0 after Record-B had been written, we would fail
to hold-off archiving seg-X. This patch is based on a assumption that
that case cannot happen because we don't leave a pending page at the
time of segment switch and no records don't span over three or more
segments.

> In this case, it would be OK to notify segment X, as long as F is
> greater than the end of record A. And if I'm reading Kyotaro's patch
> correctly, that's what would happen with the patch.
> 
> The patch seems correct to me. I'm a bit sad that we have to track yet
> another WAL position (two, actually) to fix this, but I don't see a
> better way.

Is the two means Record-A and B? Is it needed even with having the
assumption above?

> I wonder if we should arrange things so that XLogwrtResult.Flush never
> points in the middle of a record? I'm not totally convinced that all

That happens at good percentage of page-boundary. And a record can
span over three or more pages. Do we need to avoid all such cases?

I did that only for the cross-segment case.

> the current callers of GetFlushRecPtr() are OK with a middle-of-WAL
> record value. Could we get into similar trouble if a standby
> replicates half of a cross-segment record to a cascaded standby, and
> the cascaded standby has WAL archiving enabled?

The patch includes a fix for primary->standby case. But I'm not sure
we can do that in the cascaded case. A standby is not aware of the
structure of a WAL blob and has no idea of up-to-where to send the
received blobs. However, if we can rely on the behavior of CopyData
that we always receive a blob as a whole sent from the sender at once,
the cascaded standbys are free from the trouble (as far as the
cascaded-standby doesn't crash just before writing the last-half of a
record into pg_wal and after archiving the last full-segment, which
seems unlikely.).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From e8320f7486c057ccb97be56ce1859296b8072256 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 13 Oct 2020 20:41:33 +0900
Subject: [PATCH v2] Avoid archiving a WAL segment that continues to the next
 segment

If the last record of a finshed segment continues to the next segment,
we need to defer archiving of the segment until the record is flushed
to the end. Otherwise crash recovery can overwrite the last record of
a segment and history diverges between archive and pg_wal.
---
 src/backend/access/transam/xlog.c   | 160 +++-
 src/backend/replication/walsender.c |  14 +--
 src/include/access/xlog.h   |   1 +
 3 files changed, 163 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 52a67b1170..8fd0fdb598 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -723,6 +723,16 @@ typedef struct XLogCtlData

Re: archive status ".ready" files may be created too early

2020-10-12 Thread Heikki Linnakangas

On 07/07/2020 12:02, matsumura@fujitsu.com wrote:

At Monday, July 6, 2020 05:13:40 +,  "Kyotaro Horiguchi 
" wrote in

after WAL buffer is filled up to the requested position. So when it
crosses segment boundary we know the all past corss segment-boundary
records are stable. That means all we need to remember is only the
position of the latest corss-boundary record.


I could not agree. In the following case, it may not work well.
- record-A and record-B (record-B is a newer one) is copied, and
- lastSegContRecStart/End points to record-B's, and
- FlushPtr is proceeded to in the middle of record-A.


IIUC, that means record-B is a cross segment-border record and we hav e
flushed beyond the recrod-B. In that case crash recovery afterwards
can read the complete record-B and will finish recovery *after* the
record-B. That's what we need here.


I'm sorry I didn't explain enough.

Record-A and Record-B are cross segment-border records.
Record-A spans segment X and X+1
Record-B spans segment X+2 and X+3.
If both records have been inserted to WAL buffer, lastSegContRecStart/End 
points to Record-B.
If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() 
allows the writer to notify segment-X.
Is my understanding correct?


I think this little ASCII drawing illustrates the above scenario:

A  F  B
|-|-|-|
   seg Xseg X+1   seg X+2

A and B are Record-A and Record-B. F is the current flush pointer.

In this case, it would be OK to notify segment X, as long as F is 
greater than the end of record A. And if I'm reading Kyotaro's patch 
correctly, that's what would happen with the patch.


The patch seems correct to me. I'm a bit sad that we have to track yet 
another WAL position (two, actually) to fix this, but I don't see a 
better way.


I wonder if we should arrange things so that XLogwrtResult.Flush never 
points in the middle of a record? I'm not totally convinced that all the 
current callers of GetFlushRecPtr() are OK with a middle-of-WAL record 
value. Could we get into similar trouble if a standby replicates half of 
a cross-segment record to a cascaded standby, and the cascaded standby 
has WAL archiving enabled?


- Heikki




Re: archive status ".ready" files may be created too early

2020-09-06 Thread Michael Paquier
On Wed, Jul 22, 2020 at 02:53:49AM +, matsumura@fujitsu.com wrote:
> Then, Record-A may be invalidated by crash-recovery and overwritten by new 
> WAL record.
> The segment-X is not same as the archived one.

Please note that the latest patch fails to apply per the CF bot, so a
rebase would be in order to have at least some automated tests for the
last patch.
--
Michael


signature.asc
Description: PGP signature


RE: archive status ".ready" files may be created too early

2020-07-21 Thread matsumura....@fujitsu.com
Hello,

> At Mon, 13 Jul 2020 01:57:36 +, "Kyotaro Horiguchi 
> " wrote in
> Am I missing something here?

I write more detail(*).

  Record-A and Record-B are cross segment-border records.
  Record-A spans segment X and X+1.
  Record-B spans segment X+2 and X+3.
  If both records have been inserted to WAL buffer, lastSegContRecStart/End 
points to Record-B
* If a writer flushes segment X and a part of X+1 but record-A is not flushed 
completely,
  NotifyStableSegments() allows the writer to notify segment-X.

Then, Record-A may be invalidated by crash-recovery and overwritten by new WAL 
record.
The segment-X is not same as the archived one.

Regard
Ryo Matsumura




Re: archive status ".ready" files may be created too early

2020-07-12 Thread Kyotaro Horiguchi
Hello.

# Sorry, I wrongly thought that I replied to this thread..

At Tue, 7 Jul 2020 09:02:56 +, "matsumura@fujitsu.com" 
 wrote in 
> At Monday, July 6, 2020 05:13:40 +,  "Kyotaro Horiguchi 
> " wrote in
> > > > after WAL buffer is filled up to the requested position. So when it
> > > > crosses segment boundary we know the all past corss segment-boundary
> > > > records are stable. That means all we need to remember is only the
> > > > position of the latest corss-boundary record.
> > > 
> > > I could not agree. In the following case, it may not work well.
> > > - record-A and record-B (record-B is a newer one) is copied, and
> > > - lastSegContRecStart/End points to record-B's, and
> > > - FlushPtr is proceeded to in the middle of record-A.
> >
> > IIUC, that means record-B is a cross segment-border record and we hav e
> > flushed beyond the recrod-B. In that case crash recovery afterwards
> > can read the complete record-B and will finish recovery *after* the
> > record-B. That's what we need here.
> 
> I'm sorry I didn't explain enough.
> 
> Record-A and Record-B are cross segment-border records.
> Record-A spans segment X and X+1
> Record-B spans segment X+2 and X+3.

Ok.


> If both records have been inserted to WAL buffer, lastSegContRecStart/End 
> points to Record-B.
> If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() 
> allows the writer to notify segment-X.
> Is my understanding correct?

I think that that cannot happen since the segment X must have been
flushed at the time Record-A is completely flushed out. When we write
to the next segment, we have already flushed and closed the whole last
segment. If it is not the case we are to archive segment files not
fully flushed, and would get broken archive files.

Am I missing something here?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: archive status ".ready" files may be created too early

2020-07-07 Thread matsumura....@fujitsu.com
Hello, Horiguchi-san

At Monday, July 6, 2020 05:13:40 +,  "Kyotaro Horiguchi 
" wrote in
> > > after WAL buffer is filled up to the requested position. So when it
> > > crosses segment boundary we know the all past corss segment-boundary
> > > records are stable. That means all we need to remember is only the
> > > position of the latest corss-boundary record.
> > 
> > I could not agree. In the following case, it may not work well.
> > - record-A and record-B (record-B is a newer one) is copied, and
> > - lastSegContRecStart/End points to record-B's, and
> > - FlushPtr is proceeded to in the middle of record-A.
>
> IIUC, that means record-B is a cross segment-border record and we hav e
> flushed beyond the recrod-B. In that case crash recovery afterwards
> can read the complete record-B and will finish recovery *after* the
> record-B. That's what we need here.

I'm sorry I didn't explain enough.

Record-A and Record-B are cross segment-border records.
Record-A spans segment X and X+1
Record-B spans segment X+2 and X+3.
If both records have been inserted to WAL buffer, lastSegContRecStart/End 
points to Record-B.
If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() 
allows the writer to notify segment-X.
Is my understanding correct?

Regards
Ryo Matsumrua




Re: archive status ".ready" files may be created too early

2020-07-05 Thread Kyotaro Horiguchi
Hello, Matsumura-san.

At Mon, 6 Jul 2020 04:02:23 +, "matsumura@fujitsu.com" 
 wrote in 
> Hello, Horiguchi-san
> 
> Thank you for your comment and patch.
> 
> At Thursday, June 25, 2020 3:36 PM(JST),  "Kyotaro Horiguchi 
> " wrote in
> > I think we don't need most of that shmem stuff.  XLogWrite is called
> 
> I wanted no more shmem stuff too, but other ideas need more lock
> that excludes inserter and writer each other.
> 
> > after WAL buffer is filled up to the requested position. So when it
> > crosses segment boundary we know the all past corss segment-boundary
> > records are stable. That means all we need to remember is only the
> > position of the latest corss-boundary record.
> 
> I could not agree. In the following case, it may not work well.
> - record-A and record-B (record-B is a newer one) is copied, and
> - lastSegContRecStart/End points to record-B's, and
> - FlushPtr is proceeded to in the middle of record-A.

IIUC, that means record-B is a cross segment-border record and we hav e
flushed beyond the recrod-B. In that case crash recovery afterwards
can read the complete record-B and will finish recovery *after* the
record-B. That's what we need here.

> In the above case, the writer should notify segments before record-A,
> but it notifies ones before record-B. If the writer notifies

If you mean that NotifyStableSegments notifies up-to the previous
segment of the segment where record-A is placed, that's wrong. The
issue here is crash recovery sees an incomplete record at a
segment-border. So it is sufficient that crash recoery can read the
last record by looking pg_wal.

> only when it flushes the latest record completely, it works well.

It confirms that "lastSegContRecEnd < LogwrtResult.Flush", that means
the last record(B) is completely flushed-out, isn't that? So it works
well.

> But the writer may not be enable to notify any segment forever when
> WAL records crossing-segment-boundary are inserted contiunuously.

No. As I mentioned in the preivous main, if we see a
cross-segment-boundary record, the previous cross-segment-boundary
record is flushed completely, and the segment containing the
first-half of the previous cross-segment-boundary record has already
been flushed.  I didin't that but we can put an assertion in
XLogInsertRecord like this:

 +  /* Remember the range of the record if it spans over segments */
 +  XLByteToSeg(StartPos, startseg, wal_segment_size);
 +  XLByteToPrevSeg(EndPos, endseg, wal_segment_size);
 +
 +  if (startseg != endseg)
 +  {
++  /* we shouldn't have a record spanning over three or more segments 
*/
++  Assert(endseg = startseg + 1);
 +  SpinLockAcquire(>info_lck);
 +  if (XLogCtl->lastSegContRecEnd < StartPos)
 +  {
 +  XLogCtl->lastSegContRecStart = StartPos;
 +  XLogCtl->lastSegContRecEnd = EndPos;

> So I think that we must remeber all such cross-segement-boundary records's 
> EndRecPtr in buffer.
> 
> 
> > If we call XLogMarkEndRecPtrIfNeeded() there, the function is called
> > every time a record is written, most of which are wasteful.
> > XLogInsertRecord already has a code block executed only at every page
> > boundary.
> 
> I agree.
> XLogMarkEndRecPtrIfNeeded() is moved into the code block before updating
> LogwrtRqst.Write for avoiding passing-each-other with writer.
> 
> 
> > Now we can identify stable portion of WAL stream. It's enough to
> > prevent walsender from sending data that can be overwritten
> > afterwards. GetReplicationTargetRecPtr() in the attached does that.
> 
> I didn't notice it.
> I agree basically, but it is based on lastSegContRecStart/End.
> 
> So, first of all, we have to agree what should be remebered.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: archive status ".ready" files may be created too early

2020-07-05 Thread matsumura....@fujitsu.com
Hello, Horiguchi-san

Thank you for your comment and patch.

At Thursday, June 25, 2020 3:36 PM(JST),  "Kyotaro Horiguchi 
" wrote in
> I think we don't need most of that shmem stuff.  XLogWrite is called

I wanted no more shmem stuff too, but other ideas need more lock
that excludes inserter and writer each other.

> after WAL buffer is filled up to the requested position. So when it
> crosses segment boundary we know the all past corss segment-boundary
> records are stable. That means all we need to remember is only the
> position of the latest corss-boundary record.

I could not agree. In the following case, it may not work well.
- record-A and record-B (record-B is a newer one) is copied, and
- lastSegContRecStart/End points to record-B's, and
- FlushPtr is proceeded to in the middle of record-A.

In the above case, the writer should notify segments before record-A,
but it notifies ones before record-B. If the writer notifies
only when it flushes the latest record completely, it works well.
But the writer may not be enable to notify any segment forever when
WAL records crossing-segment-boundary are inserted contiunuously.

So I think that we must remeber all such cross-segement-boundary records's 
EndRecPtr in buffer.


> If we call XLogMarkEndRecPtrIfNeeded() there, the function is called
> every time a record is written, most of which are wasteful.
> XLogInsertRecord already has a code block executed only at every page
> boundary.

I agree.
XLogMarkEndRecPtrIfNeeded() is moved into the code block before updating
LogwrtRqst.Write for avoiding passing-each-other with writer.


> Now we can identify stable portion of WAL stream. It's enough to
> prevent walsender from sending data that can be overwritten
> afterwards. GetReplicationTargetRecPtr() in the attached does that.

I didn't notice it.
I agree basically, but it is based on lastSegContRecStart/End.

So, first of all, we have to agree what should be remebered.


Regards
Ryo Matsumura




Re: archive status ".ready" files may be created too early

2020-06-25 Thread Kyotaro Horiguchi
Hello.  Matsumura-san.

I agree that WAL writer is not the place to notify segmnet. And the
direction you suggested would work.

At Fri, 19 Jun 2020 10:18:34 +, "matsumura@fujitsu.com" 
 wrote in 
> 1. Description in primary side
> 
> [Basic problem]
>   A process flushing WAL record doesn't know whether the flushed RecPtr is 
>   EndRecPtr of cross-segment-boundary WAL record or not because only process 
>   inserting the WAL record knows it and it never memorizes the information to 
> anywhere.
> 
> [Basic concept of the patch in primary]
>   A process inserting a cross-segment-boundary WAL record memorizes its 
> EndRecPtr
>   (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl.
>   A flushing process creates .ready (Later, I call it just 'notify'.) against 
>   a segment that is previous one including CrossBoundaryEndRecPtr only when 
> its 
>   flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
...
>   See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in 
> my patch.

I think we don't need most of that shmem stuff.  XLogWrite is called
after WAL buffer is filled up to the requested position. So when it
crosses segment boundary we know the all past corss segment-boundary
records are stable. That means all we need to remember is only the
position of the latest corss-boundary record.

> * Action of inserting process
>   A inserting process memorie its CrossBoundaryEndRecPtr to 
> CrossBoundaryEndRecPtr
>   array element calculated by XLogRecPtrToBufIdx with its 
> CrossBoundaryEndRecPtr.
>   If the WAL record crosses many segments, only element against last segment
>   including the EndRecPtr is set and others are not set.
>   See also CopyXLogRecordToWAL() in my patch.

If we call XLogMarkEndRecPtrIfNeeded() there, the function is called
every time a record is written, most of which are wasteful.
XLogInsertRecord already has a code block executed only at every page
boundary.

> * Action of flushing process
>   Overview has been already written as the follwing.
> A flushing process creates .ready (Later, I call it just 'notify'.) 
> against 
> a segment that is previous one including CrossBoundaryEndRecPtr only when 
> its 
> flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
> 
>   An additional detail is as the following.  The flushing process may notify
>   many segments if the record crosses many segments, so the process memorizes
>   latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl.
>   The process notifies from latestArchiveNotifiedSegNo + 1 to
>   flushing segment number - 1.
>
>   And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process 
> exits
>   replay-loop.  Standby also set same timing (= before promoting).
> 
>   Mutual exlusion about latestArchiveNotifiedSegNo is not required because
>   the timing of accessing has been already included in WALWriteLock critical 
> section.

Looks reasonable.

> 2. Description in standby side
> 
> * Who notifies?
>   walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of
>   cross-segment-boundary WAL record or not.  In standby, only Startup process
>   knows the information because it is hidden in WAL record itself and only
>   Startup process reads and builds WAL record.

Standby doesn't write it's own WAL records.  Even if primary sent an
immature record on segment boundary, it just would promote to a new
TLI and starts its own history. Nothing breaks.  However it could be a
problem if a standby that crashed the problematic way were started
as-is as a primary, such scenario is out of our scope.

Now we can identify stable portion of WAL stream. It's enough to
prevent walsender from sending data that can be overwritten
afterwards. GetReplicationTargetRecPtr() in the attached does that.

> * Action of Statup process
>   Therefore, I implemented that walreceiver never notify and Startup process 
> does it.
>   In detail, when Startup process reads one full-length WAL record, it 
> notifies
>   from a segment that includes head(ReadRecPtr) of the record to a previous 
> segment that 
>   includes EndRecPtr of the record.

I don't like that archiving on standby relies on replay progress.  We
should avoid that and fortunately I think we dont need it.

>   Now, we must pay attention about switching time line.
>   The last segment of previous TimeLineID must be notified before switching.
>   This case is considered when RM_XLOG_ID is detected.

That segment is archived after renamed as ".partial" later. We don't
archive the last incomplete segment of the previous timeline as-is.

> 3. About other notifying for segment
> Two notifyings for segment are remain.  They are not needed to fix.
> 
> (1) Notifying for partial segment
> It is not needed to be completed, so it's OK to notify without special 
> consideration.
> 
> (2) Re-notifying
> Currently, Checkpointer has notified through XLogArchiveCheckDone().
> It is a 

RE: archive status ".ready" files may be created too early

2020-06-19 Thread matsumura....@fujitsu.com
> On 5/28/20, 11:42 PM, "matsumura@fujitsu.com" 
> wrote:
> > I'm preparing a patch that backend inserting segment-crossboundary
> > WAL record leaves its EndRecPtr and someone flushing it checks
> > the EndRecPtr and notifies..


I'm sorry for my slow work.

I attach a patch.
I also attach a simple target test for primary.


1. Description in primary side

[Basic problem]
  A process flushing WAL record doesn't know whether the flushed RecPtr is 
  EndRecPtr of cross-segment-boundary WAL record or not because only process 
  inserting the WAL record knows it and it never memorizes the information to 
anywhere.

[Basic concept of the patch in primary]
  A process inserting a cross-segment-boundary WAL record memorizes its 
EndRecPtr
  (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl.
  A flushing process creates .ready (Later, I call it just 'notify'.) against 
  a segment that is previous one including CrossBoundaryEndRecPtr only when its 
  flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.

[Detail of implementation in primary]
* Structure of CrossBoundaryEndRecPtrs
  Requirement of structure is as the following:
  - System must memorize multiple CrossBoundaryEndRecPtr.
  - A flushing process must determine to notify or not with only flushed RecPtr 
briefly.

  Therefore, I implemented the structure as an array (I call it 
CrossBoundaryEndRecPtr array)
  that is same as xlblck array.  Strictly, it is enogh that the length is
  'xbuffers/wal_segment_size', but I choose 'xbuffers' for simplicity that makes
  enable the flushing process to use XLogRecPtrToBufIdx().
  See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in 
my patch.

* Action of inserting process
  A inserting process memorie its CrossBoundaryEndRecPtr to 
CrossBoundaryEndRecPtr
  array element calculated by XLogRecPtrToBufIdx with its 
CrossBoundaryEndRecPtr.
  If the WAL record crosses many segments, only element against last segment
  including the EndRecPtr is set and others are not set.
  See also CopyXLogRecordToWAL() in my patch.

* Action of flushing process
  Overview has been already written as the follwing.
A flushing process creates .ready (Later, I call it just 'notify'.) against 
a segment that is previous one including CrossBoundaryEndRecPtr only when 
its 
flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.

  An additional detail is as the following.  The flushing process may notify
  many segments if the record crosses many segments, so the process memorizes
  latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl.
  The process notifies from latestArchiveNotifiedSegNo + 1 to
  flushing segment number - 1.

  And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process exits
  replay-loop.  Standby also set same timing (= before promoting).

  Mutual exlusion about latestArchiveNotifiedSegNo is not required because
  the timing of accessing has been already included in WALWriteLock critical 
section.


2. Description in standby side

* Who notifies?
  walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of
  cross-segment-boundary WAL record or not.  In standby, only Startup process
  knows the information because it is hidden in WAL record itself and only
  Startup process reads and builds WAL record.

* Action of Statup process
  Therefore, I implemented that walreceiver never notify and Startup process 
does it.
  In detail, when Startup process reads one full-length WAL record, it notifies
  from a segment that includes head(ReadRecPtr) of the record to a previous 
segment that 
  includes EndRecPtr of the record.

  Now, we must pay attention about switching time line.
  The last segment of previous TimeLineID must be notified before switching.
  This case is considered when RM_XLOG_ID is detected.


3. About other notifying for segment
Two notifyings for segment are remain.  They are not needed to fix.

(1) Notifying for partial segment
It is not needed to be completed, so it's OK to notify without special 
consideration.

(2) Re-notifying
Currently, Checkpointer has notified through XLogArchiveCheckDone().
It is a safe-net for failure of notifying by backend or WAL writer.
Backend or WAL writer doesn't retry to notify if falis, but Checkpointer retries
to notify when it removes old segment. If it fails to notify, then it does not
remove the segment.  It makes Checkpointer to retry notify until the notifying 
suceeeds.
Also, in this case, we can just notify whithout special consideration
because Checkpointer guarantees that all WAL record included in the segment 
have been already flushed.


Please, your review and comments.


Regards
Ryo Matsumura


test_in_primary.sh
Description: test_in_primary.sh


bugfix_early_archiving_v1.0.patch
Description: bugfix_early_archiving_v1.0.patch


Re: archive status ".ready" files may be created too early

2020-05-31 Thread Bossart, Nathan
On 5/28/20, 11:42 PM, "matsumura@fujitsu.com"  
wrote:
> I'm preparing a patch that backend inserting segment-crossboundary
> WAL record leaves its EndRecPtr and someone flushing it checks
> the EndRecPtr and notifies..

Thank you for sharing your thoughts.  I will be happy to take a look
at your patch.

Nathan



  1   2   >