Re: archive status ".ready" files may be created too early
On 2021-Sep-01, Michael Paquier wrote: > That's about 515e3d8, right? Yes. > I have not looked in details at what you have here, but this produces > a compilation warning on Windows for me with this part of the patch: This seems a tiny speck in a sea of bogosity. If you want to silence the warning, be my guest, but in the long run I am inclined to revert the whole commit once I have a better picture of a way forward. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On Tue, Aug 31, 2021 at 08:52:05PM -0400, alvhe...@alvh.no-ip.org wrote: > Yeah, that's becoming my conclusion too -- undo that, and start from > scratch using the other idea. That's about 515e3d8, right? I have not looked in details at what you have here, but this produces a compilation warning on Windows for me with this part of the patch: +RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr endpos) +{ + XLogSegNo segno PG_USED_FOR_ASSERTS_ONLY; segno gets to be an unreferenced local variable. -- Michael signature.asc Description: PGP signature
Re: archive status ".ready" files may be created too early
On 2021-Aug-31, Andres Freund wrote: > Maybe, but this is getting uglier and uglier. > > I think patch should be reverted. It's not in a state that's appropriate for > the backbranches. Yeah, that's becoming my conclusion too -- undo that, and start from scratch using the other idea. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 2021-08-31 23:31:15 +, Bossart, Nathan wrote: > On 8/31/21, 1:30 PM, "Andres Freund" wrote: > > On 2021-08-31 18:09:36 +, Bossart, Nathan wrote: > >> What appears to happen in this case is that bgwriter eventually creates a > >> xl_running_xacts record and nudges walwriter to flush it to disk, at which > >> point the .ready file(s) will be created. That's admittedly a bit fragile. > > > > That's not guaranteed to happen. If e.g. the partial record is a checkpoint > > or > > a xl_running_xacts, we'll not trigger further WAL writes in the background, > > unless autovacuum ends up doing something. > > Right. Per the attached patch, a simple way to handle that could be > to teach XLogBackgroundFlush() to flush to the "earliest" segment > boundary if it doesn't find anything else to do. I think you could > still miss creating a .ready file for the previous segment in single- > user mode, though. Maybe, but this is getting uglier and uglier. I think patch should be reverted. It's not in a state that's appropriate for the backbranches.
Re: archive status ".ready" files may be created too early
Hi, On 2021-08-31 18:09:36 +, Bossart, Nathan wrote: > On 8/31/21, 10:21 AM, "Andres Freund" wrote: > > What would trigger the flushing? We don't write out partially filled pages > > unless > > a) we're explicitly flushing an LSN on the partial page (e.g. because a > >synchronous commit record resides on it) > > b) there's an async commit (i.e. commit with synchronous_commit=off) on the > >page > > Ah, so your point is that an open transaction that has written a > partial page on the next segment wouldn't trigger a flush. Doesn't have to be a transaction, can be a checkpoint or xl_running_xacts, or ... as well. > What appears to happen in this case is that bgwriter eventually creates a > xl_running_xacts record and nudges walwriter to flush it to disk, at which > point the .ready file(s) will be created. That's admittedly a bit fragile. That's not guaranteed to happen. If e.g. the partial record is a checkpoint or a xl_running_xacts, we'll not trigger further WAL writes in the background, unless autovacuum ends up doing something. Regards, Andres
Re: archive status ".ready" files may be created too early
On 8/31/21, 10:21 AM, "Andres Freund" wrote: > What would trigger the flushing? We don't write out partially filled pages > unless > a) we're explicitly flushing an LSN on the partial page (e.g. because a >synchronous commit record resides on it) > b) there's an async commit (i.e. commit with synchronous_commit=off) on the >page Ah, so your point is that an open transaction that has written a partial page on the next segment wouldn't trigger a flush. What appears to happen in this case is that bgwriter eventually creates a xl_running_xacts record and nudges walwriter to flush it to disk, at which point the .ready file(s) will be created. That's admittedly a bit fragile. Nathan
Re: archive status ".ready" files may be created too early
Hi, On 2021-08-31 17:01:31 +, Bossart, Nathan wrote: > > If the record ending at s4 + 10 isn't an async commit (and thus > > XLogCtl->asyncXactLSN is smaller), and there are no further records, we can > > end up waiting effectively forever for s2 (and s3) to be archived. If all > > other connections (and autovac etc) are idle, there will be no XLogFlush() > > calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If > > already known flushed" path, because the the first page in s4 is only > > partially filled. > > I'm not following why s4 wouldn't be flushed in this example. Even if > the first page in s4 is only partially filled, that portion of the > record should still get flushed, and we'll create the .ready files for > s2 and s3 at that time. What would trigger the flushing? We don't write out partially filled pages unless a) we're explicitly flushing an LSN on the partial page (e.g. because a synchronous commit record resides on it) b) there's an async commit (i.e. commit with synchronous_commit=off) on the page Greetings, Andres Freund
Re: archive status ".ready" files may be created too early
On 8/31/21, 12:44 AM, "Andres Freund" wrote: > If there's now a flush request including all of s3, we'll have the following > sequence of notifies: > > NotifySegmentsReadyForArchive(s1) > nothing happens, smaller than s1+10 > > NotifySegmentsReadyForArchive(s2) > earliestSegBoundary = s4 > earliestSegBoundaryEndPtr = s4+10 > latestSegBoundary = s4 > latestSegBoundaryEndPtr = s4 + 10 > latest_boundary_seg = s1 > > NotifySegmentsReadyForArchive(s3) > nothing happens, flush is smaller than s4 When earliestSegBoundary is set to s4, latestSegBoundary will be set to MaxXLogSegNo. > If the record ending at s4 + 10 isn't an async commit (and thus > XLogCtl->asyncXactLSN is smaller), and there are no further records, we can > end up waiting effectively forever for s2 (and s3) to be archived. If all > other connections (and autovac etc) are idle, there will be no XLogFlush() > calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If > already known flushed" path, because the the first page in s4 is only > partially filled. I'm not following why s4 wouldn't be flushed in this example. Even if the first page in s4 is only partially filled, that portion of the record should still get flushed, and we'll create the .ready files for s2 and s3 at that time. I tested this by adding some debug logging and creating a small record that crossed segment boundaries but didn't fill the first page on the next segment, and the .ready file was created as expected. Is there a case where we wouldn't flush the end of the record to disk? During my testing, I did find an obvious bug. We probably shouldn't be calling NotifySegmentsReadyForArchive() when archiving isn't enabled. diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c index 6a1e16edc2..8ca0d8e616 100644 --- a/src/backend/postmaster/walwriter.c +++ b/src/backend/postmaster/walwriter.c @@ -253,7 +253,8 @@ WalWriterMain(void) * here to handle a race condition where WAL is flushed to disk prior * to registering the segment boundary. */ -NotifySegmentsReadyForArchive(GetFlushRecPtr()); +if (XLogArchivingActive()) +NotifySegmentsReadyForArchive(GetFlushRecPtr()); /* * Do what we're here for; then, if XLogBackgroundFlush() found useful Nathan
Re: archive status ".ready" files may be created too early
Hi On 2021-08-31 06:45:06 +, Bossart, Nathan wrote: > On 8/30/21, 7:39 PM, "Andres Freund" wrote: > > On 2021-08-30 22:39:04 +, Bossart, Nathan wrote: > >> If we called NotifySegmentsReadyForArchive() before we updated the > >> flush location in shared memory, we might skip nudging the WAL writer > >> even though we should. > > > > That's trivial to address - just have a local variable saying whether we > > need > > to call NotifySegmentsReadyForArchive(). > > I think we can remove the race condition entirely by moving boundary > registration to before WALInsertLockRelease(). I attached a patch for > discussion. I think it's a bad idea to move more code to before WALInsertLockRelease. There's a very limited number of xlog insert slots, and WAL flushes (e.g. commits) need to wait for insertions to finish. > >> > When s1 is fully written out, NotifySegmentsReadyForArchive() will set > >> > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 - > >> > ok. But when when s2 is flushed, we'll afaict happily create .ready files > >> > for s2, s3 despite s4 not yet being written, because earliestSegBoundary > >> > is > >> > now s4. > >> > >> In this case, the .ready files for s2 and s3 wouldn't be created until > >> s4 is flushed to disk. > > > > I don't think that's true as the code stands today? The > > NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to > > s4, > > because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the > > keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a > > segment < 4 will then be able to flush s2 and s3? > > When flushRecPtr is less than both of the segment boundaries, > NotifySegmentsReadyForArchive() will return without doing anything. > At least, that was the intent. If there is some reason it's not > actually working that way, I can work on fixing it. But that's not OK either! Consider a scenario when there's small records each spanning just a bit into the next segment, and initially all the data is in wal_buffers. RegisterSegmentBoundary(s1, s1+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 RegisterSegmentBoundary(s2, s2+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 latestSegBoundary = s2 latestSegBoundaryEndPtr = s2 + 10 RegisterSegmentBoundary(s3, s3+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 latestSegBoundary = s2 latestSegBoundaryEndPtr = s2 + 10 RegisterSegmentBoundary(s4, s4+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 latestSegBoundary = s4 latestSegBoundaryEndPtr = s4 + 10 If there's now a flush request including all of s3, we'll have the following sequence of notifies: NotifySegmentsReadyForArchive(s1) nothing happens, smaller than s1+10 NotifySegmentsReadyForArchive(s2) earliestSegBoundary = s4 earliestSegBoundaryEndPtr = s4+10 latestSegBoundary = s4 latestSegBoundaryEndPtr = s4 + 10 latest_boundary_seg = s1 NotifySegmentsReadyForArchive(s3) nothing happens, flush is smaller than s4 If the record ending at s4 + 10 isn't an async commit (and thus XLogCtl->asyncXactLSN is smaller), and there are no further records, we can end up waiting effectively forever for s2 (and s3) to be archived. If all other connections (and autovac etc) are idle, there will be no XLogFlush() calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If already known flushed" path, because the the first page in s4 is only partially filled. Am I missing something? > > I don't think it's sensible to fix these separately. It'd be one thing to do > > that for HEAD, but on the back branches? And that this patch hasn't gotten > > any > > performance testing is scary. > > Are there any specific performance tests you'd like to see? I don't > mind running a couple. - Parallel copy with > 8 processes - Parallel non-transactional insertion of small-medium records Simulates inserting rows within a transaction - Parallel transactional insertion of small-medium sized records, with fsync=on Plain oltp writes - Parallel transactional insertion of small-medium sized records, with fsync=off fsync=off to simulate a fast server-class SSD (where fsync is instantaneous). Of course, if you have one of those, you can also use that. For the oltp ones I've had good experience simulating workloads with pg_logical_emit_message(). That just hits the WAL path, *drastically* reducing the variance / shortening the required test duration. Greetings, Andres Freund
Re: archive status ".ready" files may be created too early
Hi, On 2021-08-30 22:39:04 +, Bossart, Nathan wrote: > On 8/30/21, 2:06 PM, "Andres Freund" wrote: > > When were you thinking of doing the latch sets? Adding a latch set for every > > XLogWrite() wouldn't be cheap either. Both because syscalls under a lock > > aren't free and because waking up walsender even more often isn't free (we > > already have a few threads about reducing the signalling frequency). > > > > There's also the question of what to do with single user mode. We shouldn't > > just skip creating .ready files there... > > Good point. > > > Although, the more I think about, the more I am confused about the trailing > > if (XLogArchivingActive()) > > NotifySegmentsReadyForArchive(LogwrtResult.Flush); > > > > in XLogWrite(). Shouldn't that at the very least be inside the "If asked to > > flush, do so" branch? Outside that and the finishing_seg branch > > LogwrtResult.Flush won't have moved, right? So the call to > > NotifySegmentsReadyForArchive() can't do anything, no? > > The registration logic looks like this: > 1. Register boundary > 2. Get flush location from shared memory > 3. If flush location >= our just-registered boundary, nudge >the WAL writer to create .ready files if needed > > If we called NotifySegmentsReadyForArchive() before we updated the > flush location in shared memory, we might skip nudging the WAL writer > even though we should. That's trivial to address - just have a local variable saying whether we need to call NotifySegmentsReadyForArchive(). Note that the finishing_seg path currently calls NotifySegmentsReadyForArchive() before the shared memory flush location is updated. > > When s1 is fully written out, NotifySegmentsReadyForArchive() will set > > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 - > > ok. But when when s2 is flushed, we'll afaict happily create .ready files > > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is > > now s4. > > In this case, the .ready files for s2 and s3 wouldn't be created until > s4 is flushed to disk. I don't think that's true as the code stands today? The NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to s4, because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a segment < 4 will then be able to flush s2 and s3? > > The whole approach here of delaying .ready creation for these types of > > segments seems wrong to me. Doesn't the exact same problem also exist for > > streaming rep - which one can also use to maintain a PITR archive? walsender > > sends up to the flush location, and pg_receivewal's FindStreamingStart() > > will > > afaict just continue receiving from after that point. > > The problem with streaming replication is being discussed in a new > thread [0]. I don't think it's sensible to fix these separately. It'd be one thing to do that for HEAD, but on the back branches? And that this patch hasn't gotten any performance testing is scary. Greetings, Andres Freund
Re: archive status ".ready" files may be created too early
On 8/30/21, 3:40 PM, "Bossart, Nathan" wrote: > On 8/30/21, 2:06 PM, "Andres Freund" wrote: >> Although, the more I think about, the more I am confused about the trailing >> if (XLogArchivingActive()) >> NotifySegmentsReadyForArchive(LogwrtResult.Flush); >> >> in XLogWrite(). Shouldn't that at the very least be inside the "If asked to >> flush, do so" branch? Outside that and the finishing_seg branch >> LogwrtResult.Flush won't have moved, right? So the call to >> NotifySegmentsReadyForArchive() can't do anything, no? > > The registration logic looks like this: > 1. Register boundary > 2. Get flush location from shared memory > 3. If flush location >= our just-registered boundary, nudge >the WAL writer to create .ready files if needed > > If we called NotifySegmentsReadyForArchive() before we updated the > flush location in shared memory, we might skip nudging the WAL writer > even though we should. In the other thread [0], we're considering moving boundary registration to before WALInsertLockRelease(). If I'm right that this removes the race condition in question, we should be able to move the call to NotifySegmentsReadyForArchive() at the end of XLogWrite() to the if-asked-to-flush branch. Nathan [0] https://postgr.es/m/DE60B9AA-9670-47DA-9678-6C79BCD884E3%40amazon.com
Re: archive status ".ready" files may be created too early
On 8/30/21, 2:06 PM, "Andres Freund" wrote: > When were you thinking of doing the latch sets? Adding a latch set for every > XLogWrite() wouldn't be cheap either. Both because syscalls under a lock > aren't free and because waking up walsender even more often isn't free (we > already have a few threads about reducing the signalling frequency). > > There's also the question of what to do with single user mode. We shouldn't > just skip creating .ready files there... Good point. > Although, the more I think about, the more I am confused about the trailing > if (XLogArchivingActive()) > NotifySegmentsReadyForArchive(LogwrtResult.Flush); > > in XLogWrite(). Shouldn't that at the very least be inside the "If asked to > flush, do so" branch? Outside that and the finishing_seg branch > LogwrtResult.Flush won't have moved, right? So the call to > NotifySegmentsReadyForArchive() can't do anything, no? The registration logic looks like this: 1. Register boundary 2. Get flush location from shared memory 3. If flush location >= our just-registered boundary, nudge the WAL writer to create .ready files if needed If we called NotifySegmentsReadyForArchive() before we updated the flush location in shared memory, we might skip nudging the WAL writer even though we should. > Nor does it seem like we'd ever need to call NotifySegmentsReadyForArchive() > if we started writing on the current page - flushRecPtr can't move across a > segment boundary in that case. I think there is a chance that we've crossed one of our recorded segment boundaries anytime the flush pointer moves. > When s1 is fully written out, NotifySegmentsReadyForArchive() will set > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 - > ok. But when when s2 is flushed, we'll afaict happily create .ready files > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is > now s4. In this case, the .ready files for s2 and s3 wouldn't be created until s4 is flushed to disk. > The whole approach here of delaying .ready creation for these types of > segments seems wrong to me. Doesn't the exact same problem also exist for > streaming rep - which one can also use to maintain a PITR archive? walsender > sends up to the flush location, and pg_receivewal's FindStreamingStart() will > afaict just continue receiving from after that point. The problem with streaming replication is being discussed in a new thread [0]. Nathan [0] https://postgr.es/m/202108232252.dh7uxf6oxwcy%40alvherre.pgsql
Re: archive status ".ready" files may be created too early
Hi, On 2021-08-30 15:51:54 -0400, alvhe...@alvh.no-ip.org wrote: > On 2021-Aug-28, Andres Freund wrote: > > > While rebasing the aio patchset ontop of HEAD I noticed that this commit > > added > > another atomic operation to XLogWrite() with archiving enabled. The WAL > > write > > path is really quite hot, and at least some of the > > NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held. > > > > I think we should at least try to make the fast-path where no segment > > boundaries were crossed use no atomic operations. > > I think the best way to achieve this is is to rely completely on > walwriter doing the segment notification, so that the WAL write done by > backend would only do a latch set. When were you thinking of doing the latch sets? Adding a latch set for every XLogWrite() wouldn't be cheap either. Both because syscalls under a lock aren't free and because waking up walsender even more often isn't free (we already have a few threads about reducing the signalling frequency). There's also the question of what to do with single user mode. We shouldn't just skip creating .ready files there... Although, the more I think about, the more I am confused about the trailing if (XLogArchivingActive()) NotifySegmentsReadyForArchive(LogwrtResult.Flush); in XLogWrite(). Shouldn't that at the very least be inside the "If asked to flush, do so" branch? Outside that and the finishing_seg branch LogwrtResult.Flush won't have moved, right? So the call to NotifySegmentsReadyForArchive() can't do anything, no? Nor does it seem like we'd ever need to call NotifySegmentsReadyForArchive() if we started writing on the current page - flushRecPtr can't move across a segment boundary in that case. I hadn't yet realized that this commit doesn't just make XLogWrite() more expensive, it also makes XLogInsertRecord() more expensive :(. Adding two divisions to XLogInsertRecord() isn't nice, especially as it happens even if !XLogArchivingActive(). I can't really convince myself this deals correctly with multiple segment spanning records and with records spanning more than one segment? It'd be easier to understand if the new XLogCtlData variables were documented... If there's one record from segment s0 to s1 and one from s1 to s4, and wal_buffers is big enough to contain them all, the first record will set earliestSegBoundary = s1 the second latestSegBoundary = s4. When s1 is fully written out, NotifySegmentsReadyForArchive() will set earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 - ok. But when when s2 is flushed, we'll afaict happily create .ready files for s2, s3 despite s4 not yet being written, because earliestSegBoundary is now s4. I think there's other issues as well. The more I look at this commit, the less I believe it's right. The whole approach here of delaying .ready creation for these types of segments seems wrong to me. Doesn't the exact same problem also exist for streaming rep - which one can also use to maintain a PITR archive? walsender sends up to the flush location, and pg_receivewal's FindStreamingStart() will afaict just continue receiving from after that point. Greetings, Andres Freund
Re: archive status ".ready" files may be created too early
On 8/30/21, 12:52 PM, "alvhe...@alvh.no-ip.org" wrote: > On 2021-Aug-28, Andres Freund wrote: > >> While rebasing the aio patchset ontop of HEAD I noticed that this commit >> added >> another atomic operation to XLogWrite() with archiving enabled. The WAL write >> path is really quite hot, and at least some of the >> NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held. >> >> I think we should at least try to make the fast-path where no segment >> boundaries were crossed use no atomic operations. > > I think the best way to achieve this is is to rely completely on > walwriter doing the segment notification, so that the WAL write done by > backend would only do a latch set. +1. If we do that, we may also want to move NotifySegmentsReadyForArchive() to after the call to XLogBackgroundFlush() in the WAL writer. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Aug-28, Andres Freund wrote: > While rebasing the aio patchset ontop of HEAD I noticed that this commit added > another atomic operation to XLogWrite() with archiving enabled. The WAL write > path is really quite hot, and at least some of the > NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held. > > I think we should at least try to make the fast-path where no segment > boundaries were crossed use no atomic operations. I think the best way to achieve this is is to rely completely on walwriter doing the segment notification, so that the WAL write done by backend would only do a latch set. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
Hi, On 2021-08-23 15:55:03 -0400, alvhe...@alvh.no-ip.org wrote: > On 2021-Aug-23, Bossart, Nathan wrote: > > > Ah, okay. BTW the other changes you mentioned made sense to me. > > Thanks. I've pushed this now to all live branches. While rebasing the aio patchset ontop of HEAD I noticed that this commit added another atomic operation to XLogWrite() with archiving enabled. The WAL write path is really quite hot, and at least some of the NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held. I think we should at least try to make the fast-path where no segment boundaries were crossed use no atomic operations. Greetings, Andres Freund
Re: archive status ".ready" files may be created too early
On 8/25/21, 11:01 AM, "Fujii Masao" wrote: > If LogwrtResult.Flush >= EndPos, which means that another process already > has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush. > This situation also means that that another process called > NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right? If the segment boundary wasn't registered before the other process called NotifySegmentsReadyForArchive(), then it couldn't have used the boundary for deciding which .ready files to create. > If this understanding is right, there seems no need to wake walwriter up here > so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain. > Thought? We're actually discussing this right now in another thread [0]. I think we might be able to get rid of that part if we move the boundary registration to before we release the WAL insert lock(s). Nathan [0] https://postgr.es/m/DE60B9AA-9670-47DA-9678-6C79BCD884E3%40amazon.com
Re: archive status ".ready" files may be created too early
On 2021/08/24 4:55, alvhe...@alvh.no-ip.org wrote: On 2021-Aug-23, Bossart, Nathan wrote: Ah, okay. BTW the other changes you mentioned made sense to me. Thanks. I've pushed this now to all live branches. Thanks a lot! + /* +* There's a chance that the record was already flushed to disk and we +* missed marking segments as ready for archive. If this happens, we +* nudge the WALWriter, which will take care of notifying segments as +* needed. +*/ + if (StartSeg != EndSeg && XLogArchivingActive() && + LogwrtResult.Flush >= EndPos && ProcGlobal->walwriterLatch) + SetLatch(ProcGlobal->walwriterLatch); Is this really necessary? If LogwrtResult.Flush >= EndPos, which means that another process already has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush. This situation also means that that another process called NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right? If this understanding is right, there seems no need to wake walwriter up here so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain. Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: archive status ".ready" files may be created too early
On 2021-Aug-23, Bossart, Nathan wrote: > On 8/23/21, 12:55 PM, "alvhe...@alvh.no-ip.org" > wrote: > > Thanks. I've pushed this now to all live branches. > > Thank you! I appreciate the thorough reviews. Should we make a new > thread for the streaming replication fix? Yeah, this one is long enough :-) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 8/23/21, 12:55 PM, "alvhe...@alvh.no-ip.org" wrote: > Thanks. I've pushed this now to all live branches. Thank you! I appreciate the thorough reviews. Should we make a new thread for the streaming replication fix? Nathan
Re: archive status ".ready" files may be created too early
On 2021-Aug-23, Bossart, Nathan wrote: > Ah, okay. BTW the other changes you mentioned made sense to me. Thanks. I've pushed this now to all live branches. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ It does it in a really, really complicated way why does it need to be complicated? Because it's MakeMaker.
Re: archive status ".ready" files may be created too early
On 8/23/21, 10:31 AM, "alvhe...@alvh.no-ip.org" wrote: > On 2021-Aug-23, alvhe...@alvh.no-ip.org wrote: > >> The only way .ready files are created is that XLogNotifyWrite() is >> called. For regular WAL files during regular operation, that only >> happens in XLogNotifyWriteSeg(). That, in turn, only happens in >> NotifySegmentsReadyForArchive(). But if the system runs and never >> writes WAL records that cross WAL boundaries, that function will see >> that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno, >> and return without doing anything. So no segments will be notified. > > Nevermind -- I realized that all segments get registered, not just those > for which we generate continuation records. Ah, okay. BTW the other changes you mentioned made sense to me. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Aug-23, alvhe...@alvh.no-ip.org wrote: > The only way .ready files are created is that XLogNotifyWrite() is > called. For regular WAL files during regular operation, that only > happens in XLogNotifyWriteSeg(). That, in turn, only happens in > NotifySegmentsReadyForArchive(). But if the system runs and never > writes WAL records that cross WAL boundaries, that function will see > that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno, > and return without doing anything. So no segments will be notified. Nevermind -- I realized that all segments get registered, not just those for which we generate continuation records. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Use it up, wear it out, make it do, or do without"
Re: archive status ".ready" files may be created too early
On 2021-Aug-23, Bossart, Nathan wrote: > Sorry, I'm still not following this one. If we skipped creating > .ready segments due to a crash, we rely on RemoveOldXlogFiles() to > create them as needed in the end-of-recovery checkpoint. If a record > fits perfectly in the end of a segment, we'll still register it as a > boundary for the next segment (hence why we use XLByteToSeg() instead > of XLByteToPrevSeg()). If database activity stops completely, there > shouldn't be anything to mark ready. The only way .ready files are created is that XLogNotifyWrite() is called. For regular WAL files during regular operation, that only happens in XLogNotifyWriteSeg(). That, in turn, only happens in NotifySegmentsReadyForArchive(). But if the system runs and never writes WAL records that cross WAL boundaries, that function will see that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno, and return without doing anything. So no segments will be notified. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 8/23/21, 9:33 AM, "alvhe...@alvh.no-ip.org" wrote: > On 2021-Aug-23, Bossart, Nathan wrote: > >> Hm. My expectation would be that if there are no registrations, we >> cannot create .ready files for the flushed segments. The scenario >> where I can see that happening is when a record gets flushed to disk >> prior to registration. In that case, we'll still eventually register >> the record and wake up the WAL writer process, which will take care of >> creating the .ready files that were missed earlier. Is there another >> case you are thinking of where we could miss registration for a cross- >> segment record altogether? > > I'm thinking of the case where no record cross segment boundaries ever. Sorry, I'm still not following this one. If we skipped creating .ready segments due to a crash, we rely on RemoveOldXlogFiles() to create them as needed in the end-of-recovery checkpoint. If a record fits perfectly in the end of a segment, we'll still register it as a boundary for the next segment (hence why we use XLByteToSeg() instead of XLByteToPrevSeg()). If database activity stops completely, there shouldn't be anything to mark ready. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Aug-23, Bossart, Nathan wrote: > Hm. My expectation would be that if there are no registrations, we > cannot create .ready files for the flushed segments. The scenario > where I can see that happening is when a record gets flushed to disk > prior to registration. In that case, we'll still eventually register > the record and wake up the WAL writer process, which will take care of > creating the .ready files that were missed earlier. Is there another > case you are thinking of where we could miss registration for a cross- > segment record altogether? I'm thinking of the case where no record cross segment boundaries ever. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 8/23/21, 8:50 AM, "alvhe...@alvh.no-ip.org" wrote: > ... while reading the resulting code after backpatching to all branches, > I realized that if there are no registrations whatsoever, then archiving > won't do anything, which surely is the wrong thing to do. The correct > behavior should be "if there are no registrations, then *all* flushed > segments can be notified". Hm. My expectation would be that if there are no registrations, we cannot create .ready files for the flushed segments. The scenario where I can see that happening is when a record gets flushed to disk prior to registration. In that case, we'll still eventually register the record and wake up the WAL writer process, which will take care of creating the .ready files that were missed earlier. Is there another case you are thinking of where we could miss registration for a cross- segment record altogether? Nathan
Re: archive status ".ready" files may be created too early
On 2021-Aug-21, Bossart, Nathan wrote: > > Well, the thing I realized is that these three helper functions have > > exactly one caller each. I think the compiler is going to inline them, > > so there isn't going to be a function call in the assembly. I haven't > > verified this, though. > > Good point. It looks like they're getting inlined for me. I still didn't like it, because it looks like we're creating an API for which there can be only one caller. So I expanded the functions in the caller. It doesn't look too bad. However ... ... while reading the resulting code after backpatching to all branches, I realized that if there are no registrations whatsoever, then archiving won't do anything, which surely is the wrong thing to do. The correct behavior should be "if there are no registrations, then *all* flushed segments can be notified". I'll fix that ... Another thing I didn't like is that you used a name ending in RecPtr for the LSN, which gives no indication that it really is the *end* LSN, not the start pointer. And it won't play nice with the need to add the *start* LSN which we'll need to implement solving the equivalent problem for streaming replication. I'll rename those to earliestSegBoundaryEndPtr and latestSegBoundaryEndPtr. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ Syntax error: function hell() needs an argument. Please choose what hell you want to involve. >From 6e4c3fd6f5687ec5762de121344ce35c1c890812 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 23 Aug 2021 09:06:09 -0400 Subject: [PATCH v15] Avoid creating archive status ".ready" files too early WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file is created, and the archiver may process it. If PostgreSQL crashes before it is able to write and flush the rest of the record (in the next WAL segment), the wrong version of the first segment file lingers in the archive, which causes operations such as point-in-time restores to fail. To fix this, keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. This has always been wrong, so backpatch all the way back. Author: Nathan Bossart Reviewed-by: Kyotaro Horiguchi Reviewed-by: Ryo Matsumura Reviewed-by: Andrey Borodin Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com --- src/backend/access/transam/timeline.c| 2 +- src/backend/access/transam/xlog.c| 220 +-- src/backend/access/transam/xlogarchive.c | 17 +- src/backend/postmaster/walwriter.c | 7 + src/backend/replication/walreceiver.c| 6 +- src/include/access/xlog.h| 1 + src/include/access/xlogarchive.h | 4 +- src/include/access/xlogdefs.h| 1 + 8 files changed, 234 insertions(+), 24 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..acd5c2431d 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, if (XLogArchivingActive()) { TLHistoryFileName(histfname, newTLI); - XLogArchiveNotify(histfname); + XLogArchiveNotify(histfname, true); } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a749d..8e7c3a364a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -724,6 +724,18 @@ typedef struct XLogCtlData XLogRecPtr lastFpwDisableRecPtr; slock_t info_lck; /* locks shared variables shown above */ + + /* + * Variables used to track cross-segment records. Protected by + * segtrack_lck. + */ + XLogSegNo lastNotifiedSeg; + XLogSegNo earliestSegBoundary; + XLogRecPtr earliestSegBoundaryRecPtr; + XLogSegNo latestSegBoundary; + XLogRecPtr latestSegBoundaryRecPtr; + + slock_t segtrack_lck; /* locks shared variables shown above */ } XLogCtlData; static XLogCtlData *XLogCtl = NULL; @@ -920,6 +932,7 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo, XLogSegNo *endlogSegNo); static void UpdateLastRemovedPtr(char *filename); static void ValidateXLOGDirectoryStructure(void); +static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos); static void CleanupBackupHistory(void); static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord *ReadRecord(XLogReaderState *xlogreader, @@ -1154,23 +1167,56 @@ XLogInsertRecord(XLogRecData *rdata, END_CRIT_SECTION(); /* - * Update shared LogwrtRqst.Write, if we crossed page boundary. + * If we crossed page boundary, update LogwrtRqst.Write; if we crossed + *
Re: archive status ".ready" files may be created too early
On 8/20/21, 4:52 PM, "alvhe...@alvh.no-ip.org" wrote: > On 2021-Aug-20, Bossart, Nathan wrote: > >> I was looking at moving the function calls out of the spinlock region. >> I don't think the functions are doing anything too expensive, and they >> help clean up NotifySegmentsReadyForArchive() quite a bit, but I >> understand why it might be against project policy to do something like >> that. It would be easy enough to get rid of the helper functions if >> that was concern. > > Well, the thing I realized is that these three helper functions have > exactly one caller each. I think the compiler is going to inline them, > so there isn't going to be a function call in the assembly. I haven't > verified this, though. Good point. It looks like they're getting inlined for me. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Aug-20, Bossart, Nathan wrote: > I was looking at moving the function calls out of the spinlock region. > I don't think the functions are doing anything too expensive, and they > help clean up NotifySegmentsReadyForArchive() quite a bit, but I > understand why it might be against project policy to do something like > that. It would be easy enough to get rid of the helper functions if > that was concern. Well, the thing I realized is that these three helper functions have exactly one caller each. I think the compiler is going to inline them, so there isn't going to be a function call in the assembly. I haven't verified this, though. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Los dioses no protegen a los insensatos. Éstos reciben protección de otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)
Re: archive status ".ready" files may be created too early
On 8/20/21, 4:00 PM, "alvhe...@alvh.no-ip.org" wrote: > Attached is v14 which uses a separate spinlock. This looks good to me. I was looking at moving the function calls out of the spinlock region. I don't think the functions are doing anything too expensive, and they help clean up NotifySegmentsReadyForArchive() quite a bit, but I understand why it might be against project policy to do something like that. It would be easy enough to get rid of the helper functions if that was concern. Nathan
Re: archive status ".ready" files may be created too early
Attached is v14 which uses a separate spinlock. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "No me acuerdo, pero no es cierto. No es cierto, y si fuera cierto, no me acuerdo." (Augusto Pinochet a una corte de justicia) >From 4180334f51a1f343ce5795d0b62f3aa298b472da Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 20 Aug 2021 19:25:31 + Subject: [PATCH v14] Avoid creating archive status ".ready" files too early WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file is created, and the archiver may process it. If PostgreSQL crashes before it is able to write and flush the rest of the record (in the next WAL segment), the wrong version of the first segment file lingers in the archive, which causes operations such as point-in-time restores to fail. To fix this, keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. Author: Nathan Bossart Reviewed-by: Kyotaro Horiguchi Reviewed-by: Ryo Matsumura Reviewed-by: Andrey Borodin Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com --- src/backend/access/transam/timeline.c| 2 +- src/backend/access/transam/xlog.c| 242 +-- src/backend/access/transam/xlogarchive.c | 17 +- src/backend/postmaster/walwriter.c | 7 + src/backend/replication/walreceiver.c| 6 +- src/include/access/xlog.h| 1 + src/include/access/xlogarchive.h | 4 +- src/include/access/xlogdefs.h| 1 + 8 files changed, 256 insertions(+), 24 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..acd5c2431d 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, if (XLogArchivingActive()) { TLHistoryFileName(histfname, newTLI); - XLogArchiveNotify(histfname); + XLogArchiveNotify(histfname, true); } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a749d..95f03adef8 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -724,6 +724,18 @@ typedef struct XLogCtlData XLogRecPtr lastFpwDisableRecPtr; slock_t info_lck; /* locks shared variables shown above */ + + /* + * Variables used to track cross-segment records. Protected by + * segtrack_lck. + */ + XLogSegNo lastNotifiedSeg; + XLogSegNo earliestSegBoundary; + XLogRecPtr earliestSegBoundaryRecPtr; + XLogSegNo latestSegBoundary; + XLogRecPtr latestSegBoundaryRecPtr; + + slock_t segtrack_lck; /* locks shared variables shown above */ } XLogCtlData; static XLogCtlData *XLogCtl = NULL; @@ -920,6 +932,9 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo, XLogSegNo *endlogSegNo); static void UpdateLastRemovedPtr(char *filename); static void ValidateXLOGDirectoryStructure(void); +static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos); +static bool GetLatestSegmentBoundary(XLogRecPtr flushed, XLogSegNo *latest_boundary_seg); +static void RemoveSegmentBoundariesUpTo(XLogSegNo seg); static void CleanupBackupHistory(void); static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord *ReadRecord(XLogReaderState *xlogreader, @@ -1154,23 +1169,56 @@ XLogInsertRecord(XLogRecData *rdata, END_CRIT_SECTION(); /* - * Update shared LogwrtRqst.Write, if we crossed page boundary. + * If we crossed page boundary, update LogwrtRqst.Write; if we crossed + * segment boundary, register that and wake up walwriter. */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { + XLogSegNo StartSeg; + XLogSegNo EndSeg; + + XLByteToSeg(StartPos, StartSeg, wal_segment_size); + XLByteToSeg(EndPos, EndSeg, wal_segment_size); + + /* + * Register our crossing the segment boundary if that occurred. + * + * Note that we did not use XLByteToPrevSeg() for determining the + * ending segment. This is so that a record that fits perfectly into + * the end of the segment is marked ready for archival as soon as the + * flushed pointer jumps to the next segment. + */ + if (StartSeg != EndSeg && XLogArchivingActive()) + RegisterSegmentBoundary(EndSeg, EndPos); + + /* + * Advance LogwrtRqst.Write so that it includes new block(s). + * + * We do this after registering the segment boundary so that the + * comparison with the flushed pointer below can use the latest value + * known globally. + */ SpinLockAcquire(>info_lck); - /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write
Re: archive status ".ready" files may be created too early
On 2021-Aug-20, Bossart, Nathan wrote: > > On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan wrote: > >> This led me to revisit the two-element > >> approach that was discussed upthread. What if we only stored the > >> earliest and latest segment boundaries at any given time? Once the > >> earliest boundary is added, it never changes until the segment is > >> flushed and it is removed. The latest boundary, however, will be > >> updated any time we register another segment. Once the earliest > >> boundary is removed, we replace it with the latest boundary. This > >> strategy could cause us to miss intermediate boundaries, but AFAICT > >> the worst case scenario is that we hold off creating .ready files a > >> bit longer than necessary. > I've attached a patch to demonstrate what I'm thinking. There is only one thing I didn't like in this new version, which is that we're holding info_lck too much. I've seen info_lck contention be a problem in some workloads and I'd rather not add more stuff to it. I'd rather we stick with using a new lock object to protect all the data we need for this job. Should this new lock object be a spinlock or an lwlock? I think a spinlock would generally be better because it's lower overhead and we can't use it in shared mode anywhere, which would be the greatest argument for an lwlock. However, I think we avoid letting code run with spinlocks held that's not straight-line code, and we have some function calls there. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On Fri, Aug 20, 2021 at 1:52 PM alvhe...@alvh.no-ip.org wrote: > Actually, you were right. Hash tables in shared memory can be expanded. > There are some limitations (the hash "directory" is fixed size, which > means the hash table get less efficient if it grows too much), but you > can definitely create more hash entries than the initial size. See for > example element_alloc(), which covers the case of a hash table being > IS_PARTITIONED -- something that only shmem hash tables can be. Note > that ShmemInitHash passes the HASH_ALLOC flag and uses ShmemAllocNoError > as allocation function, which acquires memory from the shared segment. I realize that the code supports this ... but I thought we had established a policy that only the main lock manager's shared hash tables, and not any others, are actually allowed to make use of this functionality. See commit 7c797e7194d969f974abf579cacf30ffdccdbb95. It seems like a dangerous thing to rely on in any case, since we can't predict how much extra shared memory might actually be available. -- Robert Haas EDB: http://www.enterprisedb.com
Re: archive status ".ready" files may be created too early
On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan wrote: > Thinking about this stuff further, I was wondering if one way to > handle the bounded shared hash table problem would be to replace the > latest boundary in the map whenever it was full. But at that point, > do we even need a hash table? This led me to revisit the two-element > approach that was discussed upthread. What if we only stored the > earliest and latest segment boundaries at any given time? Once the > earliest boundary is added, it never changes until the segment is > flushed and it is removed. The latest boundary, however, will be > updated any time we register another segment. Once the earliest > boundary is removed, we replace it with the latest boundary. This > strategy could cause us to miss intermediate boundaries, but AFAICT > the worst case scenario is that we hold off creating .ready files a > bit longer than necessary. I think this is a promising approach. We could also have a small fixed-size array, so that we only have to risk losing track of anything when we overflow the array. But I guess I'm still unconvinced that there's a real possibility of genuinely needing multiple elements. Suppose we are thinking of adding a second element to the array (or the hash table). I feel like it's got to be safe to just remove the first one. If not, then apparently the WAL record that caused us to make the first entry isn't totally flushed yet - which I still think is impossible. -- Robert Haas EDB: http://www.enterprisedb.com
Re: archive status ".ready" files may be created too early
On 2021-Aug-20, Bossart, Nathan wrote: > On 8/20/21, 8:29 AM, "Robert Haas" wrote: > > We can't expand the hash table either. It has an initial and maximum > > size of 16 elements, which means it's basically an expensive array, > > and which also means that it imposes a new limit of 16 * > > wal_segment_size on the size of WAL records. If you exceed that limit, > > I think things just go boom... which I think is not acceptable. I > > think we can have records in the multi-GB range of wal_level=logical > > and someone chooses a stupid replica identity setting. > > I was under the impression that shared hash tables could be expanded > as necessary, but from your note and the following comment, that does > not seem to be true: Actually, you were right. Hash tables in shared memory can be expanded. There are some limitations (the hash "directory" is fixed size, which means the hash table get less efficient if it grows too much), but you can definitely create more hash entries than the initial size. See for example element_alloc(), which covers the case of a hash table being IS_PARTITIONED -- something that only shmem hash tables can be. Note that ShmemInitHash passes the HASH_ALLOC flag and uses ShmemAllocNoError as allocation function, which acquires memory from the shared segment. This is a minor thing -- it doesn't affect the fact that the hash table is possibly being misused and inefficient -- but I thought it was worth pointing out. As an example, consider the LOCK / PROCLOCK hash tables. These can contain more elements than max_backends * max_locks_per_transaction. Those elements consume shared memory from the "allocation slop" in the shared memory segment. It's tough when it happens (as far as I know the memory is never "returned" once such a hash table grows to use that space), but it does work. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 8/20/21, 10:08 AM, "Robert Haas" wrote: > On Fri, Aug 20, 2021 at 12:36 PM Bossart, Nathan wrote: >> If a record spans multiple segments, we only register one segment >> boundary. For example, if I insert a record that starts at segment >> number 1 and stops at 10, I'll insert one segment boundary for segment >> 10. We'll only create .ready files for segments 1 through 9 once this >> record is completely flushed to disk. > > Oh ... OK. So is there any experimental scenario in which the hash > table ends up with more than 1 entry? And if so, how does that happen? I was able to do this by turning synchronous_commit off, increasing wal_buffers substantially, and adding sleeps to XLogWrite(). >> If there isn't a way to ensure that the number of entries we need to >> store is bounded, I'm tempted to propose my original patch [0], which >> just moves .ready file creation to the very end of XLogWrite(). It's >> probably not a complete solution, but it might be better than what's >> there today. > > Doesn't that allocate memory inside a critical section? I would have > thought it would cause an immediate assertion failure. I could probably replace the list with two local variables (start and end segments). Thinking about this stuff further, I was wondering if one way to handle the bounded shared hash table problem would be to replace the latest boundary in the map whenever it was full. But at that point, do we even need a hash table? This led me to revisit the two-element approach that was discussed upthread. What if we only stored the earliest and latest segment boundaries at any given time? Once the earliest boundary is added, it never changes until the segment is flushed and it is removed. The latest boundary, however, will be updated any time we register another segment. Once the earliest boundary is removed, we replace it with the latest boundary. This strategy could cause us to miss intermediate boundaries, but AFAICT the worst case scenario is that we hold off creating .ready files a bit longer than necessary. I'll work on a patch to illustrate what I'm thinking. Nathan
Re: archive status ".ready" files may be created too early
On Fri, Aug 20, 2021 at 12:36 PM Bossart, Nathan wrote: > If a record spans multiple segments, we only register one segment > boundary. For example, if I insert a record that starts at segment > number 1 and stops at 10, I'll insert one segment boundary for segment > 10. We'll only create .ready files for segments 1 through 9 once this > record is completely flushed to disk. Oh ... OK. So is there any experimental scenario in which the hash table ends up with more than 1 entry? And if so, how does that happen? > > It's actually not clear to me why we need to track multiple entries > > anyway. The scenario postulated by Horiguchi-san in > > https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota@gmail.com > > seems to require that the write position be multiple segments ahead of > > the flush position, but that seems impossible with the present code, > > because XLogWrite() calls issue_xlog_fsync() at once if the segment is > > filled. So I think, at least with the present code, any record that > > isn't completely flushed to disk has to be at least partially in the > > current segment. And there can be only one record that starts in some > > earlier segment and ends in this one. > > We register the boundaries XLogInsertRecord(), which AFAICT just bumps > the global write request pointer ahead, so I'm not sure we can make > any assumptions about what is written/flushed at that time. (I see > that we do end up calling XLogFlush() for XLOG_SWITCH records in > XLogInsertRecord(), but I don't see any other cases where we actually > write anything in this function.) Am I missing something? Well, I'm not sure. But I *think* that the code as it exists today is smart enough not to try to archive a segment that hasn't been completely flushed, and the gap is only that even though the segment might be completely flushed, some portion of the record that is part of a later segment might not be flushed, and thus after a crash we might overwrite the already-flushed contents. The patch can make an implementation choice to do some work at XLogInsertRecord() time if it likes, but there's no real hazard at that point. The hazard only exists, or so I think, once a segment that contains part of the record is fully on disk. But that means, if my previous logic is correct, that the hazard can only exist for at most 1 record at any point in time. > If there isn't a way to ensure that the number of entries we need to > store is bounded, I'm tempted to propose my original patch [0], which > just moves .ready file creation to the very end of XLogWrite(). It's > probably not a complete solution, but it might be better than what's > there today. Doesn't that allocate memory inside a critical section? I would have thought it would cause an immediate assertion failure. -- Robert Haas EDB: http://www.enterprisedb.com
Re: archive status ".ready" files may be created too early
On 8/20/21, 8:29 AM, "Robert Haas" wrote: > On Fri, Aug 20, 2021 at 10:50 AM alvhe...@alvh.no-ip.org > wrote: >> 1. We use a hash table in shared memory. That's great. The part that's >>not so great is that in both places where we read items from it, we >>have to iterate in some way. This seems a bit silly. An array would >>serve us better, if only we could expand it as needed. However, in >>shared memory we can't do that. (I think the list of elements we >>need to memoize is arbitrary long, if enough processes can be writing >>WAL at the same time.) > > We can't expand the hash table either. It has an initial and maximum > size of 16 elements, which means it's basically an expensive array, > and which also means that it imposes a new limit of 16 * > wal_segment_size on the size of WAL records. If you exceed that limit, > I think things just go boom... which I think is not acceptable. I > think we can have records in the multi-GB range of wal_level=logical > and someone chooses a stupid replica identity setting. If a record spans multiple segments, we only register one segment boundary. For example, if I insert a record that starts at segment number 1 and stops at 10, I'll insert one segment boundary for segment 10. We'll only create .ready files for segments 1 through 9 once this record is completely flushed to disk. I was under the impression that shared hash tables could be expanded as necessary, but from your note and the following comment, that does not seem to be true: * Note: for a shared-memory hashtable, nelem needs to be a pretty good * estimate, since we can't expand the table on the fly. But an unshared * hashtable can be expanded on-the-fly, so it's better for nelem to be * on the small side and let the table grow if it's exceeded. An overly * large nelem will penalize hash_seq_search speed without buying much. > It's actually not clear to me why we need to track multiple entries > anyway. The scenario postulated by Horiguchi-san in > https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota@gmail.com > seems to require that the write position be multiple segments ahead of > the flush position, but that seems impossible with the present code, > because XLogWrite() calls issue_xlog_fsync() at once if the segment is > filled. So I think, at least with the present code, any record that > isn't completely flushed to disk has to be at least partially in the > current segment. And there can be only one record that starts in some > earlier segment and ends in this one. We register the boundaries XLogInsertRecord(), which AFAICT just bumps the global write request pointer ahead, so I'm not sure we can make any assumptions about what is written/flushed at that time. (I see that we do end up calling XLogFlush() for XLOG_SWITCH records in XLogInsertRecord(), but I don't see any other cases where we actually write anything in this function.) Am I missing something? > I will be the first to admit that the forced end-of-segment syncs > suck. They often stall every backend in the entire system at the same > time. Everyone fills up the xlog segment really fast and then stalls > HARD while waiting for that sync to happen. So it's arguably better > not to do more things that depend on that being how it works, but I > think needing a variable-size amount of shared memory is even worse. > If we're going to track multiple entries here we need some rule that > bounds how many of them we can need to track. If the number of entries > is defined by the number of segment boundaries that a particular > record crosses, it's effectively unbounded, because right now WAL > records can be pretty much arbitrarily big. If there isn't a way to ensure that the number of entries we need to store is bounded, I'm tempted to propose my original patch [0], which just moves .ready file creation to the very end of XLogWrite(). It's probably not a complete solution, but it might be better than what's there today. Nathan [0] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com
Re: archive status ".ready" files may be created too early
On Fri, Aug 20, 2021 at 10:50 AM alvhe...@alvh.no-ip.org wrote: > 1. We use a hash table in shared memory. That's great. The part that's >not so great is that in both places where we read items from it, we >have to iterate in some way. This seems a bit silly. An array would >serve us better, if only we could expand it as needed. However, in >shared memory we can't do that. (I think the list of elements we >need to memoize is arbitrary long, if enough processes can be writing >WAL at the same time.) We can't expand the hash table either. It has an initial and maximum size of 16 elements, which means it's basically an expensive array, and which also means that it imposes a new limit of 16 * wal_segment_size on the size of WAL records. If you exceed that limit, I think things just go boom... which I think is not acceptable. I think we can have records in the multi-GB range of wal_level=logical and someone chooses a stupid replica identity setting. It's actually not clear to me why we need to track multiple entries anyway. The scenario postulated by Horiguchi-san in https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota@gmail.com seems to require that the write position be multiple segments ahead of the flush position, but that seems impossible with the present code, because XLogWrite() calls issue_xlog_fsync() at once if the segment is filled. So I think, at least with the present code, any record that isn't completely flushed to disk has to be at least partially in the current segment. And there can be only one record that starts in some earlier segment and ends in this one. I will be the first to admit that the forced end-of-segment syncs suck. They often stall every backend in the entire system at the same time. Everyone fills up the xlog segment really fast and then stalls HARD while waiting for that sync to happen. So it's arguably better not to do more things that depend on that being how it works, but I think needing a variable-size amount of shared memory is even worse. If we're going to track multiple entries here we need some rule that bounds how many of them we can need to track. If the number of entries is defined by the number of segment boundaries that a particular record crosses, it's effectively unbounded, because right now WAL records can be pretty much arbitrarily big. -- Robert Haas EDB: http://www.enterprisedb.com
Re: archive status ".ready" files may be created too early
Two things. 1. We use a hash table in shared memory. That's great. The part that's not so great is that in both places where we read items from it, we have to iterate in some way. This seems a bit silly. An array would serve us better, if only we could expand it as needed. However, in shared memory we can't do that. (I think the list of elements we need to memoize is arbitrary long, if enough processes can be writing WAL at the same time.) Now that I think about this again, maybe it's limited by NUM_XLOGINSERT_LOCKS, since there can only be that many records being written down at a time ... 2. There is a new LWLock acquisition that may be a new contention point. We acquire the lock in these cases: - WAL record insert, when a segment boundary is crosses (rare enough). - XLogWrite, when a segment needs to be notified. Looking again, I think the last point might be a problem actually, because XLogWrite is called with WALWriteLock held. Maybe we should take the NotifySegmentsReadyForArchive() call outside the section locked by WALWriteLock (so put it in XLogWrite callers instead of XLogWrite itself). -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en La Feria de las Tinieblas, R. Bradbury)
Re: archive status ".ready" files may be created too early
In v12 I moved the code around a bit and reworded some comments. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ >From 7d1475578431e265a5e7f8b94a6b0791b68a9a31 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 17 Aug 2021 03:52:43 + Subject: [PATCH v12] Avoid creating archive status ".ready" files too early WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file is created, and the archiver may process it. If PostgreSQL crashes before it is able to write and flush the rest of the record (in the next WAL segment), the wrong version of the first segment file lingers in the archive, which causes operations such as point-in-time restores to fail. To fix this, keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. XXX Note that this may add a new LWLock acquisition in the transaction commit path. (It doesn't happen in all cases.) Author: Nathan Bossart Reviewed-by: Kyotaro Horiguchi Reviewed-by: Ryo Matsumura Reviewed-by: Andrey Borodin Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com --- src/backend/access/transam/timeline.c| 2 +- src/backend/access/transam/xlog.c| 307 ++- src/backend/access/transam/xlogarchive.c | 17 +- src/backend/postmaster/walwriter.c | 7 + src/backend/replication/walreceiver.c| 6 +- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/include/access/xlog.h| 1 + src/include/access/xlogarchive.h | 4 +- src/include/access/xlogdefs.h| 1 + src/tools/pgindent/typedefs.list | 1 + 10 files changed, 323 insertions(+), 24 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..acd5c2431d 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, if (XLogArchivingActive()) { TLHistoryFileName(histfname, newTLI); - XLogArchiveNotify(histfname); + XLogArchiveNotify(histfname, true); } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a749d..01da4ab7b8 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -512,6 +512,16 @@ typedef enum ExclusiveBackupState */ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE; +/* + * Entries for SegmentBoundaryMap. Each such entry represents a WAL + * record that ends in endpos and crosses the WAL segment boundary. + */ +typedef struct SegmentBoundaryEntry +{ + XLogSegNo seg; /* must be first */ + XLogRecPtr endpos; +} SegmentBoundaryEntry; + /* * Shared state data for WAL insertion. */ @@ -723,6 +733,12 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* + * The last segment we've marked ready for archival. Protected by + * info_lck. + */ + XLogSegNo lastNotifiedSeg; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -736,6 +752,23 @@ static WALInsertLockPadded *WALInsertLocks = NULL; */ static ControlFileData *ControlFile = NULL; +/* + * Segment boundary map. + * + * This hash table tracks WAL records that cross WAL segment boundaries. + * For WAL archiving it is critical not to mark a segment as archivable + * until both halves of the record are flushed: otherwise if we crash + * after archiving the segment containing the first half and before + * flushing the second half of the record, then after recovery the + * primary would overwrite the first half with new data, making the + * segment archived prior to the crash a bogus copy. + * + * XXX note that an equivalent problem exists with streaming replication. + * + * Protected by SegmentBoundaryLock. + */ +static HTAB *SegmentBoundaryMap = NULL; + /* * Calculate the amount of space left on the page after 'endptr'. Beware * multiple evaluation! @@ -920,6 +953,13 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo, XLogSegNo *endlogSegNo); static void UpdateLastRemovedPtr(char *filename); static void ValidateXLOGDirectoryStructure(void); +static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos); +static XLogSegNo GetLastNotifiedSegment(void); +static void SetLastNotifiedSegment(XLogSegNo seg); +static bool GetLatestSegmentBoundary(XLogSegNo last_notified, + XLogRecPtr flushed, + XLogSegNo *latest_boundary_seg); +static void RemoveSegmentBoundariesUpTo(XLogSegNo seg); static void CleanupBackupHistory(void); static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord
Re: archive status ".ready" files may be created too early
On 2021-Aug-18, Bossart, Nathan wrote: > As long as XLogBackgroundFlush() found work to do, it would take care > of notifying, but I don't think we can depend on that. However, since > we're primarily using the WAL writer to take care of the case when the > record has already been flushed, notifying beforehand seems fine to > me. If XLogBackgroundFlush() does end up calling XLogWrite(), it'll > call it again, anyway. Agreed. > In the attached patch, I modified XLogInsertRecord() to simply set the > latch if we detect that flushRecPtr has advanced. Right, that's what I was thinking. I modified that slightly to use LogwrtResult.Flush (which should be fresh enough) instead of calling GetFlushRecPtr again, which saves a bit. I also changed it to > instead of >=, because if I understand you correctly we only care to notify if the flush pointer advanced, not in the case it stayed the same. I made a few other cosmetic tweaks -- added comment to SegmentBoundaryEntry and renamed the 'pos' to 'endpos'; renamed argument 'notify' of XLogArchiveNotify to 'nudge' (because having two different "notify" concepts in that function seemed confusing). -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "It takes less than 2 seconds to get to 78% complete; that's a good sign. A few seconds later it's at 90%, but it seems to have stuck there. Did somebody make percentages logarithmic while I wasn't looking?" http://smylers.hates-software.com/2005/09/08/1995c749.html >From 39190a4e33a4cb35402ec9451ce6678f6e9aa494 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 17 Aug 2021 03:52:43 + Subject: [PATCH v10] Avoid creating archive status ".ready" files too early. WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file will be created. If PostgreSQL crashes before it is able to write the rest of the record, it will end up reusing segments that have already been marked as ready-for-archival. However, the archiver process may have already processed the old version of the segment, so the wrong version of the segment may be backed-up. This backed-up segment will cause operations such as point-in-time restores to fail. To fix this, we keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. --- src/backend/access/transam/timeline.c| 2 +- src/backend/access/transam/xlog.c| 291 ++- src/backend/access/transam/xlogarchive.c | 17 +- src/backend/postmaster/walwriter.c | 7 + src/backend/replication/walreceiver.c| 6 +- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/include/access/xlog.h| 1 + src/include/access/xlogarchive.h | 4 +- src/include/access/xlogdefs.h| 1 + src/tools/pgindent/typedefs.list | 1 + 10 files changed, 309 insertions(+), 22 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..acd5c2431d 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, if (XLogArchivingActive()) { TLHistoryFileName(histfname, newTLI); - XLogArchiveNotify(histfname); + XLogArchiveNotify(histfname, true); } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a749d..d2ccf2a7bb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -512,6 +512,16 @@ typedef enum ExclusiveBackupState */ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE; +/* + * Entries for SegmentBoundaryMap. Each such entry represents a WAL + * record that ends in endpos and crosses a WAL segment boundary. + */ +typedef struct SegmentBoundaryEntry +{ + XLogSegNo seg; /* must be first */ + XLogRecPtr endpos; +} SegmentBoundaryEntry; + /* * Shared state data for WAL insertion. */ @@ -723,6 +733,12 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* + * The last segment we've marked ready for archival. Protected by + * info_lck. + */ + XLogSegNo lastNotifiedSeg; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -736,6 +752,12 @@ static WALInsertLockPadded *WALInsertLocks = NULL; */ static ControlFileData *ControlFile = NULL; +/* + * Segment boundary map, used for marking segments as ready for archival. + * Protected by SegmentBoundaryLock. + */ +static HTAB *SegmentBoundaryMap = NULL; + /* * Calculate the amount of space left on the page after 'endptr'. Beware * multiple evaluation! @@ -962,6 +984,13 @@ static XLogRecPtr
Re: archive status ".ready" files may be created too early
On 2021-Aug-18, Bossart, Nathan wrote: > I'll add it after XLogBackgroundFlush(). I was wondering which would be better: before or after. XLogBackgroundFlush would do it anyway, so if you do it after then it's not clear to me that it'd do anything (I mean we should not do any new calls of NotifySegmentsReadyForArchive and just rely on the one in XLogBackgroundFlush -> XLogWrite). The advantage of doing NotifySegmentsReadyForArchive before XLogBackgroundFlush is that the files would be created sooner, so the archiver can be working in parallel while walwriter does its other thing; then we'd reach the NotifySegmentsReadyForArchive in XLogBackgroundFlush and it'd find nothing to do most of the time, which is just fine. > I think we'll also want to set the WAL writer's latch in case it is > hibernating. Yeah. (That's another advantage of doing it in walwriter rather than bgwriter: we don't publish bgwriter's latch anywhere AFAICS). > Another approach could be to keep the NotifySegmentsReadyForArchive() > call in XLogInsertRecord(), but only call it if the flush pointer is > beyond the boundary we just registered. Or we could only set the > latch in XLogInsertRecord() if we detect that the flush pointer has > advanced. Hmm. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 8/18/21, 10:48 AM, "alvhe...@alvh.no-ip.org" wrote: > On 2021-Aug-18, alvhe...@alvh.no-ip.org wrote: > >> I realize this means there's a contradiction with my previous argument, >> in that synchronous transaction commit calls XLogWrite at some point, so >> we *are* putting the client-connected backend in charge of creating the >> notify files. However, that only happens on transaction commit, where >> we already accept responsibility for the WAL flush, not on each >> individual XLOG record insert; also, the WAL writer will take care of it >> sometimes, for transactions that are long-enough lived. > > Eh. I just said WAL writer will sometimes do it, and that's true > because it'll occur in XLogBackgroundFlush. But upthread I wimped out > of having WAL writer call NotifySegmentsReadyForArchive() and instead > opined to give responsibility to bgwriter. However, thinking about it > again, maybe it does make sense to have walwriter do it too directly. > This causes no harm to walwriter's time constraints, since *it will have > to do it via XLogBackgroundFlush anyway*. I'll add it after XLogBackgroundFlush(). I think we'll also want to set the WAL writer's latch in case it is hibernating. Another approach could be to keep the NotifySegmentsReadyForArchive() call in XLogInsertRecord(), but only call it if the flush pointer is beyond the boundary we just registered. Or we could only set the latch in XLogInsertRecord() if we detect that the flush pointer has advanced. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Aug-18, alvhe...@alvh.no-ip.org wrote: > I realize this means there's a contradiction with my previous argument, > in that synchronous transaction commit calls XLogWrite at some point, so > we *are* putting the client-connected backend in charge of creating the > notify files. However, that only happens on transaction commit, where > we already accept responsibility for the WAL flush, not on each > individual XLOG record insert; also, the WAL writer will take care of it > sometimes, for transactions that are long-enough lived. Eh. I just said WAL writer will sometimes do it, and that's true because it'll occur in XLogBackgroundFlush. But upthread I wimped out of having WAL writer call NotifySegmentsReadyForArchive() and instead opined to give responsibility to bgwriter. However, thinking about it again, maybe it does make sense to have walwriter do it too directly. This causes no harm to walwriter's time constraints, since *it will have to do it via XLogBackgroundFlush anyway*. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 2021-Aug-18, Bossart, Nathan wrote: > On 8/18/21, 10:06 AM, "alvhe...@alvh.no-ip.org" > wrote: > > So that comment suggests that we should give the responsibility to bgwriter. > > This seems good enough to me. I suppose if bgwriter has a long run of > > buffers to write it could take a little bit of time (a few hundred > > milliseconds?) but I think that should be okay. > > Do you think bgwriter should be the only caller of > NotifySegmentsReadyForArchive(), or should we still have XLogWrite() > call it? I think XLogWrite should absolutely be the primary caller. The one in bgwriter should be a backstop for the case you describe where the flush pointer advanced past the registration point in XLogInsertRecord. I realize this means there's a contradiction with my previous argument, in that synchronous transaction commit calls XLogWrite at some point, so we *are* putting the client-connected backend in charge of creating the notify files. However, that only happens on transaction commit, where we already accept responsibility for the WAL flush, not on each individual XLOG record insert; also, the WAL writer will take care of it sometimes, for transactions that are long-enough lived. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "How strange it is to find the words "Perl" and "saner" in such close proximity, with no apparent sense of irony. I doubt that Larry himself could have managed it." (ncm, http://lwn.net/Articles/174769/)
Re: archive status ".ready" files may be created too early
On 8/18/21, 10:06 AM, "alvhe...@alvh.no-ip.org" wrote: > So that comment suggests that we should give the responsibility to bgwriter. > This seems good enough to me. I suppose if bgwriter has a long run of > buffers to write it could take a little bit of time (a few hundred > milliseconds?) but I think that should be okay. Do you think bgwriter should be the only caller of NotifySegmentsReadyForArchive(), or should we still have XLogWrite() call it? Nathan
Re: archive status ".ready" files may be created too early
On 2021-Aug-17, alvhe...@alvh.no-ip.org wrote: > However, why do it in a WAL-producing client-connected backend? It > strikes me as a bad thing to do, because you are possibly causing delays > for client-connected backends. I suggest that we should give this task > to the WAL writer process -- say, have XLogBackgroundFlush do it. Reading the comments on walwriter.c I am hesitant of having walwriter do it: > * Because the walwriter's cycle is directly linked to the maximum delay > * before async-commit transactions are guaranteed committed, it's probably > * unwise to load additional functionality onto it. For instance, if you've > * got a yen to create xlog segments further in advance, that'd be better done > * in bgwriter than in walwriter. So that comment suggests that we should give the responsibility to bgwriter. This seems good enough to me. I suppose if bgwriter has a long run of buffers to write it could take a little bit of time (a few hundred milliseconds?) but I think that should be okay. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "No renuncies a nada. No te aferres a nada."
Re: archive status ".ready" files may be created too early
On 2021-Aug-17, Bossart, Nathan wrote: > On 8/17/21, 2:13 PM, "alvhe...@alvh.no-ip.org" > wrote: > > > So, why isn't it that we call Register in XLogInsertRecord, and > > Notify in XLogWrite? > > We do. However, we also call NotifySegmentsReadyForArchive() in > XLogInsertRecord() to handle the probably-unlikely scenario that the > flush pointer has already advanced past the to-be-registered boundary. > This ensures that the .ready files are created as soon as possible. I see. I have two thoughts on that. First, why not do it outside the block that tests for crossing a segment boundary? If that's a good thing to do, then we should do it always. However, why do it in a WAL-producing client-connected backend? It strikes me as a bad thing to do, because you are possibly causing delays for client-connected backends. I suggest that we should give this task to the WAL writer process -- say, have XLogBackgroundFlush do it. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "All rings of power are equal, But some rings of power are more equal than others." (George Orwell's The Lord of the Rings)
Re: archive status ".ready" files may be created too early
On 8/17/21, 2:13 PM, "alvhe...@alvh.no-ip.org" wrote: > On 2021-Aug-17, Bossart, Nathan wrote: > >> The main reason for registering the boundaries in XLogInsertRecord() >> is that it has the required information about the WAL record >> boundaries. I do not think that XLogWrite() has this information. > > Doh, of course. So, why isn't it that we call Register in > XLogInsertRecord, and Notify in XLogWrite? We do. However, we also call NotifySegmentsReadyForArchive() in XLogInsertRecord() to handle the probably-unlikely scenario that the flush pointer has already advanced past the to-be-registered boundary. This ensures that the .ready files are created as soon as possible. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Aug-17, Bossart, Nathan wrote: > The main reason for registering the boundaries in XLogInsertRecord() > is that it has the required information about the WAL record > boundaries. I do not think that XLogWrite() has this information. Doh, of course. So, why isn't it that we call Register in XLogInsertRecord, and Notify in XLogWrite? -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "La fuerza no está en los medios físicos sino que reside en una voluntad indomable" (Gandhi)
Re: archive status ".ready" files may be created too early
The thing I still don't understand about this patch is why we call RegisterSegmentBoundaryEntry and NotifySegmentsReadyForArchive in XLogInsertRecord. My model concept of this would have these routines called only in XLogFlush / XLogWrite, which are conceptually about transferring data from in-memory WAL buffers into the on-disk WAL store (pg_xlog files). As I understand, XLogInsertRecord is about copying bytes from the high-level operation (heap insert etc) into WAL buffers. So doing the register/notify dance in both places should be redundant and unnecessary. In the NotifySegmentsReadyForArchive() call at the bottom of XLogWrite, we use flushed=InvalidXLogRecPtr. Why is that? Surely we can use LogwrtResult.Flush, just like in the other callsite there? (If we're covering for somebody advancing FlushRecPtr concurrently, I think we add a comment to explain that. But why do we need to do that in the first place?) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "The eagle never lost so much time, as when he submitted to learn of the crow." (William Blake)
Re: archive status ".ready" files may be created too early
On 2021-Aug-17, Bossart, Nathan wrote: > I think we are in agreement. If we assume that the flush pointer > jumps along record boundaries and segment boundaries, the solution > would be to avoid using the flush pointer when it points to a segment > boundary (given that the segment boundary is not also a record > boundary). Instead, we'd only send up to the start position of the > last record in the segment to standbys. Agreed. An implementation for that would be to test the flush pointer for it being a segment boundary, and in that case we (acquire segment boundary lock and) test for presence in the segment boundary map. If present, then retreat the pointer to the record's start address. This means that we acquire the segment boundary lock rarely. I was concerned that we'd need to acquire it every time we read the flush pointer, which would have been a disaster. Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 8/17/21, 10:44 AM, "alvhe...@alvh.no-ip.org" wrote: > On 2021-Aug-17, Bossart, Nathan wrote: >> I've been thinking about the next steps for this one, too. ISTM we'll >> need to basically assume that the flush pointer jumps along record >> boundaries except for the cross-segment records. I don't know if that >> is the safest assumption, but I think the alternative involves >> recording every record boundary in the map. > > I'm not sure I understand your idea correctly. Perhaps another solution > is to assume that the flush pointer jumps along record boundaries > *including* for cross-segment records. The problem stems precisely from > the fact that we set the flush pointer at segment boundaries, even when > they aren't record boundary. I think we are in agreement. If we assume that the flush pointer jumps along record boundaries and segment boundaries, the solution would be to avoid using the flush pointer when it points to a segment boundary (given that the segment boundary is not also a record boundary). Instead, we'd only send up to the start position of the last record in the segment to standbys. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Aug-17, Bossart, Nathan wrote: > On 8/16/21, 5:09 PM, "alvhe...@alvh.no-ip.org" > wrote: > > The reason for the latter is that I suspect the segment boundary map > > will also have a use to fix the streaming replication issue I mentioned > > earlier in the thread. This also makes me think that we'll want the wal > > record *start* address to be in the hash table too, not just its *end* > > address. So we'll use the start-1 as position to send, rather than the > > end-of-segment when GetFlushRecPtr() returns that. > > I've been thinking about the next steps for this one, too. ISTM we'll > need to basically assume that the flush pointer jumps along record > boundaries except for the cross-segment records. I don't know if that > is the safest assumption, but I think the alternative involves > recording every record boundary in the map. I'm not sure I understand your idea correctly. Perhaps another solution is to assume that the flush pointer jumps along record boundaries *including* for cross-segment records. The problem stems precisely from the fact that we set the flush pointer at segment boundaries, even when they aren't record boundary. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'. After collecting 500 such letters, he mused, a university somewhere in Arizona would probably grant him a degree. (Don Knuth)
Re: archive status ".ready" files may be created too early
So why do we call this structure "record boundary map", when the boundaries it refers to are segment boundaries? I think we should call it "segment boundary map" instead ... and also I think we should use that name in the lock that protects it, so instead of ArchNotifyLock, it could be SegmentBoundaryLock or perhaps WalSegmentBoundaryLock. The reason for the latter is that I suspect the segment boundary map will also have a use to fix the streaming replication issue I mentioned earlier in the thread. This also makes me think that we'll want the wal record *start* address to be in the hash table too, not just its *end* address. So we'll use the start-1 as position to send, rather than the end-of-segment when GetFlushRecPtr() returns that. As for 0x, I think it would be cleaner to do a #define MaxXLogSegNo with that value in the line immediately after typedef XLogSegNo, rather than use the numerical value directly in the assignment. Typo in comment atop RemoveRecordBoundariesUpTo: it reads "up to an", should read "up to and". I think the API of GetLatestRecordBoundarySegment would be better by returning the boolean and having the segment as out argument. Then you could do the caller more cleanly, if (GetLatestRecordBoundarySegment(last_notified, flushed, _boundary_segment)) { SetLastNotified( ... ); RemoveRecordBoundaries( ... ); LWLockRelease( ... ); for (..) XLogArchiveNotifySeg(...); PgArchWakeup(); } else LWLockRelease(...); -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "La virtud es el justo medio entre dos defectos" (Aristóteles)
Re: archive status ".ready" files may be created too early
At Fri, 6 Aug 2021 00:21:34 +, "Bossart, Nathan" wrote in > On 8/5/21, 12:39 AM, "Kyotaro Horiguchi" wrote: > > Honestly I don't like having this initialization in XLogWrite. We > > should and I think can initialize it earlier. It seems to me the most > > appropriate timing to initialize the variable is just before running > > the end-of-recovery checkpoint). Since StartupXLOG knows the first > > segment to write . If it were set to 0, that doesn't matter at all. > > We can get rid of the new symbol by doing this. > > This seems like a good idea to me. I made this change in v5. I > performed some basic testing, and it seems to reliably initialize > lastNotifiedSeg correctly. > > > + /* > > +* EndOfLog resides on the next segment of the last finished one. Set > > the > > +* last finished segment as lastNotifiedSeg now. In the case where the > > +* last crash has left the last several segments not being marked as > > +* .ready, the checkpoint just after does that for all finished > > segments. > > +* There's a corner case where the checkpoint advances segment, but that > > +* ends up at most with a duplicate archive notification. > > +*/ > > I'm not quite following the corner case you've described here. Is it > possible that the segment that EndOfLog points to will be eligible for > removal after the checkpoint? Archiving doesn't immediately mean removal. A finished segment is ought to be archived right away. Since the EndOfLog segment must not get marked .ready, setting lastNotifiedSeg to the previous segment is quite right, but if the end-of-recovery checkpoint advances segment, EndOfLog is marked .ready at the XLogFlush just after. But, sorry, what I forgot at the time was the checkpoint also moves lastNotifiedSeg. So, sorry, that corner case does not exist. > In v5 of the patch, I've also added an extra call to > NotifySegmentsReadyForArchive() in the same place we previously > created the .ready files. I think this helps notify archiver sooner > in certain cases (e.g., asynchronous commit). In v5, NotifySegmentsReadyForArchive() still holds ArchNotifyLock including .ready file creations. Since the notification loop doesn't need the hash itself, the loop can be took out of the lock section? current: LWLockAcquire(ArchNotifyLock, LW_EXCLUSIVE); last_notified = GetLastNotifiedSegment(); latest_boundary_seg = GetLatestRecordBoundarySegment(last_notified, flushed, ); if (found) { SetLastNotifiedSegment(latest_boundary_seg - 1); for (seg = last_notified + 1; seg < latest_boundary_seg; seg++) XLogArchiveNotifySeg(seg, false); RemoveRecordBoundariesUpTo(latest_boundary_seg); PgArchWakeup(); } LWLockRelease(ArchNotifyLock); But we can release the lock earlier. LWLockAcquire(ArchNotifyLock, LW_EXCLUSIVE); last_notified = GetLastNotifiedSegment(); latest_boundary_seg = GetLatestRecordBoundarySegment(last_notified, flushed, ); if (found) { SetLastNotifiedSegment(latest_boundary_seg - 1); RemoveRecordBoundariesUpTo(latest_boundary_seg); } LWLockRelease(ArchNotifyLock); if (found) { for (seg = last_notified + 1; seg < latest_boundary_seg; seg++) XLogArchiveNotifySeg(seg, false); PgArchWakeup(); } -- Kyotaro Horiguchi NTT Open Source Software Center
Re: archive status ".ready" files may be created too early
At Thu, 5 Aug 2021 05:15:04 +, "Bossart, Nathan" wrote in > On 8/4/21, 9:05 PM, "Kyotaro Horiguchi" wrote: > > By the way about the v3 patch, > > > > +#define InvalidXLogSegNo ((XLogSegNo) 0x) > > > > Like InvalidXLogRecPtr, the first valid segment number is 1 so we can > > use 0 as InvalidXLogSegNo. > > It's been a while since I wrote this, but if I remember correctly, the > issue with using 0 is that we could end up initializing > lastNotifiedSeg to InvalidXLogSegNo in XLogWrite(). Eventually, we'd > initialize it to 1, but we will have skipped creating the .ready file > for the first segment. Maybe this? + SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1); Hmm. Theoretically 0 is invalid as segment number. So we'd better not using 0 as a valid value of lastNotifiedSeg. Honestly I don't like having this initialization in XLogWrite. We should and I think can initialize it earlier. It seems to me the most appropriate timing to initialize the variable is just before running the end-of-recovery checkpoint). Since StartupXLOG knows the first segment to write . If it were set to 0, that doesn't matter at all. We can get rid of the new symbol by doing this. Maybe something like this: > { > /* >* There is no partial block to copy. Just set InitializedUpTo, > and >* let the first attempt to insert a log record to initialize > the next >* buffer. >*/ > XLogCtl->InitializedUpTo = EndOfLog; > } > + /* +* EndOfLog resides on the next segment of the last finished one. Set the +* last finished segment as lastNotifiedSeg now. In the case where the +* last crash has left the last several segments not being marked as +* .ready, the checkpoint just after does that for all finished segments. +* There's a corner case where the checkpoint advances segment, but that +* ends up at most with a duplicate archive notification. +*/ + XLByteToSeg(EndOfLog, EndOfLogSeg, wal_segment_size); + Assert(EndOfLogSeg > 0); + SetLastNotifiedSegment(EndOfLogSeg - 1); + > LogwrtResult.Write = LogwrtResult.Flush = EndOfLog; Does this makes sense? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: archive status ".ready" files may be created too early
On 8/4/21, 9:05 PM, "Kyotaro Horiguchi" wrote: > By the way about the v3 patch, > > +#define InvalidXLogSegNo ((XLogSegNo) 0x) > > Like InvalidXLogRecPtr, the first valid segment number is 1 so we can > use 0 as InvalidXLogSegNo. It's been a while since I wrote this, but if I remember correctly, the issue with using 0 is that we could end up initializing lastNotifiedSeg to InvalidXLogSegNo in XLogWrite(). Eventually, we'd initialize it to 1, but we will have skipped creating the .ready file for the first segment. Nathan
Re: archive status ".ready" files may be created too early
By the way about the v3 patch, +#define InvalidXLogSegNo ((XLogSegNo) 0x) Like InvalidXLogRecPtr, the first valid segment number is 1 so we can use 0 as InvalidXLogSegNo. BootStrapXLOG(): /* Create first XLOG segment file */ openLogFile = XLogFileInit(1); KeepLogSeg(): /* avoid underflow, don't go below 1 */ if (currSegNo <= keep_segs) segno = 1; regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: archive status ".ready" files may be created too early
At Tue, 3 Aug 2021 21:32:18 +, "Bossart, Nathan" wrote in > On 8/2/21, 7:37 PM, "Kyotaro Horiguchi" wrote: > > I'm afraid that using hash to store boundary info is too much. Isn't a > > ring buffer enough for this use? In that case it is enough to > > remember only the end LSN of the segment spanning records. It is easy > > to expand the buffer if needed. > > I agree that the hash table requires a bit more memory than what is > probably necessary, but I'm not sure I agree that maintaining a custom > data structure to save a few kilobytes of memory is worth the effort. Memory is one of my concerns but more significant point was required CPU cycles by GetLatestRecordBoundarySegment. So I don't mind it is using a hash if the loop on the hash didn't block other backends. Addition to that, while NotifySegmentsReadyForArchive() is notifying pending segments, other backends simultaneously reach there are blocked until the notification, incuding file creation, finishes. I don't think that's great. Couldn't we set lastNotifiedSegment before the loop? At the moment a backend decides to notify some segments, others no longer need to consider those segments. Even if the backend crashes meanwhile, as you mentionied below, it's safe since the unnotified segments are notifed after restart. > > @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) > > LogwrtResult.Flush = LogwrtResult.Write; > > /* end of page */ > > > > if (XLogArchivingActive()) > > - XLogArchiveNotifySeg(openLogSegNo); > > + > > SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1); > > > > Is it safe? If server didn't notified of WAL files for recent 3 > > finished segments in the previous server life, they need to be > > archived this life time. But this omits maybe all of the tree. > > (I didn't confirm that behavior..) > > I tested this scenario out earlier [0]. It looks like the call to > XLogArchiveCheckDone() in RemoveOldXlogFiles() will take care of > creating any .ready files we missed. Yeah, I reclled of that behvaior. In that case crash recovery reads up to just before the last (continued) record in the last finished segment. On the other hand if creash recovery was able to read that record, it's safe to archive the last segment immediately after recovery. So that behavior is safe. Thanks! > >> I believe my worry was that we'd miss notifying a segment as soon as > >> possible if the record was somehow flushed prior to registering the > >> record boundary in the map. If that's actually impossible, then I > >> would agree that the extra call to NotifySegmentsReadyForArchive() is > >> unnecessary. > > > > I don't think that XLogWrite(up to LSN=X) can happen before > > XLogInsert(endpos = X) ends. > > Is there anything preventing that from happening? At the location > where we are registering the record boundary, we've already called > CopyXLogRecordToWAL(), and neither the WAL insertion lock nor the > WALWriteLock are held. Even if we register the boundary before > updating the shared LogwrtRqst.Write, there's a chance that someone > else has already moved it ahead and called XLogWrite(). I think the > worst case scenario is that we hold off creating .ready files longer > than necessary, but IMO that's still a worthwhile thing to do. Oh, boundary registration happens actually after an insertion ends (but before XLogInsert ends). The missed segment is never processed due to the qualification by lastNotifiedSeg. Does it work that RegisterRecordBoundaryEntry omits registering of the bounary if it finds lastNotifiedSeg have gone too far? > Nathan > > [0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com > regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: archive status ".ready" files may be created too early
On 8/2/21, 7:37 PM, "Kyotaro Horiguchi" wrote: > I'm afraid that using hash to store boundary info is too much. Isn't a > ring buffer enough for this use? In that case it is enough to > remember only the end LSN of the segment spanning records. It is easy > to expand the buffer if needed. I agree that the hash table requires a bit more memory than what is probably necessary, but I'm not sure I agree that maintaining a custom data structure to save a few kilobytes of memory is worth the effort. > @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) > LogwrtResult.Flush = LogwrtResult.Write; > /* end of page */ > > if (XLogArchivingActive()) > - XLogArchiveNotifySeg(openLogSegNo); > + > SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1); > > Is it safe? If server didn't notified of WAL files for recent 3 > finished segments in the previous server life, they need to be > archived this life time. But this omits maybe all of the tree. > (I didn't confirm that behavior..) I tested this scenario out earlier [0]. It looks like the call to XLogArchiveCheckDone() in RemoveOldXlogFiles() will take care of creating any .ready files we missed. >> I believe my worry was that we'd miss notifying a segment as soon as >> possible if the record was somehow flushed prior to registering the >> record boundary in the map. If that's actually impossible, then I >> would agree that the extra call to NotifySegmentsReadyForArchive() is >> unnecessary. > > I don't think that XLogWrite(up to LSN=X) can happen before > XLogInsert(endpos = X) ends. Is there anything preventing that from happening? At the location where we are registering the record boundary, we've already called CopyXLogRecordToWAL(), and neither the WAL insertion lock nor the WALWriteLock are held. Even if we register the boundary before updating the shared LogwrtRqst.Write, there's a chance that someone else has already moved it ahead and called XLogWrite(). I think the worst case scenario is that we hold off creating .ready files longer than necessary, but IMO that's still a worthwhile thing to do. Nathan [0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com
Re: archive status ".ready" files may be created too early
At Mon, 2 Aug 2021 23:28:19 +, "Bossart, Nathan" wrote in > On 8/2/21, 2:42 PM, "Alvaro Herrera" wrote: > > I think it's okay to make lastNotifiedSeg protected by just info_lck, > > and RecordBoundaryMap protected by just ArchNotifyLock. It's okay to > > acquire the spinlock inside the lwlock-protected area, as long as we > > make sure never to do the opposite. (And we sure don't want to hold > > info_lck long enough that a LWLock acquisition would occur in the > > meantime). So I modified things that way, and also added another > > function to set the seg if it's unset, with a single spinlock > > acquisition (rather than acqquire, read, release, acqquire, set, > > release, which sounds like it would have trouble behaving.) > > The patch looks good to me. + for (seg = flushed_seg; seg > last_notified; seg--) + { + RecordBoundaryEntry *entry; + + entry = (RecordBoundaryEntry *) hash_search(RecordBoundaryMap, + (void *) , HASH_FIND, I'm afraid that using hash to store boundary info is too much. Isn't a ring buffer enough for this use? In that case it is enough to remember only the end LSN of the segment spanning records. It is easy to expand the buffer if needed. + if (!XLogSegNoIsInvalid(latest_boundary_seg)) It is a matter of taste, but I see latest_boundary_seg != InvalidXLogSegNo more frequentlyl, maybe to avoid double negation. @@ -1167,10 +1195,33 @@ XLogInsertRecord(XLogRecData *rdata, SpinLockRelease(>info_lck); } + /* +* Record the record boundary if we crossed the segment boundary. This is ... + XLByteToSeg(StartPos, StartSeg, wal_segment_size); + XLByteToSeg(EndPos, EndSeg, wal_segment_size); + + if (StartSeg != EndSeg && XLogArchivingActive()) + { The immediately prceding if block is for cross-page records. So we can reduce the overhaed by the above calculations by moving it to the preceding if-block. +RegisterRecordBoundaryEntry(XLogSegNo seg, XLogRecPtr pos) The seg is restricted to the segment that pos resides on. The caller is free from caring that restriction if the function takes only pos. It adds a small overhead to calculate segment number from the LSN but I think it doesn't matter so much. (Or if we don't use hash, that calculation is not required at all). @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ if (XLogArchivingActive()) - XLogArchiveNotifySeg(openLogSegNo); + SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1); Is it safe? If server didn't notified of WAL files for recent 3 finished segments in the previous server life, they need to be archived this life time. But this omits maybe all of the tree. (I didn't confirm that behavior..) > > I find it highly suspicious that the patch does an archiver notify (i.e. > > creation of the .ready file) in XLogInsertRecord(). Is that a sane > > thing to do? Sounds to me like that should be attempted in XLogFlush > > only. This appeared after Kyotaro's patch at [1] and before your patch > > at [2]. > > I believe my worry was that we'd miss notifying a segment as soon as > possible if the record was somehow flushed prior to registering the > record boundary in the map. If that's actually impossible, then I > would agree that the extra call to NotifySegmentsReadyForArchive() is > unnecessary. I don't think that XLogWrite(up to LSN=X) can happen before XLogInsert(endpos = X) ends. > >> I think the main downside of making lastNotifiedSeg an atomic is that > >> the value we first read in NotifySegmentsReadyForArchive() might not > >> be up-to-date, which means we might hold off creating .ready files > >> longer than necessary. > > > > I'm not sure I understand how this would be a problem. If we block > > somebody from setting a newer value, they'll just set the value > > immediately after we release the lock. Will we reread the value > > afterwards to see if it changed? > > I think you are right. If we see an old value for lastNotifiedSeg, > the worst case is that we take the ArchNotifyLock, read > lastNotifiedSeg again (which should then be up-to-date), and then Agreed. > basically do nothing. I suspect initializing lastNotifiedSeg might > still be a little tricky, though. Do you think it is important to try > to avoid this spinlock for lastNotifiedSeg? IIUC it's acquired at the > end of every call to XLogWrite() already, and we'd still need to > acquire it for the flush pointer, anyway. As mentioned above, I think it needs more cosideration. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: archive status ".ready" files may be created too early
On 8/2/21, 2:42 PM, "Alvaro Herrera" wrote: > I think it's okay to make lastNotifiedSeg protected by just info_lck, > and RecordBoundaryMap protected by just ArchNotifyLock. It's okay to > acquire the spinlock inside the lwlock-protected area, as long as we > make sure never to do the opposite. (And we sure don't want to hold > info_lck long enough that a LWLock acquisition would occur in the > meantime). So I modified things that way, and also added another > function to set the seg if it's unset, with a single spinlock > acquisition (rather than acqquire, read, release, acqquire, set, > release, which sounds like it would have trouble behaving.) The patch looks good to me. > I find it highly suspicious that the patch does an archiver notify (i.e. > creation of the .ready file) in XLogInsertRecord(). Is that a sane > thing to do? Sounds to me like that should be attempted in XLogFlush > only. This appeared after Kyotaro's patch at [1] and before your patch > at [2]. I believe my worry was that we'd miss notifying a segment as soon as possible if the record was somehow flushed prior to registering the record boundary in the map. If that's actually impossible, then I would agree that the extra call to NotifySegmentsReadyForArchive() is unnecessary. >> I think the main downside of making lastNotifiedSeg an atomic is that >> the value we first read in NotifySegmentsReadyForArchive() might not >> be up-to-date, which means we might hold off creating .ready files >> longer than necessary. > > I'm not sure I understand how this would be a problem. If we block > somebody from setting a newer value, they'll just set the value > immediately after we release the lock. Will we reread the value > afterwards to see if it changed? I think you are right. If we see an old value for lastNotifiedSeg, the worst case is that we take the ArchNotifyLock, read lastNotifiedSeg again (which should then be up-to-date), and then basically do nothing. I suspect initializing lastNotifiedSeg might still be a little tricky, though. Do you think it is important to try to avoid this spinlock for lastNotifiedSeg? IIUC it's acquired at the end of every call to XLogWrite() already, and we'd still need to acquire it for the flush pointer, anyway. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Jul-31, Bossart, Nathan wrote: > This is why I was trying to get away with just using info_lck for > reading lastNotifiedSeg. ArchNotifyLock is mostly intended to protect > RecordBoundaryMap. However, since lastNotifiedSeg is used in > functions like GetLatestRecordBoundarySegment() that access the map, I > found it easier to reason about things if I knew that it wouldn't > change as long as I held ArchNotifyLock. I think it's okay to make lastNotifiedSeg protected by just info_lck, and RecordBoundaryMap protected by just ArchNotifyLock. It's okay to acquire the spinlock inside the lwlock-protected area, as long as we make sure never to do the opposite. (And we sure don't want to hold info_lck long enough that a LWLock acquisition would occur in the meantime). So I modified things that way, and also added another function to set the seg if it's unset, with a single spinlock acquisition (rather than acqquire, read, release, acqquire, set, release, which sounds like it would have trouble behaving.) I haven't tried your repro with this yet. I find it highly suspicious that the patch does an archiver notify (i.e. creation of the .ready file) in XLogInsertRecord(). Is that a sane thing to do? Sounds to me like that should be attempted in XLogFlush only. This appeared after Kyotaro's patch at [1] and before your patch at [2]. [1] https://postgr.es/m/20201014.090628.839639906081252194.horikyota@gmail.com [2] https://postgr.es/m/eff40306-8e8a-4259-b181-c84f3f066...@amazon.com I also just realized that Kyotaro's patch there also tried to handle the streaming replication issue I was talking about. > I think the main downside of making lastNotifiedSeg an atomic is that > the value we first read in NotifySegmentsReadyForArchive() might not > be up-to-date, which means we might hold off creating .ready files > longer than necessary. I'm not sure I understand how this would be a problem. If we block somebody from setting a newer value, they'll just set the value immediately after we release the lock. Will we reread the value afterwards to see if it changed? -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ >From 87c77464049c719c2075bd94d4e9ed168c7cf159 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 30 Jul 2021 18:35:52 -0400 Subject: [PATCH v3] Avoid creating archive status ".ready" files too early. WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file will be created. If PostgreSQL crashes before it is able to write the rest of the record, it will end up reusing segments that have already been marked as ready-for-archival. However, the archiver process may have already processed the old version of the segment, so the wrong version of the segment may be backed-up. This backed-up segment will cause operations such as point-in-time restores to fail. To fix this, we keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. --- src/backend/access/transam/timeline.c| 2 +- src/backend/access/transam/xlog.c| 286 ++- src/backend/access/transam/xlogarchive.c | 14 +- src/backend/replication/walreceiver.c| 6 +- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/include/access/xlogarchive.h | 4 +- src/include/access/xlogdefs.h| 5 + 7 files changed, 296 insertions(+), 22 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..acd5c2431d 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, if (XLogArchivingActive()) { TLHistoryFileName(histfname, newTLI); - XLogArchiveNotify(histfname); + XLogArchiveNotify(histfname, true); } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f84c0bb01e..99fe1ac0a2 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -512,6 +512,13 @@ typedef enum ExclusiveBackupState */ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE; +/* entries for RecordBoundaryMap, used to mark segments ready for archival */ +typedef struct RecordBoundaryEntry +{ + XLogSegNo seg; /* must be first */ + XLogRecPtr pos; +} RecordBoundaryEntry; + /* * Shared state data for WAL insertion. */ @@ -723,6 +730,12 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* + * The last segment we've marked ready for archival. Protected by + * info_lck. + */ + XLogSegNo lastNotifiedSeg; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData;
Re: archive status ".ready" files may be created too early
On 7/31/21, 9:12 AM, "Alvaro Herrera" wrote: > On 2021-Jul-31, Bossart, Nathan wrote: > >> On 7/30/21, 4:52 PM, "Alvaro Herrera" wrote: > >> > I noticed that XLogCtl->lastNotifiedSeg is protected by both the >> > info_lck and ArchNotifyLock. I think it it's going to be protected by >> > the lwlock, then we should drop the use of the spinlock. >> >> That seems reasonable to me. This means that the lock is acquired at >> the end of every XLogWrite(), > > Uhh, actually that there sounds really bad; it's going to be a serious > contention point. > > Another option might be to make it an atomic. This is why I was trying to get away with just using info_lck for reading lastNotifiedSeg. ArchNotifyLock is mostly intended to protect RecordBoundaryMap. However, since lastNotifiedSeg is used in functions like GetLatestRecordBoundarySegment() that access the map, I found it easier to reason about things if I knew that it wouldn't change as long as I held ArchNotifyLock. I think the main downside of making lastNotifiedSeg an atomic is that the value we first read in NotifySegmentsReadyForArchive() might not be up-to-date, which means we might hold off creating .ready files longer than necessary. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Jul-31, Bossart, Nathan wrote: > On 7/30/21, 4:52 PM, "Alvaro Herrera" wrote: > > I noticed that XLogCtl->lastNotifiedSeg is protected by both the > > info_lck and ArchNotifyLock. I think it it's going to be protected by > > the lwlock, then we should drop the use of the spinlock. > > That seems reasonable to me. This means that the lock is acquired at > the end of every XLogWrite(), Uhh, actually that there sounds really bad; it's going to be a serious contention point. Another option might be to make it an atomic. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 7/30/21, 4:52 PM, "Alvaro Herrera" wrote: > I think that creating files out of order might be problematic. On the > archiver side, pgarch_readyXlog() expects to return the oldest > archivable file; but if we create a newer segment's .ready file first, > it is possible that a directory scan would return that newer file before > the older segment's .ready file appears. > > However, the comments in pgarch_readyXlog() aren't super convincing that > processing the files in order is actually a correctness requirement, so > perhaps it doesn't matter all that much. I can't think of a reason it'd be needed from a correctness perspective. After a quick scan, I couldn't find any promises about archival order in the documentation, either. In any case, it doesn't look like there's a risk that the archiver will skip files when the .ready files are created out of order. > I noticed that XLogCtl->lastNotifiedSeg is protected by both the > info_lck and ArchNotifyLock. I think it it's going to be protected by > the lwlock, then we should drop the use of the spinlock. That seems reasonable to me. This means that the lock is acquired at the end of every XLogWrite(), but the other places that acquire the lock only do so once per WAL segment. Plus, the call to NotifySegmentsReadyForArchive() at the end of every XLogWrite() should usually only need the lock for a short amount of time to retrieve a value from shared memory. > We set archiver's latch on each XLogArchiveNotify(), but if we're doing > it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is > better to create all the .ready files first and do PgArchWakeup() at the > end. I'm not convinced that this is useful but let's at least discard > the idea explicitly if not. I don't have a terribly strong opinion, but I would lean towards setting the latch for each call to XLogArchiveNotify() so that the archiver process can get started as soon as a segment is ready. However, I doubt that holding off until the end of the loop has any discernible effect in most cases. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Jul-30, Alvaro Herrera wrote: > We set archiver's latch on each XLogArchiveNotify(), but if we're doing > it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is > better to create all the .ready files first and do PgArchWakeup() at the > end. I'm not convinced that this is useful but let's at least discard > the idea explicitly if not. hm, this causes an ABI change so it's not backpatchable. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "This is what I like so much about PostgreSQL. Most of the surprises are of the "oh wow! That's cool" Not the "oh shit!" kind. :)" Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php
Re: archive status ".ready" files may be created too early
On 2021-Jul-30, Bossart, Nathan wrote: > Yes, that was what I was worried about. However, I just performed a > variety of tests with just 0001 applied, and I am beginning to suspect > my concerns were unfounded. With wal_buffers set very high, > synchronous_commit set to off, and a long sleep at the end of > XLogWrite(), I can reliably cause the archive status files to lag far > behind the current open WAL segment. However, even if I crash at this > time, the .ready files are created when the server restarts (albeit > out of order). I think that creating files out of order might be problematic. On the archiver side, pgarch_readyXlog() expects to return the oldest archivable file; but if we create a newer segment's .ready file first, it is possible that a directory scan would return that newer file before the older segment's .ready file appears. However, the comments in pgarch_readyXlog() aren't super convincing that processing the files in order is actually a correctness requirement, so perhaps it doesn't matter all that much. I noticed that XLogCtl->lastNotifiedSeg is protected by both the info_lck and ArchNotifyLock. I think it it's going to be protected by the lwlock, then we should drop the use of the spinlock. We set archiver's latch on each XLogArchiveNotify(), but if we're doing it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is better to create all the .ready files first and do PgArchWakeup() at the end. I'm not convinced that this is useful but let's at least discard the idea explicitly if not. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Always assume the user will do much worse than the stupidest thing you can imagine."(Julien PUYDT) >From f9d94c614d6cf640fdaa09785d0817a1d86b9658 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 30 Jul 2021 18:35:52 -0400 Subject: [PATCH v2] Avoid creating archive status ".ready" files too early. WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file will be created. If PostgreSQL crashes before it is able to write the rest of the record, it will end up reusing segments that have already been marked as ready-for-archival. However, the archiver process may have already processed the old version of the segment, so the wrong version of the segment may be backed-up. This backed-up segment will cause operations such as point-in-time restores to fail. To fix this, we keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. --- src/backend/access/transam/timeline.c| 2 +- src/backend/access/transam/xlog.c| 279 ++- src/backend/access/transam/xlogarchive.c | 14 +- src/backend/replication/walreceiver.c| 6 +- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/include/access/xlogarchive.h | 4 +- src/include/access/xlogdefs.h| 3 + 7 files changed, 289 insertions(+), 20 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..acd5c2431d 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, if (XLogArchivingActive()) { TLHistoryFileName(histfname, newTLI); - XLogArchiveNotify(histfname); + XLogArchiveNotify(histfname, true); } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 26fa2b6c8f..700bbccff8 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -531,6 +531,13 @@ typedef enum ExclusiveBackupState */ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE; +/* entries for RecordBoundaryMap, used to mark segments ready for archival */ +typedef struct RecordBoundaryEntry +{ + XLogSegNo seg; /* must be first */ + XLogRecPtr pos; +} RecordBoundaryEntry; + /* * Shared state data for WAL insertion. */ @@ -742,6 +749,12 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* + * The last segment we've marked ready for archival. Protected by + * ArchNotifyLock. + */ + XLogSegNo lastNotifiedSeg; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -755,6 +768,12 @@ static WALInsertLockPadded *WALInsertLocks = NULL; */ static ControlFileData *ControlFile = NULL; +/* + * Record boundary map, used for marking segments as ready for archival. + * Protected by ArchNotifyLock. + */ +static HTAB *RecordBoundaryMap = NULL; + /* * Calculate the amount of space left on the page after 'endptr'. Beware * multiple evaluation! @@ -983,6 +1002,12 @@
Re: archive status ".ready" files may be created too early
On 7/30/21, 3:23 PM, "Alvaro Herrera" wrote: > That's great to hear. I'll give 0001 a look again. Much appreciated. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Jul-30, Bossart, Nathan wrote: > On 7/30/21, 11:34 AM, "Alvaro Herrera" wrote: > > Hmm ... I'm not sure we're prepared to backpatch this kind of change. > > It seems a bit too disruptive to how replay works. I think patch we > > should be focusing solely on patch 0001 to surgically fix the precise > > bug you see. Does patch 0002 exist because you think that a system with > > only 0001 will not correctly deal with a crash at the right time? > > Yes, that was what I was worried about. However, I just performed a > variety of tests with just 0001 applied, and I am beginning to suspect > my concerns were unfounded. With wal_buffers set very high, > synchronous_commit set to off, and a long sleep at the end of > XLogWrite(), I can reliably cause the archive status files to lag far > behind the current open WAL segment. However, even if I crash at this > time, the .ready files are created when the server restarts (albeit > out of order). This appears to be due to the call to > XLogArchiveCheckDone() in RemoveOldXlogFiles(). Therefore, we can > likely abandon 0002. That's great to hear. I'll give 0001 a look again. > > Now, the reason I'm looking at this patch series is that we're seeing a > > related problem with walsender/walreceiver, which apparently are capable > > of creating a file in the replica that ends up not existing in the > > primary after a crash, for a reason closely related to what you > > describe for WAL archival. I'm not sure what is going on just yet, so > > I'm not going to try and explain because I'm likely to get it wrong. > > I've suspected that this is due to the use of the flushed location for > the send pointer, which AFAICT needn't align with a WAL record > boundary. Yeah, I had gotten as far as the GetFlushRecPtr but haven't tracked down what happens with a contrecord. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 7/30/21, 11:34 AM, "Alvaro Herrera" wrote: > Hmm ... I'm not sure we're prepared to backpatch this kind of change. > It seems a bit too disruptive to how replay works. I think patch we > should be focusing solely on patch 0001 to surgically fix the precise > bug you see. Does patch 0002 exist because you think that a system with > only 0001 will not correctly deal with a crash at the right time? Yes, that was what I was worried about. However, I just performed a variety of tests with just 0001 applied, and I am beginning to suspect my concerns were unfounded. With wal_buffers set very high, synchronous_commit set to off, and a long sleep at the end of XLogWrite(), I can reliably cause the archive status files to lag far behind the current open WAL segment. However, even if I crash at this time, the .ready files are created when the server restarts (albeit out of order). This appears to be due to the call to XLogArchiveCheckDone() in RemoveOldXlogFiles(). Therefore, we can likely abandon 0002. > Now, the reason I'm looking at this patch series is that we're seeing a > related problem with walsender/walreceiver, which apparently are capable > of creating a file in the replica that ends up not existing in the > primary after a crash, for a reason closely related to what you > describe for WAL archival. I'm not sure what is going on just yet, so > I'm not going to try and explain because I'm likely to get it wrong. I've suspected that this is due to the use of the flushed location for the send pointer, which AFAICT needn't align with a WAL record boundary. /* * Streaming the current timeline on a primary. * * Attempt to send all data that's already been written out and * fsync'd to disk. We cannot go further than what's been written out * given the current implementation of WALRead(). And in any case * it's unsafe to send WAL that is not securely down to disk on the * primary: if the primary subsequently crashes and restarts, standbys * must not have applied any WAL that got lost on the primary. */ SendRqstPtr = GetFlushRecPtr(); Nathan
Re: archive status ".ready" files may be created too early
On 2021-Jul-28, Bossart, Nathan wrote: > On 7/27/21, 6:05 PM, "Alvaro Herrera" wrote: > > I'm not sure I understand what's the reason not to store this value in > > pg_control; I feel like I'm missing something. Can you please explain? > > Thanks for taking a look. > > The only reason I can think of is that it could make back-patching > difficult. I don't mind working on a version of the patch that uses > pg_control. Back-patching this fix might be a stretch, anyway. Hmm ... I'm not sure we're prepared to backpatch this kind of change. It seems a bit too disruptive to how replay works. I think patch we should be focusing solely on patch 0001 to surgically fix the precise bug you see. Does patch 0002 exist because you think that a system with only 0001 will not correctly deal with a crash at the right time? Now, the reason I'm looking at this patch series is that we're seeing a related problem with walsender/walreceiver, which apparently are capable of creating a file in the replica that ends up not existing in the primary after a crash, for a reason closely related to what you describe for WAL archival. I'm not sure what is going on just yet, so I'm not going to try and explain because I'm likely to get it wrong. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 7/27/21, 6:05 PM, "Alvaro Herrera" wrote: > On 2021-Feb-19, Bossart, Nathan wrote: > >> 0002 adds logic for persisting the last notified segment through >> crashes. This is needed because a poorly-timed crash could otherwise >> cause us to skip marking segments as ready-for-archival altogether. >> This file is only used for primary servers, as there exists a separate >> code path for marking segments as ready-for-archive for standbys. > > I'm not sure I understand what's the reason not to store this value in > pg_control; I feel like I'm missing something. Can you please explain? Thanks for taking a look. The only reason I can think of is that it could make back-patching difficult. I don't mind working on a version of the patch that uses pg_control. Back-patching this fix might be a stretch, anyway. > There were some comments earlier in the thread about the maximum size of > a record. As I recall, you can have records of arbitrary size if you > have COMMIT with a large number of relation invalidation messages being > included in the xlog record, or a large number of XIDs of > subtransactions in the transaction. Spanning several segments is > possible, AFAIU. This is my understanding, too. Nathan
Re: archive status ".ready" files may be created too early
On 2021-Feb-19, Bossart, Nathan wrote: > 0002 adds logic for persisting the last notified segment through > crashes. This is needed because a poorly-timed crash could otherwise > cause us to skip marking segments as ready-for-archival altogether. > This file is only used for primary servers, as there exists a separate > code path for marking segments as ready-for-archive for standbys. I'm not sure I understand what's the reason not to store this value in pg_control; I feel like I'm missing something. Can you please explain? There were some comments earlier in the thread about the maximum size of a record. As I recall, you can have records of arbitrary size if you have COMMIT with a large number of relation invalidation messages being included in the xlog record, or a large number of XIDs of subtransactions in the transaction. Spanning several segments is possible, AFAIU. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)
Re: archive status ".ready" files may be created too early
On 1/26/21, 6:36 PM, "Kyotaro Horiguchi" wrote: > At Tue, 26 Jan 2021 19:13:57 +, "Bossart, Nathan" > wrote in >> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" wrote: >> > At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan" >> > wrote in >> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" wrote: >> >> > You're right in that regard. There's a window where partial record is >> >> > written when write location passes F0 after insertion location passes >> >> > F1. However, remembering all spanning records seems overkilling to me. >> >> >> >> I'm curious why you feel that recording all cross-segment records is >> >> overkill. IMO it seems far simpler to just do that rather than try to >> > >> > Sorry, my words are not enough. Remembering all spanning records in >> > *shared memory* seems to be overkilling. Much more if it is stored in >> > shared hash table. Even though it rarely the case, it can fail hard >> > way when reaching the limit. If we could do well by remembering just >> > two locations, we wouldn't need to worry about such a limitation. >> >> I don't think it will fail if we reach max_size for the hash table. >> The comment above ShmemInitHash() has this note: >> >> * max_size is the estimated maximum number of hashtable entries. This is >> * not a hard limit, but the access efficiency will degrade if it is >> * exceeded substantially (since it's used to compute directory size and >> * the hash table buckets will get overfull). > > That description means that a shared hash has a directory with fixed > size thus there may be synonyms, which causes degradation. Even though > buckets are preallocated with the specified number, since the minimum > directory size is 256, buckets are allocated at least 256 in a long > run. Minimum on-the-fly allocation size is 32. I haven't calcuated > further precicely, but I'm worried about the amount of spare shared > memory the hash can allocate. On my machine, hash_estimate_size() for the table returns 5,968 bytes. That estimate is for a max_size of 16. In my testing, I've been able to need up to 6 elements in this table, but that required turning off synchronous_commit, adding a long sleep at the end of XLogWrite(), and increasing wal_buffers substantially. This leads me to think that a max_size of 16 elements is typically sufficient. (I may have also accidentally demonstrated that only storing two record boundaries could be insufficient.) >> > Another concern about the concrete patch: >> > >> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a >> > LWLock every time XLogWrite is called while segment archive is being >> > held off. I don't think it is acceptable and I think it could be a >> > problem when many backends are competing on WAL. >> >> This is a fair point. I did some benchmarking with a few hundred >> connections all doing writes, and I was not able to discern any >> noticeable performance impact. My guess is that contention on this >> new lock is unlikely because callers of XLogWrite() must already hold >> WALWriteLock. Otherwise, I believe we only acquire ArchNotifyLock no >> more than once per segment to record new record boundaries. > > Thanks. I agree that the reader-reader contention is not a problem due > to existing serialization by WALWriteLock. Adding an entry happens > only at segment boundary so the ArchNotifyLock doesn't seem to be a > problem. > > However the function prolongs the WALWriteLock section. Couldn't we > somehow move the call to NotifySegmentsReadyForArchive in XLogWrite > out of the WALWriteLock section? I don't see a clean way to do that. XLogWrite() assumes that WALWriteLock is held when it is called, and it doesn't release it at any point. I think we'd have to move NotifySegmentsReadyForArchive() to the callers of XLogWrite() if we wanted to avoid holding onto WALWriteLock for longer. Unless we can point to a measurable performance penalty, I'm not sure this is worth it. Nathan
Re: archive status ".ready" files may be created too early
At Tue, 26 Jan 2021 19:13:57 +, "Bossart, Nathan" wrote in > On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" wrote: > > At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan" > > wrote in > >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" wrote: > >> > You're right in that regard. There's a window where partial record is > >> > written when write location passes F0 after insertion location passes > >> > F1. However, remembering all spanning records seems overkilling to me. > >> > >> I'm curious why you feel that recording all cross-segment records is > >> overkill. IMO it seems far simpler to just do that rather than try to > > > > Sorry, my words are not enough. Remembering all spanning records in > > *shared memory* seems to be overkilling. Much more if it is stored in > > shared hash table. Even though it rarely the case, it can fail hard > > way when reaching the limit. If we could do well by remembering just > > two locations, we wouldn't need to worry about such a limitation. > > I don't think it will fail if we reach max_size for the hash table. > The comment above ShmemInitHash() has this note: > > * max_size is the estimated maximum number of hashtable entries. This is > * not a hard limit, but the access efficiency will degrade if it is > * exceeded substantially (since it's used to compute directory size and > * the hash table buckets will get overfull). That description means that a shared hash has a directory with fixed size thus there may be synonyms, which causes degradation. Even though buckets are preallocated with the specified number, since the minimum directory size is 256, buckets are allocated at least 256 in a long run. Minimum on-the-fly allocation size is 32. I haven't calcuated further precicely, but I'm worried about the amount of spare shared memory the hash can allocate. > > Another concern about the concrete patch: > > > > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a > > LWLock every time XLogWrite is called while segment archive is being > > held off. I don't think it is acceptable and I think it could be a > > problem when many backends are competing on WAL. > > This is a fair point. I did some benchmarking with a few hundred > connections all doing writes, and I was not able to discern any > noticeable performance impact. My guess is that contention on this > new lock is unlikely because callers of XLogWrite() must already hold > WALWriteLock. Otherwise, I believe we only acquire ArchNotifyLock no > more than once per segment to record new record boundaries. Thanks. I agree that the reader-reader contention is not a problem due to existing serialization by WALWriteLock. Adding an entry happens only at segment boundary so the ArchNotifyLock doesn't seem to be a problem. However the function prolongs the WALWriteLock section. Couldn't we somehow move the call to NotifySegmentsReadyForArchive in XLogWrite out of the WALWriteLock section? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: archive status ".ready" files may be created too early
On 1/2/21, 8:55 AM, "Andrey Borodin" wrote: > Recently we had somewhat related incident. Do I understand correctly that > this incident is related to the bug discussed in this thread? I'm not sure that we can rule it out, but the log pattern I've typically seen for this is "invalid contrecord length." The issue is that we're marking segments as ready for archive when the segment is fully written versus when its WAL records are fully written (since its WAL records may cross into the next segment). The fact that you're seeing zeroes at the end of your archived segments leads me to think it is unlikely that you are experiencing this bug. Nathan
Re: archive status ".ready" files may be created too early
On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" wrote: > At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan" > wrote in >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" wrote: >> > You're right in that regard. There's a window where partial record is >> > written when write location passes F0 after insertion location passes >> > F1. However, remembering all spanning records seems overkilling to me. >> >> I'm curious why you feel that recording all cross-segment records is >> overkill. IMO it seems far simpler to just do that rather than try to > > Sorry, my words are not enough. Remembering all spanning records in > *shared memory* seems to be overkilling. Much more if it is stored in > shared hash table. Even though it rarely the case, it can fail hard > way when reaching the limit. If we could do well by remembering just > two locations, we wouldn't need to worry about such a limitation. I don't think it will fail if we reach max_size for the hash table. The comment above ShmemInitHash() has this note: * max_size is the estimated maximum number of hashtable entries. This is * not a hard limit, but the access efficiency will degrade if it is * exceeded substantially (since it's used to compute directory size and * the hash table buckets will get overfull). > Another concern about the concrete patch: > > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a > LWLock every time XLogWrite is called while segment archive is being > held off. I don't think it is acceptable and I think it could be a > problem when many backends are competing on WAL. This is a fair point. I did some benchmarking with a few hundred connections all doing writes, and I was not able to discern any noticeable performance impact. My guess is that contention on this new lock is unlikely because callers of XLogWrite() must already hold WALWriteLock. Otherwise, I believe we only acquire ArchNotifyLock no more than once per segment to record new record boundaries. Nathan
Re: archive status ".ready" files may be created too early
Hi! Thanks for working on this. > 18 дек. 2020 г., в 10:42, Kyotaro Horiguchi > написал(а): > > I noticed that we can cause the continuation record flushed > immedately. I've took a look into the code and want to share some thoughts. 1. Maybe we could tend to avoid interlacing field protected by different locks in XLogCtlData? We can place lastNotifiedSeg somewhere near field that is protected by WALWriteLock. I'm not sure it's useful idea. 2. In XLogInsertRecord() we release >info_lck just to compute few bytes. And possibly aquire it back. Maybe just hold the lock a little longer? 3. Things that are done by GetLastNotifiedSegment() could just be atomic read? I'm not sure it's common practice. Thanks! Best regards, Andrey Borodin.
Re: archive status ".ready" files may be created too early
Hi! I was looking to review something in CF. This seems like a thread of some interest to me. Recently we had somewhat related incident. Do I understand correctly that this incident is related to the bug discussed in this thread? Primary instance was killed by OOM [ 2020-11-12 15:27:03.732 MSK ,,,739,0 ]:LOG: server process (PID 40189) was terminated by signal 9: Killed after recovery it archived some WAL segments. [ 2020-11-12 15:27:31.477 MSK ,,,739,0 ]:LOG: database system is ready to accept connections INFO: 2020/11/12 15:27:32.059541 FILE PATH: 000E0001C02F00AF.br INFO: 2020/11/12 15:27:32.114319 FILE PATH: 000E0001C02F00B3.br then PITR failed on another host [ 2020-11-12 16:26:33.024 MSK ,,,51414,0 ]:LOG: restored log file "000E0001C02F00B3" from archive [ 2020-11-12 16:26:33.042 MSK ,,,51414,0 ]:LOG: invalid record length at 1C02F/B3FFF778: wanted 24, got 0 [ 2020-11-12 16:26:33.042 MSK ,,,51414,0 ]:LOG: invalid record length at 1C02F/B3FFF778: wanted 24, got 0 archived segment has some zeroes at the end rmgr: XLOGlen (rec/tot): 51/ 1634, tx: 0, lsn: 1C02F/B3FFF058, prev 1C02F/B3FFEFE8, desc: FPI_FOR_HINT , blkref #0: rel 1663/14030/16384 blk 140 FPW rmgr: Heaplen (rec/tot):129/ 129, tx: 3890578935, lsn: 1C02F/B3FFF6C0, prev 1C02F/B3FFF058, desc: HOT_UPDATE off 34 xmax 3890578935 ; new off 35 xmax 0, blkref #0: rel 1663/14030/16384 blk 140 rmgr: Transaction len (rec/tot): 46/46, tx: 3890578935, lsn: 1C02F/B3FFF748, prev 1C02F/B3FFF6C0, desc: COMMIT 2020-11-12 15:27:31.507363 MSK pg_waldump: FATAL: error in WAL record at 1C02F/**B3FFF748**: invalid record length at 1C02F/**B3FFF778**: wanted 24, got 0 Meanwhile next segment points to previous record at **B3FFF748** postgres@man-odszl7u4361o8m3z:/tmp$ pg_waldump 000E0001C02F00B4| head rmgr: Heaplen (rec/tot):129/ 129, tx: 3890578936, lsn: 1C02F/B4000A68, prev 1C02F/**B3FFF778**, desc: HOT_UPDATE off 35 xmax 3890578936 ; new off 36 xmax 0, blkref #0: rel 1663/14030/16384 blk 140 rmgr: Transaction len (rec/tot): 46/46, tx: 3890578936, lsn: 1C02F/B4000AF0, prev 1C02F/B4000A68, desc: COMMIT 2020-11-12 15:27:32.509443 MSK Best regards, Andrey Borodin.
Re: archive status ".ready" files may be created too early
At Wed, 16 Dec 2020 11:01:20 +0900 (JST), Kyotaro Horiguchi wrote in > - Record the beginning LSN of the first cross-seg record and the end > LSN of the last cross-seg recrod in a consecutive segments bonded by > cross-seg recrods. Spcifically X and Y below. > >X Z Y >[recA] [recB] [recC] > [seg A] [seg B] [seg C] [seg D] [seg E] > (1)(2.2)(2.2) (2.1) (2.1) (1) > > 1. If we wrote upto before X or beyond Y at a segment boundary, notify > the finished segment immediately. > > 1.1. If we have written beyond Y, clear the recorded region. > > 2. Otherwise we don't notify the segment immediately: > > 2.1. If write request was up to exactly the current segment boundary > and we know the end LSN of the record there (that is, it is recC > above), extend the request to the end LSN. Then notify the segment > after the record is written to the end. > > 2.2. Otherwise (that is recA or recB), we don't know whether the > last record of the last segment is ends just at the segment boundary > (Z) or a record spans between segments (recB). Anyway even if there > is such a record there, we don't know where it ends. As the result > what we can do there is only to refrain from notifying. It doesn't > matter so much since we have already inserted recC so we will soon > reach recC and will notify up to seg C. > > There might be a case where we insert up to Y before writing up to Z, > the segment-region X-Y contains non-connected segment boundary in that > case. It is processed as if it is a connected segment > boundary. However, like 2.2 above, It doesn't matter since we write up > to Y soon. I noticed that we can cause the continuation record flushed immedately. So in the attached, 1. If there's no remembered cross-segment boundary or we're out of the region X-Y, notify the finished segment immediately. 2. Otherwise we don't notify the segment immedately 2.1. If we are finishing the last semgment known to continue to the next segment, extend write request to the end of the recrod *and* force to write then flush up to there. 2.2. (the same to the above) 3. In the case of 2.1, we can flush the previous segment immediately so do that. X. When we notify a segment, clear the rememberd region if we have got out of the region. The attached is changed in the following points: - Fixed some bugs that I confusedly refer to write-lsn instead of flush-lsn. - Changed to urge flushing up to the end of a continuation record, not only waiting for the recrod to be written. - More agressively clear the remembered region. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 95e438ee448c1686c946909f1fc84ec95ee6c7d4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 18 Dec 2020 13:45:29 +0900 Subject: [PATCH v5 1/2] Avoid archiving a WAL segment that continues to the next segment If the last record of a finshed segment continues to the next segment, we need to defer archiving of the segment until the record is flushed to the end. Otherwise crash recovery can overwrite the last record of a segment and history diverges between archive and pg_wal. --- src/backend/access/transam/xlog.c | 221 +++- src/backend/replication/walsender.c | 14 +- src/include/access/xlog.h | 1 + 3 files changed, 224 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index b1e5d2dbff..b0d3ba2c5a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -733,6 +733,16 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* The last segment notified to be archived. Protected by WALWriteLock */ + XLogSegNo lastNotifiedSeg; + + /* + * Remember the region we need to consider refraining from archiving + * finished segments immediately. Protected by info_lck. + */ + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -1170,6 +1180,9 @@ XLogInsertRecord(XLogRecData *rdata, */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { + XLogSegNo startseg; + XLogSegNo endseg; + SpinLockAcquire(>info_lck); /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) @@ -1177,6 +1190,22 @@ XLogInsertRecord(XLogRecData *rdata, /* update local result copy while I have the chance */ LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(>info_lck); + + /* + * Remember the range of segment boundaries that are connected by a + * continuation record. + */ + XLByteToSeg(StartPos, startseg, wal_segment_size); + XLByteToPrevSeg(EndPos, endseg, wal_segment_size); + + if (startseg != endseg) + { + SpinLockAcquire(>info_lck); + if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr) +
Re: archive status ".ready" files may be created too early
At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan" wrote in > On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" wrote: > > You're right in that regard. There's a window where partial record is > > written when write location passes F0 after insertion location passes > > F1. However, remembering all spanning records seems overkilling to me. > > I'm curious why you feel that recording all cross-segment records is > overkill. IMO it seems far simpler to just do that rather than try to Sorry, my words are not enough. Remembering all spanning records in *shared memory* seems to be overkilling. Much more if it is stored in shared hash table. Even though it rarely the case, it can fail hard way when reaching the limit. If we could do well by remembering just two locations, we wouldn't need to worry about such a limitation. > reason about all these different scenarios and rely on various > (and possibly fragile) assumptions. You only need to record the end After the previous mail sent, I noticed that the assumption on record-length was not needed. So that way no longer need any of the assumption^^; > location of records that cross into the next segment (or that fit > perfectly into the end of the current one) and to evaluate which > segments to mark .ready as the "flushed" LSN advances. I'd expect > that in most cases we wouldn't need to store more than a couple of > record boundaries, so it's not like we'd normally be storing dozens of > boundaries. Even if we did need to store several boundaries, AFAICT > the approach I'm proposing should still work well enough. I didn't say it doesn't work, just overkill. Another concern about the concrete patch: NotifySegmentsReadyForArchive() searches the shared hashacquiaing a LWLock every time XLogWrite is called while segment archive is being held off. I don't think it is acceptable and I think it could be a problem when many backends are competing on WAL. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: archive status ".ready" files may be created too early
On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" wrote: > You're right in that regard. There's a window where partial record is > written when write location passes F0 after insertion location passes > F1. However, remembering all spanning records seems overkilling to me. I'm curious why you feel that recording all cross-segment records is overkill. IMO it seems far simpler to just do that rather than try to reason about all these different scenarios and rely on various (and possibly fragile) assumptions. You only need to record the end location of records that cross into the next segment (or that fit perfectly into the end of the current one) and to evaluate which segments to mark .ready as the "flushed" LSN advances. I'd expect that in most cases we wouldn't need to store more than a couple of record boundaries, so it's not like we'd normally be storing dozens of boundaries. Even if we did need to store several boundaries, AFAICT the approach I'm proposing should still work well enough. Nathan
Re: archive status ".ready" files may be created too early
At Tue, 15 Dec 2020 19:32:57 +0900 (JST), Kyotaro Horiguchi wrote in > At Mon, 14 Dec 2020 18:25:23 +, "Bossart, Nathan" > wrote in > > I wonder if these are safe assumptions to make. For your example, if > > we've written record B to the WAL buffers, but neither record A nor B > > have been written to disk or flushed, aren't we still in trouble? > > You're right in that regard. There's a window where partial record is > written when write location passes F0 after insertion location passes > F1. However, remembering all spanning records seems overkilling to me. > > I modifed the previous patch so that it remembers the start LSN of the > *oldest* corss-segment continuation record in the last consecutive > bonded segments, and the end LSN of the latest cross-segmetn > continuation record. This doesn't foreget past segment boundaries. > The region is cleard when WAL-write LSN goes beyond the remembered end > LSN. So the region may contain several wal-segments that are not > connected to the next one, but that doesn't matter so much. Mmm. Even tough it'a PoC, it was too bogus. I fixed it to work saner way. - Record the beginning LSN of the first cross-seg record and the end LSN of the last cross-seg recrod in a consecutive segments bonded by cross-seg recrods. Spcifically X and Y below. X Z Y [recA] [recB] [recC] [seg A] [seg B] [seg C] [seg D] [seg E] (1)(2.2)(2.2) (2.1) (2.1) (1) 1. If we wrote upto before X or beyond Y at a segment boundary, notify the finished segment immediately. 1.1. If we have written beyond Y, clear the recorded region. 2. Otherwise we don't notify the segment immediately: 2.1. If write request was up to exactly the current segment boundary and we know the end LSN of the record there (that is, it is recC above), extend the request to the end LSN. Then notify the segment after the record is written to the end. 2.2. Otherwise (that is recA or recB), we don't know whether the last record of the last segment is ends just at the segment boundary (Z) or a record spans between segments (recB). Anyway even if there is such a record there, we don't know where it ends. As the result what we can do there is only to refrain from notifying. It doesn't matter so much since we have already inserted recC so we will soon reach recC and will notify up to seg C. There might be a case where we insert up to Y before writing up to Z, the segment-region X-Y contains non-connected segment boundary in that case. It is processed as if it is a connected segment boundary. However, like 2.2 above, It doesn't matter since we write up to Y soon. At Tue, 15 Dec 2020 19:32:57 +0900 (JST), Kyotaro Horiguchi wrote in me> I added an assertion that a record must be shorter than a wal segment me> to XLogRecordAssemble(). This guarantees the assumption to be true? me> (The condition is tentative, would need to be adjusted.) Changed the assertion more direct way. me> Also, the attached is a PoC. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 50b3b05dd0eed79cd0b97991e82090b9d569cbac Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 16 Dec 2020 10:36:14 +0900 Subject: [PATCH v4 1/2] Avoid archiving a WAL segment that continues to the next segment If the last record of a finshed segment continues to the next segment, we need to defer archiving of the segment until the record is flushed to the end. Otherwise crash recovery can overwrite the last record of a segment and history diverges between archive and pg_wal. --- src/backend/access/transam/xlog.c | 214 +++- src/backend/replication/walsender.c | 14 +- src/include/access/xlog.h | 1 + 3 files changed, 217 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8dd225c2e1..8705809160 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -723,6 +723,16 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* The last segment notified to be archived. Protected by WALWriteLock */ + XLogSegNo lastNotifiedSeg; + + /* + * Remember the oldest and newest known segment that ends with a + * continuation record. + */ + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata, */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { + XLogSegNo startseg; + XLogSegNo endseg; + SpinLockAcquire(>info_lck); /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) @@ -1167,6 +1180,24 @@ XLogInsertRecord(XLogRecData *rdata, /* update local result copy while I have the chance */ LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(>info_lck);
Re: archive status ".ready" files may be created too early
At Mon, 14 Dec 2020 18:25:23 +, "Bossart, Nathan" wrote in > Apologies for the long delay. > > I've spent a good amount of time thinking about this bug and trying > out a few different approaches for fixing it. I've attached a work- > in-progress patch for my latest attempt. > > On 10/13/20, 5:07 PM, "Kyotaro Horiguchi" wrote: > > F0F1 > > A F B > > |-|-|-| > >seg Xseg X+1 seg X+2 > > > > Matsumura-san has a concern about the case where there are two (or > > more) partially-flushed segment-spanning records at the same time. > > > > This patch remembers only the last cross-segment record. If we were > > going to flush up to F0 after Record-B had been written, we would fail > > to hold-off archiving seg-X. This patch is based on a assumption that > > that case cannot happen because we don't leave a pending page at the > > time of segment switch and no records don't span over three or more > > segments. > > I wonder if these are safe assumptions to make. For your example, if > we've written record B to the WAL buffers, but neither record A nor B > have been written to disk or flushed, aren't we still in trouble? You're right in that regard. There's a window where partial record is written when write location passes F0 after insertion location passes F1. However, remembering all spanning records seems overkilling to me. I modifed the previous patch so that it remembers the start LSN of the *oldest* corss-segment continuation record in the last consecutive bonded segments, and the end LSN of the latest cross-segmetn continuation record. This doesn't foreget past segment boundaries. The region is cleard when WAL-write LSN goes beyond the remembered end LSN. So the region may contain several wal-segments that are not connected to the next one, but that doesn't matter so much. > Also, is there actually any limit on WAL record length that means that > it is impossible for a record to span over three or more segments? Even though it is not a hard limit, AFAICS as mentioned before the longest possible record is what log_newpages() emits. that is up to about 500kBytes for now. I think we don't want to make the length longer. If we set the wal_segment_size to 1MB and set the block size to 16kB or more, we would have a recrod spanning over three or more segments but I don't think that is a sane configuration and that kind of issue could happen elsewhere. > Perhaps these assumptions are true, but it doesn't seem obvious to me > that they are, and they might be pretty fragile. I added an assertion that a record must be shorter than a wal segment to XLogRecordAssemble(). This guarantees the assumption to be true? (The condition is tentative, would need to be adjusted.) > The attached patch doesn't make use of these assumptions. Instead, we > track the positions of the records that cross segment boundaries in a > small hash map, and we use that to determine when it is safe to mark a > segment as ready for archival. I think this approach resembles > Matsumura-san's patch from June. > > As before, I'm not handling replication, archive_timeout, and > persisting latest-marked-ready through crashes yet. For persisting > the latest-marked-ready segment through crashes, I was thinking of > using a new file that stores the segment number. Also, the attached is a PoC. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 5714bd064d61135c41a64ecc39aeff74c25a0e74 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 15 Dec 2020 16:24:13 +0900 Subject: [PATCH v3] PoC: Avoid archiving a WAL segment that continues to the next segment If the last record of a finshed segment continues to the next segment, we need to defer archiving of the segment until the record is flushed to the end. Otherwise crash recovery can overwrite the last record of a segment and history diverges between archive and pg_wal. --- src/backend/access/transam/xlog.c | 187 +++- src/backend/access/transam/xloginsert.c | 3 + src/backend/replication/walsender.c | 14 +- src/include/access/xlog.h | 1 + 4 files changed, 193 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8dd225c2e1..98da521601 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -723,6 +723,16 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* The last segment notified to be archived. Protected by WALWriteLock */ + XLogSegNo lastNotifiedSeg; + + /* + * Remember the oldest and newest known segment that ends with a + * continuation record. + */ + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata, */ if (StartPos / XLOG_BLCKSZ != EndPos /
Re: archive status ".ready" files may be created too early
On 14.10.2020 03:06, Kyotaro Horiguchi wrote: The patch includes a fix for primary->standby case. But I'm not sure we can do that in the cascaded case. A standby is not aware of the structure of a WAL blob and has no idea of up-to-where to send the received blobs. However, if we can rely on the behavior of CopyData that we always receive a blob as a whole sent from the sender at once, the cascaded standbys are free from the trouble (as far as the cascaded-standby doesn't crash just before writing the last-half of a record into pg_wal and after archiving the last full-segment, which seems unlikely.). regards. Status update for a commitfest entry. This entry was "Waiting on author" during this CF. As I see, the latest message contains new version of the patch. Does it need more work? Are you going to continue working on it? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: archive status ".ready" files may be created too early
Thanks for visiting this thread. At Mon, 12 Oct 2020 15:04:40 +0300, Heikki Linnakangas wrote in > On 07/07/2020 12:02, matsumura@fujitsu.com wrote: > > At Monday, July 6, 2020 05:13:40 +, "Kyotaro Horiguchi > > " wrote in > after WAL buffer is filled up to the requested position. So when it > crosses segment boundary we know the all past corss segment-boundary > records are stable. That means all we need to remember is only the > position of the latest corss-boundary record. > >>> > >>> I could not agree. In the following case, it may not work well. > >>> - record-A and record-B (record-B is a newer one) is copied, and > >>> - lastSegContRecStart/End points to record-B's, and > >>> - FlushPtr is proceeded to in the middle of record-A. > >> > >> IIUC, that means record-B is a cross segment-border record and we hav > >> e > >> flushed beyond the recrod-B. In that case crash recovery afterwards > >> can read the complete record-B and will finish recovery *after* the > >> record-B. That's what we need here. > > I'm sorry I didn't explain enough. > > Record-A and Record-B are cross segment-border records. > > Record-A spans segment X and X+1 > > Record-B spans segment X+2 and X+3. > > If both records have been inserted to WAL buffer, > > lastSegContRecStart/End points to Record-B. > > If a writer flushes upto the middle of segment-X+1, > > NotifyStableSegments() allows the writer to notify segment-X. > > Is my understanding correct? > > I think this little ASCII drawing illustrates the above scenario: > > A F B > |-|-|-| >seg Xseg X+1 seg X+2 > > A and B are Record-A and Record-B. F is the current flush > pointer. I modified the figure a bit for the explanation below. F0F1 A F B |-|-|-| seg Xseg X+1 seg X+2 Matsumura-san has a concern about the case where there are two (or more) partially-flushed segment-spanning records at the same time. This patch remembers only the last cross-segment record. If we were going to flush up to F0 after Record-B had been written, we would fail to hold-off archiving seg-X. This patch is based on a assumption that that case cannot happen because we don't leave a pending page at the time of segment switch and no records don't span over three or more segments. > In this case, it would be OK to notify segment X, as long as F is > greater than the end of record A. And if I'm reading Kyotaro's patch > correctly, that's what would happen with the patch. > > The patch seems correct to me. I'm a bit sad that we have to track yet > another WAL position (two, actually) to fix this, but I don't see a > better way. Is the two means Record-A and B? Is it needed even with having the assumption above? > I wonder if we should arrange things so that XLogwrtResult.Flush never > points in the middle of a record? I'm not totally convinced that all That happens at good percentage of page-boundary. And a record can span over three or more pages. Do we need to avoid all such cases? I did that only for the cross-segment case. > the current callers of GetFlushRecPtr() are OK with a middle-of-WAL > record value. Could we get into similar trouble if a standby > replicates half of a cross-segment record to a cascaded standby, and > the cascaded standby has WAL archiving enabled? The patch includes a fix for primary->standby case. But I'm not sure we can do that in the cascaded case. A standby is not aware of the structure of a WAL blob and has no idea of up-to-where to send the received blobs. However, if we can rely on the behavior of CopyData that we always receive a blob as a whole sent from the sender at once, the cascaded standbys are free from the trouble (as far as the cascaded-standby doesn't crash just before writing the last-half of a record into pg_wal and after archiving the last full-segment, which seems unlikely.). regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From e8320f7486c057ccb97be56ce1859296b8072256 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 13 Oct 2020 20:41:33 +0900 Subject: [PATCH v2] Avoid archiving a WAL segment that continues to the next segment If the last record of a finshed segment continues to the next segment, we need to defer archiving of the segment until the record is flushed to the end. Otherwise crash recovery can overwrite the last record of a segment and history diverges between archive and pg_wal. --- src/backend/access/transam/xlog.c | 160 +++- src/backend/replication/walsender.c | 14 +-- src/include/access/xlog.h | 1 + 3 files changed, 163 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 52a67b1170..8fd0fdb598 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -723,6 +723,16 @@ typedef struct XLogCtlData
Re: archive status ".ready" files may be created too early
On 07/07/2020 12:02, matsumura@fujitsu.com wrote: At Monday, July 6, 2020 05:13:40 +, "Kyotaro Horiguchi " wrote in after WAL buffer is filled up to the requested position. So when it crosses segment boundary we know the all past corss segment-boundary records are stable. That means all we need to remember is only the position of the latest corss-boundary record. I could not agree. In the following case, it may not work well. - record-A and record-B (record-B is a newer one) is copied, and - lastSegContRecStart/End points to record-B's, and - FlushPtr is proceeded to in the middle of record-A. IIUC, that means record-B is a cross segment-border record and we hav e flushed beyond the recrod-B. In that case crash recovery afterwards can read the complete record-B and will finish recovery *after* the record-B. That's what we need here. I'm sorry I didn't explain enough. Record-A and Record-B are cross segment-border records. Record-A spans segment X and X+1 Record-B spans segment X+2 and X+3. If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B. If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() allows the writer to notify segment-X. Is my understanding correct? I think this little ASCII drawing illustrates the above scenario: A F B |-|-|-| seg Xseg X+1 seg X+2 A and B are Record-A and Record-B. F is the current flush pointer. In this case, it would be OK to notify segment X, as long as F is greater than the end of record A. And if I'm reading Kyotaro's patch correctly, that's what would happen with the patch. The patch seems correct to me. I'm a bit sad that we have to track yet another WAL position (two, actually) to fix this, but I don't see a better way. I wonder if we should arrange things so that XLogwrtResult.Flush never points in the middle of a record? I'm not totally convinced that all the current callers of GetFlushRecPtr() are OK with a middle-of-WAL record value. Could we get into similar trouble if a standby replicates half of a cross-segment record to a cascaded standby, and the cascaded standby has WAL archiving enabled? - Heikki
Re: archive status ".ready" files may be created too early
On Wed, Jul 22, 2020 at 02:53:49AM +, matsumura@fujitsu.com wrote: > Then, Record-A may be invalidated by crash-recovery and overwritten by new > WAL record. > The segment-X is not same as the archived one. Please note that the latest patch fails to apply per the CF bot, so a rebase would be in order to have at least some automated tests for the last patch. -- Michael signature.asc Description: PGP signature
RE: archive status ".ready" files may be created too early
Hello, > At Mon, 13 Jul 2020 01:57:36 +, "Kyotaro Horiguchi > " wrote in > Am I missing something here? I write more detail(*). Record-A and Record-B are cross segment-border records. Record-A spans segment X and X+1. Record-B spans segment X+2 and X+3. If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B * If a writer flushes segment X and a part of X+1 but record-A is not flushed completely, NotifyStableSegments() allows the writer to notify segment-X. Then, Record-A may be invalidated by crash-recovery and overwritten by new WAL record. The segment-X is not same as the archived one. Regard Ryo Matsumura
Re: archive status ".ready" files may be created too early
Hello. # Sorry, I wrongly thought that I replied to this thread.. At Tue, 7 Jul 2020 09:02:56 +, "matsumura@fujitsu.com" wrote in > At Monday, July 6, 2020 05:13:40 +, "Kyotaro Horiguchi > " wrote in > > > > after WAL buffer is filled up to the requested position. So when it > > > > crosses segment boundary we know the all past corss segment-boundary > > > > records are stable. That means all we need to remember is only the > > > > position of the latest corss-boundary record. > > > > > > I could not agree. In the following case, it may not work well. > > > - record-A and record-B (record-B is a newer one) is copied, and > > > - lastSegContRecStart/End points to record-B's, and > > > - FlushPtr is proceeded to in the middle of record-A. > > > > IIUC, that means record-B is a cross segment-border record and we hav e > > flushed beyond the recrod-B. In that case crash recovery afterwards > > can read the complete record-B and will finish recovery *after* the > > record-B. That's what we need here. > > I'm sorry I didn't explain enough. > > Record-A and Record-B are cross segment-border records. > Record-A spans segment X and X+1 > Record-B spans segment X+2 and X+3. Ok. > If both records have been inserted to WAL buffer, lastSegContRecStart/End > points to Record-B. > If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() > allows the writer to notify segment-X. > Is my understanding correct? I think that that cannot happen since the segment X must have been flushed at the time Record-A is completely flushed out. When we write to the next segment, we have already flushed and closed the whole last segment. If it is not the case we are to archive segment files not fully flushed, and would get broken archive files. Am I missing something here? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: archive status ".ready" files may be created too early
Hello, Horiguchi-san At Monday, July 6, 2020 05:13:40 +, "Kyotaro Horiguchi " wrote in > > > after WAL buffer is filled up to the requested position. So when it > > > crosses segment boundary we know the all past corss segment-boundary > > > records are stable. That means all we need to remember is only the > > > position of the latest corss-boundary record. > > > > I could not agree. In the following case, it may not work well. > > - record-A and record-B (record-B is a newer one) is copied, and > > - lastSegContRecStart/End points to record-B's, and > > - FlushPtr is proceeded to in the middle of record-A. > > IIUC, that means record-B is a cross segment-border record and we hav e > flushed beyond the recrod-B. In that case crash recovery afterwards > can read the complete record-B and will finish recovery *after* the > record-B. That's what we need here. I'm sorry I didn't explain enough. Record-A and Record-B are cross segment-border records. Record-A spans segment X and X+1 Record-B spans segment X+2 and X+3. If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B. If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() allows the writer to notify segment-X. Is my understanding correct? Regards Ryo Matsumrua
Re: archive status ".ready" files may be created too early
Hello, Matsumura-san. At Mon, 6 Jul 2020 04:02:23 +, "matsumura@fujitsu.com" wrote in > Hello, Horiguchi-san > > Thank you for your comment and patch. > > At Thursday, June 25, 2020 3:36 PM(JST), "Kyotaro Horiguchi > " wrote in > > I think we don't need most of that shmem stuff. XLogWrite is called > > I wanted no more shmem stuff too, but other ideas need more lock > that excludes inserter and writer each other. > > > after WAL buffer is filled up to the requested position. So when it > > crosses segment boundary we know the all past corss segment-boundary > > records are stable. That means all we need to remember is only the > > position of the latest corss-boundary record. > > I could not agree. In the following case, it may not work well. > - record-A and record-B (record-B is a newer one) is copied, and > - lastSegContRecStart/End points to record-B's, and > - FlushPtr is proceeded to in the middle of record-A. IIUC, that means record-B is a cross segment-border record and we hav e flushed beyond the recrod-B. In that case crash recovery afterwards can read the complete record-B and will finish recovery *after* the record-B. That's what we need here. > In the above case, the writer should notify segments before record-A, > but it notifies ones before record-B. If the writer notifies If you mean that NotifyStableSegments notifies up-to the previous segment of the segment where record-A is placed, that's wrong. The issue here is crash recovery sees an incomplete record at a segment-border. So it is sufficient that crash recoery can read the last record by looking pg_wal. > only when it flushes the latest record completely, it works well. It confirms that "lastSegContRecEnd < LogwrtResult.Flush", that means the last record(B) is completely flushed-out, isn't that? So it works well. > But the writer may not be enable to notify any segment forever when > WAL records crossing-segment-boundary are inserted contiunuously. No. As I mentioned in the preivous main, if we see a cross-segment-boundary record, the previous cross-segment-boundary record is flushed completely, and the segment containing the first-half of the previous cross-segment-boundary record has already been flushed. I didin't that but we can put an assertion in XLogInsertRecord like this: + /* Remember the range of the record if it spans over segments */ + XLByteToSeg(StartPos, startseg, wal_segment_size); + XLByteToPrevSeg(EndPos, endseg, wal_segment_size); + + if (startseg != endseg) + { ++ /* we shouldn't have a record spanning over three or more segments */ ++ Assert(endseg = startseg + 1); + SpinLockAcquire(>info_lck); + if (XLogCtl->lastSegContRecEnd < StartPos) + { + XLogCtl->lastSegContRecStart = StartPos; + XLogCtl->lastSegContRecEnd = EndPos; > So I think that we must remeber all such cross-segement-boundary records's > EndRecPtr in buffer. > > > > If we call XLogMarkEndRecPtrIfNeeded() there, the function is called > > every time a record is written, most of which are wasteful. > > XLogInsertRecord already has a code block executed only at every page > > boundary. > > I agree. > XLogMarkEndRecPtrIfNeeded() is moved into the code block before updating > LogwrtRqst.Write for avoiding passing-each-other with writer. > > > > Now we can identify stable portion of WAL stream. It's enough to > > prevent walsender from sending data that can be overwritten > > afterwards. GetReplicationTargetRecPtr() in the attached does that. > > I didn't notice it. > I agree basically, but it is based on lastSegContRecStart/End. > > So, first of all, we have to agree what should be remebered. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: archive status ".ready" files may be created too early
Hello, Horiguchi-san Thank you for your comment and patch. At Thursday, June 25, 2020 3:36 PM(JST), "Kyotaro Horiguchi " wrote in > I think we don't need most of that shmem stuff. XLogWrite is called I wanted no more shmem stuff too, but other ideas need more lock that excludes inserter and writer each other. > after WAL buffer is filled up to the requested position. So when it > crosses segment boundary we know the all past corss segment-boundary > records are stable. That means all we need to remember is only the > position of the latest corss-boundary record. I could not agree. In the following case, it may not work well. - record-A and record-B (record-B is a newer one) is copied, and - lastSegContRecStart/End points to record-B's, and - FlushPtr is proceeded to in the middle of record-A. In the above case, the writer should notify segments before record-A, but it notifies ones before record-B. If the writer notifies only when it flushes the latest record completely, it works well. But the writer may not be enable to notify any segment forever when WAL records crossing-segment-boundary are inserted contiunuously. So I think that we must remeber all such cross-segement-boundary records's EndRecPtr in buffer. > If we call XLogMarkEndRecPtrIfNeeded() there, the function is called > every time a record is written, most of which are wasteful. > XLogInsertRecord already has a code block executed only at every page > boundary. I agree. XLogMarkEndRecPtrIfNeeded() is moved into the code block before updating LogwrtRqst.Write for avoiding passing-each-other with writer. > Now we can identify stable portion of WAL stream. It's enough to > prevent walsender from sending data that can be overwritten > afterwards. GetReplicationTargetRecPtr() in the attached does that. I didn't notice it. I agree basically, but it is based on lastSegContRecStart/End. So, first of all, we have to agree what should be remebered. Regards Ryo Matsumura
Re: archive status ".ready" files may be created too early
Hello. Matsumura-san. I agree that WAL writer is not the place to notify segmnet. And the direction you suggested would work. At Fri, 19 Jun 2020 10:18:34 +, "matsumura@fujitsu.com" wrote in > 1. Description in primary side > > [Basic problem] > A process flushing WAL record doesn't know whether the flushed RecPtr is > EndRecPtr of cross-segment-boundary WAL record or not because only process > inserting the WAL record knows it and it never memorizes the information to > anywhere. > > [Basic concept of the patch in primary] > A process inserting a cross-segment-boundary WAL record memorizes its > EndRecPtr > (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl. > A flushing process creates .ready (Later, I call it just 'notify'.) against > a segment that is previous one including CrossBoundaryEndRecPtr only when > its > flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr. ... > See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in > my patch. I think we don't need most of that shmem stuff. XLogWrite is called after WAL buffer is filled up to the requested position. So when it crosses segment boundary we know the all past corss segment-boundary records are stable. That means all we need to remember is only the position of the latest corss-boundary record. > * Action of inserting process > A inserting process memorie its CrossBoundaryEndRecPtr to > CrossBoundaryEndRecPtr > array element calculated by XLogRecPtrToBufIdx with its > CrossBoundaryEndRecPtr. > If the WAL record crosses many segments, only element against last segment > including the EndRecPtr is set and others are not set. > See also CopyXLogRecordToWAL() in my patch. If we call XLogMarkEndRecPtrIfNeeded() there, the function is called every time a record is written, most of which are wasteful. XLogInsertRecord already has a code block executed only at every page boundary. > * Action of flushing process > Overview has been already written as the follwing. > A flushing process creates .ready (Later, I call it just 'notify'.) > against > a segment that is previous one including CrossBoundaryEndRecPtr only when > its > flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr. > > An additional detail is as the following. The flushing process may notify > many segments if the record crosses many segments, so the process memorizes > latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl. > The process notifies from latestArchiveNotifiedSegNo + 1 to > flushing segment number - 1. > > And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process > exits > replay-loop. Standby also set same timing (= before promoting). > > Mutual exlusion about latestArchiveNotifiedSegNo is not required because > the timing of accessing has been already included in WALWriteLock critical > section. Looks reasonable. > 2. Description in standby side > > * Who notifies? > walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of > cross-segment-boundary WAL record or not. In standby, only Startup process > knows the information because it is hidden in WAL record itself and only > Startup process reads and builds WAL record. Standby doesn't write it's own WAL records. Even if primary sent an immature record on segment boundary, it just would promote to a new TLI and starts its own history. Nothing breaks. However it could be a problem if a standby that crashed the problematic way were started as-is as a primary, such scenario is out of our scope. Now we can identify stable portion of WAL stream. It's enough to prevent walsender from sending data that can be overwritten afterwards. GetReplicationTargetRecPtr() in the attached does that. > * Action of Statup process > Therefore, I implemented that walreceiver never notify and Startup process > does it. > In detail, when Startup process reads one full-length WAL record, it > notifies > from a segment that includes head(ReadRecPtr) of the record to a previous > segment that > includes EndRecPtr of the record. I don't like that archiving on standby relies on replay progress. We should avoid that and fortunately I think we dont need it. > Now, we must pay attention about switching time line. > The last segment of previous TimeLineID must be notified before switching. > This case is considered when RM_XLOG_ID is detected. That segment is archived after renamed as ".partial" later. We don't archive the last incomplete segment of the previous timeline as-is. > 3. About other notifying for segment > Two notifyings for segment are remain. They are not needed to fix. > > (1) Notifying for partial segment > It is not needed to be completed, so it's OK to notify without special > consideration. > > (2) Re-notifying > Currently, Checkpointer has notified through XLogArchiveCheckDone(). > It is a
RE: archive status ".ready" files may be created too early
> On 5/28/20, 11:42 PM, "matsumura@fujitsu.com" > wrote: > > I'm preparing a patch that backend inserting segment-crossboundary > > WAL record leaves its EndRecPtr and someone flushing it checks > > the EndRecPtr and notifies.. I'm sorry for my slow work. I attach a patch. I also attach a simple target test for primary. 1. Description in primary side [Basic problem] A process flushing WAL record doesn't know whether the flushed RecPtr is EndRecPtr of cross-segment-boundary WAL record or not because only process inserting the WAL record knows it and it never memorizes the information to anywhere. [Basic concept of the patch in primary] A process inserting a cross-segment-boundary WAL record memorizes its EndRecPtr (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl. A flushing process creates .ready (Later, I call it just 'notify'.) against a segment that is previous one including CrossBoundaryEndRecPtr only when its flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr. [Detail of implementation in primary] * Structure of CrossBoundaryEndRecPtrs Requirement of structure is as the following: - System must memorize multiple CrossBoundaryEndRecPtr. - A flushing process must determine to notify or not with only flushed RecPtr briefly. Therefore, I implemented the structure as an array (I call it CrossBoundaryEndRecPtr array) that is same as xlblck array. Strictly, it is enogh that the length is 'xbuffers/wal_segment_size', but I choose 'xbuffers' for simplicity that makes enable the flushing process to use XLogRecPtrToBufIdx(). See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in my patch. * Action of inserting process A inserting process memorie its CrossBoundaryEndRecPtr to CrossBoundaryEndRecPtr array element calculated by XLogRecPtrToBufIdx with its CrossBoundaryEndRecPtr. If the WAL record crosses many segments, only element against last segment including the EndRecPtr is set and others are not set. See also CopyXLogRecordToWAL() in my patch. * Action of flushing process Overview has been already written as the follwing. A flushing process creates .ready (Later, I call it just 'notify'.) against a segment that is previous one including CrossBoundaryEndRecPtr only when its flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr. An additional detail is as the following. The flushing process may notify many segments if the record crosses many segments, so the process memorizes latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl. The process notifies from latestArchiveNotifiedSegNo + 1 to flushing segment number - 1. And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process exits replay-loop. Standby also set same timing (= before promoting). Mutual exlusion about latestArchiveNotifiedSegNo is not required because the timing of accessing has been already included in WALWriteLock critical section. 2. Description in standby side * Who notifies? walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of cross-segment-boundary WAL record or not. In standby, only Startup process knows the information because it is hidden in WAL record itself and only Startup process reads and builds WAL record. * Action of Statup process Therefore, I implemented that walreceiver never notify and Startup process does it. In detail, when Startup process reads one full-length WAL record, it notifies from a segment that includes head(ReadRecPtr) of the record to a previous segment that includes EndRecPtr of the record. Now, we must pay attention about switching time line. The last segment of previous TimeLineID must be notified before switching. This case is considered when RM_XLOG_ID is detected. 3. About other notifying for segment Two notifyings for segment are remain. They are not needed to fix. (1) Notifying for partial segment It is not needed to be completed, so it's OK to notify without special consideration. (2) Re-notifying Currently, Checkpointer has notified through XLogArchiveCheckDone(). It is a safe-net for failure of notifying by backend or WAL writer. Backend or WAL writer doesn't retry to notify if falis, but Checkpointer retries to notify when it removes old segment. If it fails to notify, then it does not remove the segment. It makes Checkpointer to retry notify until the notifying suceeeds. Also, in this case, we can just notify whithout special consideration because Checkpointer guarantees that all WAL record included in the segment have been already flushed. Please, your review and comments. Regards Ryo Matsumura test_in_primary.sh Description: test_in_primary.sh bugfix_early_archiving_v1.0.patch Description: bugfix_early_archiving_v1.0.patch
Re: archive status ".ready" files may be created too early
On 5/28/20, 11:42 PM, "matsumura@fujitsu.com" wrote: > I'm preparing a patch that backend inserting segment-crossboundary > WAL record leaves its EndRecPtr and someone flushing it checks > the EndRecPtr and notifies.. Thank you for sharing your thoughts. I will be happy to take a look at your patch. Nathan