Re: Seek failure at end of FSM file during WAL replay (in 11)
On Wed, Jul 24, 2019 at 01:30:42PM -0400, Tom Lane wrote: > Hm. AFAICS the immediate issuer of the error must have been > _mdnblocks(); there are other matches to that error string but > they are in places where we can tell which file the seek must > have been applied to, and it wasn't a FSM file. Yep, that matches my guesses. _mdnblocks() is the only match among the three places with this error string. > lseek() per se presumably would never return ENOENT. A more likely > theory is that the file wasn't actually open but only had a leftover > VFD entry, and when FileSize() -> FileAccess() tried to open it, > the open failed with ENOENT --- but _mdnblocks() would still call it > a seek failure. > > So I'd opine that this is a pretty high-level failure --- what are > we doing trying to replay WAL against a table that's been dropped? > Or if it wasn't dropped, why was the FSM removed? Interesting theory. In this particular workload, all DDLs are run when the product runs firstboot and the schema does not change afterwards, so the stuff does not drop any tables. Actually I have been able to extract more information from the log bundles I have as this stuff does a lookup of all the files of the data folder at the moment a log bundle is taken. For this relation the main fork exists on the primary and there is no FSM and VM. On the standby, the main fork also exists but there is a FSM, which I guess has been rebuilt at the follow-up recovery. So could it be possible that a FSM has been removed on the primary, with its removal done on the standby because of that without the proper VFD entry cleaned up? -- Michael signature.asc Description: PGP signature
Re: Seek failure at end of FSM file during WAL replay (in 11)
Michael Paquier writes: > Recently, one of the test beds we use has blown up once when doing > streaming replication like that: > FATAL: could not seek to end of file "base/16386/19817_fsm": No such >file or directory > CONTEXT: WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197 > LOG: startup process (PID 44886) exited with exit code 1 > All the WAL records have been wiped out since, so I don't know exactly > what happened, but I could track down that this FSM file got removed > a couple of hours before as I got my hands on some FS-level logs which > showed a deletion. Hm. AFAICS the immediate issuer of the error must have been _mdnblocks(); there are other matches to that error string but they are in places where we can tell which file the seek must have been applied to, and it wasn't a FSM file. > Before blaming a lower level of > the application stack, I am wondering if we have some issues with > mdfd_vfd meaning that the file has been removed but that it is still > tracked as opened. lseek() per se presumably would never return ENOENT. A more likely theory is that the file wasn't actually open but only had a leftover VFD entry, and when FileSize() -> FileAccess() tried to open it, the open failed with ENOENT --- but _mdnblocks() would still call it a seek failure. So I'd opine that this is a pretty high-level failure --- what are we doing trying to replay WAL against a table that's been dropped? Or if it wasn't dropped, why was the FSM removed? regards, tom lane
Seek failure at end of FSM file during WAL replay (in 11)
Hi all, Recently, one of the test beds we use has blown up once when doing streaming replication like that: FATAL: could not seek to end of file "base/16386/19817_fsm": No such file or directory CONTEXT: WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197 LOG: startup process (PID 44886) exited with exit code 1 All the WAL records have been wiped out since, so I don't know exactly what happened, but I could track down that this FSM file got removed a couple of hours before as I got my hands on some FS-level logs which showed a deletion. This happens in the context of a WAL record XLOG_HEAP2_CLEAN, and the redo logic is in heap_xlog_clean(), where there are FSM lookups within XLogRecordPageWithFreeSpace() -> XLogReadBufferExtended(). At the subsequent restart, recovery has been able to move on after the failing record, so the FSM has been rebuilt correctly, still that caused an HA setup to be less... Available. However, we are rather careful in those code paths to call smgrcreate() so as the file gets created at redo if it is not around. Before blaming a lower level of the application stack, I am wondering if we have some issues with mdfd_vfd meaning that the file has been removed but that it is still tracked as opened. A quick lookup of the code does not show any issues, has anyone seen this particular error recently? The last commit on REL_11_STABLE which touched this area is this one FWIW: commit: 6872c2be6a97057aa736110e31f0390a53305c9e author: Alvaro Herrera date: Wed, 15 Aug 2018 18:09:29 -0300 Update FSM on WAL replay of page all-visible/frozen Also, this setup was using 11.2 (I know this one lags behind a bit, anyway...). Thanks, -- Michael signature.asc Description: PGP signature