Re: Seek failure at end of FSM file during WAL replay (in 11)

2019-07-25 Thread Michael Paquier
On Wed, Jul 24, 2019 at 01:30:42PM -0400, Tom Lane wrote:
> Hm.  AFAICS the immediate issuer of the error must have been
> _mdnblocks(); there are other matches to that error string but
> they are in places where we can tell which file the seek must
> have been applied to, and it wasn't a FSM file.

Yep, that matches my guesses.  _mdnblocks() is the only match among
the three places with this error string.

> lseek() per se presumably would never return ENOENT.  A more likely
> theory is that the file wasn't actually open but only had a leftover
> VFD entry, and when FileSize() -> FileAccess() tried to open it,
> the open failed with ENOENT --- but _mdnblocks() would still call it
> a seek failure.
> 
> So I'd opine that this is a pretty high-level failure --- what are
> we doing trying to replay WAL against a table that's been dropped?
> Or if it wasn't dropped, why was the FSM removed?

Interesting theory.  In this particular workload, all DDLs are run
when the product runs firstboot and the schema does not change
afterwards, so the stuff does not drop any tables.  Actually I have
been able to extract more information from the log bundles I have as
this stuff does a lookup of all the files of the data folder at the
moment a log bundle is taken.  For this relation the main fork exists
on the primary and there is no FSM and VM.  On the standby, the main
fork also exists but there is a FSM, which I guess has been rebuilt at
the follow-up recovery.  So could it be possible that a FSM has been
removed on the primary, with its removal done on the standby because
of that without the proper VFD entry cleaned up?
--
Michael


signature.asc
Description: PGP signature


Re: Seek failure at end of FSM file during WAL replay (in 11)

2019-07-24 Thread Tom Lane
Michael Paquier  writes:
> Recently, one of the test beds we use has blown up once when doing
> streaming replication like that:
> FATAL:  could not seek to end of file "base/16386/19817_fsm": No such
>file or directory 
> CONTEXT:  WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197
> LOG:  startup process (PID 44886) exited with exit code 1

> All the WAL records have been wiped out since, so I don't know exactly
> what happened, but I could track down that this FSM file got removed
> a couple of hours before as I got my hands on some FS-level logs which
> showed a deletion.

Hm.  AFAICS the immediate issuer of the error must have been
_mdnblocks(); there are other matches to that error string but
they are in places where we can tell which file the seek must
have been applied to, and it wasn't a FSM file.

> Before blaming a lower level of
> the application stack, I am wondering if we have some issues with
> mdfd_vfd meaning that the file has been removed but that it is still
> tracked as opened.

lseek() per se presumably would never return ENOENT.  A more likely
theory is that the file wasn't actually open but only had a leftover
VFD entry, and when FileSize() -> FileAccess() tried to open it,
the open failed with ENOENT --- but _mdnblocks() would still call it
a seek failure.

So I'd opine that this is a pretty high-level failure --- what are
we doing trying to replay WAL against a table that's been dropped?
Or if it wasn't dropped, why was the FSM removed?

regards, tom lane




Seek failure at end of FSM file during WAL replay (in 11)

2019-07-24 Thread Michael Paquier
Hi all,

Recently, one of the test beds we use has blown up once when doing
streaming replication like that:
FATAL:  could not seek to end of file "base/16386/19817_fsm": No such
   file or directory 
CONTEXT:  WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197
LOG:  startup process (PID 44886) exited with exit code 1

All the WAL records have been wiped out since, so I don't know exactly
what happened, but I could track down that this FSM file got removed
a couple of hours before as I got my hands on some FS-level logs which
showed a deletion.

This happens in the context of a WAL record XLOG_HEAP2_CLEAN, and the
redo logic is in heap_xlog_clean(), where there are FSM lookups within
XLogRecordPageWithFreeSpace() -> XLogReadBufferExtended().  At the
subsequent restart, recovery has been able to move on after the
failing record, so the FSM has been rebuilt correctly, still that
caused an HA setup to be less...  Available.  However, we are rather
careful in those code paths to call smgrcreate() so as the file gets
created at redo if it is not around.  Before blaming a lower level of
the application stack, I am wondering if we have some issues with
mdfd_vfd meaning that the file has been removed but that it is still
tracked as opened.  A quick lookup of the code does not show any
issues, has anyone seen this particular error recently?

The last commit on REL_11_STABLE which touched this area is this one
FWIW:
commit: 6872c2be6a97057aa736110e31f0390a53305c9e
author: Alvaro Herrera 
date: Wed, 15 Aug 2018 18:09:29 -0300
Update FSM on WAL replay of page all-visible/frozen

Also, this setup was using 11.2 (I know this one lags behind a bit,
anyway...).

Thanks,
--
Michael


signature.asc
Description: PGP signature