Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-08 Thread Noah Misch
On Sat, May 08, 2021 at 04:57:54PM +1200, Thomas Munro wrote: > On Sat, May 8, 2021 at 2:30 AM Tom Lane wrote: > > May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11 > > That's -EAGAIN (assuming errnos match x86) and I guess it indicates > that VDC_MAX_RETRIES is exceeded

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Thomas Munro
On Sat, May 8, 2021 at 2:30 AM Tom Lane wrote: > May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11 That's -EAGAIN (assuming errnos match x86) and I guess it indicates that VDC_MAX_RETRIES is exceeded here:

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Noah Misch
On Fri, May 07, 2021 at 10:18:14PM -0400, Tom Lane wrote: > Andres Freund writes: > > On 2021-05-07 17:14:18 -0700, Noah Misch wrote: > >> Having a flaky buildfarm member is bad news. I'll LD_PRELOAD the attached > >> to > >> prevent fsync from reaching the kernel. Hopefully, that will make

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Tom Lane
Andres Freund writes: > On 2021-05-07 17:14:18 -0700, Noah Misch wrote: >> Having a flaky buildfarm member is bad news. I'll LD_PRELOAD the attached to >> prevent fsync from reaching the kernel. Hopefully, that will make the >> hardware-or-kernel trouble unreachable. (Changing

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Andres Freund
Hi, On 2021-05-07 17:14:18 -0700, Noah Misch wrote: > Having a flaky buildfarm member is bad news. I'll LD_PRELOAD the attached to > prevent fsync from reaching the kernel. Hopefully, that will make the > hardware-or-kernel trouble unreachable. (Changing 008_fsm_truncation.pl > wouldn't avoid

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Michael Paquier
On Fri, May 07, 2021 at 04:42:46PM +1200, Thomas Munro wrote: > Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the > tree" to fix this exact problem. I don't see that we'd be able to get a redesign of this area safe enough for a backpatch, but perhaps we (I?) had better put some

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Michael Paquier
On Fri, May 07, 2021 at 04:30:00PM -0400, Tom Lane wrote: > I can certainly see an argument for running some buildfarm animals > with fsync on (for all tests). I don't see a reason for forcing > them all to run some tests that way; and if I were going to do that, > I doubt that

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Noah Misch
On Fri, May 07, 2021 at 01:18:19PM -0400, Tom Lane wrote: > Realizing that 9989d37d prevents the assertion failure, I went > to see if thorntail had shown EIO failures without assertions. > Looking back 180 days, I found these: > > sysname |branch | snapshot | stage

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Tom Lane
Andres Freund writes: > Isn't this a good reason to have at least some tests run with fsync=on? Why? I can certainly see an argument for running some buildfarm animals with fsync on (for all tests). I don't see a reason for forcing them all to run some tests that way; and if I were going to do

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Andres Freund
Hi, On 2021-05-07 10:29:58 -0400, Tom Lane wrote: > I wrote: > > 1. No wonder we could not reproduce it anywhere else. I've warned > > the cfarm admins that their machine may be having hardware issues. > > I heard back from the machine's admin. The time of the crash I observed > matches

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Andrew Dunstan
On 5/7/21 11:27 AM, Andrew Dunstan wrote: > On 5/7/21 12:38 AM, Andres Freund wrote: >> Hi, >> >> On 2021-05-07 00:30:11 -0400, Tom Lane wrote: >>> Andres Freund writes: On 2021-05-06 21:43:32 -0400, Tom Lane wrote: > That I'm not sure about. gdb is certainly installed, and thorntail

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Tom Lane
I wrote: > Thomas Munro writes: >> Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the >> tree" to fix this exact problem. > Hah, so that maybe explains why thorntail has only shown this in > the v12 branch. Should we consider back-patching that? Realizing that 9989d37d

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Andrew Dunstan
On 5/7/21 12:38 AM, Andres Freund wrote: > Hi, > > On 2021-05-07 00:30:11 -0400, Tom Lane wrote: >> Andres Freund writes: >>> On 2021-05-06 21:43:32 -0400, Tom Lane wrote: That I'm not sure about. gdb is certainly installed, and thorntail is visibly running the current buildfarm

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Tom Lane
I wrote: > 1. No wonder we could not reproduce it anywhere else. I've warned > the cfarm admins that their machine may be having hardware issues. I heard back from the machine's admin. The time of the crash I observed matches exactly to these events in the kernel log: May 07 03:31:39 gcc202

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-06 Thread Tom Lane
Thomas Munro writes: > On Fri, May 7, 2021 at 1:43 PM Tom Lane wrote: >> The interesting part of this is frame 6, which points here: > Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the > tree" to fix this exact problem. Hah, so that maybe explains why thorntail has only shown

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-06 Thread Thomas Munro
On Fri, May 7, 2021 at 1:43 PM Tom Lane wrote: > The interesting part of this is frame 6, which points here: > > case SYNC_METHOD_FDATASYNC: > if (pg_fdatasync(fd) != 0) > ereport(PANIC, > (errcode_for_file_access(), >

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-06 Thread Andres Freund
Hi, On 2021-05-07 00:30:11 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2021-05-06 21:43:32 -0400, Tom Lane wrote: > >> That I'm not sure about. gdb is certainly installed, and thorntail is > >> visibly running the current buildfarm client and is configured with the > >> correct

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-06 Thread Tom Lane
Andres Freund writes: > On 2021-05-06 21:43:32 -0400, Tom Lane wrote: >> That I'm not sure about. gdb is certainly installed, and thorntail is >> visibly running the current buildfarm client and is configured with the >> correct core_file_glob, and I can report that the crash did leave a 'core'

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-06 Thread Andres Freund
Hi, On 2021-05-06 21:43:32 -0400, Tom Lane wrote: > 2. We evidently need to put a bit more effort into this error > reporting logic. More generally, I wonder how we could audit > the code for similar hazards elsewhere, because I bet there are > some. (Or ... could it be sane to run functions

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-06 Thread Noah Misch
On Thu, May 06, 2021 at 09:43:32PM -0400, Tom Lane wrote: > 2. We evidently need to put a bit more effort into this error > reporting logic. More generally, I wonder how we could audit > the code for similar hazards elsewhere, because I bet there are > some. (Or ... could it be sane to run

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-06 Thread Tom Lane
Thomas Munro writes: > While looking for something else, I noticed thorntail has failed twice > like this, on REL_12_STABLE: > TRAP: FailedAssertion("!(CritSectionCount == 0 || > (context)->allowInCritSection)", File: >

Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-06 Thread Thomas Munro
Hi, While looking for something else, I noticed thorntail has failed twice like this, on REL_12_STABLE: TRAP: FailedAssertion("!(CritSectionCount == 0 || (context)->allowInCritSection)", File: