Re: "pg_ctl: the PID file ... is empty" at end of make check

2020-11-13 Thread Tom Lane
Thomas Munro  writes:
> My bug report got an automated-looking message telling me to retest in
> Big Sur beta 6 back in September, and I've just now upgraded an old
> x86 Mac to Big Sur 11.01 and I can no longer reproduce the problem, so
> it looks like it was fixed!  Thanks, Apple.

Hah!  I've occasionally despaired that Apple actually reads anything
I file there, so this is glad tidings indeed.

regards, tom lane




Re: "pg_ctl: the PID file ... is empty" at end of make check

2020-11-13 Thread Thomas Munro
On Fri, Oct 18, 2019 at 2:26 PM Thomas Munro  wrote:
> On Fri, Oct 18, 2019 at 1:21 AM Tom Lane  wrote:
> > Thomas Munro  writes:
> > > On Tue, Oct 15, 2019 at 1:55 PM Tom Lane  wrote:
> > >> and now prairiedog has shown it too:
> > >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-14%2021%3A45%3A47
> > >> which is positively fascinating, because prairiedog is running a
> > >> bronze-age version of macOS that surely never heard of APFS.
> > >> So this makes it look like this is a basic macOS bug that's not
> > >> as filesystem-dependent as one might think.
> >
> > > Does https://github.com/macdice/unlinktest show the problem on that 
> > > system?
> >
> > It does, though with a very low frequency:
> >
> > $ ./unlinktest
> > $ ./unlinktest 1
> > read 0 bytes, unexpected
> > $ ./unlinktest 1
> > read 0 bytes, unexpected
> > read 0 bytes, unexpected
> > $ ./unlinktest 1
> > $
> >
> > The failure rate on my recent-vintage laptop is more like one
> > failure every five loops.
>
> Wow.  Ok, I'll add a note to the bug report to say it's reproducible
> on "Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007;
> root:xnu-792.24.17~1/RELEASE_PPC" next time I'm near an Apple device
> that will let me log into the bug thing.  On the off-chance that
> someone from Apple stumbles on this and is interested, the Radar
> number is rdar://46318934 and the title is "unlink(2) is not atomic
> (kernel/file system bug)".

My bug report got an automated-looking message telling me to retest in
Big Sur beta 6 back in September, and I've just now upgraded an old
x86 Mac to Big Sur 11.01 and I can no longer reproduce the problem, so
it looks like it was fixed!  Thanks, Apple.




Re: "pg_ctl: the PID file ... is empty" at end of make check

2019-10-17 Thread Thomas Munro
On Fri, Oct 18, 2019 at 1:21 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Tue, Oct 15, 2019 at 1:55 PM Tom Lane  wrote:
> >> and now prairiedog has shown it too:
> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-14%2021%3A45%3A47
> >> which is positively fascinating, because prairiedog is running a
> >> bronze-age version of macOS that surely never heard of APFS.
> >> So this makes it look like this is a basic macOS bug that's not
> >> as filesystem-dependent as one might think.
>
> > Does https://github.com/macdice/unlinktest show the problem on that system?
>
> It does, though with a very low frequency:
>
> $ ./unlinktest
> $ ./unlinktest 1
> read 0 bytes, unexpected
> $ ./unlinktest 1
> read 0 bytes, unexpected
> read 0 bytes, unexpected
> $ ./unlinktest 1
> $
>
> The failure rate on my recent-vintage laptop is more like one
> failure every five loops.

Wow.  Ok, I'll add a note to the bug report to say it's reproducible
on "Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007;
root:xnu-792.24.17~1/RELEASE_PPC" next time I'm near an Apple device
that will let me log into the bug thing.  On the off-chance that
someone from Apple stumbles on this and is interested, the Radar
number is rdar://46318934 and the title is "unlink(2) is not atomic
(kernel/file system bug)".




Re: "pg_ctl: the PID file ... is empty" at end of make check

2019-10-17 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Oct 15, 2019 at 1:55 PM Tom Lane  wrote:
>> and now prairiedog has shown it too:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-14%2021%3A45%3A47
>> which is positively fascinating, because prairiedog is running a
>> bronze-age version of macOS that surely never heard of APFS.
>> So this makes it look like this is a basic macOS bug that's not
>> as filesystem-dependent as one might think.

> Does https://github.com/macdice/unlinktest show the problem on that system?

It does, though with a very low frequency:

$ ./unlinktest 
$ ./unlinktest 1
read 0 bytes, unexpected
$ ./unlinktest 1
read 0 bytes, unexpected
read 0 bytes, unexpected
$ ./unlinktest 1
$ 

The failure rate on my recent-vintage laptop is more like one
failure every five loops.

regards, tom lane




Re: "pg_ctl: the PID file ... is empty" at end of make check

2019-10-14 Thread Thomas Munro
On Tue, Oct 15, 2019 at 1:55 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Agreed.  Secret non-shareable bug report filed.  Fingers crossed.
>
> Since that conversation, longfin has shown the same symptom
> just once more:
>
>  longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for 
> ser
> ver to shut down..pg_ctl: the PID file 
> "/Users/buildfarm/bf-data/REL_11_STAB
> LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat
> a/postmaster.pid" is empty
>
> and now prairiedog has shown it too:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-14%2021%3A45%3A47
>
> which is positively fascinating, because prairiedog is running a
> bronze-age version of macOS that surely never heard of APFS.
> So this makes it look like this is a basic macOS bug that's not
> as filesystem-dependent as one might think.

Does https://github.com/macdice/unlinktest show the problem on that system?

> Trawling the buildfarm database finds no other matches in the last year,
> so whatever it is, it's pretty darn improbable.
>
> Did you ever get any reaction to that bug report?

No acknowledgement.   I did learn from an anonymous source that "it’s
prioritized and in the queue".




Re: "pg_ctl: the PID file ... is empty" at end of make check

2019-10-14 Thread Tom Lane
[ blast from the past dept. ]

Thomas Munro  writes:
> On Thu, Nov 29, 2018 at 3:30 AM Tom Lane  wrote:
>> Thomas Munro  writes:
>>> https://github.com/macdice/unlinktest

>> Bleah.  But you can do better than ask whether it's a bug: you can
>> quote POSIX:
>> ...
>> Not a lot of wiggle room there.

> Agreed.  Secret non-shareable bug report filed.  Fingers crossed.

Since that conversation, longfin has shown the same symptom
just once more:

 longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for ser
ver to shut down..pg_ctl: the PID file "/Users/buildfarm/bf-data/REL_11_STAB
LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat
a/postmaster.pid" is empty

and now prairiedog has shown it too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-14%2021%3A45%3A47

which is positively fascinating, because prairiedog is running a
bronze-age version of macOS that surely never heard of APFS.
So this makes it look like this is a basic macOS bug that's not
as filesystem-dependent as one might think.

Trawling the buildfarm database finds no other matches in the last year,
so whatever it is, it's pretty darn improbable.

Did you ever get any reaction to that bug report?

regards, tom lane




Re: "pg_ctl: the PID file ... is empty" at end of make check

2018-11-28 Thread Thomas Munro
On Thu, Nov 29, 2018 at 3:30 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Wed, Nov 28, 2018 at 6:47 PM Tom Lane  wrote:
> >> Is it possible that unlink() on APFS is not atomic?
>
> > I think you might be right.
> > https://github.com/macdice/unlinktest
>
> Bleah.  But you can do better than ask whether it's a bug: you can
> quote POSIX:
>
> The unlink() function shall remove a link to a file. If path names a
> symbolic link, unlink() shall remove the symbolic link named by path
> and shall not affect any file or directory named by the contents of
> the symbolic link. Otherwise, unlink() shall remove the link named by
> the pathname pointed to by path and shall decrement the link count of
> the file referenced by the link.
>
> When the file's link count becomes 0 and no process has the file open,
> the space occupied by the file shall be freed and the file shall no
> longer be accessible. If one or more processes have the file open when
> the last link is removed, the link shall be removed before unlink()
> returns, but the removal of the file contents shall be postponed until
> all references to the file are closed.
>
> Not a lot of wiggle room there.

Agreed.  Secret non-shareable bug report filed.  Fingers crossed.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: "pg_ctl: the PID file ... is empty" at end of make check

2018-11-28 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Nov 28, 2018 at 6:47 PM Tom Lane  wrote:
>> Is it possible that unlink() on APFS is not atomic?

> I think you might be right.
> https://github.com/macdice/unlinktest

Bleah.  But you can do better than ask whether it's a bug: you can
quote POSIX:

The unlink() function shall remove a link to a file. If path names a
symbolic link, unlink() shall remove the symbolic link named by path
and shall not affect any file or directory named by the contents of
the symbolic link. Otherwise, unlink() shall remove the link named by
the pathname pointed to by path and shall decrement the link count of
the file referenced by the link.

When the file's link count becomes 0 and no process has the file open,
the space occupied by the file shall be freed and the file shall no
longer be accessible. If one or more processes have the file open when
the last link is removed, the link shall be removed before unlink()
returns, but the removal of the file contents shall be postponed until
all references to the file are closed.

Not a lot of wiggle room there.

regards, tom lane



Re: "pg_ctl: the PID file ... is empty" at end of make check

2018-11-28 Thread Thomas Munro
On Wed, Nov 28, 2018 at 6:47 PM Tom Lane  wrote:
> Is it possible that unlink() on APFS is not atomic?  That is, the
> sequence of events is something like
>
> pg_ctl: open("postmaster.pid")
>  postmaster: unlink("postmaster.pid")
> pg_ctl: reads file, gets zero bytes
>
> But that idea is in the OS-bug class, which we agree seems unlikely.

I think you might be right.

https://github.com/macdice/unlinktest

-- 
Thomas Munro
http://www.enterprisedb.com



Re: "pg_ctl: the PID file ... is empty" at end of make check

2018-11-27 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Nov 28, 2018 at 5:28 PM Tom Lane  wrote:
>> So my theory is we broke something in HEAD a couple weeks ago.  But what?

> Hmm.  Not seeing it.  I'm trying to do it again, with a make check loop.

>> The fsync changes you made are suspiciously close to this issue (ie one
>> could explain it as written data not getting out), and were committed in
>> the right time frame, but that change didn't affect writes to
>> postmaster.pid did it?

> Commit 9ccdd7f6 doesn't affect writes to anything.  It just changes
> the elevel if certain fsync calls fail (and incidentally none near
> this code, and in any case there was no failure).

Yeah ... the other weird thing about this is that the postmaster log
shows normal shutdown.

Is it possible that unlink() on APFS is not atomic?  That is, the
sequence of events is something like

pg_ctl: open("postmaster.pid")
 postmaster: unlink("postmaster.pid")
pg_ctl: reads file, gets zero bytes

But that idea is in the OS-bug class, which we agree seems unlikely.

regards, tom lane



Re: "pg_ctl: the PID file ... is empty" at end of make check

2018-11-27 Thread Tom Lane
Thomas Munro  writes:
> Today I saw a one-off case of $SUBJECT, on macOS.  I can't reproduce
> it, but I noticed exactly the same thing on longfin the other day:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin=2018-11-25%2005%3A39%3A04

I trawled the buildfarm logs and discovered a second instance of exactly
the same thing:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin=2018-11-19%2018%3A37%3A00

There have not been any other occurrences in the past 3 months, which is
as far back as I went.  (lorikeet has half a dozen occurrences of "could
not stop postmaster", which is what I was grepping for, but they all
are associated with that machine's intermittent postmaster crashes.)

So that lets out the flaky-hardware theory: that occurrence is before
longfin's hardware transplant.

Also, I don't think I believe the OS-bug idea either, given that you
saw it on 10.14.0.  longfin's been running 10.14.something since
2018-09-26, and has accumulated circa 200 runs since then just on HEAD,
never mind the back branches.  It'd be pretty unlikely to see it only
in the past week, and only on HEAD, if it were an OS bug introduced two
months ago.

So my theory is we broke something in HEAD a couple weeks ago.  But what?

The fsync changes you made are suspiciously close to this issue (ie one
could explain it as written data not getting out), and were committed in
the right time frame, but that change didn't affect writes to
postmaster.pid did it?

regards, tom lane



Re: "pg_ctl: the PID file ... is empty" at end of make check

2018-11-27 Thread Thomas Munro
On Wed, Nov 28, 2018 at 4:10 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Today I saw a one-off case of $SUBJECT, on macOS.  I can't reproduce
> > it, but I noticed exactly the same thing on longfin the other day:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin=2018-11-25%2005%3A39%3A04
> > Anyone know what that's about?
>
> No :-(.  I was wondering about that longfin failure too.  What macOS
> version are you running?
>
> (longfin is up to date, 10.14.1.  Is it worth noting that I jacked
> up the name and drove new hardware underneath it just last week?
> Hard to see how that could be related, but ...)

I have not yet accepted the offer to update to 10.14.1, so I guess it
must be 10.14.0, though the about box just says 10.14.  According to
sw_vers, it's 10.14 build 18A391.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: "pg_ctl: the PID file ... is empty" at end of make check

2018-11-27 Thread Tom Lane
Thomas Munro  writes:
> Today I saw a one-off case of $SUBJECT, on macOS.  I can't reproduce
> it, but I noticed exactly the same thing on longfin the other day:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin=2018-11-25%2005%3A39%3A04
> Anyone know what that's about?

No :-(.  I was wondering about that longfin failure too.  What macOS
version are you running?

(longfin is up to date, 10.14.1.  Is it worth noting that I jacked
up the name and drove new hardware underneath it just last week?
Hard to see how that could be related, but ...)

regards, tom lane



"pg_ctl: the PID file ... is empty" at end of make check

2018-11-27 Thread Thomas Munro
Hello,

Today I saw a one-off case of $SUBJECT, on macOS.  I can't reproduce
it, but I noticed exactly the same thing on longfin the other day:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin=2018-11-25%2005%3A39%3A04

Anyone know what that's about?

-- 
Thomas Munro
http://www.enterprisedb.com