Re: [HACKERS] spoonbill vs. -HEAD

2013-05-07 Thread Stefan Kaltenbrunner

On 04/04/2013 02:18 AM, Tom Lane wrote:

Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:

On 04/03/2013 12:59 AM, Tom Lane wrote:

BTW, on further thought it seems like maybe this is an OpenBSD bug,
at least in part: what is evidently happening is that the temporary
blockage of SIGINT during the handler persists even after we've
longjmp'd back to the main loop.  But we're using sigsetjmp(..., 1)
to establish that longjmp handler --- so why isn't the original signal
mask reinstalled when we return to the main loop?

If (your version of) OpenBSD is getting this wrong, it'd explain why
we've not seen similar behavior elsewhere.



hmm trolling the openbsd cvs history brings up this:
http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/sparc64/sparc64/machdep.c?r1=1.143;sortby=date#rev1.143


That's about alternate signal stacks, which we're not using.

I put together a simple test program (attached) and tried it on
spoonbill, and it says that the signal *does* get unblocked when control
returns to the sigsetjmp(...,1).  So now I'm really confused.  Somehow
the results we're getting in a full-fledged backend do not match up with
the results gotten by this test program ... but why?


as a followup to this - I spend some time upgrading spoonbill to the 
lastest OpenBSD release (5.3) the other day and it seems to be able to 
pass a full regression test run now on a manual run. I will add it to 
the regular reporting schedule again, but it seems at least part of the 
problem is/was an Operating system level issue that got fixed in either 
5.2 or 5.3 (spoonbill was on 5.1 before).



Stefan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-04-03 Thread Stefan Kaltenbrunner
On 04/03/2013 12:59 AM, Tom Lane wrote:
 I wrote:
 I think the simplest fix is to insert PG_SETMASK(UnBlockSig) into
 StatementCancelHandler() and any other handlers that might exit via
 longjmp.  I'm a bit inclined to only do this on platforms where a
 problem is demonstrable, which so far is only OpenBSD.  (You'd
 think that all BSDen would have the same issue, but the buildfarm
 shows otherwise.)
 
 BTW, on further thought it seems like maybe this is an OpenBSD bug,
 at least in part: what is evidently happening is that the temporary
 blockage of SIGINT during the handler persists even after we've
 longjmp'd back to the main loop.  But we're using sigsetjmp(..., 1)
 to establish that longjmp handler --- so why isn't the original signal
 mask reinstalled when we return to the main loop?
 
 If (your version of) OpenBSD is getting this wrong, it'd explain why
 we've not seen similar behavior elsewhere.

hmm trolling the openbsd cvs history brings up this:

http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/sparc64/sparc64/machdep.c?r1=1.143;sortby=date#rev1.143


Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-04-03 Thread Tom Lane
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 On 04/03/2013 12:59 AM, Tom Lane wrote:
 BTW, on further thought it seems like maybe this is an OpenBSD bug,
 at least in part: what is evidently happening is that the temporary
 blockage of SIGINT during the handler persists even after we've
 longjmp'd back to the main loop.  But we're using sigsetjmp(..., 1)
 to establish that longjmp handler --- so why isn't the original signal
 mask reinstalled when we return to the main loop?
 
 If (your version of) OpenBSD is getting this wrong, it'd explain why
 we've not seen similar behavior elsewhere.

 hmm trolling the openbsd cvs history brings up this:
 http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/sparc64/sparc64/machdep.c?r1=1.143;sortby=date#rev1.143

That's about alternate signal stacks, which we're not using.

I put together a simple test program (attached) and tried it on
spoonbill, and it says that the signal *does* get unblocked when control
returns to the sigsetjmp(...,1).  So now I'm really confused.  Somehow
the results we're getting in a full-fledged backend do not match up with
the results gotten by this test program ... but why?

regards, tom lane

#include signal.h
#include stdio.h
#include stdlib.h
#include unistd.h
#include setjmp.h

typedef void (*pqsigfunc) (int signo);

sigjmp_buf *PG_exception_stack;

static void
printmask(const char *labl)
{
sigset_t sigs;
int i;

printf(masked signals at %s: , labl);
if (sigprocmask(SIG_SETMASK, NULL, sigs) == 0)
{
for (i = 1; i = 32; i++)
{
//  if (sigs.sigset[0]  sigmask(i))
if (sigs  sigmask(i))
printf(%d, , i);
}
}
else
printf(failed!);
printf(\n);
fflush(stdout);
}

static void
sigint_handler(int signo)
{
printmask(sigint_handler);
siglongjmp(*PG_exception_stack, 1);
abort();
}

static pqsigfunc
pqsignal(int signo, pqsigfunc func)
{
struct sigaction act,
oact;

act.sa_handler = func;
sigemptyset(act.sa_mask);
act.sa_flags = 0;
if (signo != SIGALRM)
act.sa_flags |= SA_RESTART;
#ifdef SA_NOCLDSTOP
if (signo == SIGCHLD)
act.sa_flags |= SA_NOCLDSTOP;
#endif
if (sigaction(signo, act, oact)  0)
return SIG_ERR;
return oact.sa_handler;
}

void
intermediate_func(int level)
{
do {
sigjmp_buf *save_exception_stack = PG_exception_stack;
sigjmp_buf local_sigjmp_buf;

if (sigsetjmp(local_sigjmp_buf, 0) == 0)
{
PG_exception_stack = local_sigjmp_buf;

if (level  0)
intermediate_func(level - 1);
else
{
kill(getpid(), SIGINT);
sleep(1);
abort();
}
}
else
{
PG_exception_stack = save_exception_stack;

printmask(inner longjmp catcher);
siglongjmp(*PG_exception_stack, 1);
abort();
}
PG_exception_stack = save_exception_stack;
} while (0);
}

int
main(int argc, char **argv)
{
sigjmp_buf  outer_sigjmp_buf;

printmask(startup);
pqsignal(SIGINT, sigint_handler);

if (sigsetjmp(outer_sigjmp_buf, 1) != 0)
{
printmask(outer longjmp catcher);
return 0;
}
PG_exception_stack = outer_sigjmp_buf;

intermediate_func(2);

printmask(unexpected exit);
return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-04-02 Thread Tom Lane
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 On 03/26/2013 11:30 PM, Tom Lane wrote:
 A different line of thought is that the cancel was received by the
 backend but didn't succeed in cancelling the query for some reason.

 I added the pgcancel failed codepath you suggested but it does not
 seem to get triggered at all so the above might actually be what is
 happening...

Stefan was kind enough to grant me access to spoonbill, and after
some experimentation I found out the problem.  It seems that OpenBSD
blocks additional deliveries of a signal while the signal handler is
in progress, and that this is implemented by just calling sigprocmask()
before and after calling the handler.  Therefore, if the handler doesn't
return normally --- like, say, it longjmps --- the restoration of the
previous mask never happens.  So we're left with the signal still
blocked, meaning second and subsequent attempts to interrupt the backend
don't work.

This isn't revealed by the regular regression tests because they don't
exercise PQcancel, but several recently-added isolation tests do attempt
to PQcancel the same backend more than once.

It's a bit surprising that it's taken us this long to recognize the
problem.  Typical use of PQcancel doesn't necessarily cause a failure:
StatementCancelHandler() won't exit through longjmp unless
ImmediateInterruptOK is true, which is only the case while waiting for a
heavyweight lock.  But still, you'd think somebody would've run into
the case in normal usage.

I think the simplest fix is to insert PG_SETMASK(UnBlockSig) into
StatementCancelHandler() and any other handlers that might exit via
longjmp.  I'm a bit inclined to only do this on platforms where a
problem is demonstrable, which so far is only OpenBSD.  (You'd
think that all BSDen would have the same issue, but the buildfarm
shows otherwise.)

BTW, this does not seem to explain the symptoms shown at
http://www.postgresql.org/message-id/4fe4d89a.8020...@kaltenbrunner.cc
because what we were seeing there was that *all* signals appeared to be
blocked.  However, after this round of debugging I no longer have a lot
of faith in OpenBSD's ps, because it was lying to me about whether the
process had signals blocked or not (or at least, it couldn't see the
effects of the interrupt signal disable, although when I added debugging
code to print the active signal mask according to sigprocmask() I got
told the truth).  So I'm not sure how much trust to put in those older
ps results.  It's possible that the previous failures were a
manifestation of something related to this bug.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-04-02 Thread Tom Lane
I wrote:
 I think the simplest fix is to insert PG_SETMASK(UnBlockSig) into
 StatementCancelHandler() and any other handlers that might exit via
 longjmp.  I'm a bit inclined to only do this on platforms where a
 problem is demonstrable, which so far is only OpenBSD.  (You'd
 think that all BSDen would have the same issue, but the buildfarm
 shows otherwise.)

BTW, on further thought it seems like maybe this is an OpenBSD bug,
at least in part: what is evidently happening is that the temporary
blockage of SIGINT during the handler persists even after we've
longjmp'd back to the main loop.  But we're using sigsetjmp(..., 1)
to establish that longjmp handler --- so why isn't the original signal
mask reinstalled when we return to the main loop?

If (your version of) OpenBSD is getting this wrong, it'd explain why
we've not seen similar behavior elsewhere.

This theory doesn't really exonerate our code completely, because we use
sigsetjmp(..., 0) in PG_TRY, which means that in a code path where we
catch a longjmp and don't PG_RE_THROW it, we could be left with the
signal disabled.  I don't believe that is happening in the
isolation-test cases, though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-03-27 Thread Stefan Kaltenbrunner
On 03/26/2013 11:30 PM, Tom Lane wrote:
 Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 hmm - will look into that in a bit - but I also just noticed that on the
 same day spoonbill broke there was also a commit to that file
 immediately before that code adding the fflush() calls.
 
 It's hard to see how those would be related to this symptom.  My bet
 is that the new fk-deadlock test exposed some pre-existing issue in
 isolationtester.  Not quite clear what yet, though.

yeah removing them does not seem to change the behaviour at all


 
 A different line of thought is that the cancel was received by the
 backend but didn't succeed in cancelling the query for some reason.

I added the pgcancel failed codepath you suggested but it does not
seem to get triggered at all so the above might actually be what is
happening...


Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-03-27 Thread Stefan Kaltenbrunner
On 03/26/2013 10:18 PM, Tom Lane wrote:
 Andrew Dunstan and...@dunslane.net writes:
 There is some timeout code already in the buildfarm client. It was 
 originally put there to help us when we got CVS hangs, a not infrequent 
 occurrence in the early days, so it's currently only used if configured 
 for the checkout phase, but it could easily be used to create a build 
 timeout which would kill the whole process group if the timeout expired. 
 It wouldn't work on Windows, and of course it won't solve whatever 
 problem caused the hang in the first place, but it still might be worth 
 doing.
 
 +1 --- at least then we'd get reports of failures, rather than the
 current behavior where the animal just stops reporting.

yeah I have had multiple buildfarm members running into similiar issues
(like the still-unexplained issues on spoonbill from a year back:
http://www.postgresql.org/message-id/4fe4b674.3020...@kaltenbrunner.cc)
so I would really like to see an option for a global timeout.



Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-03-27 Thread Andrew Dunstan


On 03/27/2013 03:49 PM, Stefan Kaltenbrunner wrote:

On 03/26/2013 10:18 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

There is some timeout code already in the buildfarm client. It was
originally put there to help us when we got CVS hangs, a not infrequent
occurrence in the early days, so it's currently only used if configured
for the checkout phase, but it could easily be used to create a build
timeout which would kill the whole process group if the timeout expired.
It wouldn't work on Windows, and of course it won't solve whatever
problem caused the hang in the first place, but it still might be worth
doing.

+1 --- at least then we'd get reports of failures, rather than the
current behavior where the animal just stops reporting.

yeah I have had multiple buildfarm members running into similiar issues
(like the still-unexplained issues on spoonbill from a year back:
http://www.postgresql.org/message-id/4fe4b674.3020...@kaltenbrunner.cc)
so I would really like to see an option for a global timeout.



I have been looking at it briefly. There are a few wrinkles, but it's on 
the TODO list.


cheers

andrew





Stefan





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] spoonbill vs. -HEAD

2013-03-26 Thread Stefan Kaltenbrunner
Hi all!


I finally started to investigate why spoonbill stopped reporting to the
buildfarm feedback about 2 months ago.
It seems that the foreign-keys locking patch (or something commity very
close to January 23th) broke it in a fairly annoying way - running the
buildfarm script seems to
consistently stall during the isolationtester part of the regression
testing leaving the postgresql instance running causing all future
buildfarm runs to fail...


The process listing at that time looks like:

https://www.kaltenbrunner.cc/files/process_listing.txt


pg_stats_activity of the running instance:


https://www.kaltenbrunner.cc/files/pg_stat_activity.txt


pg_locks:

https://www.kaltenbrunner.cc/files/pg_locks.txt


backtraces of the three backends:

https://www.kaltenbrunner.cc/files/bt_20467.txt
https://www.kaltenbrunner.cc/files/bt_20897.txt
https://www.kaltenbrunner.cc/files/bt_24285.txt




Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-03-26 Thread Tom Lane
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 I finally started to investigate why spoonbill stopped reporting to the
 buildfarm feedback about 2 months ago.
 It seems that the foreign-keys locking patch (or something commity very
 close to January 23th) broke it in a fairly annoying way - running the
 buildfarm script seems to
 consistently stall during the isolationtester part of the regression
 testing leaving the postgresql instance running causing all future
 buildfarm runs to fail...

It looks from here like the isolationtester client is what's dropping
the ball --- the backend states are unsurprising, and two of them are
waiting for a new client command.  Can you get a stack trace from the
isolationtester process?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-03-26 Thread Stefan Kaltenbrunner
On 03/26/2013 08:45 PM, Tom Lane wrote:
 Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 I finally started to investigate why spoonbill stopped reporting to the
 buildfarm feedback about 2 months ago.
 It seems that the foreign-keys locking patch (or something commity very
 close to January 23th) broke it in a fairly annoying way - running the
 buildfarm script seems to
 consistently stall during the isolationtester part of the regression
 testing leaving the postgresql instance running causing all future
 buildfarm runs to fail...
 
 It looks from here like the isolationtester client is what's dropping
 the ball --- the backend states are unsurprising, and two of them are
 waiting for a new client command.  Can you get a stack trace from the
 isolationtester process?


https://www.kaltenbrunner.cc/files/isolationtester.txt


Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-03-26 Thread Tom Lane
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 On 03/26/2013 08:45 PM, Tom Lane wrote:
 It looks from here like the isolationtester client is what's dropping
 the ball --- the backend states are unsurprising, and two of them are
 waiting for a new client command.  Can you get a stack trace from the
 isolationtester process?

 https://www.kaltenbrunner.cc/files/isolationtester.txt

Hmm ... isolationtester.c:584 is in the code that tries to cancel the
current permutation (test case) after realizing that it's constructed
an invalid permutation.  It looks like the preceding PQcancel() failed
for some reason, since the waiting backend is still waiting.  The
isolationtester code doesn't bother to check for an error result there,
which is kinda bad, not that it's clear what it could do about it.
Could you look at the contents of the local variable buf in the
run_permutation stack frame?  Or else try modifying the code along the
lines of

-PQcancel(cancel, buf, sizeof(buf));
+if (!PQcancel(cancel, buf, sizeof(buf)))
+  fprintf(stderr, PQcancel failed: %s\n, buf);

and see if it prints anything interesting before hanging up.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-03-26 Thread Andrew Dunstan


On 03/26/2013 02:50 PM, Stefan Kaltenbrunner wrote:

Hi all!


I finally started to investigate why spoonbill stopped reporting to the
buildfarm feedback about 2 months ago.
It seems that the foreign-keys locking patch (or something commity very
close to January 23th) broke it in a fairly annoying way - running the
buildfarm script seems to
consistently stall during the isolationtester part of the regression
testing leaving the postgresql instance running causing all future
buildfarm runs to fail...




There is some timeout code already in the buildfarm client. It was 
originally put there to help us when we got CVS hangs, a not infrequent 
occurrence in the early days, so it's currently only used if configured 
for the checkout phase, but it could easily be used to create a build 
timeout which would kill the whole process group if the timeout expired. 
It wouldn't work on Windows, and of course it won't solve whatever 
problem caused the hang in the first place, but it still might be worth 
doing.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-03-26 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 There is some timeout code already in the buildfarm client. It was 
 originally put there to help us when we got CVS hangs, a not infrequent 
 occurrence in the early days, so it's currently only used if configured 
 for the checkout phase, but it could easily be used to create a build 
 timeout which would kill the whole process group if the timeout expired. 
 It wouldn't work on Windows, and of course it won't solve whatever 
 problem caused the hang in the first place, but it still might be worth 
 doing.

+1 --- at least then we'd get reports of failures, rather than the
current behavior where the animal just stops reporting.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-03-26 Thread Stefan Kaltenbrunner
On 03/26/2013 09:33 PM, Tom Lane wrote:
 Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 On 03/26/2013 08:45 PM, Tom Lane wrote:
 It looks from here like the isolationtester client is what's dropping
 the ball --- the backend states are unsurprising, and two of them are
 waiting for a new client command.  Can you get a stack trace from the
 isolationtester process?
 
 https://www.kaltenbrunner.cc/files/isolationtester.txt
 
 Hmm ... isolationtester.c:584 is in the code that tries to cancel the
 current permutation (test case) after realizing that it's constructed
 an invalid permutation.  It looks like the preceding PQcancel() failed
 for some reason, since the waiting backend is still waiting.  The
 isolationtester code doesn't bother to check for an error result there,
 which is kinda bad, not that it's clear what it could do about it.
 Could you look at the contents of the local variable buf in the
 run_permutation stack frame?  Or else try modifying the code along the
 lines of
 
 -PQcancel(cancel, buf, sizeof(buf));
 +if (!PQcancel(cancel, buf, sizeof(buf)))
 +  fprintf(stderr, PQcancel failed: %s\n, buf);
 
 and see if it prints anything interesting before hanging up.

hmm - will look into that in a bit - but I also just noticed that on the
same day spoonbill broke there was also a commit to that file
immediately before that code adding the fflush() calls.


Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spoonbill vs. -HEAD

2013-03-26 Thread Tom Lane
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 hmm - will look into that in a bit - but I also just noticed that on the
 same day spoonbill broke there was also a commit to that file
 immediately before that code adding the fflush() calls.

It's hard to see how those would be related to this symptom.  My bet
is that the new fk-deadlock test exposed some pre-existing issue in
isolationtester.  Not quite clear what yet, though.

A different line of thought is that the cancel was received by the
backend but didn't succeed in cancelling the query for some reason.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers