Re: [HACKERS] Multithreaded SIGPIPE race in libpq on Solaris

2014-08-29 Thread Thomas Munro
On 29 August 2014 01:04, Thomas Munro  wrote:
> On 28 August 2014 23:45, Tom Lane  wrote:
>> I don't claim to be an expert on this stuff, but I had the idea that
>> multithreaded environments were supposed to track signal state per-thread
>> not just per-process, precisely because of issues like this.
>
> After some googling, I found reply #3 in
> https://community.oracle.com/thread/1950900?start=0&tstart=0 and
> various other sources which say that in Solaris versions 10 they
> changed SIGPIPE delivery from per process (as specified by UNIX98) to
> per thread (as specified by POSIX:2001).  But we are on version 11, so
> my theory doesn't look great.  (Though 9 is probably still in use out
> there somewhere...)

I discovered that the machine we saw the problem on was running
Solaris 9 at the time, but has been upgraded since.  So I think my
sigwait race theory may have been correct, but we can just put this
down to a historical quirk and forget about it, because Solaris 9 is
basically deceased ("extended support" ends in October 2014).  Sorry
for the noise.

Best regards,
Thomas Munro


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multithreaded SIGPIPE race in libpq on Solaris

2014-08-28 Thread Thomas Munro
On 28 August 2014 23:45, Tom Lane  wrote:
> I don't claim to be an expert on this stuff, but I had the idea that
> multithreaded environments were supposed to track signal state per-thread
> not just per-process, precisely because of issues like this.

After some googling, I found reply #3 in
https://community.oracle.com/thread/1950900?start=0&tstart=0 and
various other sources which say that in Solaris versions 10 they
changed SIGPIPE delivery from per process (as specified by UNIX98) to
per thread (as specified by POSIX:2001).  But we are on version 11, so
my theory doesn't look great.  (Though 9 is probably still in use out
there somewhere...)

I also found this article:
http://krokisplace.blogspot.co.uk/2010/02/suppressing-sigpipe-in-library.html

The author recommends an approach nearly identical to the PostgreSQL
approach, except s/he says:  "to do this we use sigtimedwait() with
zero timeout; this is to avoid blocking in a scenario where malicious
user sent SIGPIPE manually to a whole process: in this case we will
see it pending, but other thread may handle it before we had a
[chance] to wait for it".

Maybe we have malicious users sending signals to processes.  It does
seem more likely the crashing database triggered this somehow though,
perhaps in combination with something else the client app was doing,
though I can't think what it could be that would eat another thread's
SIGPIPE in between the sigpending and sigwait syscalls.

Best regards,
Thomas Munro


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multithreaded SIGPIPE race in libpq on Solaris

2014-08-28 Thread Tom Lane
Thomas Munro  writes:
> My theory is that if two connections accessed by different threads get
> shut down around the same time, there is a race scenario where each of
> them fails to write to its socket, sees errno == EPIPE and then sees a
> pending SIGPIPE with sigpending(), but only one thread returns from
> sigwait() due to signal merging.

Hm, that does sound like it could be a problem, if the platform fails
to track pending SIGPIPE on a per-thread basis.

> We never saw the problem again after we made the following change:
> ...
> Does this make any sense?

I don't think that patch would fix the problem if it's real.  It would
prevent libpq from hanging up when it's trying to throw away a pending
SIGPIPE, but the fundamental issue is that that action could cause a
SIGPIPE that's meant for some other thread to get lost; and that other
thread isn't necessarily doing anything with libpq.

I don't claim to be an expert on this stuff, but I had the idea that
multithreaded environments were supposed to track signal state per-thread
not just per-process, precisely because of issues like this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers