Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Andres Freund
On 2016-01-03 13:57:18 -0500, Tom Lane wrote:
> Done, we'll soon see what the buildfarm thinks.

Thanks.

I wonder if we ought to backport this further: e.g. walsender
continously uses nonblocking sockets via pq_getbyte_if_available(). On
the other hand I can't immediately see a problem with that, besides
differing messages on windows/the rest of the world.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Andres Freund
On 2016-01-04 10:20:43 -0500, Tom Lane wrote:
> I'm slightly worried about breaking 3rd-party code that might be using
> recv() and somehow expecting the current behavior.  However, it's equally
> arguable that such code would have Windows-specific problems that would be
> fixed by the patch.

I don't really see how somebody could validly depend on the old
behaviour, given it always was windows only, and there only with
nonblocking sockets. Seems more likely to be an undiscovered bug for
such a hyptothetical user.

I also think there's rather few drivers / clients actually shutting down
sockets. Here it AFAICS really was a more or less unintentional side
effect.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Magnus Hagander
On Mon, Jan 4, 2016 at 4:20 PM, Tom Lane  wrote:

> Andres Freund  writes:
> > I wonder if we ought to backport this further: e.g. walsender
> > continously uses nonblocking sockets via pq_getbyte_if_available(). On
> > the other hand I can't immediately see a problem with that, besides
> > differing messages on windows/the rest of the world.
>
> I'm slightly worried about breaking 3rd-party code that might be using
> recv() and somehow expecting the current behavior.  However, it's equally
> arguable that such code would have Windows-specific problems that would be
> fixed by the patch.  Now that we've seen a successful round of buildfarm
> results, I'd be okay with back-patching 90e61df8 personally.
>
> Any other opinions out there?
>

Maybe holdoff until the release with the new code has been out for a while,
but make sure we get it into the next set of minors? That'll give us at
least some real world deployment to notice any issues with it?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Tom Lane
Magnus Hagander  writes:
> On Mon, Jan 4, 2016 at 4:20 PM, Tom Lane  wrote:
>> I'm slightly worried about breaking 3rd-party code that might be using
>> recv() and somehow expecting the current behavior.  However, it's equally
>> arguable that such code would have Windows-specific problems that would be
>> fixed by the patch.  Now that we've seen a successful round of buildfarm
>> results, I'd be okay with back-patching 90e61df8 personally.
>> 
>> Any other opinions out there?

> Maybe holdoff until the release with the new code has been out for a while,
> but make sure we get it into the next set of minors? That'll give us at
> least some real world deployment to notice any issues with it?

Well, we could push it now before we forget about it, and we'd still have
exactly that result, given that the next set of minors are scheduled for
early February.  If there is anything badly wrong we should hear about
it from early adopters of 9.5.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Tom Lane
Robert Haas  writes:
> OK, well, if the consensus is in favor of a back-patch, so be it.  It
> seems a little strange to me to back-patch a commit that doesn't fix
> anything, but I just work here.

Well, it's true that we can't point to specific field reports and say
that this will fix those.  But it's not like our Windows port is so
rock-solid-reliable that we should give it the benefit of the doubt
about existing behaviors being correct.  We do know that the code path
in question is used in previous branches --- we put it there for a
reason --- and I think it's probably possible that it gets exercised
in corner cases, even pre-9.5.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Tom Lane
Andres Freund  writes:
> On 2016-01-04 10:35:12 -0500, Robert Haas wrote:
>> If we don't know of a specific problem that would be fixed by
>> back-patching this commit to pre-9.5 branches, and it seems like we
>> don't, then I don't really see much upside to back-patching it.  I
>> mean, yeah, we think that this is wrong because we think we know that
>> the behavior of Windows is different than what we thought when the
>> code was written.  But if we were wrong then, we could be wrong now,
>> too.  If so, it would be better to only have broken 9.5.

> I think it always was just a typo, given code a few lines down in the
> same function, added by the same commit, treated that case differently.

And, indeed, it was only because that code further down handled the case
correctly that nobody noticed for so long.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Robert Haas
On Mon, Jan 4, 2016 at 10:22 AM, Magnus Hagander  wrote:
> On Mon, Jan 4, 2016 at 4:20 PM, Tom Lane  wrote:
>> Andres Freund  writes:
>> > I wonder if we ought to backport this further: e.g. walsender
>> > continously uses nonblocking sockets via pq_getbyte_if_available(). On
>> > the other hand I can't immediately see a problem with that, besides
>> > differing messages on windows/the rest of the world.
>>
>> I'm slightly worried about breaking 3rd-party code that might be using
>> recv() and somehow expecting the current behavior.  However, it's equally
>> arguable that such code would have Windows-specific problems that would be
>> fixed by the patch.  Now that we've seen a successful round of buildfarm
>> results, I'd be okay with back-patching 90e61df8 personally.
>>
>> Any other opinions out there?
>
> Maybe holdoff until the release with the new code has been out for a while,
> but make sure we get it into the next set of minors? That'll give us at
> least some real world deployment to notice any issues with it?

If we don't know of a specific problem that would be fixed by
back-patching this commit to pre-9.5 branches, and it seems like we
don't, then I don't really see much upside to back-patching it.  I
mean, yeah, we think that this is wrong because we think we know that
the behavior of Windows is different than what we thought when the
code was written.  But if we were wrong then, we could be wrong now,
too.  If so, it would be better to only have broken 9.5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Andres Freund
On 2016-01-04 10:35:12 -0500, Robert Haas wrote:
> If we don't know of a specific problem that would be fixed by
> back-patching this commit to pre-9.5 branches, and it seems like we
> don't, then I don't really see much upside to back-patching it.  I
> mean, yeah, we think that this is wrong because we think we know that
> the behavior of Windows is different than what we thought when the
> code was written.  But if we were wrong then, we could be wrong now,
> too.  If so, it would be better to only have broken 9.5.

I think it always was just a typo, given code a few lines down in the
same function, added by the same commit, treated that case differently.

Anyway, I don't have a particularly strong opinion here, so ...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Robert Haas
On Mon, Jan 4, 2016 at 10:49 AM, Tom Lane  wrote:
> Andres Freund  writes:
>> On 2016-01-04 10:35:12 -0500, Robert Haas wrote:
>>> If we don't know of a specific problem that would be fixed by
>>> back-patching this commit to pre-9.5 branches, and it seems like we
>>> don't, then I don't really see much upside to back-patching it.  I
>>> mean, yeah, we think that this is wrong because we think we know that
>>> the behavior of Windows is different than what we thought when the
>>> code was written.  But if we were wrong then, we could be wrong now,
>>> too.  If so, it would be better to only have broken 9.5.
>
>> I think it always was just a typo, given code a few lines down in the
>> same function, added by the same commit, treated that case differently.
>
> And, indeed, it was only because that code further down handled the case
> correctly that nobody noticed for so long.

OK, well, if the consensus is in favor of a back-patch, so be it.  It
seems a little strange to me to back-patch a commit that doesn't fix
anything, but I just work here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Tom Lane
Andres Freund  writes:
> I wonder if we ought to backport this further: e.g. walsender
> continously uses nonblocking sockets via pq_getbyte_if_available(). On
> the other hand I can't immediately see a problem with that, besides
> differing messages on windows/the rest of the world.

I'm slightly worried about breaking 3rd-party code that might be using
recv() and somehow expecting the current behavior.  However, it's equally
arguable that such code would have Windows-specific problems that would be
fixed by the patch.  Now that we've seen a successful round of buildfarm
results, I'd be okay with back-patching 90e61df8 personally.

Any other opinions out there?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-04 Thread Robert Haas
On Mon, Jan 4, 2016 at 10:59 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> OK, well, if the consensus is in favor of a back-patch, so be it.  It
>> seems a little strange to me to back-patch a commit that doesn't fix
>> anything, but I just work here.
>
> Well, it's true that we can't point to specific field reports and say
> that this will fix those.  But it's not like our Windows port is so
> rock-solid-reliable that we should give it the benefit of the doubt
> about existing behaviors being correct.  We do know that the code path
> in question is used in previous branches --- we put it there for a
> reason --- and I think it's probably possible that it gets exercised
> in corner cases, even pre-9.5.

Yeah, I'm just worried about collateral damage.  If you're convinced
that there won't be any, have at it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-03 Thread Andres Freund
On 2016-01-03 10:03:41 +0530, Amit Kapila wrote:
> On Sun, Jan 3, 2016 at 3:01 AM, Andres Freund  wrote:
> > Indeed it does use shutdown(). If I read the npgsql code that'll even be
> > done in the exception handling path. So fixing the 0 byte case might
> > already do the trick.
> >
> 
> I think this true for a TCP socket, but this code-path is used for UDP
> (SOCK_DGRAM) sockets as well and there is a comment below in
> that function which seems to be indicating why originally 0 byte case
> has not been handled at the place suggested by you (now it seems to
> be much less relevant).

I'm not sure what the origin of that comment is, it's been there all the
way since a4c40f14. But it doesn't really have much real effect: If
WSARecv in the retry loop returns 0 bytes, we'll not retry again as r !=
SOCKET_ERROR but actually return 0.

Note that the whole retry loop in pgwin32_recv(), which kinda mitigates
the problem explained above, isn't entered anymore as the FE/BE socket
is now always operated in non-blocking mode. I.e.
if (pgwin32_noblock)
{
/*
 * No data received, and we are in "emulated non-blocking 
mode", so
 * return indicating that we'd block if we were to continue.
 */
errno = EWOULDBLOCK;
return -1;
}
will always be taken. Which exactly explains the problem, together with
the edge-triggered behaviour of
WaitForMultipleObjects()/WSAEventSelect().

I really think we have a host of buggy code around the event handling -
but most of it has been used for a long while. So I think fixing the 0
byte case for 9.5 is good enough.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-03 Thread Tom Lane
Andres Freund  writes:
> On January 3, 2016 6:23:20 PM GMT+01:00, Tom Lane  wrote:
>> Agreed.  Let's do it and ship this puppy.

> Unless somebody beats me to it, I'll push in the European morning.

Um.  For something that at least potentially has portability issues
(we think not, but we could be wrong), it's pretty scary to push only
a couple of hours before the wrap deadline.  I'd like to get it done
now so we can get a full day's buildfarm testing.  If you like I can
push something ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-03 Thread Andres Freund
On January 3, 2016 7:04:29 PM GMT+01:00, Tom Lane  wrote:
>Andres Freund  writes:
>> On January 3, 2016 6:23:20 PM GMT+01:00, Tom Lane 
>wrote:
>>> Agreed.  Let's do it and ship this puppy.
>
>> Unless somebody beats me to it, I'll push in the European morning.
>
>Um.  For something that at least potentially has portability issues
>(we think not, but we could be wrong), it's pretty scary to push only
>a couple of hours before the wrap deadline.  I'd like to get it done
>now so we can get a full day's buildfarm testing.  If you like I can
>push something ...

I agree. But I'm at an event and only have my phone with me. So I can't get to 
it right now. I might be able to push in 5ish hours. Otherwise it's in 15h. 
Sorry.

Fell free to push earlier, if you can make the time.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-03 Thread Tom Lane
Andres Freund  writes:
> On January 3, 2016 7:04:29 PM GMT+01:00, Tom Lane  wrote:
>> Um.  For something that at least potentially has portability issues
>> (we think not, but we could be wrong), it's pretty scary to push only
>> a couple of hours before the wrap deadline.  I'd like to get it done
>> now so we can get a full day's buildfarm testing.  If you like I can
>> push something ...

> I agree. But I'm at an event and only have my phone with me. So I can't get 
> to it right now. I might be able to push in 5ish hours. Otherwise it's in 
> 15h. Sorry.

> Fell free to push earlier, if you can make the time.

Done, we'll soon see what the buildfarm thinks.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-03 Thread Andres Freund
On January 3, 2016 6:23:20 PM GMT+01:00, Tom Lane  wrote:

>> I really think we have a host of buggy code around the event handling
>-
>> but most of it has been used for a long while. So I think fixing the
>0
>> byte case for 9.5 is good enough.
>
>Agreed.  Let's do it and ship this puppy.

Unless somebody beats me to it, I'll push in the European morning.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-03 Thread Tom Lane
Andres Freund  writes:
> On 2016-01-03 10:03:41 +0530, Amit Kapila wrote:
>> I think this true for a TCP socket, but this code-path is used for UDP
>> (SOCK_DGRAM) sockets as well and there is a comment below in
>> that function which seems to be indicating why originally 0 byte case
>> has not been handled at the place suggested by you (now it seems to
>> be much less relevant).

> I'm not sure what the origin of that comment is, it's been there all the
> way since a4c40f14. But it doesn't really have much real effect: If
> WSARecv in the retry loop returns 0 bytes, we'll not retry again as r !=
> SOCKET_ERROR but actually return 0.

That comment is a bit scary because it suggests that there might be some
cross-Windows-versions differences in the behavior of WSARecv.  However,
I poked around on msdn.microsoft.com until I found a version of the
WSARecv man page that claimed to apply to Windows 2000, and it says the
same thing as the newer versions: b==0 implies graceful connection closure
(if stream protocol) or zero-size message (if message-oriented protocol)
and in neither case is it appropriate for this code to block.

So I think the exclusion of zero is an outright bug and always has been.

Actually, we could remove the test on b altogether and then simplify the
next line; I see no indication in Microsoft's docs that b<0 is a possible
case.  That would make the code here more nearly match what is in the
retry loop --- and we now realize that it's only because we used to fall
through to the retry loop that the EOF case behaved sanely.

> I really think we have a host of buggy code around the event handling -
> but most of it has been used for a long while. So I think fixing the 0
> byte case for 9.5 is good enough.

Agreed.  Let's do it and ship this puppy.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Andres Freund
Hi Petr,

On 2016-01-02 09:17:02 +0100, Petr Jelinek wrote:
> so the commit which triggers this issue is
> 387da18874afa17156ee3af63766f17efb53c4b9 , not sure why yet (wanted to give
> heads up early since multiple people are looking at this). Note that the
> compilation around this commit is made harder by the fact that commit
> 91fa7b4719ac583420d9143132ba4ccddefbc5b2 breaks linking and fix is few
> commits later.

Thanks for bisecting! What exactly did you use to reproduce? Shay's
npgsql binary? Or a plain psql session where you killed the client?
Given that Amit couldn't reproduce with the latter that seems an
interesting difference.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Andres Freund
On 2016-01-02 14:26:47 +0100, Andres Freund wrote:
> On 2016-01-02 18:40:38 +0530, Amit Kapila wrote:
> > If we
> > remember the closed socket event and then take appropriate action,
> > then this problem won't happen.  Attached patch which by no-means
> > a complete fix shows what I wanted to say and after this the problem
> > mentioned by Shay doesn't happen, although I get LOG message
> > which is due to the reason that proper handling for socket closure
> > needs to be done in this path.  This patch is based on the code
> > after commit 387da18874afa17156ee3af63766f17efb53c4b9.  I
> > will do testing and refine the fix based on HEAD later as I am done
> > for the today.
> 
> It's weird that this fixes the problem. As we were previously, according
> to Shay, not busy looping, this seems to indicate that FD_CLOSE is only
> reported once or somesuch?
> 
> It'd be very interesting to add a debug elog() into the
>   if (resEvents.lNetworkEvents & FD_CLOSE)
>   {
>   if (wakeEvents & WL_SOCKET_READABLE)
>   result |= WL_SOCKET_READABLE;
>   if (wakeEvents & WL_SOCKET_WRITEABLE)
>   result |= WL_SOCKET_WRITEABLE;
>   }
> 
> path in WaitLatchOrSocket. If it actually returns with the current code,
> we have a better idea where to look for problems.


I wonder if the following is the problem: The docs for WSAEventSelect()
says:
"Having successfully recorded the occurrence of the network event (by
setting the corresponding bit in the internal network event record) and
signaled the associated event object, no further actions are taken for
that network event until the application makes the function call that
implicitly reenables the setting of that network event and signaling of
the associated event object."
and also notes specifically for FD_CLOSE that there's no re-enabling
functions.

See
https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576%28v=vs.85%29.aspx
which goes on to talk about some level triggered events (FD_READ, ...)
and others being edge triggered. It's not clear to me from that whether
FD_CLOSE is supposed to be edge or level triggered.

If FD_CLOSE is indeed edge and not level triggered - which imo would be
supremely insane - we'd be in trouble. It'd explain why some failures
are noticed and others not.

ISTM this should relatively easily be debuggable by adding a few debug
elogs.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Petr Jelinek

On 2016-01-02 12:05, Amit Kapila wrote:

On Sat, Jan 2, 2016 at 3:16 PM, Andres Freund > wrote:

Hi Petr,

On 2016-01-02 09:17:02 +0100, Petr Jelinek wrote:
> so the commit which triggers this issue is
> 387da18874afa17156ee3af63766f17efb53c4b9 , not sure why yet (wanted to 
give
> heads up early since multiple people are looking at this). Note that the
> compilation around this commit is made harder by the fact that commit
> 91fa7b4719ac583420d9143132ba4ccddefbc5b2 breaks linking and fix is few
> commits later.

Thanks for bisecting! What exactly did you use to reproduce? Shay's
npgsql binary? Or a plain psql session where you killed the client?
Given that Amit couldn't reproduce with the latter that seems an
interesting difference.


I am also able to reproduce now. The reason was that I didn't have
latest .Net framework and Visual Studio, which is must for the recent
version of Npgsql.

One probable reason of the problem seems to be that now for windows, we
are emulating non-blocking behaviour by setting pgwin32_noblock = true
which makes function pgwin32_recv() return EWOULDBLOCK and it would
wait using WaitLatchOrSocket() instead of pgwin32_waitforsinglesocket().
There are some differences in the way both the API's (WaitLatchOrSocket()
and pgwin32_waitforsinglesocket()) do wait, now may be that is the reason
for this behaviour.  One thing I have tried is that if I don't
set pgwin32_noblock
in secure_raw_read(), then this problem won't occur which lead to above
reasoning.  I am still investigating.



Well, without pgwin32_noblock = true we never enter the code block which 
calls WaitLatchOrSocket and hangs as in my testing this was always 
called because of EWOULDBLOCK.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Andres Freund
On 2016-01-02 18:40:38 +0530, Amit Kapila wrote:
> What I wanted to say is that the handling of socket closure is not
> same in WaitLatchOrSocket() and pgwin32_waitforsinglesocket()
> due to which this problem can arise and it seems that is the
> right line of direction to pursue.  I have found that
> in WaitLatchOrSocket(),
> even when the socket is closed, we remember the result as
> WL_SOCKET_READABLE and again tries to wait whereas the
> same is handled properly in pgwin32_waitforsinglesocket().

That's actually intentional, and part of the design:
 * When waiting on a socket, EOF and error conditions are reported by
 * returning the socket as readable/writable or both, depending on
 * WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE being specified.

The way this is supposed to work, and does on unixoid systems, is that
WaitLatchOS returns, the recv is retried and signals an error.

> If we
> remember the closed socket event and then take appropriate action,
> then this problem won't happen.  Attached patch which by no-means
> a complete fix shows what I wanted to say and after this the problem
> mentioned by Shay doesn't happen, although I get LOG message
> which is due to the reason that proper handling for socket closure
> needs to be done in this path.  This patch is based on the code
> after commit 387da18874afa17156ee3af63766f17efb53c4b9.  I
> will do testing and refine the fix based on HEAD later as I am done
> for the today.

It's weird that this fixes the problem. As we were previously, according
to Shay, not busy looping, this seems to indicate that FD_CLOSE is only
reported once or somesuch?

It'd be very interesting to add a debug elog() into the
if (resEvents.lNetworkEvents & FD_CLOSE)
{
if (wakeEvents & WL_SOCKET_READABLE)
result |= WL_SOCKET_READABLE;
if (wakeEvents & WL_SOCKET_WRITEABLE)
result |= WL_SOCKET_WRITEABLE;
}

path in WaitLatchOrSocket. If it actually returns with the current code,
we have a better idea where to look for problems.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Petr Jelinek

Hi,

so the commit which triggers this issue is 
387da18874afa17156ee3af63766f17efb53c4b9 , not sure why yet (wanted to 
give heads up early since multiple people are looking at this). Note 
that the compilation around this commit is made harder by the fact that 
commit 91fa7b4719ac583420d9143132ba4ccddefbc5b2 breaks linking and fix 
is few commits later.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Amit Kapila
On Sat, Jan 2, 2016 at 3:16 PM, Andres Freund  wrote:

> Hi Petr,
>
> On 2016-01-02 09:17:02 +0100, Petr Jelinek wrote:
> > so the commit which triggers this issue is
> > 387da18874afa17156ee3af63766f17efb53c4b9 , not sure why yet (wanted to
> give
> > heads up early since multiple people are looking at this). Note that the
> > compilation around this commit is made harder by the fact that commit
> > 91fa7b4719ac583420d9143132ba4ccddefbc5b2 breaks linking and fix is few
> > commits later.
>
> Thanks for bisecting! What exactly did you use to reproduce? Shay's
> npgsql binary? Or a plain psql session where you killed the client?
> Given that Amit couldn't reproduce with the latter that seems an
> interesting difference.
>
>
I am also able to reproduce now. The reason was that I didn't have
latest .Net framework and Visual Studio, which is must for the recent
version of Npgsql.

One probable reason of the problem seems to be that now for windows, we
are emulating non-blocking behaviour by setting pgwin32_noblock = true
which makes function pgwin32_recv() return EWOULDBLOCK and it would
wait using WaitLatchOrSocket() instead of pgwin32_waitforsinglesocket().
There are some differences in the way both the API's (WaitLatchOrSocket()
and pgwin32_waitforsinglesocket()) do wait, now may be that is the reason
for this behaviour.  One thing I have tried is that if I don't
set pgwin32_noblock
in secure_raw_read(), then this problem won't occur which lead to above
reasoning.  I am still investigating.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Petr Jelinek

On 2016-01-02 10:46, Andres Freund wrote:

Hi Petr,

On 2016-01-02 09:17:02 +0100, Petr Jelinek wrote:

so the commit which triggers this issue is
387da18874afa17156ee3af63766f17efb53c4b9 , not sure why yet (wanted to give
heads up early since multiple people are looking at this). Note that the
compilation around this commit is made harder by the fact that commit
91fa7b4719ac583420d9143132ba4ccddefbc5b2 breaks linking and fix is few
commits later.


Thanks for bisecting! What exactly did you use to reproduce? Shay's
npgsql binary? Or a plain psql session where you killed the client?
Given that Amit couldn't reproduce with the latter that seems an
interesting difference.



I used the npgsql binary to reproduce, when I try killing psql from task 
manager it behaves correctly.



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Amit Kapila
On Sat, Jan 2, 2016 at 5:02 PM, Petr Jelinek  wrote:

> On 2016-01-02 12:05, Amit Kapila wrote:
>>
>> I am also able to reproduce now. The reason was that I didn't have
>> latest .Net framework and Visual Studio, which is must for the recent
>> version of Npgsql.
>>
>> One probable reason of the problem seems to be that now for windows, we
>> are emulating non-blocking behaviour by setting pgwin32_noblock = true
>> which makes function pgwin32_recv() return EWOULDBLOCK and it would
>> wait using WaitLatchOrSocket() instead of pgwin32_waitforsinglesocket().
>> There are some differences in the way both the API's (WaitLatchOrSocket()
>> and pgwin32_waitforsinglesocket()) do wait, now may be that is the reason
>> for this behaviour.  One thing I have tried is that if I don't
>> set pgwin32_noblock
>> in secure_raw_read(), then this problem won't occur which lead to above
>> reasoning.  I am still investigating.
>>
>>
> Well, without pgwin32_noblock = true we never enter the code block which
> calls WaitLatchOrSocket and hangs as in my testing this was always called
> because of EWOULDBLOCK.
>
>
What I wanted to say is that the handling of socket closure is not
same in WaitLatchOrSocket() and pgwin32_waitforsinglesocket()
due to which this problem can arise and it seems that is the
right line of direction to pursue.  I have found that
in WaitLatchOrSocket(),
even when the socket is closed, we remember the result as
WL_SOCKET_READABLE and again tries to wait whereas the
same is handled properly in pgwin32_waitforsinglesocket().  If we
remember the closed socket event and then take appropriate action,
then this problem won't happen.  Attached patch which by no-means
a complete fix shows what I wanted to say and after this the problem
mentioned by Shay doesn't happen, although I get LOG message
which is due to the reason that proper handling for socket closure
needs to be done in this path.  This patch is based on the code
after commit 387da18874afa17156ee3af63766f17efb53c4b9.  I
will do testing and refine the fix based on HEAD later as I am done
for the today.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


win_socket_wait_issue_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Andres Freund
On 2016-01-02 16:20:58 +0100, Andres Freund wrote:
> I really right now can see only two somewhat surgical fixes:
> 
> 1) We do a nonblocking or select() *after* registering our events. Both
>in WaitLatchOrSocket() and waitforsinglesocket. Since select/poll are
>explicitly level triggered, that should make us notice any events we
>might have missed. select() appears to have been available for a fair
>while.
> 
> 2) We explicitly shutdown(SD_BOTH) the socket whenever we get a FD_CLOSE
>object. I *think* this should trigger errors in WSArecv, WSAEventSelect
>et al. Doesn't solve the problem that we might miss important events
>though.
> 
> 
> Given 2) isn't a complete fix and I can't find reliable documentation
> since when shutdown() is supported I'm inclined to go with 1).


Attached is an attempt to blindly implement 1). It needs some further
polishing and thought, but I'd like to verify it actually fixes things
before investing more time in this.

Greetings,

Andres Freund
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
index 0e3aaee..b4441d8 100644
--- a/src/backend/port/win32_latch.c
+++ b/src/backend/port/win32_latch.c
@@ -177,6 +177,65 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	/* Ensure that signals are serviced even if latch is already set */
 	pgwin32_dispatch_queued_signals();
 
+	/*
+	 * FD_WRITE and FD_CLOSE are edge and not level triggered
+	 * (c.f. WSAEventSelect's documentation), and at least FD_CLOSE is only
+	 * signalled when the state changes while the event is attached to the
+	 * socket. To avoid waiting, potentially indefinitely, do a non-blocking
+	 * select() on the socket diagnosing the current state. Check this after
+	 * the signal dispatching above, to avoid starving signal dispatch.
+	 *
+	 * Unfortunately, according to the documentation, windows doesn't signal
+	 * writeability when a socket is closed. Only WSAPoll() does so - but it's
+	 * only available in newer versions of windows. XXX: Is this an active
+	 * race condition?
+	 */
+	if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+	{
+		fd_set		input_mask;
+		fd_set	   *input_maskp = NULL;
+		fd_set		output_mask;
+		fd_set	   *output_maskp = NULL;
+		struct timeval tv;
+
+		FD_ZERO(_mask);
+		FD_ZERO(_mask);
+		tv.tv_sec = 0;
+		tv.tv_usec = 0;
+
+		if (wakeEvents & WL_SOCKET_READABLE)
+		{
+			FD_SET(sock, _mask);
+			input_maskp = _mask;
+		}
+		if (wakeEvents & WL_SOCKET_WRITEABLE)
+		{
+			FD_SET(sock, _mask);
+			output_maskp = _mask;
+		}
+
+		rc = select(1, input_maskp, output_maskp, NULL, );
+
+		if (rc < 0)
+		{
+			elog(ERROR, "select() failed: %lu", GetLastError());
+		}
+		else if (rc > 0)
+		{
+			if ((wakeEvents & WL_SOCKET_READABLE) && FD_ISSET(sock, _mask))
+			{
+/* data available in socket, or EOF */
+result |= WL_SOCKET_READABLE;
+			}
+			if ((wakeEvents & WL_SOCKET_WRITEABLE) && FD_ISSET(sock, _mask))
+			{
+/* socket is writable */
+result |= WL_SOCKET_WRITEABLE;
+			}
+			return result;
+		}
+	}
+
 	do
 	{
 		/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Tom Lane
Andres Freund  writes:
> A bit of searching around brought up that we saw issues around this
> before:
> http://www.postgresql.org/message-id/4351.1336927...@sss.pgh.pa.us

Indeed.  It doesn't look like any of the cleanup I suggested in that
thread has ever gotten done.  I suspect that we'll continue to see
problems until we get rid of the transient event object attachments.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Andres Freund
On 2016-01-02 13:00:09 -0500, Tom Lane wrote:
> : More generally, it seems clear to me that Microsoft's code is designed
> : around the assumption that an event object remains attached to a socket
> : for the lifetime of the socket.  This business of transiently associating
> : event objects with sockets looks quite inefficient and is evidently
> : triggering a lot of unpleasant corner-case behaviors.  I wonder whether we
> : should not try to make "pgsocket" encapsulate a socket and an associated
> : event object as a single entity on Windows.  (Such a struct would be a
> : good place to keep a per-socket noblock flag, too.)  I'm not going to
> : tackle that myself though.
> 
> which I think is the same as what you're suggesting.

Pretty similar, yes. I think we're going to additionally need a 'pending
events' flag or something to store edge triggered
status. E.g. 'writable' which'd only be cleared by EWOULDBLOCK being
returned by send, and 'closed' which is never cleared.

I guess it'd make sense to do this for all platforms; so most code can
just do pg_send/recv/... and the platform differences live in
pgsock_win/posix or somesuch.

I'm inclined to fix this for 9.5 with something like the select() hack I
posted upthread, and then revamp this in 9.6. I don't want to push a
relatively large API overhaul into 9.5 at this point.



Because of the recent thread about WL_POSTMASTER_DEATH scaling badly I'm
also wondering if we should change the latch API to make
OwnLatch/InitLatch have a fixed association with the socket that can be
waited on. My testing shows that e.g. linux' epoll is noticeably faster,
even measurably so in the single threaded case or without
WL_POSTMASTER_DEATH, because it requires less locking on the kernel side
than setting up poll/select datastructures in the kernel everytime we're
blocking.  Afaics there are no uses of latches where we'd want to use
different sockets.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Andres Freund
On 2016-01-02 15:40:03 +0100, Andres Freund wrote:
> I wonder if the following is the problem: The docs for WSAEventSelect()
> says:
> "Having successfully recorded the occurrence of the network event (by
> setting the corresponding bit in the internal network event record) and
> signaled the associated event object, no further actions are taken for
> that network event until the application makes the function call that
> implicitly reenables the setting of that network event and signaling of
> the associated event object."
> and also notes specifically for FD_CLOSE that there's no re-enabling
> functions.
> 
> See
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576%28v=vs.85%29.aspx
> which goes on to talk about some level triggered events (FD_READ, ...)
> and others being edge triggered. It's not clear to me from that whether
> FD_CLOSE is supposed to be edge or level triggered.
> 
> If FD_CLOSE is indeed edge and not level triggered - which imo would be
> supremely insane - we'd be in trouble. It'd explain why some failures
> are noticed and others not.

I found a few more resources confirming that FD_CLOSE is edge
triggered. Which probably doesn't just make our code buggy when waiting
twice on the same socket, but probably also makes it very timing
dependent: As the event is only triggered when the close actually occurs
it's possible that we don't have any event associated with that socket:
We only do so for shorts amount of time in WaitLatchOrSocket() and
pgwin32_waitforsinglesocket().

A bit of searching around brought up that we saw issues around this
before:
http://www.postgresql.org/message-id/4351.1336927...@sss.pgh.pa.us


I really right now can see only two somewhat surgical fixes:

1) We do a nonblocking or select() *after* registering our events. Both
   in WaitLatchOrSocket() and waitforsinglesocket. Since select/poll are
   explicitly level triggered, that should make us notice any events we
   might have missed. select() appears to have been available for a fair
   while.

2) We explicitly shutdown(SD_BOTH) the socket whenever we get a FD_CLOSE
   object. I *think* this should trigger errors in WSArecv, WSAEventSelect
   et al. Doesn't solve the problem that we might miss important events
   though.


Given 2) isn't a complete fix and I can't find reliable documentation
since when shutdown() is supported I'm inclined to go with 1).


Better ideas?


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Andres Freund
On January 2, 2016 6:28:10 PM GMT+01:00, Tom Lane  wrote:
>Andres Freund  writes:
>> A bit of searching around brought up that we saw issues around this
>> before:
>> http://www.postgresql.org/message-id/4351.1336927...@sss.pgh.pa.us
>
>Indeed.  It doesn't look like any of the cleanup I suggested in that
>thread has ever gotten done.  I suspect that we'll continue to see
>problems until we get rid of the transient event object attachments.

That'd address some of the problem, but that'd not address the edge triggered 
behaviour of FD-CLOSE. I think we'll have to abstract away windows sockets, and 
store the event & state there.

Andres


--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Andres Freund
On 2016-01-02 15:40:03 +0100, Andres Freund wrote:
> If FD_CLOSE is indeed edge and not level triggered - which imo would be
> supremely insane - we'd be in trouble. It'd explain why some failures
> are noticed and others not.

I wonder if the FD_CLOSE and FD_WRITE being edge-triggered is the actual
root cause for:
/*
 * Just a workaround of unknown locking problem with writing in UDP 
socket
 * under high load: Client's pgsql backend sleeps infinitely in
 * WaitForMultipleObjectsEx, pgstat process sleeps in pgwin32_select().
 * So, we will wait with small timeout(0.1 sec) and if socket is still
 * blocked, try WSASend (see comments in pgwin32_select) and wait again.
 */
if ((what & FD_WRITE) && isUDP)
in pgwin32_waitforsinglesocket().

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Tom Lane
Andres Freund  writes:
> On January 2, 2016 6:28:10 PM GMT+01:00, Tom Lane  wrote:
>> Indeed.  It doesn't look like any of the cleanup I suggested in that
>> thread has ever gotten done.  I suspect that we'll continue to see
>> problems until we get rid of the transient event object attachments.

> That'd address some of the problem, but that'd not address the edge triggered 
> behaviour of FD-CLOSE. I think we'll have to abstract away windows sockets, 
> and store the event & state there.

Right.  What I wrote in the 2012 thread was

: More generally, it seems clear to me that Microsoft's code is designed
: around the assumption that an event object remains attached to a socket
: for the lifetime of the socket.  This business of transiently associating
: event objects with sockets looks quite inefficient and is evidently
: triggering a lot of unpleasant corner-case behaviors.  I wonder whether we
: should not try to make "pgsocket" encapsulate a socket and an associated
: event object as a single entity on Windows.  (Such a struct would be a
: good place to keep a per-socket noblock flag, too.)  I'm not going to
: tackle that myself though.

which I think is the same as what you're suggesting.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Tom Lane
Andres Freund  writes:
> I found a few more resources confirming that FD_CLOSE is edge
> triggered. Which probably doesn't just make our code buggy when waiting
> twice on the same socket, but probably also makes it very timing
> dependent: As the event is only triggered when the close actually occurs
> it's possible that we don't have any event associated with that socket:
> We only do so for shorts amount of time in WaitLatchOrSocket() and
> pgwin32_waitforsinglesocket().

Does the timing dependence explain why we've not been able to trigger this
by killing psql?

If the bug only occurs when the client connection drops when we're not
waiting for input, that would likely explain why nobody noticed it for
ten months.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Petr Jelinek

On 2016-01-02 22:31, Andres Freund wrote:

On 2016-01-02 22:25:31 +0100, Brar Piening wrote:

Andres Freund wrote:

That seems like a pretty straight forward bug. But it hinges on the
client side calling shutdown() on the socket. I don't know enough about
.net's internals to judge wether it does so. I've traced things far
enough to find
"Disposing a Stream object flushes any buffered data, and essentially
calls the Flush method for you. Dispose also releases operating system
resources such as file handles, network connections, or memory used for
any internal buffering. The BufferedStream class provides the capability
of wrapping a buffered stream around another stream in order to improve
read and write performance."
https://msdn.microsoft.com/en-us/library/system.io.stream%28v=vs.110%29.aspx

which'd plausibly use shutdown().


In the new days of Microsoft you can confirm that even more...
http://referencesource.microsoft.com/#System/net/System/Net/Sockets/Socket.cs,6245


Thanks for digging thatup!

Indeed it does use shutdown(). If I read the npgsql code that'll even be
done in the exception handling path. So fixing the 0 byte case might
already do the trick.

Could any of the windows user try to check whether fixing
if (r != SOCKET_ERROR && b > 0)
/* Read succeeded right away */
return b;
to
if (r != SOCKET_ERROR && b >= 0)
/* Read succeeded right away */
return b;
already does the trick?



Yes it does indeed fix it.

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Andres Freund
On 2016-01-02 15:11:42 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I found a few more resources confirming that FD_CLOSE is edge
> > triggered. Which probably doesn't just make our code buggy when waiting
> > twice on the same socket, but probably also makes it very timing
> > dependent: As the event is only triggered when the close actually occurs
> > it's possible that we don't have any event associated with that socket:
> > We only do so for shorts amount of time in WaitLatchOrSocket() and
> > pgwin32_waitforsinglesocket().
> 
> Does the timing dependence explain why we've not been able to trigger this
> by killing psql?

I think so. Possibly it'd be reproducible in psql by sending a
interrupting while executing pg_sleep(3000) or something involving the
copy protocol?

> If the bug only occurs when the client connection drops when we're not
> waiting for input, that would likely explain why nobody noticed it for
> ten months.

Yea. I think there also might also be another issue: Windows' recv() -
which we're not using, but I don't see differing documentation for
WSARecv() - returns 0 bytes if a socket was closed 'gracefully',
i.e. shutdown(SD_SEND) was called on the client side (similar to
unix' recv).

pgwin32_recv() converts a 0 byte return from WSARecv() into
if (pgwin32_noblock)
{
/*
 * No data received, and we are in "emulated non-blocking 
mode", so
 * return indicating that we'd block if we were to continue.
 */
errno = EWOULDBLOCK;
return -1;
}

which would explain why we'd eat the FD_CLOSE and then just continue
waiting...

That seems like a pretty straight forward bug. But it hinges on the
client side calling shutdown() on the socket. I don't know enough about
.net's internals to judge wether it does so. I've traced things far
enough to find
"Disposing a Stream object flushes any buffered data, and essentially
calls the Flush method for you. Dispose also releases operating system
resources such as file handles, network connections, or memory used for
any internal buffering. The BufferedStream class provides the capability
of wrapping a buffered stream around another stream in order to improve
read and write performance."
https://msdn.microsoft.com/en-us/library/system.io.stream%28v=vs.110%29.aspx

which'd plausibly use shutdown().

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Brar Piening

Andres Freund wrote:

That seems like a pretty straight forward bug. But it hinges on the
client side calling shutdown() on the socket. I don't know enough about
.net's internals to judge wether it does so. I've traced things far
enough to find
"Disposing a Stream object flushes any buffered data, and essentially
calls the Flush method for you. Dispose also releases operating system
resources such as file handles, network connections, or memory used for
any internal buffering. The BufferedStream class provides the capability
of wrapping a buffered stream around another stream in order to improve
read and write performance."
https://msdn.microsoft.com/en-us/library/system.io.stream%28v=vs.110%29.aspx

which'd plausibly use shutdown().


In the new days of Microsoft you can confirm that even more...
http://referencesource.microsoft.com/#System/net/System/Net/Sockets/Socket.cs,6245

Regards,
Brar


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Andres Freund
On 2016-01-02 22:25:31 +0100, Brar Piening wrote:
> Andres Freund wrote:
> >That seems like a pretty straight forward bug. But it hinges on the
> >client side calling shutdown() on the socket. I don't know enough about
> >.net's internals to judge wether it does so. I've traced things far
> >enough to find
> >"Disposing a Stream object flushes any buffered data, and essentially
> >calls the Flush method for you. Dispose also releases operating system
> >resources such as file handles, network connections, or memory used for
> >any internal buffering. The BufferedStream class provides the capability
> >of wrapping a buffered stream around another stream in order to improve
> >read and write performance."
> >https://msdn.microsoft.com/en-us/library/system.io.stream%28v=vs.110%29.aspx
> >
> >which'd plausibly use shutdown().
> 
> In the new days of Microsoft you can confirm that even more...
> http://referencesource.microsoft.com/#System/net/System/Net/Sockets/Socket.cs,6245

Thanks for digging thatup!

Indeed it does use shutdown(). If I read the npgsql code that'll even be
done in the exception handling path. So fixing the 0 byte case might
already do the trick.

Could any of the windows user try to check whether fixing
if (r != SOCKET_ERROR && b > 0)
/* Read succeeded right away */
return b;
to
if (r != SOCKET_ERROR && b >= 0)
/* Read succeeded right away */
return b;
already does the trick?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-02 Thread Amit Kapila
On Sun, Jan 3, 2016 at 3:01 AM, Andres Freund  wrote:
> On 2016-01-02 22:25:31 +0100, Brar Piening wrote:
> > Andres Freund wrote:
> > >That seems like a pretty straight forward bug. But it hinges on the
> > >client side calling shutdown() on the socket. I don't know enough about
> > >.net's internals to judge wether it does so. I've traced things far
> > >enough to find
> > >"Disposing a Stream object flushes any buffered data, and essentially
> > >calls the Flush method for you. Dispose also releases operating system
> > >resources such as file handles, network connections, or memory used for
> > >any internal buffering. The BufferedStream class provides the
capability
> > >of wrapping a buffered stream around another stream in order to improve
> > >read and write performance."
> > >
https://msdn.microsoft.com/en-us/library/system.io.stream%28v=vs.110%29.aspx
> > >
> > >which'd plausibly use shutdown().
> >
> > In the new days of Microsoft you can confirm that even more...
> >
http://referencesource.microsoft.com/#System/net/System/Net/Sockets/Socket.cs,6245
>
> Thanks for digging thatup!
>
> Indeed it does use shutdown(). If I read the npgsql code that'll even be
> done in the exception handling path. So fixing the 0 byte case might
> already do the trick.
>

I think this true for a TCP socket, but this code-path is used for UDP
(SOCK_DGRAM) sockets as well and there is a comment below in
that function which seems to be indicating why originally 0 byte case
has not been handled at the place suggested by you (now it seems to
be much less relevant).

pgwin32_recv()

{

..

/* No error, zero bytes (win2000+) or error+WSAEWOULDBLOCK (<=nt4) */


for (n = 0; n < 5; n++)

{
..
}


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-01 Thread Shay Rojansky
>
> On googling, it seems this is related to .Net framework compatibility. I am
> using .Net Framework 4 to build the program.cs and that is what I have
> on my m/c.  Are you using the same for Npgsql or some different version?
>

That is probably the problem. Npgsql 3.0 is only available for .NET
Framework 4.5 and above, you should be able to build and run the sample
with VS2013 or VS2015 (note that Microsoft release totally free community
editions of these which you can use). Let me know if you run into any
issues.

Regardless, apologies I haven't had more time to work on this recently. I
now have debug builds of 9.5rc1 (with the bug) and 9.4.5 (without the bug)
and can reliably debug through application as the error occurs. I will
probably have some time tomorrow or Monday to dive into this. Amit, it
would be great if you could confirm the error happening too.


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-01 Thread Amit Kapila
On Wed, Dec 30, 2015 at 10:31 PM, Shay Rojansky  wrote:

> OK, I finally found some time to dive into this.
>
> The backends seem to hang when the client closes a socket without first
> sending a Terminate message - some of the tests make this happen. I've
> confirmed this happens with 9.5rc1 running on Windows (versions 10 and 7),
> but this does not occur on Ubuntu 15.10. The client runs on Windows as well
> (although I doubt that's important).
>
> In case it helps, here's a gist
>  with some .NET code
> that uses Npgsql 3.0.4 to reproduce this.
>
>
I am trying to setup an environment to reproduce this issue.
I have installed Npgsql as below:
PM> Install-Package Npgsql -Version 3.0.4
Installing 'Npgsql 3.0.4'.
Successfully installed 'Npgsql 3.0.4'.

Now, I am trying to use program.cs present in the above link, but
it is neither printing any error nor able to perform any action.
I have just created program.cs file as mentioned by you in one
of the test folders and then trying to run it via:

>program.cs "Host=localhost;Username=amit;Database=postgres"

This program neither prints any error, nor I think it does any action.

Can you guide me in reproducing the issue?


I have tried to close/kill the psql client in various ways (closing the
window/
End Process), but it works as expected.  So I think it is important to
reproduce
it via Npgsql.

Also I am trying this on latest PostgreSQL code (9.6 version code), but I
think that doesn't matter because once I am able to execute this program
successfully, I can even use 9.5 line to see the issue if it is not
reproducible
on 9.6.

I am using Win-7 and Visual Studio 2010.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2016-01-01 Thread Amit Kapila
On Fri, Jan 1, 2016 at 4:40 PM, Amit Kapila  wrote:

> On Wed, Dec 30, 2015 at 10:31 PM, Shay Rojansky  wrote:
>
>> OK, I finally found some time to dive into this.
>>
>> The backends seem to hang when the client closes a socket without first
>> sending a Terminate message - some of the tests make this happen. I've
>> confirmed this happens with 9.5rc1 running on Windows (versions 10 and 7),
>> but this does not occur on Ubuntu 15.10. The client runs on Windows as well
>> (although I doubt that's important).
>>
>> In case it helps, here's a gist
>>  with some .NET code
>> that uses Npgsql 3.0.4 to reproduce this.
>>
>>
> I am trying to setup an environment to reproduce this issue.
> I have installed Npgsql as below:
> PM> Install-Package Npgsql -Version 3.0.4
> Installing 'Npgsql 3.0.4'.
> Successfully installed 'Npgsql 3.0.4'.
>
> Now, I am trying to use program.cs present in the above link, but
> it is neither printing any error nor able to perform any action.
> I have just created program.cs file as mentioned by you in one
> of the test folders and then trying to run it via:
>
> >program.cs "Host=localhost;Username=amit;Database=postgres"
>
> This program neither prints any error, nor I think it does any action.
>
>
I think, I need to create Visual C# project to use the program.cs code
mentioned by you and by implementing it as Visual C# Console Application,
I am able to build and run it, but getting below error:

Unhandled Exception: System.TypeInitializationException: The type
initializer fo
r 'Npgsql.NpgsqlConnectionStringBuilder' threw an exception. --->
System.TypeLoa
dException: Could not load type
'System.Reflection.CustomAttributeExtensions' fr
om assembly 'mscorlib, Version=4.0.0.0, Culture=neutral,
PublicKeyToken=b77a5c56
1934e089'.
   at Npgsql.NpgsqlConnectionStringBuilder.<>c.<.cctor>b__8_0(PropertyInfo
p)
   at System.Linq.Enumerable.WhereArrayIterator`1.MoveNext()
   at System.Linq.Buffer`1..ctor(IEnumerable`1 source)
   at System.Linq.Enumerable.ToArray[TSource](IEnumerable`1 source)
   at Npgsql.NpgsqlConnectionStringBuilder..cctor()
   --- End of inner exception stack trace ---
   at Npgsql.NpgsqlConnectionStringBuilder..ctor(String connectionString)
   at Npgsql.NpgsqlConnection.set_ConnectionString(String value)
   at Npgsql.NpgsqlConnection..ctor(String connectionString)
   at PGHang.Program.Main() in
E:\PG_Patch\ConsoleApplication1\Program.cs:line 11
Press any key to continue . . .


On googling, it seems this is related to .Net framework compatibility. I am
using .Net Framework 4 to build the program.cs and that is what I have
on my m/c.  Are you using the same for Npgsql or some different version?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Tom Lane
Andres Freund  writes:
> FWIW, the
>   if (sock == PGINVALID_SOCKET)
>   wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
> block in both latch implementations looks like a problem waiting to happen.

You think it should throw an error instead?  Seems reasonable to me.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Andres Freund
On 2015-12-30 19:38:23 +0200, Shay Rojansky wrote:
> > Hm. So that seems to indicate that, on windows, we're not properly
> > recognizing dead sockets in the latch code. Could you check, IIRC with
> > netstat or something like it, in what state the connections are?

> netstat shows the socket is in FIN_WAIT_2.


> > Any chance you could single-step through WaitLatchOrSocket() with a
> > debugger? Without additional information this is rather hard to
> > diagnose.
> >
> 
> Uh I sure can, but I have no idea what to look for :) Anything
> specific?

Things that'd be interesting:
1) what are the arguments passed to WaitLatchOrSocket(), most
   importantly wakeEvents and sock
2) are we busy looping, or is WaitForMultipleObjects() blocking
   endlessly
3) If you kill -9 (well, terminate in the task manager) a client, while
   stepping serverside in WaitLatchOrSocket, does
   WaitForMultipleObjects() return? If so, what paths are we taking?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Andres Freund
Hi,

On 2015-12-30 19:01:10 +0200, Shay Rojansky wrote:
> OK, I finally found some time to dive into this.
> 
> The backends seem to hang when the client closes a socket without first
> sending a Terminate message - some of the tests make this happen. I've
> confirmed this happens with 9.5rc1 running on Windows (versions 10 and 7),
> but this does not occur on Ubuntu 15.10. The client runs on Windows as well
> (although I doubt that's important).

Hm. So that seems to indicate that, on windows, we're not properly
recognizing dead sockets in the latch code. Could you check, IIRC with
netstat or something like it, in what state the connections are?

Any chance you could single-step through WaitLatchOrSocket() with a
debugger? Without additional information this is rather hard to
diagnose.

> On Wed, Dec 30, 2015 at 5:32 AM, Amit Kapila 
> wrote:
> > What procedure do you use to kill backends?  Normally, if we kill
> > via task manager using "End Process", it is considered as backend
> > crash and the server gets restarted and all other backends got
> > disconnected.

Unless I miss something major here the problem is clients disconnecting
and leaving backends hanging. The killing of backends only comes into
play after that's already the case.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Tom Lane
Shay Rojansky  writes:
> The backends seem to hang when the client closes a socket without first
> sending a Terminate message - some of the tests make this happen. I've
> confirmed this happens with 9.5rc1 running on Windows (versions 10 and 7),
> but this does not occur on Ubuntu 15.10.

Nor OS X.  Ugh.  My first thought was that ac1d7945f broke this, but
that's only in HEAD not 9.5, so some earlier change must be responsible.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Andres Freund
On 2015-12-30 12:41:56 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > FWIW, the
> > if (sock == PGINVALID_SOCKET)
> > wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
> > block in both latch implementations looks like a problem waiting to happen.
> 
> You think it should throw an error instead?  Seems reasonable to me.

Yea. Error or maybe just an assert. That path seems to always indicate
something having gone wrong.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Tom Lane
Andres Freund  writes:
> On 2015-12-30 12:30:43 -0500, Tom Lane wrote:
>> Nor OS X.  Ugh.  My first thought was that ac1d7945f broke this, but
>> that's only in HEAD not 9.5, so some earlier change must be responsible.

> The backtrace in
> http://archives.postgresql.org/message-id/CADT4RqBo79_0Vx%3D-%2By%3DnFv3zdnm_-CgGzbtSv9LhxrFEoYMVFg%40mail.gmail.com
> seems to indicate that it's really WaitLatchOrSocket() not noticing the
> socket is closed.

Right, and what I was wondering was whether adding the additional wait-for
condition had exposed some pre-existing flaw in the Windows latch code.
But that's not it, so we're left with the conclusion that we broke
something that used to work.

Are we sure this is a 9.5-only bug?  Shay, can you try 9.4 branch tip
and see if it misbehaves?  Can anyone else reproduce the problem?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Andres Freund
On 2015-12-30 12:50:58 -0500, Tom Lane wrote:
> Right, and what I was wondering was whether adding the additional wait-for
> condition had exposed some pre-existing flaw in the Windows latch code.
> But that's not it, so we're left with the conclusion that we broke
> something that used to work.

4bad60e is another suspect. Besides wondering why I moved the FD_CLOSE
case out of the existing if cases, I don't see anything suspicious
though.  If we were hitting the write-path here, it'd be plausible that
we're hitting an issue with FD_CLOSE and waiting for writability; but
we're not.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Andres Freund
On 2015-12-30 20:12:07 +0200, Shay Rojansky wrote:
> >
> > Is this in a backend with ssl?
> >
> 
> No.

There goes that theory. Amongst others. The aforementioned problem with
waitfor doesn't seem to be actually armed because waitfor is only used
if errno == EWOULDBLOCK || errno == EAGAIN.

> If you go up one frame, what value does port->sock have?

> For some reason VS is telling me "Unable to read memory" on port->sock... I
> have no idea why that is...

Hm. Is this with a self compiled postgres? If so, is it with assertions
enabled?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Andres Freund
On 2015-12-30 20:18:52 +0200, Shay Rojansky wrote:
> Tom's probably right about the optimized code. I could try compiling a
> debug version..

Seems to be the next unfortunately. Sorry.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Shay Rojansky
>
> > The backends seem to hang when the client closes a socket without first
> > sending a Terminate message - some of the tests make this happen. I've
> > confirmed this happens with 9.5rc1 running on Windows (versions 10 and
> 7),
> > but this does not occur on Ubuntu 15.10. The client runs on Windows as
> well
> > (although I doubt that's important).
>
> Hm. So that seems to indicate that, on windows, we're not properly
> recognizing dead sockets in the latch code. Could you check, IIRC with
> netstat or something like it, in what state the connections are?
>

netstat shows the socket is in FIN_WAIT_2.


> Any chance you could single-step through WaitLatchOrSocket() with a
> debugger? Without additional information this is rather hard to
> diagnose.
>

Uh I sure can, but I have no idea what to look for :) Anything specific?


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Shay Rojansky
>
> Hm. Is this with a self compiled postgres? If so, is it with assertions
> enabled?
>

No, it's just the EnterpriseDB 9.5rc1 installer...

Tom's probably right about the optimized code. I could try compiling a
debug version..


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Tom Lane
Andres Freund  writes:
> On 2015-12-30 19:54:19 +0200, Shay Rojansky wrote:
>> wakeEvents is 8387808 and so is sock.

> Hm. That seems like an extremely weird value.

Probably just means the debugger is confused by optimized code.

> I think it's indicative of
> a bug in 80788a431e. Specifically secure_read/write's waitfor isn't
> necessarily set in the ssl case. Importantly not in the
> SSL_ERROR_SYSCALL case, which we're likely hitting here.

Hmm, that does look bogus ... and the Assert wouldn't fire in a non-assert
build.  But waitfor would be zero if that was what was happening.

> Is this in a backend with ssl?

I thought we'd eliminated SSL already, or have I lost track of the
bidding?  But I suspect you've identified a different bug.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Andres Freund
Hi,

On 2015-12-30 13:17:47 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2015-12-30 19:54:19 +0200, Shay Rojansky wrote:
> >> wakeEvents is 8387808 and so is sock.
> 
> > Hm. That seems like an extremely weird value.
> 
> Probably just means the debugger is confused by optimized code.

Yea.


> > I think it's indicative of
> > a bug in 80788a431e. Specifically secure_read/write's waitfor isn't
> > necessarily set in the ssl case. Importantly not in the
> > SSL_ERROR_SYSCALL case, which we're likely hitting here.
> 
> Hmm, that does look bogus ... and the Assert wouldn't fire in a non-assert
> build.

After a bit of thinking it seems this can't actually happen because
waitfor is only used if errno is EWOULDBLOCK/EAGAIN. And all the other
branches allowing for n < 0 set errno to a different value.  Not
particularly obvious.


> > Is this in a backend with ssl?
> 
> I thought we'd eliminated SSL already, or have I lost track of the
> bidding?

We know a very similar problem happens without ssl. But we could have
ended up two different bugs...


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Shay Rojansky
OK, I finally found some time to dive into this.

The backends seem to hang when the client closes a socket without first
sending a Terminate message - some of the tests make this happen. I've
confirmed this happens with 9.5rc1 running on Windows (versions 10 and 7),
but this does not occur on Ubuntu 15.10. The client runs on Windows as well
(although I doubt that's important).

In case it helps, here's a gist
 with some .NET code
that uses Npgsql 3.0.4 to reproduce this.

If there's anything else I can do please let me know.

Shay

On Wed, Dec 30, 2015 at 5:32 AM, Amit Kapila 
wrote:

>
>
> On Tue, Dec 29, 2015 at 7:04 PM, Shay Rojansky  wrote:
>
>> Could you describe the worklad a bit more? Is this rather concurrent? Do
>>> you use optimized or debug builds? How long did you wait for the
>>> backends to die? Is this all over localhost, external ip but local,
>>> remotely?
>>>
>>
>> The workload is a a rather diverse set of integration tests executed with
>> Npgsql. There's no concurrency whatsoever - tests are executed serially.
>> The backends stay alive indefinitely, until they are killed. All this is
>> over localhost with TCP. I can try other scenarios if that'll help.
>>
>>
>
> What procedure do you use to kill backends?  Normally, if we kill
> via task manager using "End Process", it is considered as backend
> crash and the server gets restarted and all other backends got
> disconnected.
>
>
>> > Note that the number of backends that stay stuck after the tests is
>>> > constant (always 12).
>>>
>>> Can you increase the number of backends used in the test? And check
>>> whether it's still 12?
>>>
>>
>> Well, I ran the testsuite twice in parallel, and got... 23 backends stuck
>> at the end.
>>
>>
>>> How are your clients disconnecting? Possibly without properly
>>> disconnecting?
>>>
>>
>> That's possible, definitely in some of the test cases.
>>
>> What I can do is try to isolate things further by playing around with the
>> tests and trying to see if a more minimal repro can be done - I'll try
>> doing this later today or tomorrow. If anyone has any other specific tests
>> or checks I should do let me know.
>>
>
> I think first we should try to isolate whether the hanged backends
> are due to the reason that they are not disconnected properly or
> there is some other factor involved as well, so you can try to kill/
> disconnect the sessions connected via psql in the same way as
> you are doing for connections with Npgsql and see if you can
> reproduce the same behaviour.
>
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Andres Freund
On 2015-12-30 19:54:19 +0200, Shay Rojansky wrote:
> >
> > Things that'd be interesting:
> > 1) what are the arguments passed to WaitLatchOrSocket(), most
> >importantly wakeEvents and sock
> >
> 
> wakeEvents is 8387808 and so is sock.

Hm. That seems like an extremely weird value. I think it's indicative of
a bug in 80788a431e. Specifically secure_read/write's waitfor isn't
necessarily set in the ssl case. Importantly not in the
SSL_ERROR_SYSCALL case, which we're likely hitting here.

Is this in a backend with ssl?

If you go up one frame, what value does port->sock have?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Shay Rojansky
>
> > > Any chance you could single-step through WaitLatchOrSocket() with a
> > > debugger? Without additional information this is rather hard to
> > > diagnose.
> > >
> >
> > Uh I sure can, but I have no idea what to look for :) Anything
> > specific?
>
> Things that'd be interesting:
> 1) what are the arguments passed to WaitLatchOrSocket(), most
>importantly wakeEvents and sock
> 2) are we busy looping, or is WaitForMultipleObjects() blocking
>endlessly
> 3) If you kill -9 (well, terminate in the task manager) a client, while
>stepping serverside in WaitLatchOrSocket, does
>WaitForMultipleObjects() return? If so, what paths are we taking?
>

The process definitely isn't busy looping - zero CPU usage.

I'll try to set up debugging, it may take some time though (unfamiliar with
PostgreSQL internals and Windows debugging techniques).


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Shay Rojansky
>
> Things that'd be interesting:
> 1) what are the arguments passed to WaitLatchOrSocket(), most
>importantly wakeEvents and sock
>

wakeEvents is 8387808 and so is sock.

Tom, this bug doesn't occur with 9.4.4 (will try to download 9.4.5 and
test).


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Shay Rojansky
>
> Are we sure this is a 9.5-only bug?  Shay, can you try 9.4 branch tip
> and see if it misbehaves?  Can anyone else reproduce the problem?
>
>
Doesn't occur with 9.4.5 either. The first version I tested which exhibited
this was 9.5beta2.


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Shay Rojansky
>
> Is this in a backend with ssl?
>

No.

If you go up one frame, what value does port->sock have?
>

For some reason VS is telling me "Unable to read memory" on port->sock... I
have no idea why that is...


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Tom Lane
Andres Freund  writes:
> There goes that theory. Amongst others. The aforementioned problem with
> waitfor doesn't seem to be actually armed because waitfor is only used
> if errno == EWOULDBLOCK || errno == EAGAIN.

Mumble.  It is clearly possible that we'd reach the Assert failure if
SSL_read were to return -1 and set errno to EWOULDBLOCK or EAGAIN.
I doubt that is what is happening here, because those errnos don't
seem sensible for an EOF condition, but I'd still feel more comfortable
if be_tls_read/be_tls_write handled SSL_ERROR_SYSCALL like this:

if (n != -1)
{
errno = ECONNRESET;
n = -1;
}
+   else
+   {
+   /* just in case errno is EWOULDBLOCK/EAGAIN */
+   *waitfor = WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE;
+   }

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Alvaro Herrera
Shay Rojansky wrote:
> >
> > Are we sure this is a 9.5-only bug?  Shay, can you try 9.4 branch tip
> > and see if it misbehaves?  Can anyone else reproduce the problem?
> >
> >
> Doesn't occur with 9.4.5 either. The first version I tested which exhibited
> this was 9.5beta2.

Maybe it's time for some git bisect'ing?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Andres Freund
On 2015-12-30 13:26:34 -0500, Tom Lane wrote:
> I doubt that is what is happening here, because those errnos don't
> seem sensible for an EOF condition, but I'd still feel more comfortable
> if be_tls_read/be_tls_write handled SSL_ERROR_SYSCALL like this:
> 
> if (n != -1)
> {
> errno = ECONNRESET;
> n = -1;
> }
> +   else
> +   {
> +   /* just in case errno is EWOULDBLOCK/EAGAIN */
> +   *waitfor = WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE;
> +   }

Being a bit more defensive here sounds sensible. But I think it might be
better to only set WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE in
be_tls_read/be_tls_write correspondingly. It's possible that we'd
otherwise just end up in an endless loop.


Shay, if you trace the backend with process explorer or something like
that, what system calls do you seen when you kill the client?

Hm, so pgwin32_recv() has the following block:
if (pgwin32_waitforsinglesocket(s, FD_READ | FD_CLOSE | 
FD_ACCEPT,

INFINITE) == 0)
return -1;  /* errno already set */
and there's several paths in pgwin32_waitforsinglesocket() that return
0. But I can't see any of those reasonably return EAGAIN/EWOULDBLOCK


Shay earlier said that he sees the connection in state FIN_WAIT_2. The
WSAEventSelect() docs say:
"The FD_CLOSE network event is recorded when a close indication is
received for the virtual circuit corresponding to the socket. In TCP
terms, this means that the FD_CLOSE is recorded when the connection goes
into the TIME WAIT or CLOSE WAIT states.".
To me that seems to imply that the socket Shay observed is the client's
and that the server should see at least see CLOSE WAIT.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Tom Lane
Andres Freund  writes:
> On 2015-12-30 13:26:34 -0500, Tom Lane wrote:
>> I doubt that is what is happening here, because those errnos don't
>> seem sensible for an EOF condition, but I'd still feel more comfortable
>> if be_tls_read/be_tls_write handled SSL_ERROR_SYSCALL like this:
>> ...

> Being a bit more defensive here sounds sensible. But I think it might be
> better to only set WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE in
> be_tls_read/be_tls_write correspondingly. It's possible that we'd
> otherwise just end up in an endless loop.

Mumble.  We don't have enough visibility into SSL's internal protocol
to know whether it's currently trying to read the socket or write it,
unless it tells us.  I think if we wait for only readable or only
writable based on what the high-level request is, we risk getting stuck.

(In practice, if the syscall error is persistent, we're likely to fail
pretty soon anyway.  But if it were some transient error, we'd be
more likely to successfully recover if we wait for either condition.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Andres Freund
On 2015-12-30 12:30:43 -0500, Tom Lane wrote:
> Nor OS X.  Ugh.  My first thought was that ac1d7945f broke this, but
> that's only in HEAD not 9.5, so some earlier change must be responsible.

The backtrace in
http://archives.postgresql.org/message-id/CADT4RqBo79_0Vx%3D-%2By%3DnFv3zdnm_-CgGzbtSv9LhxrFEoYMVFg%40mail.gmail.com
seems to indicate that it's really WaitLatchOrSocket() not noticing the
socket is closed.

For a moment I had the theory that Port->sock might be invalid because
it somehow got closed. That'd then remove the socket from the waited-on
events, which would explain the behaviour. But afaics that's really only
possible via pq_init()'s on_proc_exit(socket_close, 0); And I can't see
how that could be reached.

FWIW, the
if (sock == PGINVALID_SOCKET)
wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
block in both latch implementations looks like a problem waiting to happen.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-30 Thread Tom Lane
Andres Freund  writes:
> On 2015-12-30 19:01:10 +0200, Shay Rojansky wrote:
>> The backends seem to hang when the client closes a socket without first
>> sending a Terminate message - some of the tests make this happen. I've
>> confirmed this happens with 9.5rc1 running on Windows (versions 10 and 7),
>> but this does not occur on Ubuntu 15.10. The client runs on Windows as well
>> (although I doubt that's important).

> Hm. So that seems to indicate that, on windows, we're not properly
> recognizing dead sockets in the latch code.

Or we just broke EOF detection on Windows sockets in general.  It might be
worth checking if the problem appears on the client side; that is, given a
psql running on Windows, do local-equivalent-of-kill-9 on the connected
backend, and see if psql notices.  (Hm, although if it's idle psql wouldn't
notice until you next try a command, so it might be hard to tell.  Maybe
kill -9 while the backend is in process of a long query?)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-29 Thread Amit Kapila
On Tue, Dec 29, 2015 at 7:04 PM, Shay Rojansky  wrote:

> Could you describe the worklad a bit more? Is this rather concurrent? Do
>> you use optimized or debug builds? How long did you wait for the
>> backends to die? Is this all over localhost, external ip but local,
>> remotely?
>>
>
> The workload is a a rather diverse set of integration tests executed with
> Npgsql. There's no concurrency whatsoever - tests are executed serially.
> The backends stay alive indefinitely, until they are killed. All this is
> over localhost with TCP. I can try other scenarios if that'll help.
>
>

What procedure do you use to kill backends?  Normally, if we kill
via task manager using "End Process", it is considered as backend
crash and the server gets restarted and all other backends got
disconnected.


> > Note that the number of backends that stay stuck after the tests is
>> > constant (always 12).
>>
>> Can you increase the number of backends used in the test? And check
>> whether it's still 12?
>>
>
> Well, I ran the testsuite twice in parallel, and got... 23 backends stuck
> at the end.
>
>
>> How are your clients disconnecting? Possibly without properly
>> disconnecting?
>>
>
> That's possible, definitely in some of the test cases.
>
> What I can do is try to isolate things further by playing around with the
> tests and trying to see if a more minimal repro can be done - I'll try
> doing this later today or tomorrow. If anyone has any other specific tests
> or checks I should do let me know.
>

I think first we should try to isolate whether the hanged backends
are due to the reason that they are not disconnected properly or
there is some other factor involved as well, so you can try to kill/
disconnect the sessions connected via psql in the same way as
you are doing for connections with Npgsql and see if you can
reproduce the same behaviour.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-29 Thread Andres Freund
On 2015-12-29 12:41:40 +0200, Shay Rojansky wrote:
> >
> > > The tests run for a couple minutes, open and close some connection. With
> > my
> > > pre-9.5 backends, the moment the test runner exits I can see that all
> > > backend processes exit immediately, and pg_activity_stat has no rows
> > > (except the querying one). With 9.5beta2, however, some backend processes
> > > continue to stay alive beyond the test runner, and pg_activity_stat
> > > contains extra rows (state idle, waiting false). This situation persists
> > > until I restart PostgreSQL.

Could you describe the worklad a bit more? Is this rather concurrent? Do
you use optimized or debug builds? How long did you wait for the
backends to die? Is this all over localhost, external ip but local,
remotely?

> Note that the number of backends that stay stuck after the tests is
> constant (always 12).

Can you increase the number of backends used in the test? And check
whether it's still 12?


> Here's are stack dumps of the same process taken with both VS2015 Community
> and Process Explorer, I went over 4 processes and saw the same thing. Let
> me know what I else I can provide to help.
> 
> From VS2015 Community:
> 
> Main Thread
> > ntdll.dll!NtWaitForMultipleObjects() Unknown
>   KernelBase.dll!WaitForMultipleObjectsEx() Unknown
>   KernelBase.dll!WaitForMultipleObjects() Unknown
>   postgres.exe!WaitLatchOrSocket(volatile Latch * latch, int wakeEvents,
> unsigned __int64 sock, long timeout) Line 202 C
>   postgres.exe!secure_read(Port * port, void * ptr, unsigned __int64 len)
> Line 151 C
>   postgres.exe!pq_getbyte() Line 926 C
>   postgres.exe!SocketBackend(StringInfoData * inBuf) Line 345 C
>   postgres.exe!PostgresMain(int argc, char * * argv, const char * dbname,
> const char * username) Line 3984 C
>   postgres.exe!BackendRun(Port * port) Line 4236 C
>   postgres.exe!SubPostmasterMain(int argc, char * * argv) Line 4727 C
>   postgres.exe!main(int argc, char * * argv) Line 211 C
>   postgres.exe!__tmainCRTStartup() Line 626 C
>   kernel32.dll!BaseThreadInitThunk() Unknown
>   ntdll.dll!RtlUserThreadStart() Unknown

Hm. So we're waiting for the latch, and expecting to get a FD_CLOSE
error back because the socket is actually closed. Which should happen
always in that path - a read through win32_latch.c doesn't show any
obvious problems. But then I really have not too much clue about windows
development.

How are your clients disconnecting? Possibly without properly
disconnecting?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-29 Thread Shay Rojansky
>
> > The tests run for a couple minutes, open and close some connection. With
> my
> > pre-9.5 backends, the moment the test runner exits I can see that all
> > backend processes exit immediately, and pg_activity_stat has no rows
> > (except the querying one). With 9.5beta2, however, some backend processes
> > continue to stay alive beyond the test runner, and pg_activity_stat
> > contains extra rows (state idle, waiting false). This situation persists
> > until I restart PostgreSQL.
>
> No idea what's happening, but a couple of questions:
>
> * Are you using SSL connections?
>
> * Can you get stack traces from the seemingly-stuck backends?
>

Most of my tests don't use SSL but some do. Looking at the query field in
pg_stat_activity I can see queries that don't seem to originate from SSL
tests.

Note that the number of backends that stay stuck after the tests is
constant (always 12).

Here's are stack dumps of the same process taken with both VS2015 Community
and Process Explorer, I went over 4 processes and saw the same thing. Let
me know what I else I can provide to help.

>From VS2015 Community:

Main Thread
> ntdll.dll!NtWaitForMultipleObjects() Unknown
  KernelBase.dll!WaitForMultipleObjectsEx() Unknown
  KernelBase.dll!WaitForMultipleObjects() Unknown
  postgres.exe!WaitLatchOrSocket(volatile Latch * latch, int wakeEvents,
unsigned __int64 sock, long timeout) Line 202 C
  postgres.exe!secure_read(Port * port, void * ptr, unsigned __int64 len)
Line 151 C
  postgres.exe!pq_getbyte() Line 926 C
  postgres.exe!SocketBackend(StringInfoData * inBuf) Line 345 C
  postgres.exe!PostgresMain(int argc, char * * argv, const char * dbname,
const char * username) Line 3984 C
  postgres.exe!BackendRun(Port * port) Line 4236 C
  postgres.exe!SubPostmasterMain(int argc, char * * argv) Line 4727 C
  postgres.exe!main(int argc, char * * argv) Line 211 C
  postgres.exe!__tmainCRTStartup() Line 626 C
  kernel32.dll!BaseThreadInitThunk() Unknown
  ntdll.dll!RtlUserThreadStart() Unknown

Worker Thread
> ntdll.dll!NtWaitForWorkViaWorkerFactory() Unknown
  ntdll.dll!TppWorkerThread() Unknown
  kernel32.dll!BaseThreadInitThunk() Unknown
  ntdll.dll!RtlUserThreadStart() Unknown

Worker Thread
> ntdll.dll!NtFsControlFile() Unknown
  KernelBase.dll!ConnectNamedPipe() Unknown
  postgres.exe!pg_signal_thread(void * param) Line 279 C
  kernel32.dll!BaseThreadInitThunk() Unknown
  ntdll.dll!RtlUserThreadStart() Unknown

Worker Thread
> ntdll.dll!NtWaitForSingleObject() Unknown
  KernelBase.dll!WaitForSingleObjectEx() Unknown
  postgres.exe!pg_timer_thread(void * param) Line 49 C
  kernel32.dll!BaseThreadInitThunk() Unknown
  ntdll.dll!RtlUserThreadStart() Unknown

>From Process Explorer (slightly different):

ntoskrnl.exe!KeSynchronizeExecution+0x3de6
ntoskrnl.exe!KeWaitForSingleObject+0xc7a
ntoskrnl.exe!KeWaitForSingleObject+0x709
ntoskrnl.exe!KeWaitForSingleObject+0x375
ntoskrnl.exe!IoQueueWorkItem+0x370
ntoskrnl.exe!KeRemoveQueueEx+0x16ba
ntoskrnl.exe!KeWaitForSingleObject+0xe8e
ntoskrnl.exe!KeWaitForSingleObject+0x709
ntoskrnl.exe!KeWaitForMultipleObjects+0x24e
ntoskrnl.exe!ObWaitForMultipleObjects+0x2bd
ntoskrnl.exe!IoWMIRegistrationControl+0x2402
ntoskrnl.exe!setjmpex+0x3943
ntdll.dll!NtWaitForMultipleObjects+0x14
KERNELBASE.dll!WaitForMultipleObjectsEx+0xef
KERNELBASE.dll!WaitForMultipleObjects+0xe
postgres.exe!WaitLatchOrSocket+0x243
postgres.exe!secure_read+0xb0
postgres.exe!pq_getbyte+0xec
postgres.exe!get_stats_option_name+0x392
postgres.exe!PostgresMain+0x537
postgres.exe!ShmemBackendArrayAllocation+0x2a6a
postgres.exe!SubPostmasterMain+0x273
postgres.exe!main+0x480
postgres.exe!pgwin32_popen+0x130b
KERNEL32.DLL!BaseThreadInitThunk+0x22
ntdll.dll!RtlUserThreadStart+0x34


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-29 Thread Shay Rojansky
>
> Could you describe the worklad a bit more? Is this rather concurrent? Do
> you use optimized or debug builds? How long did you wait for the
> backends to die? Is this all over localhost, external ip but local,
> remotely?
>

The workload is a a rather diverse set of integration tests executed with
Npgsql. There's no concurrency whatsoever - tests are executed serially.
The backends stay alive indefinitely, until they are killed. All this is
over localhost with TCP. I can try other scenarios if that'll help.


> > Note that the number of backends that stay stuck after the tests is
> > constant (always 12).
>
> Can you increase the number of backends used in the test? And check
> whether it's still 12?
>

Well, I ran the testsuite twice in parallel, and got... 23 backends stuck
at the end.


> How are your clients disconnecting? Possibly without properly
> disconnecting?
>

That's possible, definitely in some of the test cases.

What I can do is try to isolate things further by playing around with the
tests and trying to see if a more minimal repro can be done - I'll try
doing this later today or tomorrow. If anyone has any other specific tests
or checks I should do let me know.


Re: [HACKERS] Some 9.5beta2 backend processes not terminating properly?

2015-12-28 Thread Tom Lane
Shay Rojansky  writes:
> After setting up 9.5beta2 on the Npgsql build server and running the Npgsql
> test suite against I've noticed some weird behavior.

> The tests run for a couple minutes, open and close some connection. With my
> pre-9.5 backends, the moment the test runner exits I can see that all
> backend processes exit immediately, and pg_activity_stat has no rows
> (except the querying one). With 9.5beta2, however, some backend processes
> continue to stay alive beyond the test runner, and pg_activity_stat
> contains extra rows (state idle, waiting false). This situation persists
> until I restart PostgreSQL.

No idea what's happening, but a couple of questions:

* Are you using SSL connections?

* Can you get stack traces from the seemingly-stuck backends?

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers