Re: detecting "broken" TCP connections

2016-12-01 Thread Alen Vrečko
Thank you all for your advice. I dismissed KA early on for the wrong
reasons. I thought there must be something better available that I
missed. I'll go with keep alive.

2016-11-29 13:16 GMT+01:00 Greg Young :
> In my experience protocol level tcp keep alives don't always work
> between implementations. BSD - windows used to be a primary culprit,
> though they were set they would not get hit in some cases. Things may
> be better today. On same implementation they should work quite well.
> Definitely worth testing if you deal with multiple implementations.
>
> On Tue, Nov 29, 2016 at 12:00 PM, Justin Mason  wrote:
>> I think that, as the Zalando blog post suggested, you could use OS-level TCP
>> keepalive to test the connections regularly, so the kernel will eventually
>> notice that the TCP connection is now dead:
>> http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html -- by default
>> this waits for 2 hours of inactivity, which seems too long for many use
>> cases.
>>
>> I generally prefer to perform app-level keepalives with app-controlled
>> timeouts and retry settings, but in this case if it's legacy code, a
>> kernel-level sysctl tweak may be more palatable!
>>
>> --j.
>>
>> On Tue, 29 Nov 2016 at 09:48 Alen Vrečko  wrote:
>>>
>>> No. It is just a typical "off the shelf" Linux setup. Thanks for the
>>> insight.
>>>
>>> 2016-11-29 10:35 GMT+01:00 Wojciech Kudla :
>>> > Any chance that socket connection is handled by some sort of kernel
>>> > bypass?
>>> > All bets with blocking IO are off when running with onload/offload
>>> > drivers.
>>> >
>>> >
>>> > On Tue, 29 Nov 2016, 09:29 Alen Vrečko,  wrote:
>>> >>
>>> >> Got a situation where thread hanged on socket read (old school socket
>>> >> bio code). One side was in TCP established while the other in
>>> >> fin_wait_2. The customer was "upgrading" the switches at the time this
>>> >> happened.
>>> >>
>>> >> The thread will never complete. It should get a timeout exception. But
>>> >> it doesn't. There is the call to Socket#setSoTimeout in the code. It
>>> >> should do the job. My first though was there must be a bug in
>>> >> setSoTimeout. I never had much faith in SoTimeout. Was not surprised
>>> >> to find a lot of bug reports related to socketRead0 hangs. Reminded me
>>> >> of this blog post about hanged postgres connection [1].
>>> >>
>>> >> I'd use nio and app level timeouts. But it is legacy code that I
>>> >> can't/don't want to touch.
>>> >>
>>> >> Been thinking of using a custom SocketFactory that wraps the sockets
>>> >> with some monitoring code. Pretty ugly. It doesn't feel right.
>>> >>
>>> >> Found quite a few discussions about this. But not really any solutions
>>> >> that don't require app level changes.
>>> >>
>>> >> Any thoughts? Anybody in a similar boat?
>>> >>
>>> >> [1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/
>>> >>
>>> >> --
>>> >> You received this message because you are subscribed to the Google
>>> >> Groups
>>> >> "mechanical-sympathy" group.
>>> >> To unsubscribe from this group and stop receiving emails from it, send
>>> >> an
>>> >> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> >> For more options, visit https://groups.google.com/d/optout.
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> > Groups
>>> > "mechanical-sympathy" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send
>>> > an
>>> > email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> > For more options, visit https://groups.google.com/d/optout.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Studying for the Turing test
>
> --
> You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: detecting "broken" TCP connections

2016-11-29 Thread Greg Young
In my experience protocol level tcp keep alives don't always work
between implementations. BSD - windows used to be a primary culprit,
though they were set they would not get hit in some cases. Things may
be better today. On same implementation they should work quite well.
Definitely worth testing if you deal with multiple implementations.

On Tue, Nov 29, 2016 at 12:00 PM, Justin Mason  wrote:
> I think that, as the Zalando blog post suggested, you could use OS-level TCP
> keepalive to test the connections regularly, so the kernel will eventually
> notice that the TCP connection is now dead:
> http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html -- by default
> this waits for 2 hours of inactivity, which seems too long for many use
> cases.
>
> I generally prefer to perform app-level keepalives with app-controlled
> timeouts and retry settings, but in this case if it's legacy code, a
> kernel-level sysctl tweak may be more palatable!
>
> --j.
>
> On Tue, 29 Nov 2016 at 09:48 Alen Vrečko  wrote:
>>
>> No. It is just a typical "off the shelf" Linux setup. Thanks for the
>> insight.
>>
>> 2016-11-29 10:35 GMT+01:00 Wojciech Kudla :
>> > Any chance that socket connection is handled by some sort of kernel
>> > bypass?
>> > All bets with blocking IO are off when running with onload/offload
>> > drivers.
>> >
>> >
>> > On Tue, 29 Nov 2016, 09:29 Alen Vrečko,  wrote:
>> >>
>> >> Got a situation where thread hanged on socket read (old school socket
>> >> bio code). One side was in TCP established while the other in
>> >> fin_wait_2. The customer was "upgrading" the switches at the time this
>> >> happened.
>> >>
>> >> The thread will never complete. It should get a timeout exception. But
>> >> it doesn't. There is the call to Socket#setSoTimeout in the code. It
>> >> should do the job. My first though was there must be a bug in
>> >> setSoTimeout. I never had much faith in SoTimeout. Was not surprised
>> >> to find a lot of bug reports related to socketRead0 hangs. Reminded me
>> >> of this blog post about hanged postgres connection [1].
>> >>
>> >> I'd use nio and app level timeouts. But it is legacy code that I
>> >> can't/don't want to touch.
>> >>
>> >> Been thinking of using a custom SocketFactory that wraps the sockets
>> >> with some monitoring code. Pretty ugly. It doesn't feel right.
>> >>
>> >> Found quite a few discussions about this. But not really any solutions
>> >> that don't require app level changes.
>> >>
>> >> Any thoughts? Anybody in a similar boat?
>> >>
>> >> [1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/
>> >>
>> >> --
>> >> You received this message because you are subscribed to the Google
>> >> Groups
>> >> "mechanical-sympathy" group.
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> >> an
>> >> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>> >> For more options, visit https://groups.google.com/d/optout.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "mechanical-sympathy" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to mechanical-sympathy+unsubscr...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



-- 
Studying for the Turing test

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: detecting "broken" TCP connections

2016-11-29 Thread Justin Mason
I think that, as the Zalando blog post suggested, you could use OS-level
TCP keepalive to test the connections regularly, so the kernel will
eventually notice that the TCP connection is now dead:
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html -- by default
this waits for 2 hours of inactivity, which seems too long for many use
cases.

I generally prefer to perform app-level keepalives with app-controlled
timeouts and retry settings, but in this case if it's legacy code, a
kernel-level sysctl tweak may be more palatable!

--j.

On Tue, 29 Nov 2016 at 09:48 Alen Vrečko  wrote:

> No. It is just a typical "off the shelf" Linux setup. Thanks for the
> insight.
>
> 2016-11-29 10:35 GMT+01:00 Wojciech Kudla :
> > Any chance that socket connection is handled by some sort of kernel
> bypass?
> > All bets with blocking IO are off when running with onload/offload
> drivers.
> >
> >
> > On Tue, 29 Nov 2016, 09:29 Alen Vrečko,  wrote:
> >>
> >> Got a situation where thread hanged on socket read (old school socket
> >> bio code). One side was in TCP established while the other in
> >> fin_wait_2. The customer was "upgrading" the switches at the time this
> >> happened.
> >>
> >> The thread will never complete. It should get a timeout exception. But
> >> it doesn't. There is the call to Socket#setSoTimeout in the code. It
> >> should do the job. My first though was there must be a bug in
> >> setSoTimeout. I never had much faith in SoTimeout. Was not surprised
> >> to find a lot of bug reports related to socketRead0 hangs. Reminded me
> >> of this blog post about hanged postgres connection [1].
> >>
> >> I'd use nio and app level timeouts. But it is legacy code that I
> >> can't/don't want to touch.
> >>
> >> Been thinking of using a custom SocketFactory that wraps the sockets
> >> with some monitoring code. Pretty ugly. It doesn't feel right.
> >>
> >> Found quite a few discussions about this. But not really any solutions
> >> that don't require app level changes.
> >>
> >> Any thoughts? Anybody in a similar boat?
> >>
> >> [1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups
> >> "mechanical-sympathy" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an
> >> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> >> For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "mechanical-sympathy" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to mechanical-sympathy+unsubscr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: detecting "broken" TCP connections

2016-11-29 Thread Alen Vrečko
No. It is just a typical "off the shelf" Linux setup. Thanks for the insight.

2016-11-29 10:35 GMT+01:00 Wojciech Kudla :
> Any chance that socket connection is handled by some sort of kernel bypass?
> All bets with blocking IO are off when running with onload/offload drivers.
>
>
> On Tue, 29 Nov 2016, 09:29 Alen Vrečko,  wrote:
>>
>> Got a situation where thread hanged on socket read (old school socket
>> bio code). One side was in TCP established while the other in
>> fin_wait_2. The customer was "upgrading" the switches at the time this
>> happened.
>>
>> The thread will never complete. It should get a timeout exception. But
>> it doesn't. There is the call to Socket#setSoTimeout in the code. It
>> should do the job. My first though was there must be a bug in
>> setSoTimeout. I never had much faith in SoTimeout. Was not surprised
>> to find a lot of bug reports related to socketRead0 hangs. Reminded me
>> of this blog post about hanged postgres connection [1].
>>
>> I'd use nio and app level timeouts. But it is legacy code that I
>> can't/don't want to touch.
>>
>> Been thinking of using a custom SocketFactory that wraps the sockets
>> with some monitoring code. Pretty ugly. It doesn't feel right.
>>
>> Found quite a few discussions about this. But not really any solutions
>> that don't require app level changes.
>>
>> Any thoughts? Anybody in a similar boat?
>>
>> [1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: detecting "broken" TCP connections

2016-11-29 Thread Wojciech Kudla
Any chance that socket connection is handled by some sort of kernel bypass?
All bets with blocking IO are off when running with onload/offload drivers.

On Tue, 29 Nov 2016, 09:29 Alen Vrečko,  wrote:

> Got a situation where thread hanged on socket read (old school socket
> bio code). One side was in TCP established while the other in
> fin_wait_2. The customer was "upgrading" the switches at the time this
> happened.
>
> The thread will never complete. It should get a timeout exception. But
> it doesn't. There is the call to Socket#setSoTimeout in the code. It
> should do the job. My first though was there must be a bug in
> setSoTimeout. I never had much faith in SoTimeout. Was not surprised
> to find a lot of bug reports related to socketRead0 hangs. Reminded me
> of this blog post about hanged postgres connection [1].
>
> I'd use nio and app level timeouts. But it is legacy code that I
> can't/don't want to touch.
>
> Been thinking of using a custom SocketFactory that wraps the sockets
> with some monitoring code. Pretty ugly. It doesn't feel right.
>
> Found quite a few discussions about this. But not really any solutions
> that don't require app level changes.
>
> Any thoughts? Anybody in a similar boat?
>
> [1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.