Re: detecting "broken" TCP connections
Thank you all for your advice. I dismissed KA early on for the wrong reasons. I thought there must be something better available that I missed. I'll go with keep alive. 2016-11-29 13:16 GMT+01:00 Greg Young : > In my experience protocol level tcp keep alives don't always work > between implementations. BSD - windows used to be a primary culprit, > though they were set they would not get hit in some cases. Things may > be better today. On same implementation they should work quite well. > Definitely worth testing if you deal with multiple implementations. > > On Tue, Nov 29, 2016 at 12:00 PM, Justin Mason wrote: >> I think that, as the Zalando blog post suggested, you could use OS-level TCP >> keepalive to test the connections regularly, so the kernel will eventually >> notice that the TCP connection is now dead: >> http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html -- by default >> this waits for 2 hours of inactivity, which seems too long for many use >> cases. >> >> I generally prefer to perform app-level keepalives with app-controlled >> timeouts and retry settings, but in this case if it's legacy code, a >> kernel-level sysctl tweak may be more palatable! >> >> --j. >> >> On Tue, 29 Nov 2016 at 09:48 Alen Vrečko wrote: >>> >>> No. It is just a typical "off the shelf" Linux setup. Thanks for the >>> insight. >>> >>> 2016-11-29 10:35 GMT+01:00 Wojciech Kudla : >>> > Any chance that socket connection is handled by some sort of kernel >>> > bypass? >>> > All bets with blocking IO are off when running with onload/offload >>> > drivers. >>> > >>> > >>> > On Tue, 29 Nov 2016, 09:29 Alen Vrečko, wrote: >>> >> >>> >> Got a situation where thread hanged on socket read (old school socket >>> >> bio code). One side was in TCP established while the other in >>> >> fin_wait_2. The customer was "upgrading" the switches at the time this >>> >> happened. >>> >> >>> >> The thread will never complete. It should get a timeout exception. But >>> >> it doesn't. There is the call to Socket#setSoTimeout in the code. It >>> >> should do the job. My first though was there must be a bug in >>> >> setSoTimeout. I never had much faith in SoTimeout. Was not surprised >>> >> to find a lot of bug reports related to socketRead0 hangs. Reminded me >>> >> of this blog post about hanged postgres connection [1]. >>> >> >>> >> I'd use nio and app level timeouts. But it is legacy code that I >>> >> can't/don't want to touch. >>> >> >>> >> Been thinking of using a custom SocketFactory that wraps the sockets >>> >> with some monitoring code. Pretty ugly. It doesn't feel right. >>> >> >>> >> Found quite a few discussions about this. But not really any solutions >>> >> that don't require app level changes. >>> >> >>> >> Any thoughts? Anybody in a similar boat? >>> >> >>> >> [1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/ >>> >> >>> >> -- >>> >> You received this message because you are subscribed to the Google >>> >> Groups >>> >> "mechanical-sympathy" group. >>> >> To unsubscribe from this group and stop receiving emails from it, send >>> >> an >>> >> email to mechanical-sympathy+unsubscr...@googlegroups.com. >>> >> For more options, visit https://groups.google.com/d/optout. >>> > >>> > -- >>> > You received this message because you are subscribed to the Google >>> > Groups >>> > "mechanical-sympathy" group. >>> > To unsubscribe from this group and stop receiving emails from it, send >>> > an >>> > email to mechanical-sympathy+unsubscr...@googlegroups.com. >>> > For more options, visit https://groups.google.com/d/optout. >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "mechanical-sympathy" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to mechanical-sympathy+unsubscr...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "mechanical-sympathy" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to mechanical-sympathy+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. > > > > -- > Studying for the Turing test > > -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to mechanical-sympathy+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: detecting "broken" TCP connections
In my experience protocol level tcp keep alives don't always work between implementations. BSD - windows used to be a primary culprit, though they were set they would not get hit in some cases. Things may be better today. On same implementation they should work quite well. Definitely worth testing if you deal with multiple implementations. On Tue, Nov 29, 2016 at 12:00 PM, Justin Mason wrote: > I think that, as the Zalando blog post suggested, you could use OS-level TCP > keepalive to test the connections regularly, so the kernel will eventually > notice that the TCP connection is now dead: > http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html -- by default > this waits for 2 hours of inactivity, which seems too long for many use > cases. > > I generally prefer to perform app-level keepalives with app-controlled > timeouts and retry settings, but in this case if it's legacy code, a > kernel-level sysctl tweak may be more palatable! > > --j. > > On Tue, 29 Nov 2016 at 09:48 Alen Vrečko wrote: >> >> No. It is just a typical "off the shelf" Linux setup. Thanks for the >> insight. >> >> 2016-11-29 10:35 GMT+01:00 Wojciech Kudla : >> > Any chance that socket connection is handled by some sort of kernel >> > bypass? >> > All bets with blocking IO are off when running with onload/offload >> > drivers. >> > >> > >> > On Tue, 29 Nov 2016, 09:29 Alen Vrečko, wrote: >> >> >> >> Got a situation where thread hanged on socket read (old school socket >> >> bio code). One side was in TCP established while the other in >> >> fin_wait_2. The customer was "upgrading" the switches at the time this >> >> happened. >> >> >> >> The thread will never complete. It should get a timeout exception. But >> >> it doesn't. There is the call to Socket#setSoTimeout in the code. It >> >> should do the job. My first though was there must be a bug in >> >> setSoTimeout. I never had much faith in SoTimeout. Was not surprised >> >> to find a lot of bug reports related to socketRead0 hangs. Reminded me >> >> of this blog post about hanged postgres connection [1]. >> >> >> >> I'd use nio and app level timeouts. But it is legacy code that I >> >> can't/don't want to touch. >> >> >> >> Been thinking of using a custom SocketFactory that wraps the sockets >> >> with some monitoring code. Pretty ugly. It doesn't feel right. >> >> >> >> Found quite a few discussions about this. But not really any solutions >> >> that don't require app level changes. >> >> >> >> Any thoughts? Anybody in a similar boat? >> >> >> >> [1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/ >> >> >> >> -- >> >> You received this message because you are subscribed to the Google >> >> Groups >> >> "mechanical-sympathy" group. >> >> To unsubscribe from this group and stop receiving emails from it, send >> >> an >> >> email to mechanical-sympathy+unsubscr...@googlegroups.com. >> >> For more options, visit https://groups.google.com/d/optout. >> > >> > -- >> > You received this message because you are subscribed to the Google >> > Groups >> > "mechanical-sympathy" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> > an >> > email to mechanical-sympathy+unsubscr...@googlegroups.com. >> > For more options, visit https://groups.google.com/d/optout. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "mechanical-sympathy" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to mechanical-sympathy+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to mechanical-sympathy+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- Studying for the Turing test -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: detecting "broken" TCP connections
I think that, as the Zalando blog post suggested, you could use OS-level TCP keepalive to test the connections regularly, so the kernel will eventually notice that the TCP connection is now dead: http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html -- by default this waits for 2 hours of inactivity, which seems too long for many use cases. I generally prefer to perform app-level keepalives with app-controlled timeouts and retry settings, but in this case if it's legacy code, a kernel-level sysctl tweak may be more palatable! --j. On Tue, 29 Nov 2016 at 09:48 Alen Vrečko wrote: > No. It is just a typical "off the shelf" Linux setup. Thanks for the > insight. > > 2016-11-29 10:35 GMT+01:00 Wojciech Kudla : > > Any chance that socket connection is handled by some sort of kernel > bypass? > > All bets with blocking IO are off when running with onload/offload > drivers. > > > > > > On Tue, 29 Nov 2016, 09:29 Alen Vrečko, wrote: > >> > >> Got a situation where thread hanged on socket read (old school socket > >> bio code). One side was in TCP established while the other in > >> fin_wait_2. The customer was "upgrading" the switches at the time this > >> happened. > >> > >> The thread will never complete. It should get a timeout exception. But > >> it doesn't. There is the call to Socket#setSoTimeout in the code. It > >> should do the job. My first though was there must be a bug in > >> setSoTimeout. I never had much faith in SoTimeout. Was not surprised > >> to find a lot of bug reports related to socketRead0 hangs. Reminded me > >> of this blog post about hanged postgres connection [1]. > >> > >> I'd use nio and app level timeouts. But it is legacy code that I > >> can't/don't want to touch. > >> > >> Been thinking of using a custom SocketFactory that wraps the sockets > >> with some monitoring code. Pretty ugly. It doesn't feel right. > >> > >> Found quite a few discussions about this. But not really any solutions > >> that don't require app level changes. > >> > >> Any thoughts? Anybody in a similar boat? > >> > >> [1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/ > >> > >> -- > >> You received this message because you are subscribed to the Google > Groups > >> "mechanical-sympathy" group. > >> To unsubscribe from this group and stop receiving emails from it, send > an > >> email to mechanical-sympathy+unsubscr...@googlegroups.com. > >> For more options, visit https://groups.google.com/d/optout. > > > > -- > > You received this message because you are subscribed to the Google Groups > > "mechanical-sympathy" group. > > To unsubscribe from this group and stop receiving emails from it, send an > > email to mechanical-sympathy+unsubscr...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to mechanical-sympathy+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: detecting "broken" TCP connections
No. It is just a typical "off the shelf" Linux setup. Thanks for the insight. 2016-11-29 10:35 GMT+01:00 Wojciech Kudla : > Any chance that socket connection is handled by some sort of kernel bypass? > All bets with blocking IO are off when running with onload/offload drivers. > > > On Tue, 29 Nov 2016, 09:29 Alen Vrečko, wrote: >> >> Got a situation where thread hanged on socket read (old school socket >> bio code). One side was in TCP established while the other in >> fin_wait_2. The customer was "upgrading" the switches at the time this >> happened. >> >> The thread will never complete. It should get a timeout exception. But >> it doesn't. There is the call to Socket#setSoTimeout in the code. It >> should do the job. My first though was there must be a bug in >> setSoTimeout. I never had much faith in SoTimeout. Was not surprised >> to find a lot of bug reports related to socketRead0 hangs. Reminded me >> of this blog post about hanged postgres connection [1]. >> >> I'd use nio and app level timeouts. But it is legacy code that I >> can't/don't want to touch. >> >> Been thinking of using a custom SocketFactory that wraps the sockets >> with some monitoring code. Pretty ugly. It doesn't feel right. >> >> Found quite a few discussions about this. But not really any solutions >> that don't require app level changes. >> >> Any thoughts? Anybody in a similar boat? >> >> [1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/ >> >> -- >> You received this message because you are subscribed to the Google Groups >> "mechanical-sympathy" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to mechanical-sympathy+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to mechanical-sympathy+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: detecting "broken" TCP connections
Any chance that socket connection is handled by some sort of kernel bypass? All bets with blocking IO are off when running with onload/offload drivers. On Tue, 29 Nov 2016, 09:29 Alen Vrečko, wrote: > Got a situation where thread hanged on socket read (old school socket > bio code). One side was in TCP established while the other in > fin_wait_2. The customer was "upgrading" the switches at the time this > happened. > > The thread will never complete. It should get a timeout exception. But > it doesn't. There is the call to Socket#setSoTimeout in the code. It > should do the job. My first though was there must be a bug in > setSoTimeout. I never had much faith in SoTimeout. Was not surprised > to find a lot of bug reports related to socketRead0 hangs. Reminded me > of this blog post about hanged postgres connection [1]. > > I'd use nio and app level timeouts. But it is legacy code that I > can't/don't want to touch. > > Been thinking of using a custom SocketFactory that wraps the sockets > with some monitoring code. Pretty ugly. It doesn't feel right. > > Found quite a few discussions about this. But not really any solutions > that don't require app level changes. > > Any thoughts? Anybody in a similar boat? > > [1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/ > > -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to mechanical-sympathy+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.