Apologies for not posting the link to the old thread, here it is: http://bit.ly/1JAaJaJ
Thanks Powell. On 8/31/15, 2:34 PM, "Powell Molleti" <[email protected]> wrote: >In reference to: >https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jir >a_browse_ZOOKEEPER-2D2246&d=BQIFAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNt >Xt-uEs&r=yJGBUr8YNYcKMSgrAENRm8UHFXYvY5J31UIvOjn58UU&m=7rVn1QkiMOK6B21p_op >YW1s-OXb2MVJaveBSbPqIFQw&s=UVM1pPxP0lnSUZGXwuC4jgmqh82pMqRdHJTXWKjy7pQ&e= > >Plainly removing sock.setSoTimeout(0) from >https://urldefense.proofpoint.com/v2/url?u=http-3A__s.apache.org_TfI&d=BQI >FAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=yJGBUr8YNYcKMSgrAENRm8 >UHFXYvY5J31UIvOjn58UU&m=7rVn1QkiMOK6B21p_opYW1s-OXb2MVJaveBSbPqIFQw&s=Sddv >lzYICW65qMs-kxwcASfZGRMQKh_67Ot4EpzPW4k&e= has the unintended >consequence of shutting down both the RecvWorker and SendWorker threads >for all cases. Seems like current code is designed to keep the socket >alive (and threads to keep running) so as to reuse this channel to >communicate again with the the peer node which still alive but needs to >redo leader election. > >I could not reproduce any issue if threads shutdown after the timeout >since new threads are created for next iteration of leader election. I >rather would like to reuse the threads and the channel hence I propose >the following approach. > >The alternative I suggest is to still remove setSoTimeout(0) from here: >https://urldefense.proofpoint.com/v2/url?u=http-3A__s.apache.org_TfI&d=BQI >FAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=yJGBUr8YNYcKMSgrAENRm8 >UHFXYvY5J31UIvOjn58UU&m=7rVn1QkiMOK6B21p_opYW1s-OXb2MVJaveBSbPqIFQw&s=Sddv >lzYICW65qMs-kxwcASfZGRMQKh_67Ot4EpzPW4k&e= , also enable SO_KEEPALIVE >via setKeepAlive() on this socket and do not consider it an error when >timeout occurs here: >https://urldefense.proofpoint.com/v2/url?u=http-3A__bit.ly_1JHIdVY&d=BQIFA >w&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=yJGBUr8YNYcKMSgrAENRm8UH >FXYvY5J31UIvOjn58UU&m=7rVn1QkiMOK6B21p_opYW1s-OXb2MVJaveBSbPqIFQw&s=ktRCMe >jYwu8LPG_s1B6_rlPeoZFTNj8PrRET3yEAg6A&e= but consider it an error when >it happens here: >https://urldefense.proofpoint.com/v2/url?u=http-3A__bit.ly_1NTjQ9R&d=BQIFA >w&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=yJGBUr8YNYcKMSgrAENRm8UH >FXYvY5J31UIvOjn58UU&m=7rVn1QkiMOK6B21p_opYW1s-OXb2MVJaveBSbPqIFQw&s=jUAFeY >zMBnBkanBaYzZ8blViliOscQ4eSd0xm7FYb9g&e= > >This means that users can play with keep alive timeouts for TCP sockets >to quicken TCP socket failures propagating to user-space and zookeeper >also resets the socket if it detects other side is not responding when it >knows it needs a response within some bounded time. > >Ideally I wish there is some userspace pings of every socket channel >between zookeeper nodes to detect dead channels quickly. Seems like one >exists for sockets that do Follow/Lead after leader election is done but >not for this?. Such a feature could be added with care towards making it >backward compatible. > >I posted the above text to Jira. Also please point out any wrong >assumptions I have made and provide comments and suggestions. > >Thanks >Powell. > > >> From Raúl Gutiérrez Segalés <[email protected]> >> Subject Re: quorum connection manager shutdown takes long time >> Date Thu, 10 Jul 2014 18:02:37 GMT >> On 9 July 2014 08:28, Michi Mutsuzaki <[email protected]> wrote: > >>> I don't know how I missed that :) QA said this is reproducible, so >>> I'll try commenting this line out. Thanks Flavio! >>> > >> I am curious, was it that? >> -rgs >
