Cao Manh Dat created SOLR-14356: ----------------------------------- Summary: PeerSync with hanging nodes Key: SOLR-14356 URL: https://issues.apache.org/jira/browse/SOLR-14356 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Cao Manh Dat
Right now in {{PeerSync}} (during leader election), in case of exception on requesting versions to a node, we will skip that node if exception is one the following type * ConnectTimeoutException * NoHttpResponseException * SocketException Sometime the other node basically hang but still accept connection. In that case SocketTimeoutException is thrown and we consider the {{PeerSync}} process as failed and the whole shard just basically leaderless forever (as long as the hang node still there). We can't just blindly adding {{SocketTimeoutException}} to above list, since [~shalin] mentioned that sometimes timeout can happen because of genuine reasons too e.g. temporary GC pause. I think the general idea here is we obey {{leaderVoteWait}} restriction and retry doing sync with others. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org