Hoss Man created SOLR-13100: ------------------------------- Summary: harden/manage connectionpool used for intra-cluster communication when we know nodes go down Key: SOLR-13100 URL: https://issues.apache.org/jira/browse/SOLR-13100 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man
I'm spinng this idea off of some comments i made in SOLR-13028... In that issue, in discussion of some test failures that can happen after a node is shutdown/restarted (new emphasis added)... {quote} The bit where the test fails is that it: # shuts down a jetty instance # starts the jetty instance again # does some waiting for all the collections to be "active" and all the replicas to be "live" # tries to send an auto-scalling 'set-cluster-preferences' config change to the cluster The bit of test code where it does this creates an entirely new CLoudSolrClient, ignoring the existing one except for the ZKServer address, w/an explicit comment that the reason it's doing this is because the connection pool on the existing CloudSolrClient might have a stale connection to the old (Ie: dead) instance of the restarted jetty... ... ...doing this ensures that the cloudClient doesn't try to query the "dead" server directly (on a stale connection) but IIUC this issue of stale connections to the dead server instance is still problematic - and the root cause of this failure - because after the CloudSolrClient picks a random node to send the request to, _on the remote solr side, that node then has to dispatch a request to each and every node, and at that point the node doing the distributed dispatch may also have a stale connection pool pointing at a server instance that's no longer listening._ {quote} *The point of this issue is to explore, if/how we can -- in general -- better deal with pooled connections in situations where the cluster state knows that an existing node has gone down, or been restarted.* SOLR-13028 is a particular example of when/how stale pooled conection info can cause test problems -- and the bulk of the discussion in that issue is about how that specific code path (in dealing with an intra-cl autoscaling handler command dispatch) can be improved to do a retry in the event of NoHttpResponseException -- but not ever place where solr nodes need to talk to each other can blindly retry on every possible connection exception; and even when we can, it would be better if we could minimize the risk of the request failing in a way that would require a retry. *So why not improve our HTTP connection pool to be aware of our clusterstate and purge connections when we know odes have been shutdown/lost?* -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org