Hi James, This might not be 100% what you are looking for but some ideas to explore:
1. Change session timeout on ZooKeeper client; this might help you move unresponsive node to "down" state and Solr Cloud will take affected node out of rotation on its own. https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkSessions 2. Create own HttpClient with more aggressive connection/socket timeout values and pass it to CloudSolrClient during construction; if client timeouts, retry. You can also interrogate ZK what nodes serve given shard and send request to the other node with distrib=false flag; that might be more intrusive depending on your shards/data model/queries. And of all suggestions: fix the infrastructure :) Good luck! -- Jaroslaw Rozanski On Fri, 17 Nov 2017, at 00:42, kasinger, james wrote: > Hi, > > We aren’t seeing any exceptions happening for solr during that time. When > the disk freezes up, solr waits (please refer to the attached gc image > which shows a period of about a minute where no new objects are created > in memory). The node is still accepting and stacking requests, and when > the disk is accessible solr resumes with those threads in healthy state > albeit with increased latency. > > We’ve explored solutions for marking the node as unhealthy when an > incident like this occurs, but have determined that the risk of taking it > out of rotation and impacting the cluster, outweighs the momentary > latency that we are experiencing. > > Attached a thread dump to show the jetty theads that pile up while > solr/storage is in freeze, as well as a graph of total system threads > increasing and CPU IO wait on the disk. > > It’s a temporary storage outage, though could be viewed as a performance > issue, and perhaps we need to become aware of more creative ways of > handling degraded performance… Any ideas? > > Thanks, > James Kasinger > > > On 11/15/17, 8:50 PM, "Jaroslaw Rozanski" <m...@jarekrozanski.eu> wrote: > > Hi, > > It is interesting that node reports healthy despite store access > issue. > That node should be marked down if it can't open the core backing up > sharded collection. > > Maybe if you could share exceptions/errors that you see in > console/logs. > > I have experienced issues with replica node not responding in timely > manner due to performance issues but that does not seem to match your > case. > > > -- > Jaroslaw Rozanski > > On Wed, 15 Nov 2017, at 22:49, kasinger, james wrote: > > Hello folks, > > > > > > > > To start, we have a sharded solr cloud configuration running solr > version > > 5.1.0 . During shard to shard communication there is a problem state > > where queries are sent to a replica, and on that replica the storage is > > inaccessible. The node is healthy so it’s still taking requests which > get > > piled up waiting to read from disk resulting in a latency increase. > We’ve > > tried resolving this storage inaccessibility but it appears related to > > AWS ebs issues. Has anyone encountered the same issue? > > > > thanks > > > Email had 1 attachment: > + 23c0_threads_bad.zip > 24k (application/zip)