Hi James,

This might not be 100% what you are looking for but some ideas to
explore:

1. Change session timeout on ZooKeeper client; this might help you move
unresponsive node to "down" state and Solr Cloud will take affected node
out of rotation on its own.
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkSessions

2. Create own HttpClient with more aggressive connection/socket timeout
values and pass it to CloudSolrClient during construction; if client
timeouts, retry. You can also interrogate ZK what nodes serve given
shard and send request to the other node with distrib=false flag; that
might be more intrusive depending on your shards/data model/queries.

And of all suggestions: fix the infrastructure :)

 Good luck!

--
Jaroslaw Rozanski

On Fri, 17 Nov 2017, at 00:42, kasinger, james wrote:
> Hi,
> 
> We aren’t seeing any exceptions happening for solr during that time. When
> the disk freezes up, solr waits (please refer to the attached gc image
> which shows a period of about a minute where no new objects are created
> in memory). The node is still accepting and stacking requests, and when
> the disk is accessible solr resumes with those threads in healthy state
> albeit with increased latency.
> 
> We’ve explored solutions for marking the node as unhealthy when an
> incident like this occurs, but have determined that the risk of taking it
> out of rotation and impacting the cluster, outweighs the momentary
> latency that we are experiencing.  
> 
> Attached a thread dump to show the jetty theads that pile up while
> solr/storage is in freeze, as well as a graph of total system threads
> increasing and CPU IO wait on the disk.
> 
> It’s a temporary storage outage, though could be viewed as a performance
> issue, and perhaps we need to become aware of more creative ways of
> handling degraded performance… Any ideas?
> 
> Thanks,
> James Kasinger
> 
> 
> On 11/15/17, 8:50 PM, "Jaroslaw Rozanski" <m...@jarekrozanski.eu> wrote:
> 
>     Hi,
>     
>     It is interesting that node reports healthy despite store access
>     issue.
>     That node should be marked down if it can't open the core backing up
>     sharded collection.
>     
>     Maybe if you could share exceptions/errors that you see in
>     console/logs. 
>     
>     I have experienced issues with replica node not responding in timely
>     manner due to performance issues but that does not seem to match your
>     case.
>     
>     
>     --
>     Jaroslaw Rozanski 
>     
>     On Wed, 15 Nov 2017, at 22:49, kasinger, james wrote:
>     > Hello folks,
>     > 
>     > 
>     > 
>     > To start, we have a sharded solr cloud configuration running solr 
> version
>     > 5.1.0 . During shard to shard communication there is a problem state
>     > where queries are sent to a replica, and on that replica the storage is
>     > inaccessible. The node is healthy so it’s still taking requests which 
> get
>     > piled up waiting to read from disk resulting in a latency increase. 
> We’ve
>     > tried resolving this storage inaccessibility but it appears related to
>     > AWS ebs issues.  Has anyone encountered the same issue?
>     > 
>     > thanks
>     
> 
> Email had 1 attachment:
> + 23c0_threads_bad.zip
>   24k (application/zip)

Reply via email to