I’ve been experimenting with SolrCloud configurations in AWS. One issue I’ve 
been plagued with is that during indexing, occasionally a node decides it can’t 
talk to ZK, and this disables updates in the pool. The node usually recovers 
within a second or two. It’s possible this happens when I’m not indexing too, 
but I’m much less likely to notice.

I’ve seen this with multiple sharding configurations and multiple cluster 
sizes. I’ve searched around, and I think I’ve addressed the usual resolutions 
when someone complains about ZK and Solr. I’m using:

  *   60-sec ZK connection timeout (although this seems like a pretty terrible 
requirement)
  *   Independent 3-node ZK cluster, also in AWS.
  *   Solr 4.6.1
  *   Optimized GC settings (and I’ve confirmed no GC pauses are occurring)
  *   5-min auto-hard-commit with openSearcher=false

I’m indexing some 10K docs/sec using CloudSolrServer, but the CPU usage on the 
nodes doesn’t exceed 20%, typically it’s around 5%.

Here is the relevant section of logs from one of the nodes when this happened:
http://pastebin.com/K0ZdKmL4

It looks like it had a connection timeout, and tried to re-establish the same 
session on a connection to a new ZK node, except the session had also expired. 
It then closes *that* connection, changes to read-only mode, and eventually 
creates a new connection and new session which allows writes again.

Can anyone familiar with the ZK connection/session stuff comment on whether 
this is a bug? I really know nothing about proper ZK client behaviour.

Thanks.

Reply via email to