Re: ZK connection problems

2014-02-22 Thread Mark Miller


On Feb 21, 2014, at 12:23 PM, Jeff Wartes jwar...@whitepages.com wrote:

 
 I’ve been experimenting with SolrCloud configurations in AWS. One issue I’ve 
 been plagued with is that during indexing, occasionally a node decides it 
 can’t talk to ZK, and this disables updates in the pool. The node usually 
 recovers within a second or two. It’s possible this happens when I’m not 
 indexing too, but I’m much less likely to notice.
 
 I’ve seen this with multiple sharding configurations and multiple cluster 
 sizes. I’ve searched around, and I think I’ve addressed the usual resolutions 
 when someone complains about ZK and Solr. I’m using:
 
  *   60-sec ZK connection timeout (although this seems like a pretty terrible 
 requirement)

Be aware that it maxes out at like 40 or 45 seconds with the default tickTime 
of 2000.

  *   Independent 3-node ZK cluster, also in AWS.
  *   Solr 4.6.1
  *   Optimized GC settings (and I’ve confirmed no GC pauses are occurring)
  *   5-min auto-hard-commit with openSearcher=false
 
 I’m indexing some 10K docs/sec using CloudSolrServer, but the CPU usage on 
 the nodes doesn’t exceed 20%, typically it’s around 5%.
 
 Here is the relevant section of logs from one of the nodes when this happened:
 http://pastebin.com/K0ZdKmL4
 
 It looks like it had a connection timeout, and tried to re-establish the same 
 session on a connection to a new ZK node, except the session had also 
 expired. It then closes *that* connection, changes to read-only mode, and 
 eventually creates a new connection and new session which allows writes again.
 
 Can anyone familiar with the ZK connection/session stuff comment on whether 
 this is a bug? I really know nothing about proper ZK client behaviour.
 
 Thanks.
 

You have to figure out why Solr is not able to talk to ZooKeeper for 40-60 
seconds. Perhaps it’s the network, perhaps it’s the…I’m not sure. But for some 
reason a very simple heart beat cannot occur for a long time - and for Solr to 
receive updates, it has to maintain a connection with ZooKeeper. You can either 
raise the timeout, or dig into why the connection heartbeat cannot be 
maintained (its very lightweight). 

- Mark

http://about.me/markrmiller

ZK connection problems

2014-02-21 Thread Jeff Wartes

I’ve been experimenting with SolrCloud configurations in AWS. One issue I’ve 
been plagued with is that during indexing, occasionally a node decides it can’t 
talk to ZK, and this disables updates in the pool. The node usually recovers 
within a second or two. It’s possible this happens when I’m not indexing too, 
but I’m much less likely to notice.

I’ve seen this with multiple sharding configurations and multiple cluster 
sizes. I’ve searched around, and I think I’ve addressed the usual resolutions 
when someone complains about ZK and Solr. I’m using:

  *   60-sec ZK connection timeout (although this seems like a pretty terrible 
requirement)
  *   Independent 3-node ZK cluster, also in AWS.
  *   Solr 4.6.1
  *   Optimized GC settings (and I’ve confirmed no GC pauses are occurring)
  *   5-min auto-hard-commit with openSearcher=false

I’m indexing some 10K docs/sec using CloudSolrServer, but the CPU usage on the 
nodes doesn’t exceed 20%, typically it’s around 5%.

Here is the relevant section of logs from one of the nodes when this happened:
http://pastebin.com/K0ZdKmL4

It looks like it had a connection timeout, and tried to re-establish the same 
session on a connection to a new ZK node, except the session had also expired. 
It then closes *that* connection, changes to read-only mode, and eventually 
creates a new connection and new session which allows writes again.

Can anyone familiar with the ZK connection/session stuff comment on whether 
this is a bug? I really know nothing about proper ZK client behaviour.

Thanks.