On Feb 21, 2014, at 12:23 PM, Jeff Wartes jwar...@whitepages.com wrote:
I’ve been experimenting with SolrCloud configurations in AWS. One issue I’ve
been plagued with is that during indexing, occasionally a node decides it
can’t talk to ZK, and this disables updates in the pool. The node usually
recovers within a second or two. It’s possible this happens when I’m not
indexing too, but I’m much less likely to notice.
I’ve seen this with multiple sharding configurations and multiple cluster
sizes. I’ve searched around, and I think I’ve addressed the usual resolutions
when someone complains about ZK and Solr. I’m using:
* 60-sec ZK connection timeout (although this seems like a pretty terrible
requirement)
Be aware that it maxes out at like 40 or 45 seconds with the default tickTime
of 2000.
* Independent 3-node ZK cluster, also in AWS.
* Solr 4.6.1
* Optimized GC settings (and I’ve confirmed no GC pauses are occurring)
* 5-min auto-hard-commit with openSearcher=false
I’m indexing some 10K docs/sec using CloudSolrServer, but the CPU usage on
the nodes doesn’t exceed 20%, typically it’s around 5%.
Here is the relevant section of logs from one of the nodes when this happened:
http://pastebin.com/K0ZdKmL4
It looks like it had a connection timeout, and tried to re-establish the same
session on a connection to a new ZK node, except the session had also
expired. It then closes *that* connection, changes to read-only mode, and
eventually creates a new connection and new session which allows writes again.
Can anyone familiar with the ZK connection/session stuff comment on whether
this is a bug? I really know nothing about proper ZK client behaviour.
Thanks.
You have to figure out why Solr is not able to talk to ZooKeeper for 40-60
seconds. Perhaps it’s the network, perhaps it’s the…I’m not sure. But for some
reason a very simple heart beat cannot occur for a long time - and for Solr to
receive updates, it has to maintain a connection with ZooKeeper. You can either
raise the timeout, or dig into why the connection heartbeat cannot be
maintained (its very lightweight).
- Mark
http://about.me/markrmiller