Hi Shawn, To reiterate, this is the exception I get if unable to connect to Zookeeper service:
E:\solr-5.3.0\server\scripts\cloud-scripts>zkcli.bat -z 10.0.0.4:2181 -cmd list Exception in thread "main" org.apache.solr.common.SolrException: java.util.concu rrent.TimeoutException: Could not connect to ZooKeeper 10.0.0.4:2181 within 3000 0 ms at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:18 1) at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:11 5) at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:10 5) at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:181) Caused by: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper 10.0.0.4:2181 within 30000 ms at org.apache.solr.common.cloud.ConnectionManager.waitForConnected(Conne ctionManager.java:208) at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:17 3) ... 3 more For example, in the event if one of the zookeeper services goes down for a few minutes, it may be too late to bring that service back online into the zookeeper cluster due the timeout faced above. In that, all zookeeper services need to be restarted at the same time. Please clarify if there is a configuration that I missed out, an expected behaviour or if this is a bug. Regards, Adrian -----Original Message----- From: Adrian Liew [mailto:adrian.l...@avanade.com] Sent: Wednesday, October 7, 2015 11:56 AM To: solr-user@lucene.apache.org Subject: RE: Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd Hi Shawn, Thanks for the reply. Understood your comments and will revert back to the defaults. However, I raised this issue because I realized that Zookeeper becomes impatient if it cannot heartbeat its other peers in time. So for example, if 1 ZK server goes down out of 3 ZK servers, the 1 ZK server will stop pinging other servers and complain about timeout issues to zkCli connect to its service. Will revert back with an update. Regards, Adrian -----Original Message----- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Tuesday, October 6, 2015 10:16 PM To: solr-user@lucene.apache.org Subject: Re: Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd On 10/6/2015 3:38 AM, Adrian Liew wrote: > Thanks for the reply. Looks like this has been resolved by manually starting > the Zookeeper services on each server promptly so that the tickTime value > does not timeout too quickly to heartbeat other peers. Hence, I increased the > tickTime value to about 5 minutes to give some time for a node hosting > Zookeeper to restart and autostart its service. This case seems fixed but I > will double check again once more to be sure. I am using nssm > (non-sucking-service-manager) to autostart Zookeeper. I will need to retest > this once again using nssm to make sure zookeeper services are up and running. That sounds like a very bad idea. A typical tickTime is two *seconds*. Zookeeper is designed around certain things happening very quickly. I don't think you can increase that to five *minutes* (multiplying it by 150) without the strong possibility of something going very wrong and processes hanging for minutes at a time waiting for a timeout that should happen very quickly. I am reasonably certain that tickTime is used for zookeeper operation in several ways, so I believe that this much of an increase will cause fundamental problems with zookeeper's normal operation. I admit that I have not looked at the code, so I could be wrong ... but based on the following information from the Zookeeper docs, I don't think I am wrong: tickTime the length of a single tick, which is the basic time unit used by ZooKeeper, as measured in milliseconds. It is used to regulate heartbeats, and timeouts. For example, the minimum session timeout will be two ticks. Thanks, Shawn