What do you have your ZK Timeout set to (zkClientTimeout in solr.xml or command line if you override it)?
A kill of the raw process is bad, but ZK should spot that using its heartbeat mechanism, so unless your timeout is very large, it should be detecting the node is no longer available, and then triggering a leadership election. We (still) use 4.3.0 (with some patches) and we do have some issues with Solr shutdowns not causing an election quickly enough for us, but that's a known issue within Solr/Jetty, and maybe causes 10-20s of outage, not 20 minutes! You say you have 3 machines, how many shards and how many ZKs, and are they embedded ZK or external? I think we need more info about the scenario. If you are running embedded ZK, then you are losing both a shard/replica and a ZK at the same time, which isn't ideal (we moved to external ZKs quite quickly, embedded just caused too many issues) but shouldn't be that catastrophic. Also does it only happen with a kill -9, what about a normal kill, and/or a normal shutdown of Jetty? On 9 July 2013 16:18, Shawn Heisey <s...@elyograg.org> wrote: > > We are going to use solr in production. There are chances that the > machine > > itself might shutdown due to power failure or the network is disconnected > > due to manual intervention. We need to address those cases as well to > > build > > a robust system.. > > The latest version of Solr is 4.3.1, and 4.4 is right around the corner. > Any chance you can test a nightly 4.4 build or a checkout of the > lucene_solr_4_4 branch,ji so we can know whether you are running into the > same problems with what will be released soon? No sense in fixing a > problem that no longer exists. > > Thanks, > Shawn > > >