What do you have your ZK Timeout set to (zkClientTimeout in solr.xml or
command line if you override it)?

A kill of the raw process is bad, but ZK should spot that using its
heartbeat mechanism, so unless your timeout is very large, it should be
detecting the node is no longer available, and then triggering a leadership
election.

We (still) use 4.3.0 (with some patches) and we do have some issues with
Solr shutdowns not causing an election quickly enough for us, but that's a
known issue within Solr/Jetty, and maybe causes 10-20s of outage, not 20
minutes!

You say you have 3 machines, how many shards and how many ZKs, and are they
embedded ZK or external? I think we need more info about the scenario.

If you are running embedded ZK, then you are losing both a shard/replica
and a ZK at the same time, which isn't ideal (we moved to external ZKs
quite quickly, embedded just caused too many issues) but shouldn't be that
catastrophic.

Also does it only happen with a kill -9, what about a normal kill, and/or a
normal shutdown of Jetty?



On 9 July 2013 16:18, Shawn Heisey <s...@elyograg.org> wrote:

> > We are going to use solr in production. There are chances that the
> machine
> > itself might shutdown due to power failure or the network is disconnected
> > due to manual intervention. We need to address those cases as well to
> > build
> > a robust system..
>
> The latest version of Solr is 4.3.1, and 4.4 is right around the corner.
> Any chance you can test a nightly 4.4 build or a checkout of the
> lucene_solr_4_4 branch,ji so we can know whether you are running into the
> same problems with what will be released soon? No sense in fixing a
> problem that no longer exists.
>
> Thanks,
> Shawn
>
>
>

Reply via email to