First of all, killing with -9 is A Very Bad Idea. You can leave write lock files laying around. You can leave the state in an "interesting" place. You haven't given Solr a chance to tell Zookeeper that it's going away. (which would set the state to "down"). In short when you do this you have to deal with the consequences yourself, one of which is this mismatch between cluster state and live_nodes.
Now, that rant done the bin/solr script tries to stop Solr gracefully but issues a kill if solr doesn't stop nicely. Personally I think that timeout should be longer, but that's another story. The onlyIfDown='true' option is there specifically as a safety valve. It was provided for those who want to guard against typos and the like, so just don't specify it and you should be fine. Best, Erick On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang <jey...@pivotal.io> wrote: > Hi all, > > Here's the situation. > I'm using solr5.3 in cloud mode. > > I have 4 nodes. > > After use "kill -9 pid-solr-node" to kill 2 nodes. > These replicas in the two nodes still are "ACTIVE" in zookeeper's > state.json. > > The problem is, when I try to delete these down replicas with > parameter onlyIfDown='true'. > It says, > "Delete replica failed: Attempted to remove replica : > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is > 'active'." > > From this link: > <http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE> > <http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE> > <http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE> > <http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE> > http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE > > It says: > *NOTE*: when the node the replica is hosted on crashes, the replica's state > may remain ACTIVE in ZK. To determine if the replica is truly active, you > must also verify that its node > <http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName--> > is > under /live_nodes in ZK (or use ClusterState.liveNodesContain(String) > <http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String-> > ). > > So, is this a bug? > > Regards, > Jerome