Re: Send kill -9 to a node and can not delete down replicas with onlyIfDown.

Jerome Yang Wed, 20 Jul 2016 19:15:54 -0700

Thanks a lot everyone!

By setting onlyIfDown=false, it did remove the replica. But still return a
failure message.
That confuse me.


Anyway, thanks Erick and Chris.

Regards,
Jerome

On Thu, Jul 21, 2016 at 5:47 AM, Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> Maybe the problem here is some confusion/ambuguity about the meaning of
> "down" ?
>
> TL;DR: think of "onlyIfDown" as "onlyIfShutDownCleanly"
>
>
> IIUC, the purpose of the 'onlyIfDown' is a safety valve so (by default)
> the cluster will prevent you from removing a replica that wasn't shutdown
> *cleanly* and is officially in a "down" state -- as recorded in the
> ClusterState for the collection (either the collections state.json or the
> global clusterstate.json if you have an older solr instance)
>
> when you kill -9 a solr node, the replicas that were hosted on that node
> will typically still be listed in the cluster state as "active" -- but it
> will *not* be in live_nodes, which is how solr knows that replica can't
> currently be used (and leader recovery happens as needed, etc...).
>
> If, however, you shut the node down cleanly (or if -- for whatever reason
> -- the node is up, but the replica's SolrCore is not active) then the
> cluster state will record that replica as "down"
>
> Where things unfortunately get confusing, is that the CLUSTERSTATUS api
> call -- aparently in an attempt to try and implify things -- changes the
> recorded status of any replica to "down" if that replica is hosted on a
> node which is not in live_nodes.
>
> I suspect that since hte UI uses the CLUSTERSTATUS api to get it's state
> information, it doesn't display much diff between a replica shut down
> cleanly and a replica that is hosted on a node which died abruptly.
>
> I suspect that's where your confusion is coming from?
>
>
> Ultimately, what onlyIfDown is trying to do is help ensure that you don't
> accidently delete a replica that you didn't mean to.  the opertaing
> assumption is that the only replicas you will (typically) delete are
> replicas that you shut down cleanly ... if a replica is down because of a
> hard crash, then that is an exceptional situation and presumibly you will
> either: a) try to bring the replica back up; b) delete the replica using
> onlyIfDown=false to indicate that you know the replica you are deleting
> isn't 'down' intentionally, but you want do delete it anyway.
>
>
>
>
>
> On Wed, 20 Jul 2016, Erick Erickson wrote:
>
> : Date: Wed, 20 Jul 2016 08:26:32 -0700
> : From: Erick Erickson <erickerick...@gmail.com>
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user <solr-user@lucene.apache.org>
> : Subject: Re: Send kill -9 to a node and can not delete down replicas with
> :     onlyIfDown.
> :
> : Yes, it's the intended behavior. The whole point of the
> : onlyIfDown flag was as a safety valve for those
> : who wanted to be cautious and guard against typos
> : and the like.
> :
> : If you specify onlyIfDown=false and the node still
> : isn't removed from ZK, it's not right.
> :
> : Best,
> : Erick
> :
> : On Tue, Jul 19, 2016 at 10:41 PM, Jerome Yang <jey...@pivotal.io> wrote:
> : > What I'm doing is to simulate host crashed situation.
> : >
> : > Consider this, a host is not connected to the cluster.
> : >
> : > So, if a host crashed, I can not delete the down replicas by using
> : > onlyIfDown='true'.
> : > But in solr admin ui, it shows down for these replicas.
> : > And whiteout "onlyIfDown", it still show a failure:
> : > Delete replica failed: Attempted to remove replica :
> : > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is
> : > 'active'.
> : >
> : > Is this the right behavior? If a hosts gone, I can not delete replicas
> in
> : > this host?
> : >
> : > Regards,
> : > Jerome
> : >
> : > On Wed, Jul 20, 2016 at 1:58 AM, Justin Lee <lee.justi...@gmail.com>
> wrote:
> : >
> : >> Thanks for taking the time for the detailed response. I completely
> get what
> : >> you are saying. Makes sense.
> : >> On Tue, Jul 19, 2016 at 10:56 AM Erick Erickson <
> erickerick...@gmail.com>
> : >> wrote:
> : >>
> : >> > Justin:
> : >> >
> : >> > Well, "kill -9" just makes it harder. The original question
> : >> > was whether a replica being "active" was a bug, and it's
> : >> > not when you kill -9; the Solr node has no chance to
> : >> > tell Zookeeper it's going away. ZK does modify
> : >> > the live_nodes by itself, thus there are checks as
> : >> > necessary when a replica's state is referenced
> : >> > whether the node is also in live_nodes. And an
> : >> > overwhelming amount of the time this is OK, Solr
> : >> > recovers just fine.
> : >> >
> : >> > As far as the write locks are concerned, those are
> : >> > a Lucene level issue so if you kill Solr at just the
> : >> > wrong time it's possible that that'll be left over. The
> : >> > write locks are held for as short a period as possible
> : >> > by Lucene, but occasionally they can linger if you kill
> : >> > -9.
> : >> >
> : >> > When a replica comes up, if there is a write lock already, it
> : >> > doesn't just take over; it fails to load instead.
> : >> >
> : >> > A kill -9 won't bring the cluster down by itself except
> : >> > if there are several coincidences. Just don't make
> : >> > it a habit. For instance, consider if you kill -9 on
> : >> > two Solrs that happen to contain all of the replicas
> : >> > for a shard1 for collection1. And you _happen_ to
> : >> > kill them both at just the wrong time and they both
> : >> > leave Lucene write locks for those replicas. Now
> : >> > no replica will come up for shard1 and the collection
> : >> > is unusable.
> : >> >
> : >> > So the shorter form is that using "kill -9" is a poor practice
> : >> > that exposes you to some risk. The hard-core Solr
> : >> > guys work extremely had to compensate for this kind
> : >> > of thing, but kill -9 is a harsh, last-resort option and
> : >> > shouldn't be part of your regular process. And you should
> : >> > expect some "interesting" states when you do. And
> : >> > you should use the bin/solr script to stop Solr
> : >> > gracefully.
> : >> >
> : >> > Best,
> : >> > Erick
> : >> >
> : >> >
> : >> > On Tue, Jul 19, 2016 at 9:29 AM, Justin Lee <lee.justi...@gmail.com
> >
> : >> > wrote:
> : >> > > Pardon me for hijacking the thread, but I'm curious about
> something you
> : >> > > said, Erick.  I always thought that the point (in part) of going
> : >> through
> : >> > > the pain of using zookeeper and creating replicas was so that the
> : >> system
> : >> > > could seamlessly recover from catastrophic failures.  Wouldn't an
> OOM
> : >> > > condition have a similar effect (or maybe java is better at
> cleanup on
> : >> > that
> : >> > > kind of error)?  The reason I ask is that I'm trying to set up a
> solr
> : >> > > system that is highly available and I'm a little bit surprised
> that a
> : >> > kill
> : >> > > -9 on one process on one machine could put the entire system in a
> bad
> : >> > > state.  Is it common to have to address problems like this with
> manual
> : >> > > intervention in production systems?  Ideally, I'd hope to be able
> to
> : >> set
> : >> > up
> : >> > > a system where a single node dying a horrible death would never
> require
> : >> > > intervention.
> : >> > >
> : >> > > On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson <
> : >> erickerick...@gmail.com>
> : >> > > wrote:
> : >> > >
> : >> > >> First of all, killing with -9 is A Very Bad Idea. You can
> : >> > >> leave write lock files laying around. You can leave
> : >> > >> the state in an "interesting" place. You haven't given
> : >> > >> Solr a chance to tell Zookeeper that it's going away.
> : >> > >> (which would set the state to "down"). In short
> : >> > >> when you do this you have to deal with the consequences
> : >> > >> yourself, one of which is this mismatch between
> : >> > >> cluster state and live_nodes.
> : >> > >>
> : >> > >> Now, that rant done the bin/solr script tries to stop Solr
> : >> > >> gracefully but issues a kill if solr doesn't stop nicely.
> Personally
> : >> > >> I think that timeout should be longer, but that's another story.
> : >> > >>
> : >> > >> The onlyIfDown='true' option is there specifically as a
> : >> > >> safety valve. It was provided for those who want to guard against
> : >> > >> typos and the like, so just don't specify it and you should be
> fine.
> : >> > >>
> : >> > >> Best,
> : >> > >> Erick
> : >> > >>
> : >> > >> On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang <jey...@pivotal.io
> >
> : >> > wrote:
> : >> > >> > Hi all,
> : >> > >> >
> : >> > >> > Here's the situation.
> : >> > >> > I'm using solr5.3 in cloud mode.
> : >> > >> >
> : >> > >> > I have 4 nodes.
> : >> > >> >
> : >> > >> > After use "kill -9 pid-solr-node" to kill 2 nodes.
> : >> > >> > These replicas in the two nodes still are "ACTIVE" in
> zookeeper's
> : >> > >> > state.json.
> : >> > >> >
> : >> > >> > The problem is, when I try to delete these down replicas with
> : >> > >> > parameter onlyIfDown='true'.
> : >> > >> > It says,
> : >> > >> > "Delete replica failed: Attempted to remove replica :
> : >> > >> > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but
> state
> : >> is
> : >> > >> > 'active'."
> : >> > >> >
> : >> > >> > From this link:
> : >> > >> > <
> : >> > >>
> : >> >
> : >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> : >> > >> >
> : >> > >> > <
> : >> > >>
> : >> >
> : >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> : >> > >> >
> : >> > >> > <
> : >> > >>
> : >> >
> : >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> : >> > >> >
> : >> > >> > <
> : >> > >>
> : >> >
> : >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> : >> > >> >
> : >> > >> >
> : >> > >>
> : >> >
> : >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> : >> > >> >
> : >> > >> > It says:
> : >> > >> > *NOTE*: when the node the replica is hosted on crashes, the
> : >> replica's
> : >> > >> state
> : >> > >> > may remain ACTIVE in ZK. To determine if the replica is truly
> : >> active,
> : >> > you
> : >> > >> > must also verify that its node
> : >> > >> > <
> : >> > >>
> : >> >
> : >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName--
> : >> > >> >
> : >> > >> > is
> : >> > >> > under /live_nodes in ZK (or use
> : >> ClusterState.liveNodesContain(String)
> : >> > >> > <
> : >> > >>
> : >> >
> : >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String-
> : >> > >> >
> : >> > >> > ).
> : >> > >> >
> : >> > >> > So, is this a bug?
> : >> > >> >
> : >> > >> > Regards,
> : >> > >> > Jerome
> : >> > >>
> : >> >
> : >>
> :
>
> -Hoss
> http://www.lucidworks.com/
>

Re: Send kill -9 to a node and can not delete down replicas with onlyIfDown.

Reply via email to