Thanks for taking the time for the detailed response. I completely get what
you are saying. Makes sense.
On Tue, Jul 19, 2016 at 10:56 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> Justin:
>
> Well, "kill -9" just makes it harder. The original question
> was whether a replica being "active" was a bug, and it's
> not when you kill -9; the Solr node has no chance to
> tell Zookeeper it's going away. ZK does modify
> the live_nodes by itself, thus there are checks as
> necessary when a replica's state is referenced
> whether the node is also in live_nodes. And an
> overwhelming amount of the time this is OK, Solr
> recovers just fine.
>
> As far as the write locks are concerned, those are
> a Lucene level issue so if you kill Solr at just the
> wrong time it's possible that that'll be left over. The
> write locks are held for as short a period as possible
> by Lucene, but occasionally they can linger if you kill
> -9.
>
> When a replica comes up, if there is a write lock already, it
> doesn't just take over; it fails to load instead.
>
> A kill -9 won't bring the cluster down by itself except
> if there are several coincidences. Just don't make
> it a habit. For instance, consider if you kill -9 on
> two Solrs that happen to contain all of the replicas
> for a shard1 for collection1. And you _happen_ to
> kill them both at just the wrong time and they both
> leave Lucene write locks for those replicas. Now
> no replica will come up for shard1 and the collection
> is unusable.
>
> So the shorter form is that using "kill -9" is a poor practice
> that exposes you to some risk. The hard-core Solr
> guys work extremely had to compensate for this kind
> of thing, but kill -9 is a harsh, last-resort option and
> shouldn't be part of your regular process. And you should
> expect some "interesting" states when you do. And
> you should use the bin/solr script to stop Solr
> gracefully.
>
> Best,
> Erick
>
>
> On Tue, Jul 19, 2016 at 9:29 AM, Justin Lee <lee.justi...@gmail.com>
> wrote:
> > Pardon me for hijacking the thread, but I'm curious about something you
> > said, Erick.  I always thought that the point (in part) of going through
> > the pain of using zookeeper and creating replicas was so that the system
> > could seamlessly recover from catastrophic failures.  Wouldn't an OOM
> > condition have a similar effect (or maybe java is better at cleanup on
> that
> > kind of error)?  The reason I ask is that I'm trying to set up a solr
> > system that is highly available and I'm a little bit surprised that a
> kill
> > -9 on one process on one machine could put the entire system in a bad
> > state.  Is it common to have to address problems like this with manual
> > intervention in production systems?  Ideally, I'd hope to be able to set
> up
> > a system where a single node dying a horrible death would never require
> > intervention.
> >
> > On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> First of all, killing with -9 is A Very Bad Idea. You can
> >> leave write lock files laying around. You can leave
> >> the state in an "interesting" place. You haven't given
> >> Solr a chance to tell Zookeeper that it's going away.
> >> (which would set the state to "down"). In short
> >> when you do this you have to deal with the consequences
> >> yourself, one of which is this mismatch between
> >> cluster state and live_nodes.
> >>
> >> Now, that rant done the bin/solr script tries to stop Solr
> >> gracefully but issues a kill if solr doesn't stop nicely. Personally
> >> I think that timeout should be longer, but that's another story.
> >>
> >> The onlyIfDown='true' option is there specifically as a
> >> safety valve. It was provided for those who want to guard against
> >> typos and the like, so just don't specify it and you should be fine.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang <jey...@pivotal.io>
> wrote:
> >> > Hi all,
> >> >
> >> > Here's the situation.
> >> > I'm using solr5.3 in cloud mode.
> >> >
> >> > I have 4 nodes.
> >> >
> >> > After use "kill -9 pid-solr-node" to kill 2 nodes.
> >> > These replicas in the two nodes still are "ACTIVE" in zookeeper's
> >> > state.json.
> >> >
> >> > The problem is, when I try to delete these down replicas with
> >> > parameter onlyIfDown='true'.
> >> > It says,
> >> > "Delete replica failed: Attempted to remove replica :
> >> > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is
> >> > 'active'."
> >> >
> >> > From this link:
> >> > <
> >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >> >
> >> > <
> >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >> >
> >> > <
> >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >> >
> >> > <
> >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >> >
> >> >
> >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >> >
> >> > It says:
> >> > *NOTE*: when the node the replica is hosted on crashes, the replica's
> >> state
> >> > may remain ACTIVE in ZK. To determine if the replica is truly active,
> you
> >> > must also verify that its node
> >> > <
> >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName--
> >> >
> >> > is
> >> > under /live_nodes in ZK (or use ClusterState.liveNodesContain(String)
> >> > <
> >>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String-
> >> >
> >> > ).
> >> >
> >> > So, is this a bug?
> >> >
> >> > Regards,
> >> > Jerome
> >>
>

Reply via email to