Thanks, Mark!

The supervised process sounds very promising but complicated to get right.
E.g. where does the supervisor run, where do nodes report their status to,
are the checks active or passive, etc.

Having each node perform a regular background self-check and remove itself
from the cluster if that healthcheck doesn't pass seems like a great first
step, though. The most common failure we've seen has been disk failure and
a self-check should usually detect that. (JIRA:
https://issues.apache.org/jira/browse/SOLR-5805)

It would also be nice, as a cluster operator, to have an easy way to remove
a failing node from the cluster. Ideally, right from the Solr UI, but even
from a command-line script would be great. In the cases of disk failure, we
can often not SSH into a node to shut down the VM that's still connected to
ZooKeeper. We have to physically power it down. Having something quicker
would be great. (JIRA: https://issues.apache.org/jira/browse/SOLR-5806)




On Sun, Mar 2, 2014 at 9:36 PM, Mark Miller <markrmil...@gmail.com> wrote:

> The heartbeat that keeps the node alive is the connection it maintains
> with ZooKeeper.
>
> We don't currently have anything built in that will actively make sure
> each node can serve queries and remove it from clusterstatem.json if it
> cannot. If a replica is maintaining it's connection with ZooKeeper and in
> most cases, if it is accepting updates, it will appear up. Load balancing
> should handle the failures, but I guess it depends on how sticky the
> request fails are.
>
> In the past, I've seen this handled on a different search engine by having
> a variety of external agent scripts that would occasionally attempt to do a
> query, and if things did not go right, it killed the process to cause it to
> try and startup again (supervised process).
>
> I'm not sure what the right long term feature for Solr is here, but feel
> free to start a JIRA issue around it.
>
> One simple improvement might even be a background thread that periodically
> checks some local readings and depending on the results, pulls itself out
> of the mix as best it can (remove itself from clusterstate.json or simply
> closes it's zk conneciton).
>
> - Mark
>
> http://about.me/markrmiller
>
> On Mar 2, 2014, at 3:42 PM, Gregg Donovan <gregg...@gmail.com> wrote:
>
> > We had a brief SolrCloud outage this weekend when a node's SSD began to
> > fail but the node still appeared to be up to the rest of the SolrCloud
> > cluster (i.e. still green in clusterstate.json). Distributed queries that
> > reached this node would fail but whatever heartbeat keeps the node in the
> > clustrstate.json must have continued to succeed.
> >
> > We eventually had to power the node down to get it to be removed from
> > clusterstate.json.
> >
> > This is our first foray into SolrCloud, so I'm still somewhat fuzzy on
> what
> > the default heartbeat mechanism is and how we may augment it to be sure
> > that the disk is checked as part of the heartbeat and/or we verify that
> it
> > can serve queries.
> >
> > Any pointers would be appreciated.
> >
> > Thanks!
> >
> > --Gregg
>
>

Reply via email to