We had a brief SolrCloud outage this weekend when a node's SSD began to
fail but the node still appeared to be up to the rest of the SolrCloud
cluster (i.e. still green in clusterstate.json). Distributed queries that
reached this node would fail but whatever heartbeat keeps the node in the
clustrstate.json must have continued to succeed.

We eventually had to power the node down to get it to be removed from
clusterstate.json.

This is our first foray into SolrCloud, so I'm still somewhat fuzzy on what
the default heartbeat mechanism is and how we may augment it to be sure
that the disk is checked as part of the heartbeat and/or we verify that it
can serve queries.

Any pointers would be appreciated.

Thanks!

--Gregg

Reply via email to