bq: This means that technically the replica nodes should not fall behind and do
not have to go into recovery mode

Well, true if nothing weird happens. By "weird" I mean anything that
interferes with the leader getting anything other than a success code
back from a follower it sends  document to.

bq: Is this the only scenario in which a node can go into recovery status?

No, there are others. One for-instance: Leader sends a doc to the
follower and the request times out (huge  GC pauses, the doc takes too
long to index for whatever reason etc). The leader then sends a
message to the follower to go directly into the recovery state since
the leader has no way of knowing whether the follower successfully
wrote the document to it's transaction log. You'll see messages about
"leader initiated recovery" in the follower's solr log in this case.

two bits of pedantry:

bq:  Down by the other replicas

Almost. we're talking indexing here and IIUC only the leader can send
another node into recovery as all updates go through the leader.

If I'm going to be nit-picky, Zookeeper can _also_ cause a node to be
marked as down if it's periodic ping of the node fails to return.
Actually I think this is done through another Solr node that ZK
notifies....

bq: It goes into a recovery mode and tries to recover all the
documents from the leader of shard1.

Also nit-picky. But if the follower isn't "too far" behind it can be
brought back into sync from via "peer sync" where it gets the missed
docs sent to it from the tlog of a healthy replica. "Too far" is 100
docs by default, but can be set in solrconfig.xml if necessary. If
that limit is exceeded, then indeed the entire index is copied from
the leader.

Best,
Erick



On Mon, Jun 5, 2017 at 5:18 PM, suresh pendap <sureshfors...@gmail.com> wrote:
> Hi,
>
> Why and in what scenarios do Solr nodes go into recovery status?
>
> Given that Solr is a CP system it means that the writes for a Document
> index are acknowledged only after they are propagated and acknowledged by
> all the replicas of the Shard.
>
> This means that technically the replica nodes should not fall behind and do
> not have to go into recovery mode.
>
> Is my above understanding correct?
>
> Can a below scenario happen?
>
> 1. Assume that we have 3 replicas for Shard shard1 with the names
> shard1_replica1, shard1_replica2 and shard1_replica3.
>
> 2. Due to some reason, network issue or something else, the shard1_replica2
> is not reachable by the other replicas and it is marked as Down by the
> other replicas (shard1_replica1 and shard1_replica3 in this case)
>
> 3. The network issue is restored and the shard1_replica2 is reachable
> again. It goes into a recovery mode and tries to recover all the documents
> from the leader of shard1.
>
> Is this the only scenario in which a node can go into recovery status?
>
> In other words, does the node has to go into a Down status before getting
> back into a recovery status?
>
>
> Regards

Reply via email to