I have a three node, one shard SolrCloud cluster.

Last week one of the nodes went out of sync with the other two and I'm
trying to understand why that happened.

After poking through my logs and the solr code here's what I've pieced
together:

1. Leader gets an update request for a batch delete of 306 items. It sends
this update along to Replica A and Replica B.
2. On Replica A all is well. It receives the update request and logs that
306 documents were deleted.
3. Replica B also receives the update request but at some point during the
request something kills the connection. Leader logs a "connection reset"
socket error. Replica B doesn't have any errors but it does log that it
only deleted 95 documents as a result of the update call.
4. Because of the socket error, Leader starts leader-initiated-recovery for
Replica B. It sets Replica B to the "down" state in ZK.
5. Replica B gets the leader-initiated-recovery request, updates its ZK
state to "recovering", and starts the PeerSync process.
6. Replica B's PeerSync reports that it has gotten "100 versions" from the
leader but then declares that "Our versions are newer" and finishes
successfully.
7. Replica B puts itself back in the active state, but it is now out of
sync with the Leader and Replica A. It is left with 211 documents in it
that should have been deleted.

I am curious if anyone has any thoughts on why Replica B failed to detect
that it was behind the leader in this scenario.

I'm not really clear on how the update version numbers are assigned, but is
it possible that the 95 documents that did make it to Replica B had a later
version number than the 211 that didn't? I don't have perfect understanding
of the PeerSync code but looking through it, in particular at the logic
that prints the "Our versions are newer" message, it seems like if 95 of
the 100 documents fetched from the leader during PeerSync did match what
the replica already has it might declare itself up-to-date without looking
at the last few.

Reply via email to