I have a three node, one shard SolrCloud cluster. Last week one of the nodes went out of sync with the other two and I'm trying to understand why that happened.
After poking through my logs and the solr code here's what I've pieced together: 1. Leader gets an update request for a batch delete of 306 items. It sends this update along to Replica A and Replica B. 2. On Replica A all is well. It receives the update request and logs that 306 documents were deleted. 3. Replica B also receives the update request but at some point during the request something kills the connection. Leader logs a "connection reset" socket error. Replica B doesn't have any errors but it does log that it only deleted 95 documents as a result of the update call. 4. Because of the socket error, Leader starts leader-initiated-recovery for Replica B. It sets Replica B to the "down" state in ZK. 5. Replica B gets the leader-initiated-recovery request, updates its ZK state to "recovering", and starts the PeerSync process. 6. Replica B's PeerSync reports that it has gotten "100 versions" from the leader but then declares that "Our versions are newer" and finishes successfully. 7. Replica B puts itself back in the active state, but it is now out of sync with the Leader and Replica A. It is left with 211 documents in it that should have been deleted. I am curious if anyone has any thoughts on why Replica B failed to detect that it was behind the leader in this scenario. I'm not really clear on how the update version numbers are assigned, but is it possible that the 95 documents that did make it to Replica B had a later version number than the 211 that didn't? I don't have perfect understanding of the PeerSync code but looking through it, in particular at the logic that prints the "Our versions are newer" message, it seems like if 95 of the 100 documents fetched from the leader during PeerSync did match what the replica already has it might declare itself up-to-date without looking at the last few.