I should add that this is on Solr 5.1.0. On Thu, Apr 28, 2016 at 2:42 PM, Mike Wartes <zik...@gmail.com> wrote:
> I have a three node, one shard SolrCloud cluster. > > Last week one of the nodes went out of sync with the other two and I'm > trying to understand why that happened. > > After poking through my logs and the solr code here's what I've pieced > together: > > 1. Leader gets an update request for a batch delete of 306 items. It sends > this update along to Replica A and Replica B. > 2. On Replica A all is well. It receives the update request and logs that > 306 documents were deleted. > 3. Replica B also receives the update request but at some point during the > request something kills the connection. Leader logs a "connection reset" > socket error. Replica B doesn't have any errors but it does log that it > only deleted 95 documents as a result of the update call. > 4. Because of the socket error, Leader starts leader-initiated-recovery > for Replica B. It sets Replica B to the "down" state in ZK. > 5. Replica B gets the leader-initiated-recovery request, updates its ZK > state to "recovering", and starts the PeerSync process. > 6. Replica B's PeerSync reports that it has gotten "100 versions" from the > leader but then declares that "Our versions are newer" and finishes > successfully. > 7. Replica B puts itself back in the active state, but it is now out of > sync with the Leader and Replica A. It is left with 211 documents in it > that should have been deleted. > > I am curious if anyone has any thoughts on why Replica B failed to detect > that it was behind the leader in this scenario. > > I'm not really clear on how the update version numbers are assigned, but > is it possible that the 95 documents that did make it to Replica B had a > later version number than the 211 that didn't? I don't have perfect > understanding of the PeerSync code but looking through it, in particular at > the logic that prints the "Our versions are newer" message, it seems like > if 95 of the 100 documents fetched from the leader during PeerSync did > match what the replica already has it might declare itself up-to-date > without looking at the last few. >