Re: SolrCloud replicas out of sync

Jeff Wartes Tue, 26 Jan 2016 14:32:52 -0800

Ah, perhaps you fell into something like this then? 
https://issues.apache.org/jira/browse/SOLR-7844

That says it’s fixed in 5.4, but that would be an example of a split-brain type 
incident, where different documents were accepted by different replicas who 
each thought they were the leader. If this is the case, and you actually have 
different data on each replica, I’m not aware of any way to fix the problem 
short of reindexing those documents. Before that, you’ll probably need to 
choose a replica and just force the others to get in sync with it. I’d choose 
the current leader, since that’s slightly easier.

Typically, a leader writes an update to it’s transaction log, then sends the 
request to all replicas, and when those all finish it acknowledges the update. 
If a replica gets restarted, and is less than N documents behind, the leader 
will only replay that transaction log. (Where N is the numRecordsToKeep 
configured in the updateLog section of solrconfig.xml)

What you want is to provoke the heavy-duty process normally invoked if a 
replica has missed more than N docs, which essentially does a checksum and file 
copy on all the raw index files. FetchIndex would probably work, but it’s a 
replication handler API originally designed for master/slave replication, so 
take care: https://wiki.apache.org/solr/SolrReplication#HTTP_API
Probably a lot easier would be to just delete the replica and re-create it. 
That will also trigger a full file copy of the index from the leader onto the 
new replica.

I think design decisions around Solr generally use CP as a goal. (I sometimes 
wish I could get more AP behavior!) See posts like this: 
http://lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-flaky-networks/

So the fact that you encountered this sounds like a bug to me.
That said, another general recommendation (of mine) is that you not use Solr as 
your primary data source, so you can rebuild your index from scratch if you 
really need to. 

On 1/26/16, 1:10 PM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:

>Thanks Jeff!  A few comments
>
>>>
>>> Although you could probably bounce a node and get your document counts back 
>>> in sync (by provoking a check)
>>>
> 
>
>If the check is a simple doc count, that will not work. We have found that 
>replica1 and replica3, although they contain the same doc count, don’t have 
>the SAME docs.  They each missed at least one update, but of different docs.  
>This also means none of our three replicas are complete.
>
>>>
>>>it’s interesting that you’re in this situation. It implies to me that at 
>>>some point the leader couldn’t write a doc to one of the replicas,
>>>
>
>That is our belief as well. We experienced a datacenter-wide network 
>disruption of a few seconds, and user complaints started the first workday 
>after that event.  
>
>The most interesting log entry during the outage is this:
>
>"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says it is 
>coming from leader, but we are the leader: 
>update.distrib=FROMLEADER&distrib.from=http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version=2";
>
>>>
>>> You might watch the achieved replication factor of your updates and see if 
>>> it ever changes
>>>
>
>This is a good tip. I’m not sure I like the implication that any failure to 
>write all 3 of our replicas must be retried at the app layer.  Is this really 
>how SolrCloud applications must be built to survive network partitions without 
>data loss? 
>
>Regards,
>
>David
>
>
>On 1/26/16, 12:20 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote:
>
>>
>>My understanding is that the "version" represents the timestamp the searcher 
>>was opened, so it doesn’t really offer any assurances about your data.
>>
>>Although you could probably bounce a node and get your document counts back 
>>in sync (by provoking a check), it’s interesting that you’re in this 
>>situation. It implies to me that at some point the leader couldn’t write a 
>>doc to one of the replicas, but that the replica didn’t consider itself down 
>>enough to check itself.
>>
>>You might watch the achieved replication factor of your updates and see if it 
>>ever changes:
>>https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
>> (See Achieved Replication Factor/min_rf)
>>
>>If it does, that might give you clues about how this is happening. Also, it 
>>might allow you to work around the issue by trying the write again.
>>
>>
>>
>>
>>
>>
>>On 1/22/16, 10:52 AM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:
>>
>>>I have a SolrCloud v5.4 collection with 3 replicas that appear to have 
>>>fallen permanently out of sync.  Users started to complain that the same 
>>>search, executed twice, sometimes returned different result counts.  Sure 
>>>enough, our replicas are not identical:
>>>
>>>>> shard1_replica1:  89867 documents / version 1453479763194
>>>>> shard1_replica2:  89866 documents / version 1453479763194
>>>>> shard1_replica3:  89867 documents / version 1453479763191
>>>
>>>I do not think this discrepancy is going to resolve itself.  The Solr Admin 
>>>screen reports all 3 replicas as “Current”.  The last modification to this 
>>>collection was 2 hours before I captured this information, and our auto 
>>>commit time is 60 seconds.  
>>>
>>>I have a lot of concerns here, but my first question is if anyone else has 
>>>had problems with out of sync replicas, and if so, what they have done to 
>>>correct this?
>>>
>>>Kind Regards,
>>>
>>>David
>>>
>

Re: SolrCloud replicas out of sync

Reply via email to