Jeff, again, very much appreciate your feedback.  

It is interesting — the article you linked to by Shalin is exactly why we 
picked SolrCloud over ES, because (eventual) consistency is critical for our 
application and we will sacrifice availability for it.  To be clear, after the 
outage, NONE of our three replicas are correct or complete.

So we definitely don’t have CP yet — our very first network outage resulted in 
multiple overlapped lost updates.  As a result, I can’t pick one replica and 
make it the new “master”.  I must rebuild this collection from scratch, which I 
can do, but that requires downtime which is a problem in our app (24/7 High 
Availability with few maintenance windows).


So, I definitely need to “fix” this somehow.  I wish I could outline a 
reproducible test case, but as the root cause is likely very tight timing 
issues and complicated interactions with Zookeeper, that is not really an 
option.  I’m happy to share the full logs of all 3 replicas though if that 
helps.

I am curious though if the thoughts have changed since 
https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering a 
“majority quorum” model, with rollback?  Done properly, this should be free of 
all lost update problems, at the cost of availability.  Some SolrCloud users 
(like us!!!) would gladly accept that tradeoff.  

Regards

David


On 1/26/16, 4:32 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote:

>
>Ah, perhaps you fell into something like this then? 
>https://issues.apache.org/jira/browse/SOLR-7844
>
>That says it’s fixed in 5.4, but that would be an example of a split-brain 
>type incident, where different documents were accepted by different replicas 
>who each thought they were the leader. If this is the case, and you actually 
>have different data on each replica, I’m not aware of any way to fix the 
>problem short of reindexing those documents. Before that, you’ll probably need 
>to choose a replica and just force the others to get in sync with it. I’d 
>choose the current leader, since that’s slightly easier.
>
>Typically, a leader writes an update to it’s transaction log, then sends the 
>request to all replicas, and when those all finish it acknowledges the update. 
>If a replica gets restarted, and is less than N documents behind, the leader 
>will only replay that transaction log. (Where N is the numRecordsToKeep 
>configured in the updateLog section of solrconfig.xml)
>
>What you want is to provoke the heavy-duty process normally invoked if a 
>replica has missed more than N docs, which essentially does a checksum and 
>file copy on all the raw index files. FetchIndex would probably work, but it’s 
>a replication handler API originally designed for master/slave replication, so 
>take care: https://wiki.apache.org/solr/SolrReplication#HTTP_API
>Probably a lot easier would be to just delete the replica and re-create it. 
>That will also trigger a full file copy of the index from the leader onto the 
>new replica.
>
>I think design decisions around Solr generally use CP as a goal. (I sometimes 
>wish I could get more AP behavior!) See posts like this: 
>http://lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-flaky-networks/
> 
>So the fact that you encountered this sounds like a bug to me.
>That said, another general recommendation (of mine) is that you not use Solr 
>as your primary data source, so you can rebuild your index from scratch if you 
>really need to. 
>
>
>
>
>
>
>On 1/26/16, 1:10 PM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:
>
>>Thanks Jeff!  A few comments
>>
>>>>
>>>> Although you could probably bounce a node and get your document counts 
>>>> back in sync (by provoking a check)
>>>>
>> 
>>
>>If the check is a simple doc count, that will not work. We have found that 
>>replica1 and replica3, although they contain the same doc count, don’t have 
>>the SAME docs.  They each missed at least one update, but of different docs.  
>>This also means none of our three replicas are complete.
>>
>>>>
>>>>it’s interesting that you’re in this situation. It implies to me that at 
>>>>some point the leader couldn’t write a doc to one of the replicas,
>>>>
>>
>>That is our belief as well. We experienced a datacenter-wide network 
>>disruption of a few seconds, and user complaints started the first workday 
>>after that event.  
>>
>>The most interesting log entry during the outage is this:
>>
>>"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says it 
>>is coming from leader,​ but we are the leader: 
>>update.distrib=FROMLEADER&distrib.from=http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version=2";
>>
>>>>
>>>> You might watch the achieved replication factor of your updates and see if 
>>>> it ever changes
>>>>
>>
>>This is a good tip. I’m not sure I like the implication that any failure to 
>>write all 3 of our replicas must be retried at the app layer.  Is this really 
>>how SolrCloud applications must be built to survive network partitions 
>>without data loss? 
>>
>>Regards,
>>
>>David
>>
>>
>>On 1/26/16, 12:20 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote:
>>
>>>
>>>My understanding is that the "version" represents the timestamp the searcher 
>>>was opened, so it doesn’t really offer any assurances about your data.
>>>
>>>Although you could probably bounce a node and get your document counts back 
>>>in sync (by provoking a check), it’s interesting that you’re in this 
>>>situation. It implies to me that at some point the leader couldn’t write a 
>>>doc to one of the replicas, but that the replica didn’t consider itself down 
>>>enough to check itself.
>>>
>>>You might watch the achieved replication factor of your updates and see if 
>>>it ever changes:
>>>https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
>>> (See Achieved Replication Factor/min_rf)
>>>
>>>If it does, that might give you clues about how this is happening. Also, it 
>>>might allow you to work around the issue by trying the write again.
>>>
>>>
>>>
>>>
>>>
>>>
>>>On 1/22/16, 10:52 AM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:
>>>
>>>>I have a SolrCloud v5.4 collection with 3 replicas that appear to have 
>>>>fallen permanently out of sync.  Users started to complain that the same 
>>>>search, executed twice, sometimes returned different result counts.  Sure 
>>>>enough, our replicas are not identical:
>>>>
>>>>>> shard1_replica1:  89867 documents / version 1453479763194
>>>>>> shard1_replica2:  89866 documents / version 1453479763194
>>>>>> shard1_replica3:  89867 documents / version 1453479763191
>>>>
>>>>I do not think this discrepancy is going to resolve itself.  The Solr Admin 
>>>>screen reports all 3 replicas as “Current”.  The last modification to this 
>>>>collection was 2 hours before I captured this information, and our auto 
>>>>commit time is 60 seconds.  
>>>>
>>>>I have a lot of concerns here, but my first question is if anyone else has 
>>>>had problems with out of sync replicas, and if so, what they have done to 
>>>>correct this?
>>>>
>>>>Kind Regards,
>>>>
>>>>David
>>>>
>>

Reply via email to