If you can identify the problem documents, you can just re-index those after 
forcing a sync. Might save a full rebuild and downtime.

You might describe your cluster setup, including ZK. it sounds like you’ve done 
your research, but improper ZK node distribution could certainly invalidate 
some of Solr’s assumptions.




On 1/27/16, 7:59 AM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:

>Jeff, again, very much appreciate your feedback.  
>
>It is interesting — the article you linked to by Shalin is exactly why we 
>picked SolrCloud over ES, because (eventual) consistency is critical for our 
>application and we will sacrifice availability for it.  To be clear, after the 
>outage, NONE of our three replicas are correct or complete.
>
>So we definitely don’t have CP yet — our very first network outage resulted in 
>multiple overlapped lost updates.  As a result, I can’t pick one replica and 
>make it the new “master”.  I must rebuild this collection from scratch, which 
>I can do, but that requires downtime which is a problem in our app (24/7 High 
>Availability with few maintenance windows).
>
>
>So, I definitely need to “fix” this somehow.  I wish I could outline a 
>reproducible test case, but as the root cause is likely very tight timing 
>issues and complicated interactions with Zookeeper, that is not really an 
>option.  I’m happy to share the full logs of all 3 replicas though if that 
>helps.
>
>I am curious though if the thoughts have changed since 
>https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering a 
>“majority quorum” model, with rollback?  Done properly, this should be free of 
>all lost update problems, at the cost of availability.  Some SolrCloud users 
>(like us!!!) would gladly accept that tradeoff.  
>
>Regards
>
>David
>
>

Reply via email to