Maybe you are hitting the reordering issue described in SOLR-8129? Tomás
On Wed, Jan 27, 2016 at 11:32 AM, David Smith <dsmiths...@yahoo.com.invalid> wrote: > Sure. Here is our SolrCloud cluster: > > + Three (3) instances of Zookeeper on three separate (physical) > servers. The ZK servers are beefy and fairly recently built, with 2x10 > GigE (bonded) Ethernet connectivity to the rest of the data center. We > recognize importance of the stability and responsiveness of ZK to the > stability of SolrCloud as a whole. > > + 364 collections, all with single shards and a replication factor of > 3. Currently housing only 100,000,000 documents in aggregate. Expected to > grow to 25 billion+. The size of a single document would be considered > “large”, by the standards of what I’ve seen posted elsewhere on this > mailing list. > > We are always open to ZK recommendations from you or anyone else, > particularly for running a SolrCloud cluster of this size. > > Kind Regards, > > David > > > > On 1/27/16, 12:46 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote: > > > > >If you can identify the problem documents, you can just re-index those > after forcing a sync. Might save a full rebuild and downtime. > > > >You might describe your cluster setup, including ZK. it sounds like > you’ve done your research, but improper ZK node distribution could > certainly invalidate some of Solr’s assumptions. > > > > > > > > > >On 1/27/16, 7:59 AM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote: > > > >>Jeff, again, very much appreciate your feedback. > >> > >>It is interesting — the article you linked to by Shalin is exactly why > we picked SolrCloud over ES, because (eventual) consistency is critical for > our application and we will sacrifice availability for it. To be clear, > after the outage, NONE of our three replicas are correct or complete. > >> > >>So we definitely don’t have CP yet — our very first network outage > resulted in multiple overlapped lost updates. As a result, I can’t pick > one replica and make it the new “master”. I must rebuild this collection > from scratch, which I can do, but that requires downtime which is a problem > in our app (24/7 High Availability with few maintenance windows). > >> > >> > >>So, I definitely need to “fix” this somehow. I wish I could outline a > reproducible test case, but as the root cause is likely very tight timing > issues and complicated interactions with Zookeeper, that is not really an > option. I’m happy to share the full logs of all 3 replicas though if that > helps. > >> > >>I am curious though if the thoughts have changed since > https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering > a “majority quorum” model, with rollback? Done properly, this should be > free of all lost update problems, at the cost of availability. Some > SolrCloud users (like us!!!) would gladly accept that tradeoff. > >> > >>Regards > >> > >>David > >> > >> > >