Maybe you are hitting the reordering issue described in SOLR-8129?

Tomás

On Wed, Jan 27, 2016 at 11:32 AM, David Smith <dsmiths...@yahoo.com.invalid>
wrote:

> Sure.  Here is our SolrCloud cluster:
>
>    + Three (3) instances of Zookeeper on three separate (physical)
> servers.  The ZK servers are beefy and fairly recently built, with 2x10
> GigE (bonded) Ethernet connectivity to the rest of the data center.  We
> recognize importance of the stability and responsiveness of ZK to the
> stability of SolrCloud as a whole.
>
>    + 364 collections, all with single shards and a replication factor of
> 3.  Currently housing only 100,000,000 documents in aggregate.  Expected to
> grow to 25 billion+.  The size of a single document would be considered
> “large”, by the standards of what I’ve seen posted elsewhere on this
> mailing list.
>
> We are always open to ZK recommendations from you or anyone else,
> particularly for running a SolrCloud cluster of this size.
>
> Kind Regards,
>
> David
>
>
>
> On 1/27/16, 12:46 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote:
>
> >
> >If you can identify the problem documents, you can just re-index those
> after forcing a sync. Might save a full rebuild and downtime.
> >
> >You might describe your cluster setup, including ZK. it sounds like
> you’ve done your research, but improper ZK node distribution could
> certainly invalidate some of Solr’s assumptions.
> >
> >
> >
> >
> >On 1/27/16, 7:59 AM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:
> >
> >>Jeff, again, very much appreciate your feedback.
> >>
> >>It is interesting — the article you linked to by Shalin is exactly why
> we picked SolrCloud over ES, because (eventual) consistency is critical for
> our application and we will sacrifice availability for it.  To be clear,
> after the outage, NONE of our three replicas are correct or complete.
> >>
> >>So we definitely don’t have CP yet — our very first network outage
> resulted in multiple overlapped lost updates.  As a result, I can’t pick
> one replica and make it the new “master”.  I must rebuild this collection
> from scratch, which I can do, but that requires downtime which is a problem
> in our app (24/7 High Availability with few maintenance windows).
> >>
> >>
> >>So, I definitely need to “fix” this somehow.  I wish I could outline a
> reproducible test case, but as the root cause is likely very tight timing
> issues and complicated interactions with Zookeeper, that is not really an
> option.  I’m happy to share the full logs of all 3 replicas though if that
> helps.
> >>
> >>I am curious though if the thoughts have changed since
> https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering
> a “majority quorum” model, with rollback?  Done properly, this should be
> free of all lost update problems, at the cost of availability.  Some
> SolrCloud users (like us!!!) would gladly accept that tradeoff.
> >>
> >>Regards
> >>
> >>David
> >>
> >>
>
>

Reply via email to