I got to thinking about this particular question while watching this
presentation, which is well worth 45 minutes if you can spare it:

http://www.infoq.com/presentations/partitioning-comparison

I created SOLR-5468 for this.


On Tue, Nov 19, 2013 at 11:58 AM, Mark Miller <markrmil...@gmail.com> wrote:

> Mostly a lot of other systems already offer these types of things, so they
> were hard not to think about while building :) Just hard to get back to a
> lot of those things, even though a lot of them are fairly low hanging
> fruit. Hardening takes the priority :(
>
> - Mark
>
> On Nov 19, 2013, at 12:42 PM, Timothy Potter <thelabd...@gmail.com> wrote:
>
> > You're thinking is always one-step ahead of me! I'll file the JIRA
> >
> > Thanks.
> > Tim
> >
> >
> > On Tue, Nov 19, 2013 at 10:38 AM, Mark Miller <markrmil...@gmail.com>
> wrote:
> >
> >> Yeah, this is kind of like one of many little features that we have just
> >> not gotten to yet. I’ve always planned for a param that let’s you say
> how
> >> many replicas an update must be verified on before responding success.
> >> Seems to make sense to fail that type of request early if you notice
> there
> >> are not enough replicas up to satisfy the param to begin with.
> >>
> >> I don’t think there is a JIRA issue yet, fire away if you want.
> >>
> >> - Mark
> >>
> >> On Nov 19, 2013, at 12:14 PM, Timothy Potter <thelabd...@gmail.com>
> wrote:
> >>
> >>> I've been thinking about how SolrCloud deals with write-availability
> >> using
> >>> in-sync replica sets, in which writes will continue to be accepted so
> >> long
> >>> as there is at least one healthy node per shard.
> >>>
> >>> For a little background (and to verify my understanding of the process
> is
> >>> correct), SolrCloud only considers active/healthy replicas when
> >>> acknowledging a write. Specifically, when a shard leader accepts an
> >> update
> >>> request, it forwards the request to all active/healthy replicas and
> only
> >>> considers the write successful if all active/healthy replicas ack the
> >>> write. Any down / gone replicas are not considered and will sync up
> with
> >>> the leader when they come back online using peer sync or snapshot
> >>> replication. For instance, if a shard has 3 nodes, A, B, C with A being
> >> the
> >>> current leader, then writes to the shard will continue to succeed even
> >> if B
> >>> & C are down.
> >>>
> >>> The issue is that if a shard leader continues to accept updates even if
> >> it
> >>> loses all of its replicas, then we have acknowledged updates on only 1
> >>> node. If that node, call it A, then fails and one of the previous
> >> replicas,
> >>> call it B, comes back online before A does, then any writes that A
> >> accepted
> >>> while the other replicas were offline are at risk to being lost.
> >>>
> >>> SolrCloud does provide a safe-guard mechanism for this problem with the
> >>> leaderVoteWait setting, which puts any replicas that come back online
> >>> before node A into a temporary wait state. If A comes back online
> within
> >>> the wait period, then all is well as it will become the leader again
> and
> >> no
> >>> writes will be lost. As a side note, sys admins definitely need to be
> >> made
> >>> more aware of this situation as when I first encountered it in my
> >> cluster,
> >>> I had no idea what it meant.
> >>>
> >>> My question is whether we want to consider an approach where SolrCloud
> >> will
> >>> not accept writes unless there is a majority of replicas available to
> >>> accept the write? For my example, under this approach, we wouldn't
> accept
> >>> writes if both B&C failed, but would if only C did, leaving A & B
> online.
> >>> Admittedly, this lowers the write-availability of the system, so may be
> >>> something that should be tunable? Just wanted to put this out there as
> >>> something I've been thinking about lately ...
> >>>
> >>> Cheers,
> >>> Tim
> >>
> >>
>
>

Reply via email to