Re: Data Centre recovery/replication, does this seem plausible?

Daniel Collins Thu, 29 Aug 2013 10:23:14 -0700

Walter, yes we did consider this (and might be having a 3rd DC for other
reasons anyway), but 3 DCs also offers the possibility of running with 2
down and 1 up which ZK still can't handle :)


There is also a second advantage to keeping our clouds separate, they are
independent, which means if we have a problem and the collection gets
corrupted, we have a live backup we can flip to on the other DC.  If they
were all part of 1 cloud, Solr keeps them all consistent, which is good
until we get a bug in the cloud and it breaks everything in 1 fell swoop!


On 29 August 2013 17:35, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> Someone really needs to test this with EC2 availability zones. I haven't
> had the time, but I know other clustered NoSQL solutions like HBase and
> Cassandra can deal with it.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062  | c: +1 917 477 7906
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions<
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> w: appinions.com <http://www.appinions.com/>
>
>
> On Thu, Aug 29, 2013 at 12:20 PM, Walter Underwood <wun...@wunderwood.org
> >wrote:
>
> > Here is a really different approach.
> >
> > Make the two data centers one Solr Cloud cluster and use a third data
> > center (or EC2 region) for one additional Zookeeper node. When you lose a
> > DC, Zookeeper still functions.
> >
> > There would be more traffic between datacenters.
> >
> > wunder
> >
> > On Aug 29, 2013, at 4:11 AM, Erick Erickson wrote:
> >
> > > Yeah, reality gets in the way of simple solutions a lot.....
> > >
> > > And making it even more fun you'd really want to only
> > > bring up one node for each shard in the broken DC and
> > > let that one be fully synched. Then bring up the replicas
> > > in a controlled fashion so you didn't saturate the local
> > > network with replications. And then you'd.....
> > >
> > > But as Shawn says, this is certainly functionality that
> > > would be waaay cool, there's just been no time to
> > > make it all work, the main folks who've been working
> > > in this area all have a mountain of higher-priority
> > > stuff to get done first....
> > >
> > > There's been talk of making SolrCloud "rack aware" which
> > > could extend into some kind of work in this area, but
> > > that's also on the "future" plate. As you're well aware
> > > it's not a trivial problem!
> > >
> > > Hmmm, what you really want here is the ability to say
> > > to a recovering cluster "do your initial synch using nodes
> > > that the ZK ensemble located at XXX know about, then
> > > switch to your very own ensemble". Something like a
> > > "remote recovery" option..... Which is _still_ kind of
> > > tricky, I sure hope you have identical sharding schemes.....
> > >
> > > FWIW,
> > > Erick
> > >
> > >
> > > On Wed, Aug 28, 2013 at 1:12 PM, Shawn Heisey <s...@elyograg.org>
> wrote:
> > >
> > >> On 8/28/2013 10:48 AM, Daniel Collins wrote:
> > >>
> > >>> What ideally I would like to do
> > >>> is at the point that I kick off recovery, divert the indexing feed
> for
> > the
> > >>> "broken" into a transaction log on those machines, run the
> replication
> > and
> > >>> swap the index in, then replay the transaction log to bring it all up
> > to
> > >>> date.  That process (conceptually)  is the same as the
> > >>> org.apache.solr.cloud.**RecoveryStrategy code.
> > >>>
> > >>
> > >> I don't think any such mechanism exists currently.  It would be
> > extremely
> > >> awesome if it did.  If there's not an existing Jira issue, I recommend
> > that
> > >> you file one.  Being able to set up a multi-datacenter cloud with
> > automatic
> > >> recovery would be awesome.  Even if it took a long time, having it be
> > fully
> > >> automated would be exceptionally useful.
> > >>
> > >>
> > >> Yes, if I could divert that feed a that application level, then I can
> do
> > >>> what you suggest, but it feels like more work to do that (and build
> an
> > >>> external transaction log) whereas the code seems to already be in
> Solr
> > >>> itself, I just need to hook it all up (famous last words!) Our
> indexing
> > >>> pipeline does a lot of pre-processing work (its not just pulling data
> > from
> > >>> a database), and since we are only talking about the time taken to do
> > the
> > >>> replication (should be an hour or less), it feels like we ought to be
> > able
> > >>> to store that in a Solr transaction log (i.e. the last point in the
> > >>> indexing pipeline).
> > >>>
> > >>
> > >> I think it would have to be a separate transaction log.  One problem
> > with
> > >> really big regular tlogs is that when Solr gets restarted, the entire
> > >> transaction log that's currently on the disk gets replayed.  If it
> were
> > big
> > >> enough to recover the last several hours to a duplicate cloud, it
> would
> > >> take forever to replay on Solr restart.  If the regular tlog were kept
> > >> small but a second log with the last 24 hours were available, it could
> > >> replay updates when the second cloud came back up.
> > >>
> > >> I do import from a database, so the application-level tracking works
> > >> really well for me.
> > >>
> > >> Thanks,
> > >> Shawn
> > >>
> > >>
> >
> > --
> > Walter Underwood
> > wun...@wunderwood.org
> >
> >
> >
> >
>

Re: Data Centre recovery/replication, does this seem plausible?

Reply via email to