Thanks Shawn/Erick for the suggestions. Unfortunately stopping indexing
whilst we recover isn't a viable option, we are using Solr as an NRT search
platform, so indexing must continue at least on the DC that is fine.  If we
could stop indexing on the "broken" DC, then recovery is relatively
straightforward, its a rsync/copy of a snapshot from the other data center
followed by restarting indexing.

The million dollar question is how to start up our existing Solr instances
(once the data center has recovered from whatever broke it), realize that
we have a gap in indexing (using a checkpointing mechanism similar to what
Shawn describes), and recover from that (that's the tricky bit!), without
having to interrupt indexing...  I know that replication takes up to an
hour (its a rather large collection but split into 8 shards currently, and
we can replicate each shard in parallel).  What ideally I would like to do
is at the point that I kick off recovery, divert the indexing feed for the
"broken" into a transaction log on those machines, run the replication and
swap the index in, then replay the transaction log to bring it all up to
date.  That process (conceptually)  is the same as the
org.apache.solr.cloud.RecoveryStrategy code.

Yes, if I could divert that feed a that application level, then I can do
what you suggest, but it feels like more work to do that (and build an
external transaction log) whereas the code seems to already be in Solr
itself, I just need to hook it all up (famous last words!) Our indexing
pipeline does a lot of pre-processing work (its not just pulling data from
a database), and since we are only talking about the time taken to do the
replication (should be an hour or less), it feels like we ought to be able
to store that in a Solr transaction log (i.e. the last point in the
indexing pipeline).

The plan would be to recover the leaders (1 of each shard) this way, and
then use conventional replication/recovery to deal with the local replicas
(blank their data area and then they will automatically sync from the local
leader).


On 28 August 2013 15:26, Shawn Heisey <s...@elyograg.org> wrote:

> On 8/28/2013 6:13 AM, Daniel Collins wrote:
> > We have 2 separate data centers in our organisation, and in order to
> > maintain the ZK quorum during any DC outage, we have 2 separate Solr
> > clouds, one in each DC with separate ZK ensembles but both are fed with
> the
> > same indexing data.
> >
> > Now in the event of a DC outage, all our Solr instances go down, and when
> > they come back up, we need some way to recover the "lost" data.
> >
> > Our thought was to replicate from the working DC, but is there a way to
> do
> > that whilst still maintaining an "online" presence for indexing purposes?
>
> One way which would work (if your core name structures were identical
> between the two clouds) would be to shut down your indexing process,
> shut down the cloud that went down and has now come back up, and rsync
> from the good cloud.  Depending on the index size, that could take a
> long time, and the index updates would be turned off while it's
> happening.  That makes this idea less than ideal.
>
> I have a similar setup on a sharded index that's NOT using SolrCloud,
> and both copies are in one location instead of two separate data
> centers.  My general indexing method would work for your setup, though.
>
> The way that I handle this is that my indexing program tracks its update
> position for each copy of the index independently.  If one copy is down,
> the tracked position for that index won't get updated, so the next time
> it comes up, all missed updates will get done for that copy.  In the
> meantime, the program (Java, using SolrJ) is happily using a separate
> thread to continue updating the index copy that's still up.
>
> Thanks,
> Shawn
>
>

Reply via email to