Thanks Shawn/Erick for the suggestions. Unfortunately stopping indexing whilst we recover isn't a viable option, we are using Solr as an NRT search platform, so indexing must continue at least on the DC that is fine. If we could stop indexing on the "broken" DC, then recovery is relatively straightforward, its a rsync/copy of a snapshot from the other data center followed by restarting indexing.
The million dollar question is how to start up our existing Solr instances (once the data center has recovered from whatever broke it), realize that we have a gap in indexing (using a checkpointing mechanism similar to what Shawn describes), and recover from that (that's the tricky bit!), without having to interrupt indexing... I know that replication takes up to an hour (its a rather large collection but split into 8 shards currently, and we can replicate each shard in parallel). What ideally I would like to do is at the point that I kick off recovery, divert the indexing feed for the "broken" into a transaction log on those machines, run the replication and swap the index in, then replay the transaction log to bring it all up to date. That process (conceptually) is the same as the org.apache.solr.cloud.RecoveryStrategy code. Yes, if I could divert that feed a that application level, then I can do what you suggest, but it feels like more work to do that (and build an external transaction log) whereas the code seems to already be in Solr itself, I just need to hook it all up (famous last words!) Our indexing pipeline does a lot of pre-processing work (its not just pulling data from a database), and since we are only talking about the time taken to do the replication (should be an hour or less), it feels like we ought to be able to store that in a Solr transaction log (i.e. the last point in the indexing pipeline). The plan would be to recover the leaders (1 of each shard) this way, and then use conventional replication/recovery to deal with the local replicas (blank their data area and then they will automatically sync from the local leader). On 28 August 2013 15:26, Shawn Heisey <s...@elyograg.org> wrote: > On 8/28/2013 6:13 AM, Daniel Collins wrote: > > We have 2 separate data centers in our organisation, and in order to > > maintain the ZK quorum during any DC outage, we have 2 separate Solr > > clouds, one in each DC with separate ZK ensembles but both are fed with > the > > same indexing data. > > > > Now in the event of a DC outage, all our Solr instances go down, and when > > they come back up, we need some way to recover the "lost" data. > > > > Our thought was to replicate from the working DC, but is there a way to > do > > that whilst still maintaining an "online" presence for indexing purposes? > > One way which would work (if your core name structures were identical > between the two clouds) would be to shut down your indexing process, > shut down the cloud that went down and has now come back up, and rsync > from the good cloud. Depending on the index size, that could take a > long time, and the index updates would be turned off while it's > happening. That makes this idea less than ideal. > > I have a similar setup on a sharded index that's NOT using SolrCloud, > and both copies are in one location instead of two separate data > centers. My general indexing method would work for your setup, though. > > The way that I handle this is that my indexing program tracks its update > position for each copy of the index independently. If one copy is down, > the tracked position for that index won't get updated, so the next time > it comes up, all missed updates will get done for that copy. In the > meantime, the program (Java, using SolrJ) is happily using a separate > thread to continue updating the index copy that's still up. > > Thanks, > Shawn > >