We have 2 separate data centers in our organisation, and in order to
maintain the ZK quorum during any DC outage, we have 2 separate Solr
clouds, one in each DC with separate ZK ensembles but both are fed with the
same indexing data.

Now in the event of a DC outage, all our Solr instances go down, and when
they come back up, we need some way to recover the "lost" data.

Our thought was to replicate from the working DC, but is there a way to do
that whilst still maintaining an "online" presence for indexing purposes?

In essence, we want to do what happens within Solr cloud's recovery, so (as
I understand cloud recovery) a node starts up, (I'm assuming worst case and
peer sync has failed) then buffers all updates into the transaction log,
replicates from the leader, and replays the transaction log to get
everything in sync.

Is it conceivable to do the same by extending Solr, so on the activation of
some handler (user triggered), we initiated a "replicate from other DC",
which puts all the leaders into buffering updates, replicate from some
other set of servers and then replay?

Our goal is to try to minimize the downtime (beyond the initial outage), so
we would ideally like to be able to start up indexing before this
"replicate/clone" has finished, that's why I thought to enable buffering on
the transaction log.  Searches shouldn't be sent here, but if they do we
have a valid (albeit old) index to serve those until the new one swaps in.

Just curious how any other DC-aware setups handle this kind of scenario?
 Or other concerns, issues with this type of approach.

Reply via email to