We have 2 separate data centers in our organisation, and in order to maintain the ZK quorum during any DC outage, we have 2 separate Solr clouds, one in each DC with separate ZK ensembles but both are fed with the same indexing data.
Now in the event of a DC outage, all our Solr instances go down, and when they come back up, we need some way to recover the "lost" data. Our thought was to replicate from the working DC, but is there a way to do that whilst still maintaining an "online" presence for indexing purposes? In essence, we want to do what happens within Solr cloud's recovery, so (as I understand cloud recovery) a node starts up, (I'm assuming worst case and peer sync has failed) then buffers all updates into the transaction log, replicates from the leader, and replays the transaction log to get everything in sync. Is it conceivable to do the same by extending Solr, so on the activation of some handler (user triggered), we initiated a "replicate from other DC", which puts all the leaders into buffering updates, replicate from some other set of servers and then replay? Our goal is to try to minimize the downtime (beyond the initial outage), so we would ideally like to be able to start up indexing before this "replicate/clone" has finished, that's why I thought to enable buffering on the transaction log. Searches shouldn't be sent here, but if they do we have a valid (albeit old) index to serve those until the new one swaps in. Just curious how any other DC-aware setups handle this kind of scenario? Or other concerns, issues with this type of approach.