Walter, yes we did consider this (and might be having a 3rd DC for other reasons anyway), but 3 DCs also offers the possibility of running with 2 down and 1 up which ZK still can't handle :)
There is also a second advantage to keeping our clouds separate, they are independent, which means if we have a problem and the collection gets corrupted, we have a live backup we can flip to on the other DC. If they were all part of 1 cloud, Solr keeps them all consistent, which is good until we get a bug in the cloud and it breaks everything in 1 fell swoop! On 29 August 2013 17:35, Michael Della Bitta < michael.della.bi...@appinions.com> wrote: > Someone really needs to test this with EC2 availability zones. I haven't > had the time, but I know other clustered NoSQL solutions like HBase and > Cassandra can deal with it. > > Michael Della Bitta > > Applications Developer > > o: +1 646 532 3062 | c: +1 917 477 7906 > > appinions inc. > > “The Science of Influence Marketing” > > 18 East 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions< > https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > > > w: appinions.com <http://www.appinions.com/> > > > On Thu, Aug 29, 2013 at 12:20 PM, Walter Underwood <wun...@wunderwood.org > >wrote: > > > Here is a really different approach. > > > > Make the two data centers one Solr Cloud cluster and use a third data > > center (or EC2 region) for one additional Zookeeper node. When you lose a > > DC, Zookeeper still functions. > > > > There would be more traffic between datacenters. > > > > wunder > > > > On Aug 29, 2013, at 4:11 AM, Erick Erickson wrote: > > > > > Yeah, reality gets in the way of simple solutions a lot..... > > > > > > And making it even more fun you'd really want to only > > > bring up one node for each shard in the broken DC and > > > let that one be fully synched. Then bring up the replicas > > > in a controlled fashion so you didn't saturate the local > > > network with replications. And then you'd..... > > > > > > But as Shawn says, this is certainly functionality that > > > would be waaay cool, there's just been no time to > > > make it all work, the main folks who've been working > > > in this area all have a mountain of higher-priority > > > stuff to get done first.... > > > > > > There's been talk of making SolrCloud "rack aware" which > > > could extend into some kind of work in this area, but > > > that's also on the "future" plate. As you're well aware > > > it's not a trivial problem! > > > > > > Hmmm, what you really want here is the ability to say > > > to a recovering cluster "do your initial synch using nodes > > > that the ZK ensemble located at XXX know about, then > > > switch to your very own ensemble". Something like a > > > "remote recovery" option..... Which is _still_ kind of > > > tricky, I sure hope you have identical sharding schemes..... > > > > > > FWIW, > > > Erick > > > > > > > > > On Wed, Aug 28, 2013 at 1:12 PM, Shawn Heisey <s...@elyograg.org> > wrote: > > > > > >> On 8/28/2013 10:48 AM, Daniel Collins wrote: > > >> > > >>> What ideally I would like to do > > >>> is at the point that I kick off recovery, divert the indexing feed > for > > the > > >>> "broken" into a transaction log on those machines, run the > replication > > and > > >>> swap the index in, then replay the transaction log to bring it all up > > to > > >>> date. That process (conceptually) is the same as the > > >>> org.apache.solr.cloud.**RecoveryStrategy code. > > >>> > > >> > > >> I don't think any such mechanism exists currently. It would be > > extremely > > >> awesome if it did. If there's not an existing Jira issue, I recommend > > that > > >> you file one. Being able to set up a multi-datacenter cloud with > > automatic > > >> recovery would be awesome. Even if it took a long time, having it be > > fully > > >> automated would be exceptionally useful. > > >> > > >> > > >> Yes, if I could divert that feed a that application level, then I can > do > > >>> what you suggest, but it feels like more work to do that (and build > an > > >>> external transaction log) whereas the code seems to already be in > Solr > > >>> itself, I just need to hook it all up (famous last words!) Our > indexing > > >>> pipeline does a lot of pre-processing work (its not just pulling data > > from > > >>> a database), and since we are only talking about the time taken to do > > the > > >>> replication (should be an hour or less), it feels like we ought to be > > able > > >>> to store that in a Solr transaction log (i.e. the last point in the > > >>> indexing pipeline). > > >>> > > >> > > >> I think it would have to be a separate transaction log. One problem > > with > > >> really big regular tlogs is that when Solr gets restarted, the entire > > >> transaction log that's currently on the disk gets replayed. If it > were > > big > > >> enough to recover the last several hours to a duplicate cloud, it > would > > >> take forever to replay on Solr restart. If the regular tlog were kept > > >> small but a second log with the last 24 hours were available, it could > > >> replay updates when the second cloud came back up. > > >> > > >> I do import from a database, so the application-level tracking works > > >> really well for me. > > >> > > >> Thanks, > > >> Shawn > > >> > > >> > > > > -- > > Walter Underwood > > wun...@wunderwood.org > > > > > > > > >