Hmm, maybe that was too long -- I'll keep this one shorter I swear: Would it make sense to build indexes with two Hadoop/Hbase clusters by simply pointing client traffic at the cluster that is currently NOT building indexes via M/R jobs? Basically, has anyone ever tried switching back and forth between clusters instead of building indexes on one cluster and copying them to another?
On Thu, Jul 12, 2012 at 1:26 AM, Eric Czech <[email protected]> wrote: > Hi everyone, > > I have a general design question (apologies in advanced if this has > been asked before). > > I'd like to build indexes off of a raw data store and I'm trying to > think of the best way to control processing so some part of my cluster > can still serve reads and writes without being affected heavily by the > index building process. > > I get the sense that the typical process for this involves something > like the following: > > 1. Dedicate one cluster for index building (let's call it the INDEX > cluster) and one for serving application reads on the indexes as well > as writes/reads on the raw data set (let's call it the MAIN cluster). > 2. Have the raw data set replicated from the MAIN cluster to the INDEX > cluster. > 3. On the INDEX cluster, use the replicated raw data to constantly > rebuild indexes and copy the new versions to the MAIN cluster, > overwriting the old versions if necessary. > > While conceptually simple, I can't help but wonder if it doesn't make > more sense to simply switch application reads / writes from one > cluster to another based on which one is NOT currently building > indexes (but still have the raw data set replicate master-master > between them). > > To be more clear, I'm proposing doing this: > > 1. Have two clusters, call them CLUSTER_1 and CLUSTER_2, and have the > raw data set replicated master-master between them. > 2. if CLUSTER_1 is currently rebuilding indexes, redirect all > application traffic to CLUSTER_2 including reads from the indexes as > well as writes to the raw data set (and vise-versa). > > I know I'm not addressing a lot of details here but I'm just curious > if anyone has ever implemented something along these lines. > > The main advantage to what I'm proposing would be not having to copy > potentially massive indexes across the network but at the cost of > having to deal with having clients not always read from the same > cluster (seems doable though). > > Any advice would be much appreciated! > > Thanks >
