Re: Index building process design

Eric Czech Mon, 23 Jul 2012 13:06:52 -0700

Hmm, maybe that was too long -- I'll keep this one shorter I swear:

Would it make sense to build indexes with two Hadoop/Hbase clusters by
simply pointing client traffic at the cluster that is currently NOT
building indexes via M/R jobs?  Basically, has anyone ever tried switching
back and forth between clusters instead of building indexes on one cluster
and copying them to another?



On Thu, Jul 12, 2012 at 1:26 AM, Eric Czech <[email protected]> wrote:

> Hi everyone,
>
> I have a general design question (apologies in advanced if this has
> been asked before).
>
> I'd like to build indexes off of a raw data store and I'm trying to
> think of the best way to control processing so some part of my cluster
> can still serve reads and writes without being affected heavily by the
> index building process.
>
> I get the sense that the typical process for this involves something
> like the following:
>
> 1.  Dedicate one cluster for index building (let's call it the INDEX
> cluster) and one for serving application reads on the indexes as well
> as writes/reads on the raw data set (let's call it the MAIN cluster).
> 2.  Have the raw data set replicated from the MAIN cluster to the INDEX
> cluster.
> 3.  On the INDEX cluster, use the replicated raw data to constantly
> rebuild indexes and copy the new versions to the MAIN cluster,
> overwriting the old versions if necessary.
>
> While conceptually simple, I can't help but wonder if it doesn't make
> more sense to simply switch application reads / writes from one
> cluster to another based on which one is NOT currently building
> indexes (but still have the raw data set replicate master-master
> between them).
>
> To be more clear, I'm proposing doing this:
>
> 1.  Have two clusters, call them CLUSTER_1 and CLUSTER_2, and have the
> raw data set replicated master-master between them.
> 2.  if CLUSTER_1 is currently rebuilding indexes, redirect all
> application traffic to CLUSTER_2 including reads from the indexes as
> well as writes to the raw data set (and vise-versa).
>
> I know I'm not addressing a lot of details here but I'm just curious
> if anyone has ever implemented something along these lines.
>
> The main advantage to what I'm proposing would be not having to copy
> potentially massive indexes across the network but at the cost of
> having to deal with having clients not always read from the same
> cluster (seems doable though).
>
> Any advice would be much appreciated!
>
> Thanks
>

Re: Index building process design

Reply via email to