You have an interesting question. I haven't come across anyone trying this 
approach yet. Replication is a relatively new feature so that might be one 
reason. But I think you are over engineering the solution here. I'll try to 
break it down to the best of my abilities.

1. First question to answer is - where are the indices going to be stored? Are 
they going to be other HBase tables (something like secondary indices)? Or are 
they going to be in an independent indexing system like Lucene/Solr/Elastic 
Search/Blurr? The number of fields you want to index on will determine that 
answer IMO.

2. Does the index building happen synchronously as the writes come in or is it 
an asynchronous process? The answer to this question will depend on the 
expected throughput and the cluster size. Also, the answer to this question 
will determine your indexing process.

a) If you are going to build indices synchronously, you have two options. First 
being - get your client to write to HBase as well as create the index. Second 
option - use coprocessors. Coprocs are also a new feature and I'm not certain 
yet that they'll solve your problem. Try them out though, they are pretty neat.

b) If you are going to build indices asynchronously, you'll likely be using MR 
for it. MR can run on the same cluster but if you pound your cluster with full 
table scans, your latencies for real time serving will get affected. Now, if 
you have relatively low throughput for data ingests, you might be able to get 
away with running fewer tasks in your MR job. So, data serving plus index 
creation can happen on the same cluster. If your throughput is high and you 
need MR jobs to run full force, separate the clusters out. Have a cluster to 
run the MR jobs and a separate cluster to serve the data. My approach in the 
second case would be to dump new data into HDFS directly as flat files, run MR 
over them to create the index and also put them into HBase for serving.

Hope that gives you some more ideas to think about.

-ak 


On Wednesday, July 11, 2012 at 10:26 PM, Eric Czech wrote:

> Hi everyone,
> 
> I have a general design question (apologies in advanced if this has
> been asked before).
> 
> I'd like to build indexes off of a raw data store and I'm trying to
> think of the best way to control processing so some part of my cluster
> can still serve reads and writes without being affected heavily by the
> index building process.
> 
> I get the sense that the typical process for this involves something
> like the following:
> 
> 1. Dedicate one cluster for index building (let's call it the INDEX
> cluster) and one for serving application reads on the indexes as well
> as writes/reads on the raw data set (let's call it the MAIN cluster).
> 2. Have the raw data set replicated from the MAIN cluster to the INDEX 
> cluster.
> 3. On the INDEX cluster, use the replicated raw data to constantly
> rebuild indexes and copy the new versions to the MAIN cluster,
> overwriting the old versions if necessary.
> 
> While conceptually simple, I can't help but wonder if it doesn't make
> more sense to simply switch application reads / writes from one
> cluster to another based on which one is NOT currently building
> indexes (but still have the raw data set replicate master-master
> between them).
> 
> To be more clear, I'm proposing doing this:
> 
> 1. Have two clusters, call them CLUSTER_1 and CLUSTER_2, and have the
> raw data set replicated master-master between them.
> 2. if CLUSTER_1 is currently rebuilding indexes, redirect all
> application traffic to CLUSTER_2 including reads from the indexes as
> well as writes to the raw data set (and vise-versa).
> 
> I know I'm not addressing a lot of details here but I'm just curious
> if anyone has ever implemented something along these lines.
> 
> The main advantage to what I'm proposing would be not having to copy
> potentially massive indexes across the network but at the cost of
> having to deal with having clients not always read from the same
> cluster (seems doable though).
> 
> Any advice would be much appreciated!
> 
> Thanks 

Reply via email to