RE: Using Helix for HDFS serving

Kanak Biscuitwala Mon, 16 Jun 2014 16:30:29 -0700

Hi Varun,

If you would like to do this using Helix today, there are ways to support this, 
but you will have to write your own rebalancer. Essentially what you need is to 
put your resource ideal state in CUSTOMIZED rebalance mode, and then have some 
code that will listen on HDFS changes and computes the logical partition 
assignment and writes the ideal state based on what HDFS looks like at that 
moment.


Eventually we want to define affinity-based assignments in our FULL_AUTO mode, 
but the challenge here is being able to represent that 30 minute delay, and 
changing that affinity through some configuration.

Does that make sense? Perhaps others on this list have more ideas.

Kanak
________________________________
> Date: Mon, 16 Jun 2014 15:55:43 -0700 
> Subject: Using Helix for HDFS serving 
> From: [email protected] 
> To: [email protected] 
> 
> Hi folks, 
> 
> We are looking at helix for building a serving solution on top of HDFS 
> for data generated from mapreduce jobs. The files will be smaller than 
> the HDFS block size and hence each file will be on 3 replicas with each 
> replica having the whole file in entirety. A set of files output by MR 
> would be the resource and each file (or group of X files) would be a 
> partition. 
> 
> We can assume that there is a container which can serve these immutable 
> files for lookups. Since we have 3 replicas, we were wondering if we 
> could use helix for serving these files with 3 logically equivalent 
> replicas. We need a few things: 
> 
> a) In the steady state, when HDFS blocks are all triplicated, the 
> logical assigment of the 3 replicas should respect block affinity. 
> 
> b) When a node crashes, some blocks become under replicated both 
> physically and logically (from helix point of view). In such a case, we 
> don't want to carry out any transitions. Finally, over time (~ 20 
> minutes), HDFS will re replicate blocks so that physical replication 
> factor of 3 is attained. Once this happens, we want the logical 
> replication to catch up to 3 and also respect hdfs block placement. 
> 
> So there are two aspects, one is to retain block locality by doing 
> logical assigment in a way that the logical partition comes up on the 
> same nodes hosting the physical partition. Secondly, we want the 
> logical placement to trail the physical placement (as determined by 
> HDFS). So we could have the cluster in a non ideal state for a long 
> period of time - say 20-30 minutes. 
> 
> Please let us know if these are feasible with helix and if yes, what 
> would be the recommended practices. 
> 
> Thanks 
> Varun 
> 
>

RE: Using Helix for HDFS serving

Reply via email to