Re: Using Helix for HDFS serving

kishore g Tue, 17 Jun 2014 06:51:12 -0700

Hi Varun,

As Kanak mentioned, we will need to write a rebalancer that monitors HDFS
changes.


Have few questions on the design.

- Are you planning to use YARN or a standalone helix application
- Is it a must to have affinity or is it a best effort.
- If affinity is a must
--- then worst case we will have as need as many servers as the number of
partitions.
--- Another problem would be uneven distribution of partition to server.
- Apart from failure, what is the expected behavior when an rebalance is
done on the HDFS. Do you want the servers to move ?

Once we get more info, we can quickly write a simple recipe to demonstrate
this ( excluding the serving part).

thanks,
Kishore G








On Mon, Jun 16, 2014 at 4:29 PM, Kanak Biscuitwala <[email protected]>
wrote:

> Hi Varun,
>
> If you would like to do this using Helix today, there are ways to support
> this, but you will have to write your own rebalancer. Essentially what you
> need is to put your resource ideal state in CUSTOMIZED rebalance mode, and
> then have some code that will listen on HDFS changes and computes the
> logical partition assignment and writes the ideal state based on what HDFS
> looks like at that moment.
>
> Eventually we want to define affinity-based assignments in our FULL_AUTO
> mode, but the challenge here is being able to represent that 30 minute
> delay, and changing that affinity through some configuration.
>
> Does that make sense? Perhaps others on this list have more ideas.
>
> Kanak
> ________________________________
> > Date: Mon, 16 Jun 2014 15:55:43 -0700
> > Subject: Using Helix for HDFS serving
> > From: [email protected]
> > To: [email protected]
> >
> > Hi folks,
> >
> > We are looking at helix for building a serving solution on top of HDFS
> > for data generated from mapreduce jobs. The files will be smaller than
> > the HDFS block size and hence each file will be on 3 replicas with each
> > replica having the whole file in entirety. A set of files output by MR
> > would be the resource and each file (or group of X files) would be a
> > partition.
> >
> > We can assume that there is a container which can serve these immutable
> > files for lookups. Since we have 3 replicas, we were wondering if we
> > could use helix for serving these files with 3 logically equivalent
> > replicas. We need a few things:
> >
> > a) In the steady state, when HDFS blocks are all triplicated, the
> > logical assigment of the 3 replicas should respect block affinity.
> >
> > b) When a node crashes, some blocks become under replicated both
> > physically and logically (from helix point of view). In such a case, we
> > don't want to carry out any transitions. Finally, over time (~ 20
> > minutes), HDFS will re replicate blocks so that physical replication
> > factor of 3 is attained. Once this happens, we want the logical
> > replication to catch up to 3 and also respect hdfs block placement.
> >
> > So there are two aspects, one is to retain block locality by doing
> > logical assigment in a way that the logical partition comes up on the
> > same nodes hosting the physical partition. Secondly, we want the
> > logical placement to trail the physical placement (as determined by
> > HDFS). So we could have the cluster in a non ideal state for a long
> > period of time - say 20-30 minutes.
> >
> > Please let us know if these are feasible with helix and if yes, what
> > would be the recommended practices.
> >
> > Thanks
> > Varun
> >
> >
>
>

Re: Using Helix for HDFS serving

Reply via email to