Rahul,

This is a very good question, and one we are grappling with currently in our 
application port.  I think there are a lot of legacy data-processing 
applications like ours which would benefit by a port to Hadoop.  However, 
because we have a great load of C++, it is not necessarily a good fit for MR.  
There seem to be two main choices:

·         Run under Hadoop “streams”

·         Run as a custom ApplicationMaster

One of the selling points of our application is its performance and single-code 
efficiency.  I have concerns about streams:

·         We will lose performance, because of the extra layers of translation 
and I/O and because streams data is uncompressed

·         The streams model is limited to single-in, single-out

·         We have a very large number and size of files to make available 
locally, it is unclear that the -files option is going to recursively copy and 
cache all of it

In contrast, porting our application as a YARN ApplicationMaster appears to 
offer several benefits (which come at the expense of extra complexity):

·         Negotiation for container resources and scheduling.  Some of our 
operations are very heavy (load time and memory use), so they need larger 
containers and will benefit from larger data splits.

·         Direct access to HDFS via JNI without translation layers.

·         Algorithms that are not well-suited to the MR model, such as 
transitive closure.  They are more naturally expressed as MPI-like algorithms.

·         If warranted, the ability to replace MR shuffle with a C++ data 
partition (this could be a discussion thread in its own right).

Moving our processing into native Java for a more seamless MR integration is 
not an option due to the size and complexity of the code base.

It may be that I am completely wrong about the limitations of the streams 
interface; if so please tell me why.

john

From: Rahul Bhattacharjee [mailto:[email protected]]
Sent: Wednesday, May 29, 2013 8:34 AM
To: [email protected]
Subject: What else can be built on top of YARN.

Hi all,
I was going through the motivation behind Yarn. Splitting the responsibility of 
JT is the major concern.Ultimately the base (Yarn) was built in a generic way 
for building other generic distributed applications too.
I am not able to think of any other parallel processing use case that would be 
useful to built on top of YARN. I though of a lot of use cases that would be 
beneficial when run in parallel , but again ,we can do those using map only 
jobs in MR.
Can someone tell me a scenario , where a application can utilize Yarn features 
or can be built on top of YARN and at the same time , it cannot be done 
efficiently using MRv2 jobs.
thanks,
Rahul

Reply via email to