Rahul, This is a very good question, and one we are grappling with currently in our application port. I think there are a lot of legacy data-processing applications like ours which would benefit by a port to Hadoop. However, because we have a great load of C++, it is not necessarily a good fit for MR. There seem to be two main choices:
· Run under Hadoop “streams” · Run as a custom ApplicationMaster One of the selling points of our application is its performance and single-code efficiency. I have concerns about streams: · We will lose performance, because of the extra layers of translation and I/O and because streams data is uncompressed · The streams model is limited to single-in, single-out · We have a very large number and size of files to make available locally, it is unclear that the -files option is going to recursively copy and cache all of it In contrast, porting our application as a YARN ApplicationMaster appears to offer several benefits (which come at the expense of extra complexity): · Negotiation for container resources and scheduling. Some of our operations are very heavy (load time and memory use), so they need larger containers and will benefit from larger data splits. · Direct access to HDFS via JNI without translation layers. · Algorithms that are not well-suited to the MR model, such as transitive closure. They are more naturally expressed as MPI-like algorithms. · If warranted, the ability to replace MR shuffle with a C++ data partition (this could be a discussion thread in its own right). Moving our processing into native Java for a more seamless MR integration is not an option due to the size and complexity of the code base. It may be that I am completely wrong about the limitations of the streams interface; if so please tell me why. john From: Rahul Bhattacharjee [mailto:[email protected]] Sent: Wednesday, May 29, 2013 8:34 AM To: [email protected] Subject: What else can be built on top of YARN. Hi all, I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too. I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR. Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs. thanks, Rahul
