YARN has a ShuffleHandler plugin used for MR purposes, but the APIs used here aren't "general"/public so you'd have to build your own utilities to do that. Its not too difficult to achieve but a general API would certainly be nice.
Tez (Incubating) aims to solve some of this for users writing YARN apps in a general way, but it isn't consumable yet. You can follow Tez on the Apache Incubator at http://incubator.apache.org/projects/tez.html. P.s. As mentioned, YARN-based MR2 does not use HTTP (Jetty) anymore. It uses Netty. On Fri, May 24, 2013 at 3:14 AM, John Lilley <[email protected]> wrote: > Thanks to previous kind answers and more reading in the elephant book, I now > understand that mapper tasks place partitioned results into local files that > are served up to reducers via HTTP: > > > > “The output file’s partitions are made available to the reducers over HTTP. > The maximum number of worker threads used to serve the file partitions is > controlled by the tasktracker.http.threads property; this setting is per > tasktracker, not per map task slot. The default of 40 may need to be > increased for large clusters running large jobs. In MapReduce 2, this > property is not applicable because the maximum number of threads used is set > automatically based on the number of processors on the machine. (MapReduce 2 > uses Netty, which by default allows up to twice as many threads as there are > processors.)” > > > > My question is, for a custom (non-MR) application under YARN, how would I > set up my application tasks’ output data to be served over HTTP? Is there > an API to control this, or are there predefined local folders that will be > served up? Once I am finished with the temporary data, how do I request > that the files are removed? > > > > Thanks > > John > > -- Harsh J
