Re: HTTP file server, map output, and other files

Harsh J Thu, 23 May 2013 23:44:06 -0700

YARN has a ShuffleHandler plugin used for MR purposes, but the APIs
used here aren't "general"/public so you'd have to build your own
utilities to do that. Its not too difficult to achieve but a general
API would certainly be nice.


Tez (Incubating) aims to solve some of this for users writing YARN
apps in a general way, but it isn't consumable yet. You can follow Tez
on the Apache Incubator at
http://incubator.apache.org/projects/tez.html.

P.s. As mentioned, YARN-based MR2 does not use HTTP (Jetty) anymore.
It uses Netty.

On Fri, May 24, 2013 at 3:14 AM, John Lilley <[email protected]> wrote:
> Thanks to previous kind answers and more reading in the elephant book, I now
> understand that mapper tasks place partitioned results into local files that
> are served up to reducers via HTTP:
>
>
>
> “The output file’s partitions are made available to the reducers over HTTP.
> The maximum number of worker threads used to serve the file partitions is
> controlled by the tasktracker.http.threads property; this setting is per
> tasktracker, not per map task slot. The default of 40 may need to be
> increased for large clusters running large jobs. In MapReduce 2, this
> property is not applicable because the maximum number of threads used is set
> automatically based on the number of processors on the machine. (MapReduce 2
> uses Netty, which by default allows up to twice as many threads as there are
> processors.)”
>
>
>
> My question is, for a custom (non-MR) application under YARN, how would I
> set up my application tasks’ output data to be served over HTTP?  Is there
> an API to control this, or are there predefined local folders that will be
> served up?  Once I am finished with the temporary data, how do I request
> that the files are removed?
>
>
>
> Thanks
>
> John
>
>



--
Harsh J

Re: HTTP file server, map output, and other files

Reply via email to