Answers inline: On Jul 9, 2014, at 10:02 AM, Grandl Robert <[email protected]> wrote:
> Thanks a lot for your detailed answer. > > When you say "it is now decided based on a combination of available cluster > resources as well as the input data": what do you mean by available cluster > resources ? Total resources in the cluster(like number nodes * capability of > each node), or instantaneous available resources based on current workload on > each such node. > YARN provides an application a way to access its available headroom/resources. Per node resource available is not considered. Just full cluster availability. For example, if we know we can run a maximum of 100 containers in parallel but the data set consists of 10000 blocks, there is no point running 10000 tasks ( assuming each block is quite small and quick to process). Tez will try and fit the no. of tasks in a multiple of max containers ( default I believe is 1.7 waves ) assuming the amount of data processed per task is within the configured limits. Too small a task and it is just overhead. Oversized tasks are slower to recover hence an upper bound is needed. There are obviously edge cases where the above may create problems but in general, this has proved more beneficial than the static task count approach used in MapReduce. > So to be clear: even for the same job and same total input size, the number > of tasks and such input size per task will differ for consecutive runs ? > Yes. Depending on the cluster load at that point in time, it could change. Also, the grouping is not exactly predictable as it depends on what order the files' info such as block locations is obtained from HDFS. > Also, it seems the number of tasks for a vertex can change at runtime(not > about duplicates, but distinct tasks) ? > Yes again. A VertexManager has the capability to modify the parallelism at run-time. For example, the auto-reduce functionality can change the parallelism of a reduce stage based on extrapolation of the amount of the data being generated by tasks of the previous map stage. — Hitesh
