Thanks Jeff and Hitesh. Wonderful pointers / summary to get me started.

Raajay


> On Aug 19, 2015, at 4:20 PM, Hitesh Shah <[email protected]> wrote:
> 
> If you are mainly looking at this from a “mapper” and “reducer” perspective, 
> there are 2 main ways in which Tez affects parallelism: 
> 
> i) For the mapper, the no. of maps is effectively governed by how many splits 
> are created and how are splits assigned to tasks. The initial split creation 
> is controlled by the InputSplitFormat. Tez then looks at the cluster capacity 
> to decide whether the no. of splits are optimal. For example, the InputFormat 
> could create 1000 splits which would imply a 1000 tasks. However, if the max 
> no. of containers that can be launched is only 50, this would imply 20 waves 
> of tasks for the full stage/vertex to complete. So in this case, if grouping 
> is enabled, Tez tries and re-shuffles the splits to around 1.5 waves ( 1.5 is 
> configurable ) but also ensures that splits are kept within a bounded range 
> with respect to the amount of data being processed. For example, if each 
> split was processing 1 TB of data, there is no need to group splits as 1 TB 
> is already too large. Whereas if each split was 1 MB, then Tez will likely 
> end bundling up more and more splits to be processed by a single task until a 
> min limit is reached on a per task basis. Disabling grouping would allow you 
> to control no. of tasks by configuring the InputSplit as needed to a more 
> static nature.
> 
> 2) For reducers, the main problem everyone has is that today you could 
> configure a 1000 reducers but the mappers generated only 10 MB of data in 
> total. This could be processed by a single reducer. Tez monitors the data 
> being generated from the mappers and dynamically tries to decide how many 
> reducers it can scale down to. This is the auto-reduce parallelism that Jeff 
> referred to in his email earlier. Again, a static configuration for the 
> reducer stage i.e setParallelism() and disabling auto-reduce would stop Tez 
> from making any runtime changes. 
> 
> To answer your first question, each stage/vertex is governed by its own 
> VertexManager which in the end is user-code ( most of the common ones are 
> implemented within the framework but can be overridden if needed). These VMs 
> can be as dumb or as complex as possible and can make runtime decisions. 
> There are certain points in the processing up to which the VM can tweak 
> parallelism. After a certain point, parallelism becomes final ( due to the 
> requirement that failures need to be handled in a deterministic manner such 
> that re-running a task results in the same data output each time around ).
> 
> thanks
> — Hitesh
> 
> On Aug 19, 2015, at 3:24 AM, Raajay Viswanathan <[email protected]> wrote:
> 
>> Hello,
>> 
>> I am just getting started with Tez and need some clarification about the 
>> logical to physical DAG conversion in Tez. Does Tez convert a Logical DAG to 
>> physical DAG at one go or are the number of mappers and reducers for a stage 
>> determined only when the stage is ready to run ?
>> 
>> Also, what are some rules of thumb regarding the number of mappers/reducers 
>> for a stage ? I would ideally like to fix the #mappers / #reducers in the 
>> application itself rather than let Tez determine it. What are the common 
>> pitfalls in doing so ?
>> 
>> Thanks,
>> Raajay
> 

Reply via email to