You can specify the number of tasks (number of mappers/reducers> of a
stage when you build the DAG before you run it. Tez also allow user to
change the number of tasks dynamically at runtime. ShuffleVertexManager is
the user plugin to change the parallelism based on the outputs of its
upstream stage.
Of course, you can make your own VertexManager plugin for a different
strategy for determining the parallelism. But I would suggest you to use
ShuffleVertexManager because it is not easy to write your own
VertexManager when you are not familiar with the Tez internals. And
ShuffleVertexManager can only be used for
ScatterGather edge (similar as shuffle of map reduce)

Here¹s the parameter you can set for ShuffleVertexManaget to control how
to determine the parallelism (copy from the source code), let me know if
you have any future question.


 /**
   * In case of a ScatterGather connection, the fraction of source tasks
which
   * should complete before tasks for the current vertex are scheduled
   */
  public static final String TEZ_SHUFFLE_VERTEX_MANAGER_MIN_SRC_FRACTION =
                   
"tez.shuffle-vertex-manager.min-src-fraction";

  /**
   * In case of a ScatterGather connection, once this fraction of source
tasks
   * have completed, all tasks on the current vertex can be scheduled.
Number of
   * tasks ready for scheduling on the current vertex scales linearly
between
   * min-fraction and max-fraction
   */
  public static final String TEZ_SHUFFLE_VERTEX_MANAGER_MAX_SRC_FRACTION =
                   
"tez.shuffle-vertex-manager.max-src-fraction";

  
  /**
   * Enables automatic parallelism determination for the vertex. Based on
input data
   * statisitics the parallelism is decreased to a desired level.
   */
  public static final String
TEZ_SHUFFLE_VERTEX_MANAGER_ENABLE_AUTO_PARALLEL =
                   
"tez.shuffle-vertex-manager.enable.auto-parallel";

  
  /**
   * The desired size of input per task. Parallelism will be changed to
meet this criteria
   */
  public static final String
TEZ_SHUFFLE_VERTEX_MANAGER_DESIRED_TASK_INPUT_SIZE =
                   
"tez.shuffle-vertex-manager.desired-task-input-size";


  /**
   * Automatic parallelism determination will not decrease parallelism
below this value
   */
  public static final String
TEZ_SHUFFLE_VERTEX_MANAGER_MIN_TASK_PARALLELISM =
                   
"tez.shuffle-vertex-manager.min-task-parallelism";








Best Regard,
Jeff Zhang





On 8/19/15, 6:24 PM, "Raajay Viswanathan" <[email protected]> wrote:

>Hello,
>
>I am just getting started with Tez and need some clarification about the
>logical to physical DAG conversion in Tez. Does Tez convert a Logical DAG
>to physical DAG at one go or are the number of mappers and reducers for a
>stage determined only when the stage is ready to run ?
>
>Also, what are some rules of thumb regarding the number of
>mappers/reducers for a stage ? I would ideally like to fix the #mappers /
>#reducers in the application itself rather than let Tez determine it.
>What are the common pitfalls in doing so ?
>
>Thanks,
>Raajay

Reply via email to