I suggest writing a custom InputFormat or modifying your existing InputFormat
to generate more splits and at the same time, disable splits grouping for the
vertex in question to ensure that you get the high level of parallelism that
you want to achieve.
The log snippet is just indicating that vertex had been setup with -1 tasks as
the splits are being calculated in the AM and that the vertex parallelism will
be set via the initializer/controller (based on the splits from the Input
Format).
— Hitesh
> On Oct 31, 2016, at 3:33 PM, Madhusudan Ramanna wrote:
>
> Hello Tez team,
>
> We have a native Tez application. The first vertex in the graph is a
> downloader. This vertex takes a CSV or sequence file that contains the
> "urls" as input, downloads content and passes the content on to the next
> vertex. This input to vertex is smaller than the min split size. However,
> we'd like to have more than one task for running for this vertex to help
> throughput. How do we set the tasks on this particular vertex to be greater
> than one ? Of course for other vertices in the graph, number of tasks as
> computed by data size fits perfectly fine.
>
> Currently, we're seeing this in the logs:
>
> >
>
> Root Inputs exist for Vertex: download : {_initial={InputName=_initial},
> {Descriptor=ClassName=org.apache.tez.mapreduce.input.MRInput,
> hasPayload=true},
> {ControllerDescriptor=ClassName=org.apache.tez.mapreduce.common.MRInputAMSplitGenerator,
> hasPayload=false}}
> Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set
> #tasks for the vertex vertex_1477944280627_0004_1_00 [download]
> Vertex will initialize from input initializer. vertex_1477944280627_0004_1_00
> [download]
> <
>
>
>
> Thanks for your help !
>
> Madhu
>
>
>