Re : Vertex Parallelism
Hi Madhu : I try to create a InputFormat which could control split size by parallel . This is for your reference . :) public class SizeControlInputFormat extends InputFormat{ private int parallel; public SizeControlInputFormat(int parallel) { this.parallel = parallel; } @Override public List getSplits(JobContext jobContext) throws IOException, InterruptedException { List splits = new ArrayList<>(parallel); Path inputPath = FileInputFormat.getInputPaths(jobContext)[0]; Configuration config = jobContext.getConfiguration(); long fileSize = FileSystem.get(config).getContentSummary(inputPath).getLength(); long splitSize = fileSize / parallel; for (int index = 0; index < parallel; index++) { splits.add(new FileSplit(inputPath, index * splitSize, splitSize, (String[]) null)); } return null; } @Override public RecordReader createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException { return null; } } -- long is the way and hard that out of Hell leads up to light
Re: Vertex Parallelism
I suggest writing a custom InputFormat or modifying your existing InputFormat to generate more splits and at the same time, disable splits grouping for the vertex in question to ensure that you get the high level of parallelism that you want to achieve. The log snippet is just indicating that vertex had been setup with -1 tasks as the splits are being calculated in the AM and that the vertex parallelism will be set via the initializer/controller (based on the splits from the Input Format). — Hitesh > On Oct 31, 2016, at 3:33 PM, Madhusudan Ramanna <m.rama...@ymail.com> wrote: > > Hello Tez team, > > We have a native Tez application. The first vertex in the graph is a > downloader. This vertex takes a CSV or sequence file that contains the > "urls" as input, downloads content and passes the content on to the next > vertex. This input to vertex is smaller than the min split size. However, > we'd like to have more than one task for running for this vertex to help > throughput. How do we set the tasks on this particular vertex to be greater > than one ? Of course for other vertices in the graph, number of tasks as > computed by data size fits perfectly fine. > > Currently, we're seeing this in the logs: > > >>>>> > > Root Inputs exist for Vertex: download : {_initial={InputName=_initial}, > {Descriptor=ClassName=org.apache.tez.mapreduce.input.MRInput, > hasPayload=true}, > {ControllerDescriptor=ClassName=org.apache.tez.mapreduce.common.MRInputAMSplitGenerator, > hasPayload=false}} > Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set > #tasks for the vertex vertex_1477944280627_0004_1_00 [download] > Vertex will initialize from input initializer. vertex_1477944280627_0004_1_00 > [download] > <<<<< > > > > Thanks for your help ! > > Madhu > > >
Vertex Parallelism
Hello Tez team, We have a native Tez application. The first vertex in the graph is a downloader. This vertex takes a CSV or sequence file that contains the "urls" as input, downloads content and passes the content on to the next vertex. This input to vertex is smaller than the min split size. However, we'd like to have more than one task for running for this vertex to help throughput. How do we set the tasks on this particular vertex to be greater than one ? Of course for other vertices in the graph, number of tasks as computed by data size fits perfectly fine. Currently, we're seeing this in the logs: > Root Inputs exist for Vertex: download : {_initial={InputName=_initial}, {Descriptor=ClassName=org.apache.tez.mapreduce.input.MRInput, hasPayload=true}, {ControllerDescriptor=ClassName=org.apache.tez.mapreduce.common.MRInputAMSplitGenerator, hasPayload=false}}Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set #tasks for the vertex vertex_1477944280627_0004_1_00 [download]Vertex will initialize from input initializer. vertex_1477944280627_0004_1_00 [download]< Thanks for your help ! Madhu
RE: Setting vertex parallelism
Setting parallelism is a required and not optional. That is why it must be specified during creation time. Else it would have become optional and users may end up not specifying it. From: Raajay [mailto:raaja...@gmail.com] Sent: Saturday, September 12, 2015 1:21 AM To: user@tez.apache.org Subject: Setting vertex parallelism The Vertex.java api does not allow parallelism to the changed after a vertex is created; there is no setParallelism() api exposed. Any specific reason ? Will changing the parallelism affect the execution ? Thanks Raajay
Setting vertex parallelism
The Vertex.java api does not allow parallelism to the changed after a vertex is created; there is no setParallelism() api exposed. Any specific reason ? Will changing the parallelism affect the execution ? Thanks Raajay