Hi Madhu :<br/><br/>I try to create a InputFormat which could control split
size by parallel . This is for your reference . :)<br/><br/>public class
SizeControlInputFormat extends InputFormat<Text, Text> {<br/><br/>
private int parallel;<br/><br/> public SizeControlInputFormat(int parallel)
{<br/> this.parallel = parallel;<br/> }<br/><br/> @Override<br/>
public List<InputSplit> getSplits(JobContext jobContext)<br/>
throws IOException, InterruptedException {<br/> List<InputSplit>
splits = new ArrayList<>(parallel);<br/> Path inputPath =
FileInputFormat.getInputPaths(jobContext)[0];<br/> Configuration config
= jobContext.getConfiguration();<br/> long fileSize =
FileSystem.get(config).getContentSummary(inputPath).getLength();<br/>
long splitSize = fileSize / parallel;<br/> for (int index = 0; index
< parallel; index++) {<br/> splits.add(new FileSplit(inputPath,
index * splitSize, splitSize,<br/> (String[]) null));<br/>
}<br/><br/> return null;<br/> }<br/><br/> @Override<br/>
public RecordReader<Text, Text> createRecordReader(InputSplit
inputSplit,<br/>
TaskAttemptContext taskAttemptContext)<br/> throws IOException,
InterruptedException {<br/><br/> return null;<br/> }<br/>}
在 2016-11-01 07:05:56,"Hitesh Shah" <hit...@apache.org> 写道:
>I suggest writing a custom InputFormat or modifying your existing InputFormat
>to generate more splits and at the same time, disable splits grouping for the
>vertex in question to ensure that you get the high level of parallelism that
>you want to achieve.
>
>The log snippet is just indicating that vertex had been setup with -1 tasks as
>the splits are being calculated in the AM and that the vertex parallelism will
>be set via the initializer/controller (based on the splits from the Input
>Format).
>
>— Hitesh
>
>> On Oct 31, 2016, at 3:33 PM, Madhusudan Ramanna <m.rama...@ymail.com> wrote:
>>
>> Hello Tez team,
>>
>> We have a native Tez application. The first vertex in the graph is a
>> downloader. This vertex takes a CSV or sequence file that contains the
>> "urls" as input, downloads content and passes the content on to the next
>> vertex. This input to vertex is smaller than the min split size. However,
>> we'd like to have more than one task for running for this vertex to help
>> throughput. How do we set the tasks on this particular vertex to be greater
>> than one ? Of course for other vertices in the graph, number of tasks as
>> computed by data size fits perfectly fine.
>>
>> Currently, we're seeing this in the logs:
>>
>> >>>>>
>>
>> Root Inputs exist for Vertex: download : {_initial={InputName=_initial},
>> {Descriptor=ClassName=org.apache.tez.mapreduce.input.MRInput,
>> hasPayload=true},
>> {ControllerDescriptor=ClassName=org.apache.tez.mapreduce.common.MRInputAMSplitGenerator,
>> hasPayload=false}}
>> Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set
>> #tasks for the vertex vertex_1477944280627_0004_1_00 [download]
>> Vertex will initialize from input initializer.
>> vertex_1477944280627_0004_1_00 [download]
>> <<<<<
>>
>>
>>
>> Thanks for your help !
>>
>> Madhu
>>
>>
>>
>