Re:Re: Vertex Parallelism

darion.yaphet Tue, 01 Nov 2016 07:20:04 -0700
Hi Madhu :<br/><br/>I try to create a InputFormat which could control split 
size by parallel . This is for your reference . :)<br/><br/>public class 
SizeControlInputFormat extends InputFormat&lt;Text, Text&gt; {<br/><br/>    
private int parallel;<br/><br/>    public SizeControlInputFormat(int parallel) 
{<br/>        this.parallel = parallel;<br/>    }<br/><br/>    @Override<br/>   
 public List&lt;InputSplit&gt; getSplits(JobContext jobContext)<br/>            
throws IOException, InterruptedException {<br/>        List&lt;InputSplit&gt; 
splits = new ArrayList&lt;&gt;(parallel);<br/>        Path inputPath = 
FileInputFormat.getInputPaths(jobContext)[0];<br/>        Configuration config 
= jobContext.getConfiguration();<br/>        long fileSize = 
FileSystem.get(config).getContentSummary(inputPath).getLength();<br/>        
long splitSize = fileSize / parallel;<br/>        for (int index = 0; index 
&lt; parallel; index++) {<br/>            splits.add(new FileSplit(inputPath, 
index * splitSize, splitSize,<br/>                    (String[]) null));<br/>   
     }<br/><br/>        return null;<br/>    }<br/><br/>    @Override<br/>    
public RecordReader&lt;Text, Text&gt; createRecordReader(InputSplit 
inputSplit,<br/>                                                       
TaskAttemptContext taskAttemptContext)<br/>            throws IOException, 
InterruptedException {<br/><br/>        return null;<br/>    }<br/>}
在 2016-11-01 07:05:56，"Hitesh Shah" <hit...@apache.org> 写道：
>I suggest writing a custom InputFormat or modifying your existing InputFormat 
>to generate more splits and at the same time, disable splits grouping for the 
>vertex in question to ensure that you get the high level of parallelism that 
>you want to achieve.
>
>The log snippet is just indicating that vertex had been setup with -1 tasks as 
>the splits are being calculated in the AM and that the vertex parallelism will 
>be set via the initializer/controller (based on the splits from the Input 
>Format).
>
>— Hitesh
>
>> On Oct 31, 2016, at 3:33 PM, Madhusudan Ramanna <m.rama...@ymail.com> wrote:
>> 
>> Hello Tez team,
>> 
>> We have a native Tez application.  The first vertex in the graph is a 
>> downloader.  This vertex takes a CSV or sequence file that contains the 
>> "urls" as input, downloads content and passes the content on to the next 
>> vertex.  This input to vertex is smaller than the min split size.   However, 
>> we'd like to have more than one task for running for this vertex to help 
>> throughput. How do we set the tasks on this particular vertex to be greater 
>> than one ?  Of course for other vertices in the graph,  number of tasks as 
>> computed by data size fits perfectly fine. 
>> 
>> Currently, we're seeing this in the logs:
>> 
>> >>>>>
>> 
>> Root Inputs exist for Vertex: download : {_initial={InputName=_initial}, 
>> {Descriptor=ClassName=org.apache.tez.mapreduce.input.MRInput, 
>> hasPayload=true}, 
>> {ControllerDescriptor=ClassName=org.apache.tez.mapreduce.common.MRInputAMSplitGenerator,
>>  hasPayload=false}}
>> Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set 
>> #tasks for the vertex vertex_1477944280627_0004_1_00 [download]
>> Vertex will initialize from input initializer. 
>> vertex_1477944280627_0004_1_00 [download]
>> <<<<<
>> 
>> 
>> 
>> Thanks for your help !
>> 
>> Madhu
>> 
>> 
>> 
>
Re:Re: Vertex Parallelism

Reply via email to