Re : Vertex Parallelism

2016-11-01 Thread Darion Yaphet
Hi Madhu :

I try to create a InputFormat which could control split size by parallel .

This is for your reference . :)

public class SizeControlInputFormat extends InputFormat {

private int parallel;

public SizeControlInputFormat(int parallel) {
this.parallel = parallel;
}

@Override
public List getSplits(JobContext jobContext)
throws IOException, InterruptedException {
List splits = new ArrayList<>(parallel);
Path inputPath = FileInputFormat.getInputPaths(jobContext)[0];
Configuration config = jobContext.getConfiguration();
long fileSize =
FileSystem.get(config).getContentSummary(inputPath).getLength();
long splitSize = fileSize / parallel;
for (int index = 0; index < parallel; index++) {
splits.add(new FileSplit(inputPath, index * splitSize,
splitSize,
(String[]) null));
}

return null;
}

@Override
public RecordReader createRecordReader(InputSplit
inputSplit,
   TaskAttemptContext
taskAttemptContext)
throws IOException, InterruptedException {

return null;
}
}


-- 

long is the way and hard  that out of Hell leads up to light


Re: Vertex Parallelism

2016-10-31 Thread Hitesh Shah
I suggest writing a custom InputFormat or modifying your existing InputFormat 
to generate more splits and at the same time, disable splits grouping for the 
vertex in question to ensure that you get the high level of parallelism that 
you want to achieve.

The log snippet is just indicating that vertex had been setup with -1 tasks as 
the splits are being calculated in the AM and that the vertex parallelism will 
be set via the initializer/controller (based on the splits from the Input 
Format).

— Hitesh

> On Oct 31, 2016, at 3:33 PM, Madhusudan Ramanna <m.rama...@ymail.com> wrote:
> 
> Hello Tez team,
> 
> We have a native Tez application.  The first vertex in the graph is a 
> downloader.  This vertex takes a CSV or sequence file that contains the 
> "urls" as input, downloads content and passes the content on to the next 
> vertex.  This input to vertex is smaller than the min split size.   However, 
> we'd like to have more than one task for running for this vertex to help 
> throughput. How do we set the tasks on this particular vertex to be greater 
> than one ?  Of course for other vertices in the graph,  number of tasks as 
> computed by data size fits perfectly fine. 
> 
> Currently, we're seeing this in the logs:
> 
> >>>>>
> 
> Root Inputs exist for Vertex: download : {_initial={InputName=_initial}, 
> {Descriptor=ClassName=org.apache.tez.mapreduce.input.MRInput, 
> hasPayload=true}, 
> {ControllerDescriptor=ClassName=org.apache.tez.mapreduce.common.MRInputAMSplitGenerator,
>  hasPayload=false}}
> Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set 
> #tasks for the vertex vertex_1477944280627_0004_1_00 [download]
> Vertex will initialize from input initializer. vertex_1477944280627_0004_1_00 
> [download]
> <<<<<
> 
> 
> 
> Thanks for your help !
> 
> Madhu
> 
> 
> 



Vertex Parallelism

2016-10-31 Thread Madhusudan Ramanna
Hello Tez team,
We have a native Tez application.  The first vertex in the graph is a 
downloader.  This vertex takes a CSV or sequence file that contains the "urls" 
as input, downloads content and passes the content on to the next vertex.  This 
input to vertex is smaller than the min split size.   However, we'd like to 
have more than one task for running for this vertex to help throughput. How do 
we set the tasks on this particular vertex to be greater than one ?  Of course 
for other vertices in the graph,  number of tasks as computed by data size fits 
perfectly fine. 
Currently, we're seeing this in the logs:
>
Root Inputs exist for Vertex: download : {_initial={InputName=_initial}, 
{Descriptor=ClassName=org.apache.tez.mapreduce.input.MRInput, hasPayload=true}, 
{ControllerDescriptor=ClassName=org.apache.tez.mapreduce.common.MRInputAMSplitGenerator,
 hasPayload=false}}Num tasks is -1. Expecting 
VertexManager/InputInitializers/1-1 split to set #tasks for the vertex 
vertex_1477944280627_0004_1_00 [download]Vertex will initialize from input 
initializer. vertex_1477944280627_0004_1_00 [download]<


Thanks for your help !
Madhu




RE: Setting vertex parallelism

2015-09-13 Thread Bikas Saha
Setting parallelism is a required and not optional. That is why it must be 
specified during creation time. Else it would have become optional and users 
may end up not specifying it.

From: Raajay [mailto:raaja...@gmail.com]
Sent: Saturday, September 12, 2015 1:21 AM
To: user@tez.apache.org
Subject: Setting vertex parallelism

The Vertex.java api does not allow parallelism to the changed after a vertex is 
created; there is no setParallelism() api exposed.

Any specific reason ? Will changing the parallelism affect the execution ?
Thanks
Raajay


Setting vertex parallelism

2015-09-12 Thread Raajay
The Vertex.java api does not allow parallelism to the changed after a
vertex is created; there is no setParallelism() api exposed.

Any specific reason ? Will changing the parallelism affect the execution ?

Thanks
Raajay