Re: Vertex Parallelism

2016-10-31 Thread Hitesh Shah
I suggest writing a custom InputFormat or modifying your existing InputFormat 
to generate more splits and at the same time, disable splits grouping for the 
vertex in question to ensure that you get the high level of parallelism that 
you want to achieve.

The log snippet is just indicating that vertex had been setup with -1 tasks as 
the splits are being calculated in the AM and that the vertex parallelism will 
be set via the initializer/controller (based on the splits from the Input 
Format).

— Hitesh

> On Oct 31, 2016, at 3:33 PM, Madhusudan Ramanna  wrote:
> 
> Hello Tez team,
> 
> We have a native Tez application.  The first vertex in the graph is a 
> downloader.  This vertex takes a CSV or sequence file that contains the 
> "urls" as input, downloads content and passes the content on to the next 
> vertex.  This input to vertex is smaller than the min split size.   However, 
> we'd like to have more than one task for running for this vertex to help 
> throughput. How do we set the tasks on this particular vertex to be greater 
> than one ?  Of course for other vertices in the graph,  number of tasks as 
> computed by data size fits perfectly fine. 
> 
> Currently, we're seeing this in the logs:
> 
> >
> 
> Root Inputs exist for Vertex: download : {_initial={InputName=_initial}, 
> {Descriptor=ClassName=org.apache.tez.mapreduce.input.MRInput, 
> hasPayload=true}, 
> {ControllerDescriptor=ClassName=org.apache.tez.mapreduce.common.MRInputAMSplitGenerator,
>  hasPayload=false}}
> Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set 
> #tasks for the vertex vertex_1477944280627_0004_1_00 [download]
> Vertex will initialize from input initializer. vertex_1477944280627_0004_1_00 
> [download]
> <
> 
> 
> 
> Thanks for your help !
> 
> Madhu
> 
> 
> 



Vertex Parallelism

2016-10-31 Thread Madhusudan Ramanna
Hello Tez team,
We have a native Tez application.  The first vertex in the graph is a 
downloader.  This vertex takes a CSV or sequence file that contains the "urls" 
as input, downloads content and passes the content on to the next vertex.  This 
input to vertex is smaller than the min split size.   However, we'd like to 
have more than one task for running for this vertex to help throughput. How do 
we set the tasks on this particular vertex to be greater than one ?  Of course 
for other vertices in the graph,  number of tasks as computed by data size fits 
perfectly fine. 
Currently, we're seeing this in the logs:
>
Root Inputs exist for Vertex: download : {_initial={InputName=_initial}, 
{Descriptor=ClassName=org.apache.tez.mapreduce.input.MRInput, hasPayload=true}, 
{ControllerDescriptor=ClassName=org.apache.tez.mapreduce.common.MRInputAMSplitGenerator,
 hasPayload=false}}Num tasks is -1. Expecting 
VertexManager/InputInitializers/1-1 split to set #tasks for the vertex 
vertex_1477944280627_0004_1_00 [download]Vertex will initialize from input 
initializer. vertex_1477944280627_0004_1_00 [download]<


Thanks for your help !
Madhu




Hive+Tez staging dir and scratch dir

2016-10-31 Thread Dharmesh Kakadia
Hi,

I am trying to understand meaning and relation between following
configurations when running Hive on Tez.

hive.exec.stagingdir
tez.staging-dir
hive.exec.scratchdir

Thanks,
Dharmesh