Re: how to allocate more containers?

2015-09-10 Thread Jianfeng (Jeff) Zhang

Do you have the dag plan ? I mean the dag topology.  The dot file in the AM 
container.




Best Regard,
Jeff Zhang


From: Xiaoyong Zhu mailto:xiaoy...@microsoft.com>>
Reply-To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Date: Friday, September 11, 2015 at 1:25 PM
To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Subject: RE: how to allocate more containers?

Thanks for the information. here's my understanding of the resource allocation 
(please correct me if I am wrong) and my scenario:

1.   Assuming the cluster is dedicated to only one Tez application, then I 
want to maximize the usage of the single application (Mem/CPU)

2.   Assuming I have changed all the configurations in YARN side so the 
memory/CPU allocation of a certain node is maximized (meaning each node can be 
theoretically full utilized). The input is around 500GB~1TB

3.   Then I launched a Tez application (Hive on Tez). Tez will choose the 
number of tasks (in my case, there are usually 3K tasks), an each task usually 
run about 10~20 seconds.

In this case, I don't think my Tez task should be increased (as each of them 
just run a couple of seconds so I think each task has the ability to process 
its data). The swimlane picture is attached (for a smaller data size but the 
DAG plans are the same). The container reuse switch is also on.

In order to maximize the utilization, I would rather like to increase my 
container number so more tasks can be run in parallel, but I am not sure if Tez 
AM will ask RM for a certain amount of containers based on what? Can I change 
the container number Tez asks so the job could be run faster?

Xiaoyong

From: Jianfeng (Jeff) Zhang [mailto:jzh...@hortonworks.com]
Sent: Friday, September 11, 2015 1:19 PM
To: user@tez.apache.org
Subject: Re: how to allocate more containers?


 by default I think container reuse is enabled. You may disable it to get more 
containers, but it also needs some trade-off and not use resource efficiently.

Set tez.am.container.reuse.enabled = false


Best Regard,
Jeff Zhang


From: Jianfeng Zhang mailto:jzh...@hortonworks.com>>
Reply-To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Date: Friday, September 11, 2015 at 12:52 PM
To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Subject: Re: how to allocate more containers?

Resource usage is more related to your cluster configuration (the resource 
scheduler configuration)
Do you intend to increase parallelism (more tasks ) to get more containers ?
And there's some configurations that you can use to get containers more quickly 
with some other trade-off,  but it would not give you more containers.


Best Regard,
Jeff Zhang


From: Xiaoyong Zhu mailto:xiaoy...@microsoft.com>>
Reply-To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Date: Friday, September 11, 2015 at 12:38 PM
To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Subject: how to allocate more containers?

Hi

I am wondering if there is a configuration I can change to allocate more 
containers for a certain Tez application? I am using Hive on Tez.

Thanks!

Xiaoyong



RE: how to allocate more containers?

2015-09-10 Thread Xiaoyong Zhu
Thanks for the information. here's my understanding of the resource allocation 
(please correct me if I am wrong) and my scenario:

1.   Assuming the cluster is dedicated to only one Tez application, then I 
want to maximize the usage of the single application (Mem/CPU)

2.   Assuming I have changed all the configurations in YARN side so the 
memory/CPU allocation of a certain node is maximized (meaning each node can be 
theoretically full utilized). The input is around 500GB~1TB

3.   Then I launched a Tez application (Hive on Tez). Tez will choose the 
number of tasks (in my case, there are usually 3K tasks), an each task usually 
run about 10~20 seconds.

In this case, I don't think my Tez task should be increased (as each of them 
just run a couple of seconds so I think each task has the ability to process 
its data). The swimlane picture is attached (for a smaller data size but the 
DAG plans are the same). The container reuse switch is also on.

In order to maximize the utilization, I would rather like to increase my 
container number so more tasks can be run in parallel, but I am not sure if Tez 
AM will ask RM for a certain amount of containers based on what? Can I change 
the container number Tez asks so the job could be run faster?

Xiaoyong

From: Jianfeng (Jeff) Zhang [mailto:jzh...@hortonworks.com]
Sent: Friday, September 11, 2015 1:19 PM
To: user@tez.apache.org
Subject: Re: how to allocate more containers?


 by default I think container reuse is enabled. You may disable it to get more 
containers, but it also needs some trade-off and not use resource efficiently.

Set tez.am.container.reuse.enabled = false


Best Regard,
Jeff Zhang


From: Jianfeng Zhang mailto:jzh...@hortonworks.com>>
Reply-To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Date: Friday, September 11, 2015 at 12:52 PM
To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Subject: Re: how to allocate more containers?

Resource usage is more related to your cluster configuration (the resource 
scheduler configuration)
Do you intend to increase parallelism (more tasks ) to get more containers ?
And there's some configurations that you can use to get containers more quickly 
with some other trade-off,  but it would not give you more containers.


Best Regard,
Jeff Zhang


From: Xiaoyong Zhu mailto:xiaoy...@microsoft.com>>
Reply-To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Date: Friday, September 11, 2015 at 12:38 PM
To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Subject: how to allocate more containers?

Hi

I am wondering if there is a configuration I can change to allocate more 
containers for a certain Tez application? I am using Hive on Tez.

Thanks!

Xiaoyong



Re: how to allocate more containers?

2015-09-10 Thread Jianfeng (Jeff) Zhang

 by default I think container reuse is enabled. You may disable it to get more 
containers, but it also needs some trade-off and not use resource efficiently.

Set tez.am.container.reuse.enabled = false


Best Regard,
Jeff Zhang


From: Jianfeng Zhang mailto:jzh...@hortonworks.com>>
Reply-To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Date: Friday, September 11, 2015 at 12:52 PM
To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Subject: Re: how to allocate more containers?

Resource usage is more related to your cluster configuration (the resource 
scheduler configuration)
Do you intend to increase parallelism (more tasks ) to get more containers ?
And there's some configurations that you can use to get containers more quickly 
with some other trade-off,  but it would not give you more containers.


Best Regard,
Jeff Zhang


From: Xiaoyong Zhu mailto:xiaoy...@microsoft.com>>
Reply-To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Date: Friday, September 11, 2015 at 12:38 PM
To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Subject: how to allocate more containers?

Hi

I am wondering if there is a configuration I can change to allocate more 
containers for a certain Tez application? I am using Hive on Tez.

Thanks!

Xiaoyong



Re: how to allocate more containers?

2015-09-10 Thread Jianfeng (Jeff) Zhang
Resource usage is more related to your cluster configuration (the resource 
scheduler configuration)
Do you intend to increase parallelism (more tasks ) to get more containers ?
And there's some configurations that you can use to get containers more quickly 
with some other trade-off,  but it would not give you more containers.


Best Regard,
Jeff Zhang


From: Xiaoyong Zhu mailto:xiaoy...@microsoft.com>>
Reply-To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Date: Friday, September 11, 2015 at 12:38 PM
To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Subject: how to allocate more containers?

Hi

I am wondering if there is a configuration I can change to allocate more 
containers for a certain Tez application? I am using Hive on Tez.

Thanks!

Xiaoyong



how to allocate more containers?

2015-09-10 Thread Xiaoyong Zhu
Hi

I am wondering if there is a configuration I can change to allocate more 
containers for a certain Tez application? I am using Hive on Tez.

Thanks!

Xiaoyong



Re: Creating TaskLocationHints

2015-09-10 Thread Hitesh Shah
In almost all cases, this is usually hostnames. The general flow is find the 
block locations for the data source, extract the hostname from there and 
provide it to YARN so that it can provide a container on the same host as the 
datanode having the data. As long as YARN is using hostnames, the container 
locality matching should work correctly. I will need to go and check the YARN 
codebase to see if it does some additional reverse dns lookups for IPs to also 
function correctly but to be safe, hostnames should work.

I don’t believe Tez has yet introduced support for working with 
application-level YARN node labels. 

thanks
— Hitesh 

On Sep 10, 2015, at 12:43 AM, Raajay  wrote:

> While creating TaskLocationHints, using the static function
> 
> TaskLocationHint.createTaskLocationHint(Set nodes, Set racks)
> 
> what should the Strings be ? IP address of the nodes ? Node labels ? Or 
> hostnames ?
> 
> Thanks
> Raajay



Re: Error of setting vertex location hints

2015-09-10 Thread Hitesh Shah
There are 2 aspects to using Vertex Location Hints and parallelism. All of this 
depends on how you define the work that needs to be done by a particular task. 

I will take the MR approach and compare it to the more dynamic approach that 
Jeff has been explaining. 

For MR, all the work was decided upfront on the client-side. i.e. how many 
tasks are needed and which task will process what split. From a Tez point of 
view, what this means is that you can configure the vertex with a fixed 
parallelism ( i.e. not -1 ) and set up the Vertex location hints as needed. 
This also implies that you need to configure the Input for that vertex with all 
the necessary information on what work it needs to do via its user payload.

tez-tests/src/main/java/org/apache/tez/mapreduce/examples/FilterLinesByWord.java
 has an option to generate the splits on the client. You can follow this code 
path to see how the DAG is setup. The same approach is also used for running 
any MapReduce job via Tez using the yarn-tez config knob ( MR always generates 
splits on the client ). 

The dynamic approach that Tez follows is that for vertices which are taking 
input from HDFS ( or any other source for that matter ) will have parallelism 
set to -1 ( and no location hints defined at dag plan creation time ). The 
Input has an Initializer attached to it which runs in the ApplicationMaster, 
looks at the data to be processed, figures out how many tasks to run, where to 
run the tasks and also what shard/partition of work to assign to each task. 
There are multiple facets to this which have been mostly covered by Jeff in his 
earlier replies. 

thanks
— Hitesh


On Sep 10, 2015, at 1:15 AM, Jianfeng (Jeff) Zhang  
wrote:

> >>> I am trying to create a scenario where the mappers (root tasks) are 
> >>> necessarily not executed at the data location
> Not sure your purpose. Usually data locality can improve performance.
> 
> 
> >>> Can the number of tasks for the tokenizer be a value *NOT* equal to the 
> >>> number of HDFS blocks of the file ?
> Yes, it can.  Two ways
> *  MRInput internally use InputFormat to determine how to split. So all the 
> methods in InputFormat are applied to MRInput too. 
>Like mapreduce.input.fileinputformat.split.minsize & 
> mapreduce.input.fileinputformat.split.maxsize
> 
> * Another way is to use TezGroupedSplitsInputFormat which is provided by tez. 
> This InputFormat will group several splits together as a new split to be 
> consumed by one mapper.
>   You can use the following parameters to tune that, and please refer 
> MRInputConfigBuilder.groupSplits
>   • tez.grouping.split-waves 
>   • tez.grouping.max-size
>   • tez.grouping.min-size
> 
> >>>  Can a mapper be scheduled at a location different than the location of 
> >>> its input block ? If yes, how ? 
> Yes, it is possible. Tez will always use the split info, there’s no 
> option to disable it. If you really want to, you need to create new 
> InputInitializer. I think you just need to make a little changes on 
> MRInputAMSplitGenerator
>
> https://github.com/zjffdu/tez/blob/a3a7700dea0a315ad613aa2d8a7223eb73878cb5/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/common/MRInputAMSplitGenerator.java
> 
> 
> You just need to make a little changes on the following code snippet 
> 
> InputConfigureVertexTasksEvent configureVertexEvent = 
> InputConfigureVertexTasksEvent.create(
> inputSplitInfo.getNumTasks(),
> VertexLocationHint.create(inputSplitInfo.getTaskLocationHints()), 
>  // make code changes here 
> InputSpecUpdate.getDefaultSinglePhysicalInputSpecUpdate());
> events.add(configureVertexEvent);
> 
> 
> 
> Best Regard,
> Jeff Zhang
> 
> 
> From: Raajay 
> Reply-To: "user@tez.apache.org" 
> Date: Thursday, September 10, 2015 at 2:07 PM
> To: "user@tez.apache.org" 
> Subject: Re: Error of setting vertex location hints
> 
> The input is a hdfs file. I am trying to create a scenario where the mappers 
> (root tasks) are necessarily not executed at the data location. So for now, I 
> chose the Location Hint for the tasks in a random fashion. I figured by 
> populating VertexLocation hint, with address of random nodes, I could achieve 
> it.
> 
> This requires setting parallelism to be the number of elements in 
> VertexLocation hint; which led to the errors.
> 
> Summarizing, for the work count example,
> 
> 1. Can the number of tasks for the tokenizer be a value *NOT* equal to the 
> number of HDFS blocks of the file ?
> 
> 2. Can a mapper be scheduled at a location different than the location of its 
> input block ? If yes, how ? 
> 
> Raajay
> 
> 
> 
> 
> On Thu, Sep 10, 2015 at 12:30 AM, Jianfeng (Jeff) Zhang 
>  wrote:
> >>> In the WordCount example, while creating the Tokenizer Vertex, neither 
> >>> the parallelism or VertexLocation hints is specified. My guess is that at 
> >>> runtime, based on InputInitializer, these values are populated.
> Correct, the parallelism and Verte

Re: Error of setting vertex location hints

2015-09-10 Thread Jianfeng (Jeff) Zhang
>>> I am trying to create a scenario where the mappers (root tasks) are 
>>> necessarily not executed at the data location
Not sure your purpose. Usually data locality can improve performance.


>>> Can the number of tasks for the tokenizer be a value *NOT* equal to the 
>>> number of HDFS blocks of the file ?
Yes, it can.  Two ways
*  MRInput internally use InputFormat to determine how to split. So all the 
methods in InputFormat are applied to MRInput too.
   Like mapreduce.input.fileinputformat.split.minsize & 
mapreduce.input.fileinputformat.split.maxsize

* Another way is to use TezGroupedSplitsInputFormat which is provided by tez. 
This InputFormat will group several splits together as a new split to be 
consumed by one mapper.
  You can use the following parameters to tune that, and please refer 
MRInputConfigBuilder.groupSplits

  *

tez.grouping.split-waves

  *

tez.grouping.max-size

  *

tez.grouping.min-size

>>>  Can a mapper be scheduled at a location different than the location of its 
>>> input block ? If yes, how ?
Yes, it is possible. Tez will always use the split info, there’s no option 
to disable it. If you really want to, you need to create new InputInitializer. 
I think you just need to make a little changes on MRInputAMSplitGenerator
   
https://github.com/zjffdu/tez/blob/a3a7700dea0a315ad613aa2d8a7223eb73878cb5/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/common/MRInputAMSplitGenerator.java


You just need to make a little changes on the following code snippet


InputConfigureVertexTasksEvent configureVertexEvent = 
InputConfigureVertexTasksEvent.create(

inputSplitInfo.getNumTasks(),

VertexLocationHint.create(inputSplitInfo.getTaskLocationHints()),   
   // make code changes here

InputSpecUpdate.getDefaultSinglePhysicalInputSpecUpdate());

events.add(configureVertexEvent);



Best Regard,
Jeff Zhang


From: Raajay mailto:raaja...@gmail.com>>
Reply-To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Date: Thursday, September 10, 2015 at 2:07 PM
To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Subject: Re: Error of setting vertex location hints

The input is a hdfs file. I am trying to create a scenario where the mappers 
(root tasks) are necessarily not executed at the data location. So for now, I 
chose the Location Hint for the tasks in a random fashion. I figured by 
populating VertexLocation hint, with address of random nodes, I could achieve 
it.

This requires setting parallelism to be the number of elements in 
VertexLocation hint; which led to the errors.

Summarizing, for the work count example,

1. Can the number of tasks for the tokenizer be a value *NOT* equal to the 
number of HDFS blocks of the file ?

2. Can a mapper be scheduled at a location different than the location of its 
input block ? If yes, how ?

Raajay




On Thu, Sep 10, 2015 at 12:30 AM, Jianfeng (Jeff) Zhang 
mailto:jzh...@hortonworks.com>> wrote:
>>> In the WordCount example, while creating the Tokenizer Vertex, neither the 
>>> parallelism or VertexLocation hints is specified. My guess is that at 
>>> runtime, based on InputInitializer, these values are populated.
Correct, the parallelism and VertexLocation is specified at runtime by 
InputInitializer

>>> What should I do such that location of the tasks for the Tokenizer vertex 
>>> are not based on HDFS splits but can be arbitrarily configured while 
>>> creation ?
Do you mean your input is not hdfs file ?  In that case I think you need to 
create your own DataSourceDescriptor. You can refer the DataSourceDescriptor 
that is used by WordCount example as following.  If possible, let us know more 
about your context. What kind of data is your input ? And how would you specify 
the VertexLocation for your input ?


DataSourceDescriptor dataSource = MRInput.createConfigBuilder(new 
Configuration(tezConf),

TextInputFormat.class, 
inputPath).groupSplits(!isDisableSplitGrouping()).build();



Best Regard,
Jeff Zhang


From: Raajay mailto:raaja...@gmail.com>>
Reply-To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Date: Thursday, September 10, 2015 at 1:10 PM
To: "user@tez.apache.org" 
mailto:user@tez.apache.org>>
Subject: Re: Error of setting vertex location hints

I am just getting started with understanding tez code, so bear with me; I might 
be wrong here.

In the WordCount example, while creating the Tokenizer Vertex, neither the 
parallelism or VertexLocation hints is specified. My guess is that at runtime, 
based on InputInitializer, these values are populated.

However, I do not want them to be populated at runtime, but rather want them 
specified while creating the DAG itself. When I do that, I get the exception 
mentioned in the previous mail.

What should I do such that location of the tasks for the Tokenizer vertex are 
not based on HDFS splits but can be

Creating TaskLocationHints

2015-09-10 Thread Raajay
While creating TaskLocationHints, using the static function

TaskLocationHint.createTaskLocationHint(Set nodes, Set
racks)

what should the Strings be ? IP address of the nodes ? Node labels ? Or
hostnames ?

Thanks
Raajay