Parallel queries/dags running in same AM?

2015-03-09 Thread Fabio C.
Hi all,
I've been using Tez on hive, and I had a chance to hear a conversation that
mismatches with my present knowledge, can anyone confirm the following
statement?
(1)- For every TEZ AM it is possible to launch just a single query/DAG at a
time. So within a given AM several DAGs can be executed only in sequential
order (a.k.a. a session), not in parallel. To execute DAGs in parallel we
always need several AMs.
(2)- The AM is user-specific, and each user is expected to run queries
through its own AM (or on multiple AMs if there is a need for parallelism).
(3)- Several users can submit their DAGs as the same user (e.g.: through
hiveserver2), but in this case we will still have several AM.

Thanks in advance

Fabio


RE: Parallel queries/dags running in same AM?

2015-03-09 Thread Bikas Saha
(1)- For every TEZ AM it is possible to launch just a single query/DAG at a 
time. So within a given AM several DAGs can be executed only in sequential 
order (a.k.a. a session), not in parallel. To execute DAGs in parallel we 
always need several AMs.

Correct. Today a single AM will accept new DAGs when the AM is idle and run 
them. An AM is idle when no DAG is running.

(2)- The AM is user-specific, and each user is expected to run queries 
through its own AM (or on multiple AMs if there is a need for parallelism).

Correct in a secure cluster. In a non-secure cluster an AM runs as the yarn 
user which is common to all AMs. In a secure cluster, any entity that has been 
given a client token (for that app attempt) by the RM, can communicate with the 
AM. In a non-secure cluster, any entity that has obtained the AMs connection 
information from the RM can communicate with the AM. The AM has an additional 
set of ACL’s that determine who can submit, view, modify DAGs.

(3)- Several users can submit their DAGs as the same user (e.g.: through 
hiveserver2), but in this case we will still have several AM.

Correct. However, the number of AMs will be determined by the policy of the 
mediating server. It may choose to launch a new AM for every new DAG. Or queue 
up and round robin through a limited set of AMs, etc.

Bikas

From: Fabio C. [mailto:anyte...@gmail.com]
Sent: Monday, March 09, 2015 4:31 AM
To: user@tez.apache.org; u...@hive.apache.org
Subject: Parallel queries/dags running in same AM?

Hi all,
I've been using Tez on hive, and I had a chance to hear a conversation that 
mismatches with my present knowledge, can anyone confirm the following 
statement?
(1)- For every TEZ AM it is possible to launch just a single query/DAG at a 
time. So within a given AM several DAGs can be executed only in sequential 
order (a.k.a. a session), not in parallel. To execute DAGs in parallel we 
always need several AMs.
(2)- The AM is user-specific, and each user is expected to run queries through 
its own AM (or on multiple AMs if there is a need for parallelism).
(3)- Several users can submit their DAGs as the same user (e.g.: through 
hiveserver2), but in this case we will still have several AM.

Thanks in advance

Fabio


Re: Parallel queries/dags running in same AM?

2015-03-09 Thread Hitesh Shah
A clarification for (2), you can share an AM across multiple users by using 
form of proxy users and passing in the required delegation tokens to talk to 
various services such as HDFS. Also, HiveServer2 when the doAs mode is set to 
false, runs all AMs as user hive but can effectively run queries for various 
different users by doing its security check at the “perimeter”. 

— Hitesh

On Mar 9, 2015, at 10:30 AM, Bikas Saha bi...@hortonworks.com wrote:

 (1)- For every TEZ AM it is possible to launch just a single query/DAG at a 
 time. So within a given AM several DAGs can be executed only in sequential 
 order (a.k.a. a session), not in parallel. To execute DAGs in parallel we 
 always need several AMs.
  
 Correct. Today a single AM will accept new DAGs when the AM is idle and run 
 them. An AM is idle when no DAG is running.
  
 (2)- The AM is user-specific, and each user is expected to run queries 
 through its own AM (or on multiple AMs if there is a need for parallelism). 
 
 Correct in a secure cluster. In a non-secure cluster an AM runs as the yarn 
 user which is common to all AMs. In a secure cluster, any entity that has 
 been given a client token (for that app attempt) by the RM, can communicate 
 with the AM. In a non-secure cluster, any entity that has obtained the AMs 
 connection information from the RM can communicate with the AM. The AM has an 
 additional set of ACL’s that determine who can submit, view, modify DAGs.
  
 (3)- Several users can submit their DAGs as the same user (e.g.: through 
 hiveserver2), but in this case we will still have several AM.
 
 Correct. However, the number of AMs will be determined by the policy of the 
 mediating server. It may choose to launch a new AM for every new DAG. Or 
 queue up and round robin through a limited set of AMs, etc.
  
 Bikas
  
 From: Fabio C. [mailto:anyte...@gmail.com] 
 Sent: Monday, March 09, 2015 4:31 AM
 To: user@tez.apache.org; u...@hive.apache.org
 Subject: Parallel queries/dags running in same AM?
  
 Hi all,
 I've been using Tez on hive, and I had a chance to hear a conversation that 
 mismatches with my present knowledge, can anyone confirm the following 
 statement?
 (1)- For every TEZ AM it is possible to launch just a single query/DAG at a 
 time. So within a given AM several DAGs can be executed only in sequential 
 order (a.k.a. a session), not in parallel. To execute DAGs in parallel we 
 always need several AMs.
 (2)- The AM is user-specific, and each user is expected to run queries 
 through its own AM (or on multiple AMs if there is a need for parallelism). 
 (3)- Several users can submit their DAGs as the same user (e.g.: through 
 hiveserver2), but in this case we will still have several AM.
 
 Thanks in advance
 
 Fabio



Re: What is recommended memory setting for tez.am and tez task?

2015-03-09 Thread Hitesh Shah
Hello Alexander,

Are you using Tez natively or via Hive/Pig/Cascading, etc? 

To a large extent, most users I have encountered tend to have 
tez.am.resource.memory.mb sized to be between 4-8 GB though in some cases, ( 
until TEZ-776 is addressed ), this might need to increased for DAGs which have 
very high parallelism and  large scatter-gather edges. ( 4 GB is not a minimum 
requirement but in general, most YARN clusters usually end up having their 
minimum allocation configured to 4GB or so in any case ).

As for the task memory, it depends on the kind of workload and there are no 
standard guidelines from a general Tez perspective. A general rule of thumb on 
a YARN cluster is that this usually is set atleast the configured minimum size 
of a YARN container ( minimum-allocation setting ). Hive does not use this 
value and overrides it directly via its hive.tez.container.size setting. I am 
not sure if Pig has their own override configuration property or if they treat 
the tez task memory property as a passthrough. 

For both the above, Tez automatically sets the Xmx value for the JVM to around 
0.8 of the container size ( if it has not been set by the user - a general 
recommendation is to not configure -Xmx in the java opts for this reason ). 
Furthermore, most of the buffers used by the in-built inputs/outputs usually 
get auto-scaled down based on the size of the available JVM heap. 

thanks
— Hitesh


On Mar 9, 2015, at 4:04 PM, Alexander Pivovarov apivova...@gmail.com wrote:

 Hi Everyone
 
 What is recommended value for
 
 tez.am.resource.memory.mb
 
 tez.task.resource.memory.mb
 
 
 Thank you