Parallel queries/dags running in same AM?
Hi all, I've been using Tez on hive, and I had a chance to hear a conversation that mismatches with my present knowledge, can anyone confirm the following statement? (1)- For every TEZ AM it is possible to launch just a single query/DAG at a time. So within a given AM several DAGs can be executed only in sequential order (a.k.a. a session), not in parallel. To execute DAGs in parallel we always need several AMs. (2)- The AM is user-specific, and each user is expected to run queries through its own AM (or on multiple AMs if there is a need for parallelism). (3)- Several users can submit their DAGs as the same user (e.g.: through hiveserver2), but in this case we will still have several AM. Thanks in advance Fabio
RE: Parallel queries/dags running in same AM?
(1)- For every TEZ AM it is possible to launch just a single query/DAG at a time. So within a given AM several DAGs can be executed only in sequential order (a.k.a. a session), not in parallel. To execute DAGs in parallel we always need several AMs. Correct. Today a single AM will accept new DAGs when the AM is idle and run them. An AM is idle when no DAG is running. (2)- The AM is user-specific, and each user is expected to run queries through its own AM (or on multiple AMs if there is a need for parallelism). Correct in a secure cluster. In a non-secure cluster an AM runs as the yarn user which is common to all AMs. In a secure cluster, any entity that has been given a client token (for that app attempt) by the RM, can communicate with the AM. In a non-secure cluster, any entity that has obtained the AMs connection information from the RM can communicate with the AM. The AM has an additional set of ACL’s that determine who can submit, view, modify DAGs. (3)- Several users can submit their DAGs as the same user (e.g.: through hiveserver2), but in this case we will still have several AM. Correct. However, the number of AMs will be determined by the policy of the mediating server. It may choose to launch a new AM for every new DAG. Or queue up and round robin through a limited set of AMs, etc. Bikas From: Fabio C. [mailto:anyte...@gmail.com] Sent: Monday, March 09, 2015 4:31 AM To: user@tez.apache.org; u...@hive.apache.org Subject: Parallel queries/dags running in same AM? Hi all, I've been using Tez on hive, and I had a chance to hear a conversation that mismatches with my present knowledge, can anyone confirm the following statement? (1)- For every TEZ AM it is possible to launch just a single query/DAG at a time. So within a given AM several DAGs can be executed only in sequential order (a.k.a. a session), not in parallel. To execute DAGs in parallel we always need several AMs. (2)- The AM is user-specific, and each user is expected to run queries through its own AM (or on multiple AMs if there is a need for parallelism). (3)- Several users can submit their DAGs as the same user (e.g.: through hiveserver2), but in this case we will still have several AM. Thanks in advance Fabio
Re: Parallel queries/dags running in same AM?
A clarification for (2), you can share an AM across multiple users by using form of proxy users and passing in the required delegation tokens to talk to various services such as HDFS. Also, HiveServer2 when the doAs mode is set to false, runs all AMs as user hive but can effectively run queries for various different users by doing its security check at the “perimeter”. — Hitesh On Mar 9, 2015, at 10:30 AM, Bikas Saha bi...@hortonworks.com wrote: (1)- For every TEZ AM it is possible to launch just a single query/DAG at a time. So within a given AM several DAGs can be executed only in sequential order (a.k.a. a session), not in parallel. To execute DAGs in parallel we always need several AMs. Correct. Today a single AM will accept new DAGs when the AM is idle and run them. An AM is idle when no DAG is running. (2)- The AM is user-specific, and each user is expected to run queries through its own AM (or on multiple AMs if there is a need for parallelism). Correct in a secure cluster. In a non-secure cluster an AM runs as the yarn user which is common to all AMs. In a secure cluster, any entity that has been given a client token (for that app attempt) by the RM, can communicate with the AM. In a non-secure cluster, any entity that has obtained the AMs connection information from the RM can communicate with the AM. The AM has an additional set of ACL’s that determine who can submit, view, modify DAGs. (3)- Several users can submit their DAGs as the same user (e.g.: through hiveserver2), but in this case we will still have several AM. Correct. However, the number of AMs will be determined by the policy of the mediating server. It may choose to launch a new AM for every new DAG. Or queue up and round robin through a limited set of AMs, etc. Bikas From: Fabio C. [mailto:anyte...@gmail.com] Sent: Monday, March 09, 2015 4:31 AM To: user@tez.apache.org; u...@hive.apache.org Subject: Parallel queries/dags running in same AM? Hi all, I've been using Tez on hive, and I had a chance to hear a conversation that mismatches with my present knowledge, can anyone confirm the following statement? (1)- For every TEZ AM it is possible to launch just a single query/DAG at a time. So within a given AM several DAGs can be executed only in sequential order (a.k.a. a session), not in parallel. To execute DAGs in parallel we always need several AMs. (2)- The AM is user-specific, and each user is expected to run queries through its own AM (or on multiple AMs if there is a need for parallelism). (3)- Several users can submit their DAGs as the same user (e.g.: through hiveserver2), but in this case we will still have several AM. Thanks in advance Fabio
Re: What is recommended memory setting for tez.am and tez task?
Hello Alexander, Are you using Tez natively or via Hive/Pig/Cascading, etc? To a large extent, most users I have encountered tend to have tez.am.resource.memory.mb sized to be between 4-8 GB though in some cases, ( until TEZ-776 is addressed ), this might need to increased for DAGs which have very high parallelism and large scatter-gather edges. ( 4 GB is not a minimum requirement but in general, most YARN clusters usually end up having their minimum allocation configured to 4GB or so in any case ). As for the task memory, it depends on the kind of workload and there are no standard guidelines from a general Tez perspective. A general rule of thumb on a YARN cluster is that this usually is set atleast the configured minimum size of a YARN container ( minimum-allocation setting ). Hive does not use this value and overrides it directly via its hive.tez.container.size setting. I am not sure if Pig has their own override configuration property or if they treat the tez task memory property as a passthrough. For both the above, Tez automatically sets the Xmx value for the JVM to around 0.8 of the container size ( if it has not been set by the user - a general recommendation is to not configure -Xmx in the java opts for this reason ). Furthermore, most of the buffers used by the in-built inputs/outputs usually get auto-scaled down based on the size of the available JVM heap. thanks — Hitesh On Mar 9, 2015, at 4:04 PM, Alexander Pivovarov apivova...@gmail.com wrote: Hi Everyone What is recommended value for tez.am.resource.memory.mb tez.task.resource.memory.mb Thank you