RE: Question Tez under the hood

Bikas Saha Tue, 04 Nov 2014 11:47:23 -0800

To be clear, HiveServer2 and Hive CLI use TezSessions and try to reuse the
session across queries. Hive CLI will typically end up using only 1
session since the CLI blocks until the current query completes.
HiveServer2 has concurrent query support and has its own logic about when
a Tez session can be re-used.

About running multiple queries in the AM. I believe there was a jira for
that. If not we should track that. Queries in Hive typically follow a V
shape with a large parallelism in the beginning that tapers off. It may be
possible to get significant gains by pipelining queries sequentially where
the next query fills up the unused space left behind by the current query
as it winds down.

Bikas

-----Original Message-----
From: Hitesh Shah [mailto:[email protected]]
Sent: Monday, November 03, 2014 7:46 PM
To: [email protected]
Subject: Re: Question Tez under the hood

Hi

For the most part, each Hive CLI session or JDBC/ODBC connection to
HiveServer2 would map to a single Application Master. HiveServer does have
some optimizations though ( to avoid the overhead cost of launching a new
AM ) where it tries to keep a pool of ApplicationMasters around and does
some scheduling around them. In cases where the no. of queries is high, I
am not sure whether it starts spawning new AMs or queues up queries.
Something that is probably best asked on the Hive mailing lists.

As for making the AM able to handle multiple DAGs concurrently, the
problem does not lie in fixing that but more in terms of whether a cluster
has enough capacity to handle that many queries/DAGs concurrently. The
amount of savings in running multiple queries in a single AM is the
resources utilized per AM. In the end, the level of throughput may not
increase by much if there are not enough resources to run containers
needed by all the tasks of each of these queries.

On the other hand, there have been some discussions around looking at
supporting concurrent DAGs within a single AM. This has interesting
problems similar to that of the JobTracker in Hadoop 1.x i.e the Tez AM
now has to decide priorities across different DAGs and decide how to
allocate containers to complete the tasks for each DAG. From a YARN point
of view, the Tez AM is a single application and therefore all resource
management/prioritization/preemption now falls onto the Tez AM to manage
the multiple queries unlike in the case where each query has its own AM.

- Hitesh

On Nov 3, 2014, at 7:26 PM, VJ Anand <[email protected]> wrote:

> I have a follow-up question -- Bikas mentioned that the Tez App Master
submits one DAG at a time -- Now, for a Query engine like Hive, where
there would be multiple requests, how is this handled? Are we creating
multiple App Masters that round robins between them? Even then, when large
number of requests are submitted to the Hive server, if the App master can
submit only one DAG at a time, we would have situations where there would
be many outstanding requests. Is there a way we can make the App Master
multi-threaded?
>
> --
> VJ Anand
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: Question Tez under the hood

Reply via email to