Re: Question Tez under the hood

VJ Anand Tue, 04 Nov 2014 15:38:35 -0800

I agree, if the cluster resources are fully utilized then it is moot
question. But, considering this case, where I could support a dedicated
cluster without sharing any other work load/types, I was concerned about
the lack of multiple thread support within Tez AM -- the current
requirement are in the range ~200 queries per min (these are concurrent
requests) - Do you still suggest/advice that I can build on top of this
framework?


-VJ


On Tue, Nov 4, 2014 at 2:13 PM, Bikas Saha <[email protected]> wrote:

> What kind of concurrency load are we talking about here.
>
>
>
> Note that HiveServer2 and similar systems are currently building using Tez
> and support concurrency using multiple Tez sessions.. If the system is
> being fully used then its orthogonal that a single TEZ AM cannot support
> concurrent DAGs because the system capacity is already fully utilized. The
> service can accept high concurrency but can execute only as much as the
> cluster capacity allows. Sharing the cluster capacity between the queries
> depends on that services policy. E.g. FIFO, fair-share etc.
>
>
>
> Bikas
>
>
>
> *From:* VJ Anand [mailto:[email protected]]
> *Sent:* Tuesday, November 04, 2014 1:24 PM
>
> *To:* [email protected]
> *Subject:* Re: Question Tez under the hood
>
>
>
> Thanks for the info and response. The purpose of asking regarding Hive,
> was to see whether a query engine that I have been working on could be
> moved to use Tez as its execution layer. Currently, this query engine
> supports a large number of concurrent user request, and needs to do so
> going forward. Given, the current Tez AM limitations, would it possible for
> me to purpose build a YARN AM, that can leverage the Tez DAG execution
> framework? In other words, the AM would support the concurrent use cases,
> etc., needed, but at the same leverage the DAG API's and the frameword? if
> so, any pointers?
>
>
>
> -VJ
>
>
>
> On Tue, Nov 4, 2014 at 11:44 AM, Bikas Saha <[email protected]> wrote:
>
> To be clear, HiveServer2 and Hive CLI use TezSessions and try to reuse the
> session across queries. Hive CLI will typically end up using only 1
> session since the CLI blocks until the current query completes.
> HiveServer2 has concurrent query support and has its own logic about when
> a Tez session can be re-used.
>
> About running multiple queries in the AM. I believe there was a jira for
> that. If not we should track that. Queries in Hive typically follow a V
> shape with a large parallelism in the beginning that tapers off. It may be
> possible to get significant gains by pipelining queries sequentially where
> the next query fills up the unused space left behind by the current query
> as it winds down.
>
> Bikas
>
>
> -----Original Message-----
> From: Hitesh Shah [mailto:[email protected]]
> Sent: Monday, November 03, 2014 7:46 PM
> To: [email protected]
> Subject: Re: Question Tez under the hood
>
> Hi
>
> For the most part, each Hive CLI session or JDBC/ODBC connection to
> HiveServer2 would map to a single Application Master. HiveServer does have
> some optimizations though ( to avoid the overhead cost of launching a new
> AM ) where it tries to keep a pool of ApplicationMasters around and does
> some scheduling around them. In cases where the no. of queries is high, I
> am not sure whether it starts spawning new AMs or queues up queries.
> Something that is probably best asked on the Hive mailing lists.
>
> As for making the AM able to handle multiple DAGs concurrently, the
> problem does not lie in fixing that but more in terms of whether a cluster
> has enough capacity to handle that many queries/DAGs concurrently. The
> amount of savings in running multiple queries in a single AM is the
> resources utilized per AM. In the end, the level of throughput may not
> increase by much if there are not enough resources to run containers
> needed by all the tasks of each of these queries.
>
> On the other hand, there have been some discussions around looking at
> supporting concurrent DAGs within a single AM. This has interesting
> problems similar to that of the JobTracker in Hadoop 1.x i.e the Tez AM
> now has to decide priorities across different DAGs and decide how to
> allocate containers to complete the tasks for each DAG. From a YARN point
> of view, the Tez AM is a single application and therefore all resource
> management/prioritization/preemption now falls onto the Tez AM to manage
> the multiple queries unlike in the case where each query has its own AM.
>
> - Hitesh
>
> On Nov 3, 2014, at 7:26 PM, VJ Anand <[email protected]> wrote:
>
> > I have a follow-up question -- Bikas mentioned that the Tez App Master
> submits one DAG at a time -- Now, for a Query engine like Hive, where
> there would be multiple requests, how is this handled? Are we creating
> multiple App Masters that round robins between them? Even then, when large
> number of requests are submitted to the Hive server, if the App master can
> submit only one DAG at a time, we would have situations where there would
> be many outstanding requests. Is there a way we can make the App Master
> multi-threaded?
> >
> > --
> > VJ Anand
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>
>
>
> --
>
> *VJ Anand*
>
>

Re: Question Tez under the hood

Reply via email to