I agree, if the cluster resources are fully utilized then it is moot question. But, considering this case, where I could support a dedicated cluster without sharing any other work load/types, I was concerned about the lack of multiple thread support within Tez AM -- the current requirement are in the range ~200 queries per min (these are concurrent requests) - Do you still suggest/advice that I can build on top of this framework?
-VJ On Tue, Nov 4, 2014 at 2:13 PM, Bikas Saha <[email protected]> wrote: > What kind of concurrency load are we talking about here. > > > > Note that HiveServer2 and similar systems are currently building using Tez > and support concurrency using multiple Tez sessions.. If the system is > being fully used then its orthogonal that a single TEZ AM cannot support > concurrent DAGs because the system capacity is already fully utilized. The > service can accept high concurrency but can execute only as much as the > cluster capacity allows. Sharing the cluster capacity between the queries > depends on that services policy. E.g. FIFO, fair-share etc. > > > > Bikas > > > > *From:* VJ Anand [mailto:[email protected]] > *Sent:* Tuesday, November 04, 2014 1:24 PM > > *To:* [email protected] > *Subject:* Re: Question Tez under the hood > > > > Thanks for the info and response. The purpose of asking regarding Hive, > was to see whether a query engine that I have been working on could be > moved to use Tez as its execution layer. Currently, this query engine > supports a large number of concurrent user request, and needs to do so > going forward. Given, the current Tez AM limitations, would it possible for > me to purpose build a YARN AM, that can leverage the Tez DAG execution > framework? In other words, the AM would support the concurrent use cases, > etc., needed, but at the same leverage the DAG API's and the frameword? if > so, any pointers? > > > > -VJ > > > > On Tue, Nov 4, 2014 at 11:44 AM, Bikas Saha <[email protected]> wrote: > > To be clear, HiveServer2 and Hive CLI use TezSessions and try to reuse the > session across queries. Hive CLI will typically end up using only 1 > session since the CLI blocks until the current query completes. > HiveServer2 has concurrent query support and has its own logic about when > a Tez session can be re-used. > > About running multiple queries in the AM. I believe there was a jira for > that. If not we should track that. Queries in Hive typically follow a V > shape with a large parallelism in the beginning that tapers off. It may be > possible to get significant gains by pipelining queries sequentially where > the next query fills up the unused space left behind by the current query > as it winds down. > > Bikas > > > -----Original Message----- > From: Hitesh Shah [mailto:[email protected]] > Sent: Monday, November 03, 2014 7:46 PM > To: [email protected] > Subject: Re: Question Tez under the hood > > Hi > > For the most part, each Hive CLI session or JDBC/ODBC connection to > HiveServer2 would map to a single Application Master. HiveServer does have > some optimizations though ( to avoid the overhead cost of launching a new > AM ) where it tries to keep a pool of ApplicationMasters around and does > some scheduling around them. In cases where the no. of queries is high, I > am not sure whether it starts spawning new AMs or queues up queries. > Something that is probably best asked on the Hive mailing lists. > > As for making the AM able to handle multiple DAGs concurrently, the > problem does not lie in fixing that but more in terms of whether a cluster > has enough capacity to handle that many queries/DAGs concurrently. The > amount of savings in running multiple queries in a single AM is the > resources utilized per AM. In the end, the level of throughput may not > increase by much if there are not enough resources to run containers > needed by all the tasks of each of these queries. > > On the other hand, there have been some discussions around looking at > supporting concurrent DAGs within a single AM. This has interesting > problems similar to that of the JobTracker in Hadoop 1.x i.e the Tez AM > now has to decide priorities across different DAGs and decide how to > allocate containers to complete the tasks for each DAG. From a YARN point > of view, the Tez AM is a single application and therefore all resource > management/prioritization/preemption now falls onto the Tez AM to manage > the multiple queries unlike in the case where each query has its own AM. > > - Hitesh > > On Nov 3, 2014, at 7:26 PM, VJ Anand <[email protected]> wrote: > > > I have a follow-up question -- Bikas mentioned that the Tez App Master > submits one DAG at a time -- Now, for a Query engine like Hive, where > there would be multiple requests, how is this handled? Are we creating > multiple App Masters that round robins between them? Even then, when large > number of requests are submitted to the Hive server, if the App master can > submit only one DAG at a time, we would have situations where there would > be many outstanding requests. Is there a way we can make the App Master > multi-threaded? > > > > -- > > VJ Anand > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. > > > > > > -- > > *VJ Anand* > >
