Re: Tez performance on Hive

Hitesh Shah Fri, 20 Jun 2014 11:42:41 -0700

The main config to control how long containers are kept for is 
"tez.am.container.session.delay-allocation-millis”. Setting this to a higher 
value will tell the AM to retain containers for a longer period. Increasing 
this though will have a negative effect on other users in the cluster as idle 
resources will be retained by the tez application.


— Hitesh


On Jun 20, 2014, at 11:27 AM, Lars Selsaas <[email protected]> 
wrote:

> I'm also wondering which settings I can play around with to affect this? Say 
> I want to make my jobs keep stuff longer.
> 
> Thanks,
> Lars
> 
> 
> On Fri, Jun 20, 2014 at 11:08 AM, Lars Selsaas 
> <[email protected]> wrote:
> Thanks!
> 
> Hopefully I'm getting the correct logs here:
> 
> It seems the same application manager keeps on taking the requests.
> 
> They both get the same application ID: application_1403285786962_0002
> dag_1403285786962_0004_1.dot : Total file length is 2179 bytes.
> dag_1403285786962_0004_2.dot : Total file length is 2179 bytes.
> dag_1403285786962_0004_3.dot : Total file length is 2179 bytes.
> dag_1403285786962_0004_4.dot : Total file length is 2179 bytes.
> stderr : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_1 : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_2 : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_3 : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_4 : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
> stdout : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_1 : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_2 : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_3 : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_4 : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
> syslog : Total file length is 7577 bytes.
> syslog_dag_1403285786962_0004_1 : Total file length is 57034 bytes.
> syslog_dag_1403285786962_0004_1_post : Total file length is 4775 bytes.
> syslog_dag_1403285786962_0004_2 : Total file length is 56104 bytes.
> syslog_dag_1403285786962_0004_2_post : Total file length is 707 bytes.
> syslog_dag_1403285786962_0004_3 : Total file length is 53187 bytes.
> syslog_dag_1403285786962_0004_3_post : Total file length is 5003 bytes.
> syslog_dag_1403285786962_0004_4 : Total file length is 56111 bytes.
> syslog_dag_1403285786962_0004_4_post : Total file length is 4204 bytes.
> 
> fast run
> 
> Map 1  1       734 Bytes       438 Bytes       639 ms
> Map 2 1        245 KB 478 Bytes       1.34 secs
> Reducer 3      1       446 Bytes       557 Bytes       3.63 secs
> 
> 
> 
> slow run
> 
> Map 1  1       734 Bytes       438 Bytes       12.62 secs
> Map 2 1        245 KB 478 Bytes       14.37 secs
> Reducer 3      1       446 Bytes       557 Bytes       15.67 secs
> 
> 
> 
> On Fri, Jun 20, 2014 at 10:31 AM, Hitesh Shah <[email protected]> wrote:
> Hello Lars,
> 
> Just to be very clear - there is no caching of results/data across queries 
> except for some minimal meta-data caching for ORC. If you can send across the 
> logs generated by “yarn logs -applicationId <appId>”, we can try and help you 
> get a better understanding of where the speed difference is stemming from.
> 
> — HItesh
> 
> On Jun 20, 2014, at 10:13 AM, Bikas Saha <[email protected]> wrote:
> 
> > Hi,
> >
> > Thanks for your interest in trying out Hive on Tez. There are multiple 
> > reasons for the observations you see below.
> > 1)      Containers are warmed up the longer they get used. So if you 
> > repeatedly run queries then the JVM has all classes loaded and ready and 
> > may have JIT-ed the frequently run code path. As it learns more about your 
> > execution pattern, the JIT can do a better job. This will help you across 
> > different queries.
> > 2)      As you frequently access the same data from the OS it will increase 
> > the chances of your finding that data in the OS buffer cache. So you get 
> > the benefits of in-memory data JThis will help repeated runs of queries on 
> > the same data.
> > 3)      Hive is smart about explicitly caching de-serialized (Java objects) 
> > within query in order to reduce re-computation of work that has already 
> > been done. This will help within a query.
> > 4)      If you are using the ORC file then Hive will try to cache ORC file 
> > metadata like locations/sizes etc. and this helps different queries that 
> > access the same data.
> > 5)      If your Tez query session has been idle for some time, then the 
> > system starts pro-actively releasing resources back to the cluster so that 
> > they may be used by other applications (good for multi-tenancy). So if you 
> > fire a query after some delay then a slowdown will be observed in case we 
> > need to reclaim some of the released resources. This delay is configurable.
> >
> > Hope this helps and you have a positive experience experimenting with Hive 
> > on Tez.
> > Please let us know how we can help!
> > Bikas
> >
> > From: Lars Selsaas [mailto:[email protected]]
> > Sent: Friday, June 20, 2014 8:50 AM
> > To: user
> > Subject: Tez performance on Hive
> >
> > Hi,
> >
> > So when you set Tez as the execution engine for Hive it takes about half 
> > the time to finish a query the second time you run it going from say 24 
> > seconds to 12 seconds. but if I keep re running it it gets down to about 2 
> > seconds on that same query. The speed goes up to 12 seconds if I wait to 
> > long before the next rerun or if I do large enough adjustments to the query.
> >
> >
> > So I'm working on a blogpost about Tez and need to find out why this is 
> > happening. The first reduced speed seem to mainly just be because of hot 
> > containers that store the information about where to find your data. While 
> > the seconds reduce down to about 2 sec seems to be some in memory storage 
> > of the data. Does it store the results in memory and keep it ready for next 
> > time or?
> >
> >
> >
> > --
> > <~WRD018.jpg>
> > Lars Selsaas
> > Data Engineer
> > Think Big Analytics
> > [email protected]
> > 650-537-5321
> >
> >
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity to 
> > which it is addressed and may contain information that is confidential, 
> > privileged and exempt from disclosure under applicable law. If the reader 
> > of this message is not the intended recipient, you are hereby notified that 
> > any printing, copying, dissemination, distribution, disclosure or 
> > forwarding of this communication is strictly prohibited. If you have 
> > received this communication in error, please contact the sender immediately 
> > and delete it from your system. Thank You.
> 
> 
> 
> 
> -- 
>       
> Lars Selsaas
> Data Engineer
> Think Big Analytics
> [email protected]
> 650-537-5321
> 
> 
> 
> 
> -- 
>       
> Lars Selsaas
> Data Engineer
> Think Big Analytics
> [email protected]
> 650-537-5321
>

Re: Tez performance on Hive

Reply via email to