Re: Tez Sessions

2016-10-20 Thread Hitesh Shah
HiveServer2 today maintains Tez sessions ( when running with perimeter security 
i.e Ranger/Sentry ) and re-uses the session across queries. 

Tez AM recovery works for the most part. It will try to recover completed tasks 
of the last running DAG and complete the one that did not complete or were 
still running. It does not handle cases where the committer was in the middle 
of a commit though so those dags will abort when trying to recover. Given the 
complexity of recovery, there are probably bugs that we may not have discovered 
yet but for the most part, it does function well.

There are a few issues you should consider when trying to use a single AM:
   - on secure clusters, the delegation token max lifetime is 7 days so you 
will need to re-cycle apps on a weekly basis. 
   - YARN does not clean up data/logs for an app until the app completes so 
this can add space pressure on the yarn local dirs. That said, there is some 
work happening as part of TEZ-3334 to help clean up intermediate data on a 
regular basis. There have been a couple of other jiras filed recently too to 
look at being able to clean up data more frequently.

thanks
— Hitesh   


> On Oct 20, 2016, at 2:35 PM, Madhusudan Ramanna  wrote:
> 
> Ok, no worries. I agree that this single AM model would be very close to a 
> mini-job tracker.  One of the options we're investigating having 1 yarn Tez 
> AM running all our DAGs. Given this AM already has all the 
> resources/containers, we were thinking this could save on the cost of AM, and 
> container initialization.
> 
> We haven't looked into tez recovery as well.  Durability is one of our big 
> concerns as well.
> 
> 
> On Thursday, October 20, 2016 12:44 PM, Hitesh Shah  wrote:
> 
> 
> Not supported as of now. There are multiple aspects to supporting this 
> properly. One of the most important issues to address would be to do proper 
> QoS across various DAGs i.e. what kind of policies would need to be built out 
> to run multiple DAGs to completion within a limited amount of resources. The 
> model would become close to a mini-jobtracker or a spark-standalone cluster.
> 
> Could you provide more details on what you are trying to achieve? We could 
> try and provide different viewpoints on trying to get you to a viable 
> solution.
> 
> — Hitesh
> 
> > On Oct 20, 2016, at 10:52 AM, Madhusudan Ramanna  
> > wrote:
> > 
> > Hello Folks,
> > 
> > http://hortonworks.com/blog/introducing-tez-sessions/
> > 
> > From the above post it seems like DAGs can only be executed serially.  
> > Could DAGs be executed in parallel on one Tez AM ?  
> > 
> > thanks,
> > Madhu
> 
> 



Re: Tez Sessions

2016-10-20 Thread Madhusudan Ramanna
Ok, no worries. I agree that this single AM model would be very close to a 
mini-job tracker.  One of the options we're investigating having 1 yarn Tez AM 
running all our DAGs. Given this AM already has all the resources/containers, 
we were thinking this could save on the cost of AM, and container 
initialization.
We haven't looked into tez recovery as well.  Durability is one of our big 
concerns as well. 

On Thursday, October 20, 2016 12:44 PM, Hitesh Shah  
wrote:
 

 Not supported as of now. There are multiple aspects to supporting this 
properly. One of the most important issues to address would be to do proper QoS 
across various DAGs i.e. what kind of policies would need to be built out to 
run multiple DAGs to completion within a limited amount of resources. The model 
would become close to a mini-jobtracker or a spark-standalone cluster.

Could you provide more details on what you are trying to achieve? We could try 
and provide different viewpoints on trying to get you to a viable solution.

— Hitesh

> On Oct 20, 2016, at 10:52 AM, Madhusudan Ramanna  wrote:
> 
> Hello Folks,
> 
> http://hortonworks.com/blog/introducing-tez-sessions/
> 
> From the above post it seems like DAGs can only be executed serially.  Could 
> DAGs be executed in parallel on one Tez AM ?  
> 
> thanks,
> Madhu


   

Re: Tez Sessions

2016-10-20 Thread Hitesh Shah
Not supported as of now. There are multiple aspects to supporting this 
properly. One of the most important issues to address would be to do proper QoS 
across various DAGs i.e. what kind of policies would need to be built out to 
run multiple DAGs to completion within a limited amount of resources. The model 
would become close to a mini-jobtracker or a spark-standalone cluster.

Could you provide more details on what you are trying to achieve? We could try 
and provide different viewpoints on trying to get you to a viable solution.

— Hitesh

> On Oct 20, 2016, at 10:52 AM, Madhusudan Ramanna  wrote:
> 
> Hello Folks,
> 
> http://hortonworks.com/blog/introducing-tez-sessions/
> 
> From the above post it seems like DAGs can only be executed serially.  Could 
> DAGs be executed in parallel on one Tez AM ?  
> 
> thanks,
> Madhu



Re: Container settings at vertex level

2016-10-20 Thread Hitesh Shah
Hello Madhu,

If you are using Tez via Hive, then this would need a fix in Hive. I don’t 
believe Hive supports different settings for each vertex in a given query today.

However, for native jobs, Tez already supports different specs for each vertex:

Vertex::setTaskResource() ( configuring yarn resources i.e. memory/cpu )
Vertex::setTaskLaunchCmdOpts() ( java opts, etc )

Does the above help? Or are you looking for something different? 

thanks
— HItesh


> On Oct 20, 2016, at 10:44 AM, Madhusudan Ramanna  wrote:
> 
> Hello Folks,
> 
> Some vertices require more memory than other vertices. These vertices are 
> memory intensive.  The graph, in general, takes a long(ish) time to complete. 
>  Default allocation of a huge chunk of memory to this one DAG/application 
> severely limits concurrent yarn containers that can be run.  How can we 
> influence Tez Runtime to request and execute some vertices in specialized 
> containers ? What is a good solution to this problem?
> 
> thanks,
> Madhu



Tez Sessions

2016-10-20 Thread Madhusudan Ramanna
Hello Folks,
http://hortonworks.com/blog/introducing-tez-sessions/
>From the above post it seems like DAGs can only be executed serially.  Could 
>DAGs be executed in parallel on one Tez AM ?  

thanks,Madhu

Container settings at vertex level

2016-10-20 Thread Madhusudan Ramanna
Hello Folks,
Some vertices require more memory than other vertices. These vertices are 
memory intensive.  The graph, in general, takes a long(ish) time to complete.  
Default allocation of a huge chunk of memory to this one DAG/application 
severely limits concurrent yarn containers that can be run.  How can we 
influence Tez Runtime to request and execute some vertices in specialized 
containers ? What is a good solution to this problem?
thanks,Madhu