> For example, a Hive job may start Tez containers, which then retrieve data 
> from LLAP running concurrently. In the current implementation, this is 
> unrealistic

That is how LLAP was built - to push work from Tez to LLAP vertex by vertex, 
instead of an all-or-nothing implementation.

Here are the slides describing how that is plugged in LLAP from Hadoop Summit 


The flag in question is hive.llap.execution.mode - the most common use-case 
imagined for it was something like the mode=map, where only table-scan + all 
secure operators (i.e no temporary UDFs) are run inside LLAP (to take advantage 
of the cache).

LLAP can shuffle data to a Tez container, but it cannot shuffle data from a Tez 
container back into the daemon (& that's not very useful, since it won't be 

Here's the class that decides the hybrid execution tree & the plans the split 
between LLAP and Tez in the same query DAG.


If you want to consume the LLAP cached rows from something like GPUs running 
Caffee, you can access LLAP cache via the SparkSQL data-source APIs.


This is faster than directly reading off Cloud filesystems (because of LLAP's 
SSD cache), but even with a perf penalty on-prem it is very useful to restrict 
the access of the Spark ML[1] to certain columns (i.e you can extract lat/long, 
from a table which has other PII data) without having to make a complete copy 
of the data after projections to share from the EDW end of the shop to the ML 
side of it, even if the entire data-set is HDFS encrypted.

[1] - https://hortonworks.com/blog/row-column-level-control-apache-spark/

Reply via email to