> For example, a Hive job may start Tez containers, which then retrieve data
> from LLAP running concurrently. In the current implementation, this is
That is how LLAP was built - to push work from Tez to LLAP vertex by vertex,
instead of an all-or-nothing implementation.
Here are the slides describing how that is plugged in LLAP from Hadoop Summit
The flag in question is hive.llap.execution.mode - the most common use-case
imagined for it was something like the mode=map, where only table-scan + all
secure operators (i.e no temporary UDFs) are run inside LLAP (to take advantage
of the cache).
LLAP can shuffle data to a Tez container, but it cannot shuffle data from a Tez
container back into the daemon (& that's not very useful, since it won't be
Here's the class that decides the hybrid execution tree & the plans the split
between LLAP and Tez in the same query DAG.
If you want to consume the LLAP cached rows from something like GPUs running
Caffee, you can access LLAP cache via the SparkSQL data-source APIs.
This is faster than directly reading off Cloud filesystems (because of LLAP's
SSD cache), but even with a perf penalty on-prem it is very useful to restrict
the access of the Spark ML to certain columns (i.e you can extract lat/long,
from a table which has other PII data) without having to make a complete copy
of the data after projections to share from the EDW end of the shop to the ML
side of it, even if the entire data-set is HDFS encrypted.
 - https://hortonworks.com/blog/row-column-level-control-apache-spark/