Ádám Szita created HIVE-23947:
---------------------------------

             Summary: Cache affinity is unset for text files read by LLAP
                 Key: HIVE-23947
                 URL: https://issues.apache.org/jira/browse/HIVE-23947
             Project: Hive
          Issue Type: Bug
            Reporter: Ádám Szita
            Assignee: Ádám Szita


LLAP relies on HostAffinitySplitLocationProvider to route the same splits to 
always the same LLAP daemons. By having such consistent split of data among the 
nodes we can gain a good hit ratio and thus good performance.

For text files this is almost never granted: HostAffinitySplitLocationProvider 
is never used, because HS2 does not set the cache affinity flag in the job conf 
for text inputformat content during compile. The launched Tez AM will have to 
rely on HDFS location information to route the splits (and therefore tasks) to 
the executor nodes. This location information might not have a good overlap 
with where the actual daemons are, or in S3 case, the Tez AM will mostly choose 
executors in a random way.

This in turn will result in the hit ratio hardly reaching 100%, each time we 
re-run the same query, some disk/s3 read will still occur. That is until the 
same content gets populated into all the daemons (after running the query tens 
or hundreds of times) causing poor performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to