Ádám Szita created HIVE-23947: --------------------------------- Summary: Cache affinity is unset for text files read by LLAP Key: HIVE-23947 URL: https://issues.apache.org/jira/browse/HIVE-23947 Project: Hive Issue Type: Bug Reporter: Ádám Szita Assignee: Ádám Szita
LLAP relies on HostAffinitySplitLocationProvider to route the same splits to always the same LLAP daemons. By having such consistent split of data among the nodes we can gain a good hit ratio and thus good performance. For text files this is almost never granted: HostAffinitySplitLocationProvider is never used, because HS2 does not set the cache affinity flag in the job conf for text inputformat content during compile. The launched Tez AM will have to rely on HDFS location information to route the splits (and therefore tasks) to the executor nodes. This location information might not have a good overlap with where the actual daemons are, or in S3 case, the Tez AM will mostly choose executors in a random way. This in turn will result in the hit ratio hardly reaching 100%, each time we re-run the same query, some disk/s3 read will still occur. That is until the same content gets populated into all the daemons (after running the query tens or hundreds of times) causing poor performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)