Joe McDonnell created IMPALA-7827:
-------------------------------------

             Summary: Investigate increasing disk utilization by overlapping 
file open with reads
                 Key: IMPALA-7827
                 URL: https://issues.apache.org/jira/browse/IMPALA-7827
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
    Affects Versions: Impala 3.2.0
            Reporter: Joe McDonnell


Disk IO threads are responsible for doing both the HDFS file open and the reads 
for ScanRanges. Most HDFS file opens are served from the file handle cache. 
However, in case of a cache miss, the Disk IO thread is tied up waiting on a 
roundtrip to the NameNode. Depending on the number of Disk IO threads and the 
speed of the NameNode, all of the Disk IO threads could be blocked waiting on 
HDFS file open calls, even if there are ScanRanges that have file handles 
available in the cache. In particular, for spinning disks, there is a single 
Disk IO thread per disk. If this thread gets tied up in an open call, the disk 
will go idle.

It might make sense for the open call to be serviced by a separate thread pool. 
The ScanRange would go through a separate state transition that opens the file 
handle. The Disk IO thread can process ScanRanges that already have an open 
file handle (cached or otherwise) while the open call is in progress.

This is complicated by the fact that file handles can't be simultaneously used 
by multiple threads. In order to do the state transition properly, it needs to 
be clear whether a new file handle is necessary. Keeping a file handle cache at 
the RequestContext level and using preads (See IMPALA-6403) might make this 
clear.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to