Stephan Ewen created FLINK-19221:
------------------------------------

             Summary: Exploit LocatableFileStatus from Hadoop
                 Key: FLINK-19221
                 URL: https://issues.apache.org/jira/browse/FLINK-19221
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / Hadoop Compatibility
    Affects Versions: 1.11.1
            Reporter: Stephan Ewen
            Assignee: Stephan Ewen
             Fix For: 1.12.0


When the HDFS Client returns a {{FileStatus}} (description of a file) it 
sometimes returns a {{LocatedFileStatus}} which already contains all the 
{{BlockLocation}} information.

We should expose this on the Flink side, because it may save is a lot of RPC 
calls to the name node. The file enumerators often request block locations for 
all files, currently doing an RPC call for each file.

When the FileStatus obtained from listing the directory (or getting details for 
a file) already has all the block locations, we can save the extra RPC call per 
file.

The suggested implementation is as follows:

  1. We introduce a {{LocatedInputSplit}} in Flink that we integrate with the 
built-in LocalFileSystem
  2. We integrate this with the HadoopFileSystems by creating a Flink 
{{LocatedInputSplit}} whenever the underlying file system created a {{Hadoop 
LocatedInputSplit}}
  3. As a safety net, the FS methods to access block information check whether 
the presented file status already contains the block information and return 
that information directly.

Steps one and two are for simplification of FileSystem users (no need to ask 
for extra info if it is available).

Step three is the transparent shortcut that all applications get even if they 
do not explicitly use the {{LocatedInputSplit}} and keep calling 
{{FileSystem.getBlockLocations()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to