Ahmed Hussein created HADOOP-17362:
--------------------------------------

             Summary: Doing hadoop ls on Har file triggers too many RPC calls
                 Key: HADOOP-17362
                 URL: https://issues.apache.org/jira/browse/HADOOP-17362
             Project: Hadoop Common
          Issue Type: Bug
          Components: fs
            Reporter: Ahmed Hussein
            Assignee: Ahmed Hussein


[~daryn] has noticed that Invoking hadoop ls on HAR is taking too much of time.

The har system has multiple deficiencies that significantly impacted 
performance:

# Parsing the master index references ranges within the archive index. Each 
range required re-opening the hdfs input stream and seeking to the same 
location where it previously stopped.
# Listing a har stats the archive index for every "directory". The per-call 
cache used a unique key for each stat, rendering the cache useless and 
significantly increasing memory pressure.
# Determining the children of a directory scans the entire archive contents and 
filters out children. The cached metadata already stores the exact child list.
# Globbing a har's contents resulted in unnecessary stats for every leaf path.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to