[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations
[ https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276923#comment-16276923 ] Steve Loughran commented on HADOOP-12876: - This could be done with the code in HADOOP-15038 once it is made common > [Azure Data Lake] Support for process level FileStatus cache to optimize > GetFileStatus frequent opeations > - > > Key: HADOOP-12876 > URL: https://issues.apache.org/jira/browse/HADOOP-12876 > Project: Hadoop Common > Issue Type: Improvement > Components: fs, fs/adl, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > > Add support to cache GetFileStatus and ListStatus response locally for > limited period of time. Local cache for limited period of time would optimize > number of calls for GetFileStatus operation. > One of the example where local limited period cache would be useful - > terasort ListStatus on input directory follows with GetFileStatus operation > on each file within directory. For 2048 input files in a directory would save > 2048 GetFileStatus calls during start up (Using the ListStatus response to > cache FileStatus instances). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations
[ https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501914#comment-15501914 ] Kai Zheng commented on HADOOP-12876: Thanks [~fabbri] for the pointing and it helps a lot. > [Azure Data Lake] Support for process level FileStatus cache to optimize > GetFileStatus frequent opeations > - > > Key: HADOOP-12876 > URL: https://issues.apache.org/jira/browse/HADOOP-12876 > Project: Hadoop Common > Issue Type: Improvement > Components: fs, fs/azure, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > > Add support to cache GetFileStatus and ListStatus response locally for > limited period of time. Local cache for limited period of time would optimize > number of calls for GetFileStatus operation. > One of the example where local limited period cache would be useful - > terasort ListStatus on input directory follows with GetFileStatus operation > on each file within directory. For 2048 input files in a directory would save > 2048 GetFileStatus calls during start up (Using the ListStatus response to > cache FileStatus instances). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations
[ https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501716#comment-15501716 ] Aaron Fabbri commented on HADOOP-12876: --- FYI we're working on on something similar for S3. The umbrella JIRA is HADOOP-13448 and the in-memory implementation is HADOOP-13452. Once this work is complete, it should be easy to refactor so you can access the LocalMetadataStore from ADLS client. > [Azure Data Lake] Support for process level FileStatus cache to optimize > GetFileStatus frequent opeations > - > > Key: HADOOP-12876 > URL: https://issues.apache.org/jira/browse/HADOOP-12876 > Project: Hadoop Common > Issue Type: Improvement > Components: fs, fs/azure, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > > Add support to cache GetFileStatus and ListStatus response locally for > limited period of time. Local cache for limited period of time would optimize > number of calls for GetFileStatus operation. > One of the example where local limited period cache would be useful - > terasort ListStatus on input directory follows with GetFileStatus operation > on each file within directory. For 2048 input files in a directory would save > 2048 GetFileStatus calls during start up (Using the ListStatus response to > cache FileStatus instances). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations
[ https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499988#comment-15499988 ] Kai Zheng commented on HADOOP-12876: A very interesting idea to cache FileStatus object for later use! Good thoughts, [~ste...@apache.org]! bq. Would this be per FS instance, or across all instances? ... I would suggest we start with the easy case, making the cache per instance. Sharing the cache across multiple instances may incur much more trouble than the benefit, not only the security concern as you said, but also being hard to enforce cache operations or sync up. bq. Making it generic to filesystems would allow use elsewhere (Swift, S3a), as well as isolated testing Well, this is really a very good idea! The optimization looks promising to all file systems backed by remote stores, and even HDFS I guess. Should we have a base issue to lay out the foundation first, not specific to any cloud? After that done, we then can have this done very simply since very probably the major work has already been done in the common base. Maybe an abstract class like {{CloudFileSystem extends FileSystem}} for all cloud backed file system to inherit with? This way we can share many codes among all the specific implementations. bq. would there be cache invalidation after file/directory operations? Right. Would need to consider how and when cached FileStatus objects are to be invalidated/cleaned. Think about a file is listed but shortly deleted afterwards ... Does this need help? If yes it'd be good to break this down and my side can take some tasks. Thanks. > [Azure Data Lake] Support for process level FileStatus cache to optimize > GetFileStatus frequent opeations > - > > Key: HADOOP-12876 > URL: https://issues.apache.org/jira/browse/HADOOP-12876 > Project: Hadoop Common > Issue Type: Improvement > Components: fs, fs/azure, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > > Add support to cache GetFileStatus and ListStatus response locally for > limited period of time. Local cache for limited period of time would optimize > number of calls for GetFileStatus operation. > One of the example where local limited period cache would be useful - > terasort ListStatus on input directory follows with GetFileStatus operation > on each file within directory. For 2048 input files in a directory would save > 2048 GetFileStatus calls during start up (Using the ListStatus response to > cache FileStatus instances). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations
[ https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324265#comment-15324265 ] Steve Loughran commented on HADOOP-12876: - # Would this be per FS instance, or across all instances? Because the latter, even though nominally more efficient if there is >1 FS against the same URI, has some security implications on any service fielding requests from >1 principal. # Guava has a cache with eviction policies; I'd go with that unless it was fundamentally broken # Making it generic to filesystems would allow use elsewhere (Swift, S3a), as well as isolated testing # would there be cache invalidation after file/directory operations? > [Azure Data Lake] Support for process level FileStatus cache to optimize > GetFileStatus frequent opeations > - > > Key: HADOOP-12876 > URL: https://issues.apache.org/jira/browse/HADOOP-12876 > Project: Hadoop Common > Issue Type: Improvement > Components: fs, fs/azure, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > > Add support to cache GetFileStatus and ListStatus response locally for > limited period of time. Local cache for limited period of time would optimize > number of calls for GetFileStatus operation. > One of the example where local limited period cache would be useful - > terasort ListStatus on input directory follows with GetFileStatus operation > on each file within directory. For 2048 input files in a directory would save > 2048 GetFileStatus calls during start up (Using the ListStatus response to > cache FileStatus instances). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations
[ https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204575#comment-15204575 ] Vishwajeet Dusane commented on HADOOP-12876: [~twu] Once HADOOP-12666 is resolved then i will incorporate FileStatus cache related changes in the latest ASF code and raise patch here. > [Azure Data Lake] Support for process level FileStatus cache to optimize > GetFileStatus frequent opeations > - > > Key: HADOOP-12876 > URL: https://issues.apache.org/jira/browse/HADOOP-12876 > Project: Hadoop Common > Issue Type: Improvement > Components: fs, fs/azure, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > > Add support to cache GetFileStatus and ListStatus response locally for > limited period of time. Local cache for limited period of time would optimize > number of calls for GetFileStatus operation. > One of the example where local limited period cache would be useful - > terasort ListStatus on input directory follows with GetFileStatus operation > on each file within directory. For 2048 input files in a directory would save > 2048 GetFileStatus calls during start up (Using the ListStatus response to > cache FileStatus instances). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations
[ https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204524#comment-15204524 ] Tony Wu commented on HADOOP-12876: -- Hi [~vishwajeet.dusane], Thanks a lot for creating a separate JIRA to discuss the file status cache. I noticed you have removed the relevant code (i.e. {{FileStatusCacheManager}}) from the latest patch in HADOOP-12666. Do you mind reposting the cache implementation here? I think you can post a patch for this JIRA based off the latest patch for HADOOP-12666. > [Azure Data Lake] Support for process level FileStatus cache to optimize > GetFileStatus frequent opeations > - > > Key: HADOOP-12876 > URL: https://issues.apache.org/jira/browse/HADOOP-12876 > Project: Hadoop Common > Issue Type: Improvement > Components: fs, fs/azure, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > > Add support to cache GetFileStatus and ListStatus response locally for > limited period of time. Local cache for limited period of time would optimize > number of calls for GetFileStatus operation. > One of the example where local limited period cache would be useful - > terasort ListStatus on input directory follows with GetFileStatus operation > on each file within directory. For 2048 input files in a directory would save > 2048 GetFileStatus calls during start up (Using the ListStatus response to > cache FileStatus instances). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations
[ https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177700#comment-15177700 ] Vishwajeet Dusane commented on HADOOP-12876: * FileStatus cache is simple process level cache which mirrors backend storage FileStatus objects. * Time to live on the FileStatus cached object is limited. 5 seconds default and configurable through core-site.xml * FileStatus objects are stored in Synchronized LinkedHashMap. Where key is fully qualified file path and value is FileStatus java object along with time to live information. * FileStatus cache is built based on successful responses to GetFileStatus and ListStatus calls for existing files/folders. Non existent files/folder are not maintained in the cache. * FileStatus cache motivation is to avoid multiple GetFileStatus calls to the ADL backend and as a result gain better performance for job startup and during execution. > [Azure Data Lake] Support for process level FileStatus cache to optimize > GetFileStatus frequent opeations > - > > Key: HADOOP-12876 > URL: https://issues.apache.org/jira/browse/HADOOP-12876 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, fs/azure, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > > Add support to cache GetFileStatus and ListStatus response locally for > limited period of time. Local cache for limited period of time would optimize > number of calls for GetFileStatus operation. > One of the example where local limited period cache would be useful - > terasort ListStatus on input directory follows with GetFileStatus operation > on each file within directory. For 2048 input files in a directory would save > 2048 GetFileStatus calls during start up (Using the ListStatus response to > cache FileStatus instances). -- This message was sent by Atlassian JIRA (v6.3.4#6332)