[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations

2017-12-04 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276923#comment-16276923
 ] 

Steve Loughran commented on HADOOP-12876:
-

This could be done with the code in HADOOP-15038 once it is made common

> [Azure Data Lake] Support for process level FileStatus cache to optimize 
> GetFileStatus frequent opeations
> -
>
> Key: HADOOP-12876
> URL: https://issues.apache.org/jira/browse/HADOOP-12876
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/adl, tools
>Reporter: Vishwajeet Dusane
>Assignee: Vishwajeet Dusane
>
> Add support to cache GetFileStatus and ListStatus response locally for 
> limited period of time. Local cache for limited period of time would optimize 
> number of calls for GetFileStatus operation.
> One of the example  where local limited period cache would be useful - 
> terasort ListStatus on input directory follows with GetFileStatus operation 
> on each file within directory. For 2048 input files in a directory would save 
> 2048 GetFileStatus calls during start up (Using the ListStatus response to 
> cache FileStatus instances).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations

2016-09-18 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501914#comment-15501914
 ] 

Kai Zheng commented on HADOOP-12876:


Thanks [~fabbri] for the pointing and it helps a lot.

> [Azure Data Lake] Support for process level FileStatus cache to optimize 
> GetFileStatus frequent opeations
> -
>
> Key: HADOOP-12876
> URL: https://issues.apache.org/jira/browse/HADOOP-12876
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/azure, tools
>Reporter: Vishwajeet Dusane
>Assignee: Vishwajeet Dusane
>
> Add support to cache GetFileStatus and ListStatus response locally for 
> limited period of time. Local cache for limited period of time would optimize 
> number of calls for GetFileStatus operation.
> One of the example  where local limited period cache would be useful - 
> terasort ListStatus on input directory follows with GetFileStatus operation 
> on each file within directory. For 2048 input files in a directory would save 
> 2048 GetFileStatus calls during start up (Using the ListStatus response to 
> cache FileStatus instances).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations

2016-09-18 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501716#comment-15501716
 ] 

Aaron Fabbri commented on HADOOP-12876:
---

FYI we're working on on something similar for S3.  The umbrella JIRA is 
HADOOP-13448 and the in-memory implementation is HADOOP-13452.  Once this work 
is complete, it should be easy to refactor so you can access the 
LocalMetadataStore from ADLS client.

> [Azure Data Lake] Support for process level FileStatus cache to optimize 
> GetFileStatus frequent opeations
> -
>
> Key: HADOOP-12876
> URL: https://issues.apache.org/jira/browse/HADOOP-12876
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/azure, tools
>Reporter: Vishwajeet Dusane
>Assignee: Vishwajeet Dusane
>
> Add support to cache GetFileStatus and ListStatus response locally for 
> limited period of time. Local cache for limited period of time would optimize 
> number of calls for GetFileStatus operation.
> One of the example  where local limited period cache would be useful - 
> terasort ListStatus on input directory follows with GetFileStatus operation 
> on each file within directory. For 2048 input files in a directory would save 
> 2048 GetFileStatus calls during start up (Using the ListStatus response to 
> cache FileStatus instances).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations

2016-09-17 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499988#comment-15499988
 ] 

Kai Zheng commented on HADOOP-12876:


A very interesting idea to cache FileStatus object for later use!

Good thoughts, [~ste...@apache.org]! 
bq. Would this be per FS instance, or across all instances? ...
I would suggest we start with the easy case, making the cache per instance. 
Sharing the cache across multiple instances may incur much more trouble than 
the benefit, not only the security concern as you said, but also being hard to 
enforce cache operations or sync up.

bq. Making it generic to filesystems would allow use elsewhere (Swift, S3a), as 
well as isolated testing
Well, this is really a very good idea! The optimization looks promising to all 
file systems backed by remote stores, and even HDFS I guess. Should we have a 
base issue to lay out the foundation first, not specific to any cloud? After 
that done, we then can have this done very simply since very probably the major 
work has already been done in the common base. Maybe an abstract class like 
{{CloudFileSystem extends FileSystem}} for all cloud backed file system to 
inherit with? This way we can share many codes among all the specific 
implementations.

bq. would there be cache invalidation after file/directory operations?
Right. Would need to consider how and when cached FileStatus objects are to be 
invalidated/cleaned. Think about a file is listed but shortly deleted 
afterwards ...

Does this need help? If yes it'd be good to break this down and my side can 
take some tasks. Thanks.

> [Azure Data Lake] Support for process level FileStatus cache to optimize 
> GetFileStatus frequent opeations
> -
>
> Key: HADOOP-12876
> URL: https://issues.apache.org/jira/browse/HADOOP-12876
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/azure, tools
>Reporter: Vishwajeet Dusane
>Assignee: Vishwajeet Dusane
>
> Add support to cache GetFileStatus and ListStatus response locally for 
> limited period of time. Local cache for limited period of time would optimize 
> number of calls for GetFileStatus operation.
> One of the example  where local limited period cache would be useful - 
> terasort ListStatus on input directory follows with GetFileStatus operation 
> on each file within directory. For 2048 input files in a directory would save 
> 2048 GetFileStatus calls during start up (Using the ListStatus response to 
> cache FileStatus instances).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations

2016-06-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324265#comment-15324265
 ] 

Steve Loughran commented on HADOOP-12876:
-

# Would this be per FS instance, or across all instances? Because the latter, 
even though nominally more efficient if there is >1 FS against the same URI, 
has some security implications on any service fielding requests from >1 
principal.
# Guava has a cache with eviction policies; I'd go with that unless it was 
fundamentally broken
# Making it generic to filesystems would allow use elsewhere (Swift, S3a), as 
well as isolated testing
# would there be cache invalidation after file/directory operations?

> [Azure Data Lake] Support for process level FileStatus cache to optimize 
> GetFileStatus frequent opeations
> -
>
> Key: HADOOP-12876
> URL: https://issues.apache.org/jira/browse/HADOOP-12876
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/azure, tools
>Reporter: Vishwajeet Dusane
>Assignee: Vishwajeet Dusane
>
> Add support to cache GetFileStatus and ListStatus response locally for 
> limited period of time. Local cache for limited period of time would optimize 
> number of calls for GetFileStatus operation.
> One of the example  where local limited period cache would be useful - 
> terasort ListStatus on input directory follows with GetFileStatus operation 
> on each file within directory. For 2048 input files in a directory would save 
> 2048 GetFileStatus calls during start up (Using the ListStatus response to 
> cache FileStatus instances).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations

2016-03-21 Thread Vishwajeet Dusane (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204575#comment-15204575
 ] 

Vishwajeet Dusane commented on HADOOP-12876:


[~twu] Once HADOOP-12666 is resolved then i will incorporate FileStatus cache 
related changes in the latest ASF code and raise patch here.

> [Azure Data Lake] Support for process level FileStatus cache to optimize 
> GetFileStatus frequent opeations
> -
>
> Key: HADOOP-12876
> URL: https://issues.apache.org/jira/browse/HADOOP-12876
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/azure, tools
>Reporter: Vishwajeet Dusane
>Assignee: Vishwajeet Dusane
>
> Add support to cache GetFileStatus and ListStatus response locally for 
> limited period of time. Local cache for limited period of time would optimize 
> number of calls for GetFileStatus operation.
> One of the example  where local limited period cache would be useful - 
> terasort ListStatus on input directory follows with GetFileStatus operation 
> on each file within directory. For 2048 input files in a directory would save 
> 2048 GetFileStatus calls during start up (Using the ListStatus response to 
> cache FileStatus instances).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations

2016-03-21 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204524#comment-15204524
 ] 

Tony Wu commented on HADOOP-12876:
--

Hi [~vishwajeet.dusane], Thanks a lot for creating a separate JIRA to discuss 
the file status cache. I noticed you have removed the relevant code (i.e. 
{{FileStatusCacheManager}}) from the latest patch in HADOOP-12666. Do you mind 
reposting the cache implementation here?

I think you can post a patch for this JIRA based off the latest patch for 
HADOOP-12666.


> [Azure Data Lake] Support for process level FileStatus cache to optimize 
> GetFileStatus frequent opeations
> -
>
> Key: HADOOP-12876
> URL: https://issues.apache.org/jira/browse/HADOOP-12876
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/azure, tools
>Reporter: Vishwajeet Dusane
>Assignee: Vishwajeet Dusane
>
> Add support to cache GetFileStatus and ListStatus response locally for 
> limited period of time. Local cache for limited period of time would optimize 
> number of calls for GetFileStatus operation.
> One of the example  where local limited period cache would be useful - 
> terasort ListStatus on input directory follows with GetFileStatus operation 
> on each file within directory. For 2048 input files in a directory would save 
> 2048 GetFileStatus calls during start up (Using the ListStatus response to 
> cache FileStatus instances).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations

2016-03-03 Thread Vishwajeet Dusane (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177700#comment-15177700
 ] 

Vishwajeet Dusane commented on HADOOP-12876:


 * FileStatus cache is simple process level cache which mirrors backend storage 
FileStatus objects.
 * Time to live on the FileStatus cached object is limited. 5 seconds default 
and configurable through core-site.xml
 * FileStatus objects are stored in Synchronized LinkedHashMap. Where key is 
fully qualified file path and value is FileStatus java object along with time 
to live information.
 * FileStatus cache is built based on successful responses to GetFileStatus and 
ListStatus calls for existing files/folders. Non existent files/folder are not 
maintained in the cache.
 * FileStatus cache motivation is to avoid multiple GetFileStatus calls to the 
ADL backend and as a result gain better performance for job startup and during 
execution.


> [Azure Data Lake] Support for process level FileStatus cache to optimize 
> GetFileStatus frequent opeations
> -
>
> Key: HADOOP-12876
> URL: https://issues.apache.org/jira/browse/HADOOP-12876
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, fs/azure, tools
>Reporter: Vishwajeet Dusane
>Assignee: Vishwajeet Dusane
>
> Add support to cache GetFileStatus and ListStatus response locally for 
> limited period of time. Local cache for limited period of time would optimize 
> number of calls for GetFileStatus operation.
> One of the example  where local limited period cache would be useful - 
> terasort ListStatus on input directory follows with GetFileStatus operation 
> on each file within directory. For 2048 input files in a directory would save 
> 2048 GetFileStatus calls during start up (Using the ListStatus response to 
> cache FileStatus instances).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)