[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-10-10 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13208:
---
Affects Version/s: (was: 2.8.0)
 Target Version/s: 2.8.0  (was: 2.9.0)
Fix Version/s: (was: 2.9.0)
   2.8.0

I cherry-picked this to branch-2.8 for inclusion 2.8.0.

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch, 
> HADOOP-13208-branch-2-019.patch, HADOOP-13208-branch-2-020.patch, 
> HADOOP-13208-branch-2-021.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-08-17 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13208:
---
Release Note: S3A has optimized the listFiles method by doing a bulk 
listing of all entries under a path in a single S3 operation instead of 
recursively walking the directory tree.  The listLocatedStatus method has been 
optimized by fetching results from S3 lazily as the caller traverses the 
returned iterator instead of doing an eager fetch of all possible results.

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.9.0
>
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch, 
> HADOOP-13208-branch-2-019.patch, HADOOP-13208-branch-2-020.patch, 
> HADOOP-13208-branch-2-021.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-08-17 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13208:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.9.0
   Status: Resolved  (was: Patch Available)

+1 for patch 021.  I have committed this to trunk and branch-2.  
[~ste...@apache.org], thank you for the contribution.  [~fabbri], thank you for 
your code review.

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.9.0
>
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch, 
> HADOOP-13208-branch-2-019.patch, HADOOP-13208-branch-2-020.patch, 
> HADOOP-13208-branch-2-021.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-08-17 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13208:
---
Attachment: HADOOP-13208-branch-2-021.patch

Steve, thank you for patch 020.  I verified all tests passing when applied to 
trunk, running against an S3 bucket in US-west-2.  Refactoring to the new 
{{Listing}} class is helpful.

I'm attaching patch 021 with some trivial changes to get us a clean pre-commit 
run:

* Correct JavaDoc parameter name that was causing a JavaDoc failure for Java 8.
* Wrap some lines that were longer than 80 characters in {{Listing}}.

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch, 
> HADOOP-13208-branch-2-019.patch, HADOOP-13208-branch-2-020.patch, 
> HADOOP-13208-branch-2-021.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-08-17 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch, 
> HADOOP-13208-branch-2-019.patch, HADOOP-13208-branch-2-020.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-08-17 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-020.patch

Patch 020

* moves out all the new listing classes into a new class {{Listing}}, which is 
bonded to an S3a owner and makes calls on it to do the listing.
* tries to use package-private to reduce scope of member access changes and 
class visibility.
* addresses chris's javadoc comment

regarding the trunk merge; looks like theres an out-of-order import in trunk 
that 's been fixed in Branch-2; the new imports for listings hit this. Moving 
the code out will dodge the problem, though it still lurks

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch, 
> HADOOP-13208-branch-2-019.patch, HADOOP-13208-branch-2-020.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-08-17 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Open  (was: Patch Available)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch, 
> HADOOP-13208-branch-2-019.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-08-17 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch, 
> HADOOP-13208-branch-2-019.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-08-17 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-019.patch

Patch 019


@Aaron

Returning the array directly; also renamed the param 'f' to 'path' for clarity.

@cnauroth

Why split the two iterators up? (i) to make it suspiciously close to 
java-8-lang lambdas and (ii) hopefully to line things up better for a high 
performance globber.

2. I've pushed the list call down into the `ObjectListingIterator` but left 
delimiter calculation to the caller.

3. FileStatusGenerator. I had a plan there to maybe generate liststatus values 
directly, but I'll postpone that plan & have inlined thaat code.

nits: all addressed



> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch, 
> HADOOP-13208-branch-2-019.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-08-17 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Open  (was: Patch Available)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-23 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

The no of requests is 5+(N/5000) for the curious, where N= # of files + mock 
directories under path {{P}}.

Contrast with the recursive walker, where time is something like 4* 
real-or-mock-directories + files/5000

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-23 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-018.patch

Patch 018; in sync with branch-2; all glob code stripped

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-23 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Open  (was: Patch Available)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-22 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-017.patch

This is the aws module in sync with HADOOP-13207 patch 017, for curious

this diff also contains something which isn't ready for review (and which I 
will move off to the side): the tests and initial implementation of an 
s3a-specific globber. Regression tests are all off the HDFS glob tests.

There's no attempt at optimising the s3a globber yet; that will be the real 
work for the globber patch and I haven't yet started that.

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-22 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-22 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Open  (was: Patch Available)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-012.patch

HADOOP-13208 patch 012: javadoc failing now that some inner classes were 
private; converted @link to @code

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Open  (was: Patch Available)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

tested agains s3 ireland. With the modified test, I'm showing a 20x speedup for 
listing the specific directory tree created for this test:
{code}
2016-07-13 14:56:42,370 [Thread-8] INFO  contract.ContractTestUtils 
(ContractTestUtils.java:end(1362)) - Duration of List status via treewalk of 48 
directories and 6 files: 5,964,773,672 nS
2016-07-13 14:56:42,370 [Thread-8] INFO  scale.TestS3ADirectoryPerformance 
(S3ATestUtils.java:print(212)) - object_metadata_requests starting=444 
current=542 diff=98
2016-07-13 14:56:42,370 [Thread-8] INFO  scale.TestS3ADirectoryPerformance 
(S3ATestUtils.java:print(212)) - object_list_requests starting=132 current=184 
diff=52
2016-07-13 14:56:42,370 [Thread-8] INFO  scale.TestS3ADirectoryPerformance 
(S3ATestUtils.java:print(212)) - object_continue_list_requests starting=0 
current=0 diff=0
2016-07-13 14:56:42,370 [Thread-8] INFO  scale.TestS3ADirectoryPerformance 
(S3ATestUtils.java:print(212)) - op_list_files starting=0 current=0 diff=0
2016-07-13 14:56:42,370 [Thread-8] INFO  scale.TestS3ADirectoryPerformance 
(S3ATestUtils.java:print(212)) - op_get_file_status starting=225 current=274 
diff=49
2016-07-13 14:56:42,371 [Thread-8] INFO  scale.S3AScaleTestBase 
(S3AScaleTestBase.java:describe(155)) - 

testListOperations: Listing files via listFiles(recursive=true)

2016-07-13 14:56:42,579 [Thread-8] INFO  contract.ContractTestUtils 
(ContractTestUtils.java:end(1362)) - Duration of listFiles(recursive=true) of 
48 directories and 6 files: 208,461,316 nS
2016-07-13 14:56:42,580 [Thread-8] INFO  scale.TestS3ADirectoryPerformance 
(S3ATestUtils.java:print(212)) - object_metadata_requests starting=542 
current=544 diff=2
2016-07-13 14:56:42,580 [Thread-8] INFO  scale.TestS3ADirectoryPerformance 
(S3ATestUtils.java:print(212)) - object_list_requests starting=184 current=186 
diff=2
2016-07-13 14:56:42,580 [Thread-8] INFO  scale.TestS3ADirectoryPerformance 
(S3ATestUtils.java:print(212)) - object_continue_list_requests starting=0 
current=0 diff=0
2016-07-13 14:56:42,580 [Thread-8] INFO  scale.TestS3ADirectoryPerformance 
(S3ATestUtils.java:print(212)) - op_list_files starting=0 current=1 diff=1
2016-07-13 14:56:42,580 [Thread-8] INFO  scale.TestS3ADirectoryPerformance 
(S3ATestUtils.java:print(212)) - op_get_file_status starting=274 current=275 
diff=1
2016-07-13 14:56:42,580 [Thread-8] INFO  scale.S3AScaleTestBase 
(S3AScaleTestBase.java:describe(155)) - 
{code}
Obviously, that could be un-representative, but it does show the speedup we can 
obtain

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-011.patch

Patch 011: fix up TestS3ADirectoryPerformance to correctly validate the number 
of requests made during the listStatus() call, changed the assert to only 
expect two object list requests (the initial is-directory probe and the 
followup listing). Split the counter of object list requests into one for the 
initial request and one for followup actions.

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Open  (was: Patch Available)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-010.patch

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: (was: HADOOP-13208-branch-2-009.patch)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Open  (was: Patch Available)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Target Version/s: 2.9.0
  Status: Open  (was: Patch Available)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-009.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-009.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-13 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-009.patch

Patch 010. Address checkstyle and findbugs warnings.

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-009.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-12 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-12 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-009.patch

Patch 009.

# specify RemoteIterator in {{filesystem.md}}, specifically that 
implementations MUST return finite sequence, that a {{while(true) next();}} is 
a valid iteration, and cover concurrency issues.
# listing tests also grab the remote iterators returned and do the 
{{while(true)}} evaluation, verifying the results match.
# Fix S3AFileSystem iterators to correctly pass these new tests (the other 
filesystems, which don't override the default list* operations, all pass)
# javadoc all RemoteIterators in S3AFileSystem, so explaining chained operations


> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-11 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-008.patch

Patch 008; rebased to branch-2...some merge problems related to import 
declarations in {{S3AFileSystem}} fixed

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-11 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Open  (was: Patch Available)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-07-01 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-06-30 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-007.patch

This is patch -007, which I'm numbering to keep in sync with the HADOOP-13207 
patch of the same version.

This patch has the filesystem.md doc changes of that patch, and fixes the 
javadoc merge error that had crept in during rebasing.

Test: S3 Ireland

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-06-30 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Open  (was: Patch Available)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-06-29 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Status: Patch Available  (was: Open)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-06-29 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Attachment: HADOOP-13208-branch-2-001.patch

Patch 001

incorporates HADOOP-13207 
* define what the various list* commands do
* adds tests in the get file status contract test suite, which are 
automatically picked up by all implementation subclasses classes.
* adds some root dir listing tests to {{AbstractContractRootDirectoryTest}}, 
not because they manipulate the root, but to guarantee that test runs which 
parallelize the tests (e.g. hadoop-aws) run the root listing tests in a suite 
which would have to be serialized already.

And in S3A

* pull out the {{listObject}} request construction
* add a {{RemoteIterator}} {{ObjectListingIterator}} to go through the results 
of a listing
* add the {{RemoteIterator}} implementations {{FileStatusListingIterator}} and 
{{LocatedFileStatusIterator}} to take an {{ObjectListingIterator}}, do 
filtering and file status creation when iterated over.
* changed the listing operations to use these iterators, where appropriate.

The result is that we have a fairly functional programming-ish model of working 
across the listed entities, and can do what is essentially O(1) listing of 
directory trees. Specifically, It's {{1+ O(files/max-keys)}}: the number of 
child directories is not a factor except in returning some extra paths to be 
discarded.

Testing: S3 frankfurt, also tested azure, openstack, localfs, hdfs


> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: HADOOP-13208-branch-2-001.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13208) S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

2016-06-14 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13208:

Summary: S3A listFiles(recursive=true) to do a bulk listObjects instead of 
walking the pseudo-tree of directories  (was: listFiles(recursive=true) to do a 
bulk listObjects instead of walking the pseudo-tree of directories)

> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> 
>
> Key: HADOOP-13208
> URL: https://issues.apache.org/jira/browse/HADOOP-13208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org