[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2022-01-05 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469445#comment-17469445
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  9s{color} 
| {color:red}{color} | {color:red} MAPREDUCE-5907 does not apply to trunk. 
Rebase required? Wrong Branch? See 
https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | MAPREDUCE-5907 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12648040/MAPREDUCE-5907-3.patch
 |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-MAPREDUCE-Build/87/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>Priority: Major
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2022-01-05 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469434#comment-17469434
 ] 

Steve Loughran commented on MAPREDUCE-5907:
---

abfs now does incremental listing, but not deep ones

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>Priority: Major
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2020-03-31 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072007#comment-17072007
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  7s{color} 
| {color:red} MAPREDUCE-5907 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | MAPREDUCE-5907 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12648040/MAPREDUCE-5907-3.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7740/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>Priority: Major
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2018-05-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460922#comment-16460922
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  8s{color} 
| {color:red} MAPREDUCE-5907 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | MAPREDUCE-5907 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12648040/MAPREDUCE-5907-3.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7406/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>Priority: Major
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2018-02-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360699#comment-16360699
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  4s{color} 
| {color:red} MAPREDUCE-5907 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | MAPREDUCE-5907 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12648040/MAPREDUCE-5907-3.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7333/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>Priority: Major
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2018-02-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360698#comment-16360698
 ] 

Steve Loughran commented on MAPREDUCE-5907:
---

Catching up on this after spending many weeks in HADOOP-13786

* Performance boosts with s3 are moot because both the v1 and v2 algorithms are 
unreliable with raw s3, and slow even with a consistency layer; it's obsolete.
* none of the other stores *currently* do anything with the listFiles

accordingly, we don't see any immediate benefit in moving to the API, But 
if/when someone does implement the listFiles call in a store which also offers 
consistent metadata and O(1) renames, then this will help, especially in v1, 
where the listing is done in the serialized phase of job commit

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>Priority: Major
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-11-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260506#comment-16260506
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  5s{color} 
| {color:red} MAPREDUCE-5907 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | MAPREDUCE-5907 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12648040/MAPREDUCE-5907-3.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7230/console |
| Powered by | Apache Yetus 0.7.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-08-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116531#comment-16116531
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  7s{color} 
| {color:red} MAPREDUCE-5907 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | MAPREDUCE-5907 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12648040/MAPREDUCE-5907-3.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7047/console |
| Powered by | Apache Yetus 0.5.0   http://yetus.apache.org |


This message was automatically generated.



> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-04-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967303#comment-15967303
 ] 

Steve Loughran commented on MAPREDUCE-5907:
---

seems good. don't feel bad about pinging me every week or so for a review: I 
know how frustrating it is for things to be left, and this patch is an example 
of how neglecting patch review is bad: the patch ages and the effort to pull it 
in increases.

Alongside this work, a test for S3A can go in: created HADOOP-14302. It's a 
separate JIRA so this patch doesn't depend on it.

[~blue_impala_48d6] might have some insights here, being as he'd benefit from 
this.

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-04-12 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966089#comment-15966089
 ] 

Suraj Nayak commented on MAPREDUCE-5907:


That's a good suggestion [~steve_l]. I thought the same. There were a lot of 
changes and improvements across frameworks. I'll put next comment on overview 
of changes what can be done. You can review it and give your suggestions. Will 
add the patch to it based on that.


> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-04-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965716#comment-15965716
 ] 

Steve Loughran commented on MAPREDUCE-5907:
---

I don't know anyone looking at it.

It's an out of date patch, combining optimisations in the FS code, S3N and HAR 
FS implmentations, & changes in the MR Code to match

If the changes to the mapreduce module can go in today, using the existing 
{{FileSystem.listFiles(path, recursive}} call then it''ll be straightforward: 
that's the only bit which needs review and merge; S3A already handles that 
recursively very efficiently, and the other object stores can be brought up to 
speed.

If we need changes to the FS, well, I'm not against them (there's definite 
inconsistencies there), but it's a more serious change: the HDFS team will need 
to look at that, we'll need changes to the FS spec, contract tests, etc, etc. 
Lots of work and so harder to get in.

Why not see if you can apply just the MR changes, and what happens?

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-04-11 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965069#comment-15965069
 ] 

Suraj Nayak commented on MAPREDUCE-5907:


Yeah, Thanks [~ksumit] for the quick reply. I would wait on some committer to 
give some suggestion as of today so I or you can resume work if the issue is 
till open. There has been a lot of changes since the earlier suggestion was 
made :)


> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-04-11 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965026#comment-15965026
 ] 

Sumit Kumar commented on MAPREDUCE-5907:


[~snayakm] Feel free to take over if you want to work on this. I've to confess 
that i lost focus on this one.

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-04-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965025#comment-15965025
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 12s {color} 
| {color:red} MAPREDUCE-5907 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12648040/MAPREDUCE-5907-3.patch
 |
| JIRA Issue | MAPREDUCE-5907 |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6951/console |
| Powered by | Apache Yetus 0.3.0   http://yetus.apache.org |


This message was automatically generated.



> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-04-11 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965018#comment-15965018
 ] 

Suraj Nayak commented on MAPREDUCE-5907:


Is this still open or some other JIRA was implemented to optimize? I see this 
patch was almost 2years ago. Any work required here ?  Like refactoring to 
trunk ?

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2015-05-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524821#comment-14524821
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12648040/MAPREDUCE-5907-3.patch 
|
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / f1a152c |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5546/console |


This message was automatically generated.

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
 MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2015-05-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524788#comment-14524788
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12648040/MAPREDUCE-5907-3.patch 
|
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / f1a152c |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5543/console |


This message was automatically generated.

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
 MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-06-05 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019332#comment-14019332
 ] 

Sumit Kumar commented on MAPREDUCE-5907:


[~ste...@apache.org] Seeking your attention to this JIRA. Thanks!

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
 MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-06-02 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015515#comment-14015515
 ] 

Sumit Kumar commented on MAPREDUCE-5907:


Added new changes to:
1. use iterator based listLocatedStatus apis instead of listStatus apis. 
Removed not-required recursive flavors of listStatus apis that could cause 
memory concerns raised in HADOOP-10634
2. change s3n implementation to use the listLocatedStatus api abstraction, this 
validated the implementation of the new recursive apis as well.
3. added a test case that demonstrates how such a recursive listing benefits 
s3N. The test case simulates an hourly rotated log aggregation and processing a 
year long data. Total number of calls reduces to just 10 instead of 360 calls 
(12 months * 30 days).
4. Fixed few bugs in InMemoryNativeFileSystemStore while validating the test 
case.

I did spend sometime on Swift object store implementation but it doesn't have 
that iteration based abstraction (neither at store level, nor at the filesystem 
level). Looking at the recursive implementation, for swift fs, it seems that it 
would try to get all the files/directories from the backend in just one 
webservice call. I suspect it would suffer from memory issues when such 
recursive calls are made. I may be wrong though so please correct me if i'm 
wrong.

[~ste...@apache.org] How should we deal with this? Are you aware of an 
iterative webservice api where we could list a swift fs directory recursively 
but in batches of say 1000 or 1 entries (as may seem appropriate).

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015589#comment-14015589
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12647921/MAPREDUCE-5907-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4639//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4639//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4639//console

This message is automatically generated.

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016129#comment-14016129
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12648040/MAPREDUCE-5907-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core:

  org.apache.hadoop.ha.TestZKFailoverControllerStress

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4641//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4641//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4641//console

This message is automatically generated.

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
 MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-06-02 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016132#comment-14016132
 ] 

Sumit Kumar commented on MAPREDUCE-5907:


Findbugs is complaining about following return statement:

{code:java}

  return (stat != null);
}
{code}

If we know how to make findbug happy for this case, i will be happy to 
resubmit. The failed test doesn't seem to be related to this patch (this test 
passed in the previous submission).

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
 MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-05-31 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014706#comment-14014706
 ] 

Steve Loughran commented on MAPREDUCE-5907:
---

marking as dependency of HADOOP-10634

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-05-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012238#comment-14012238
 ] 

Steve Loughran commented on MAPREDUCE-5907:
---

# ..provide a joint patch with both changes in
# presumably this patch is only going to benefit object stores where the cost 
of any ls operation is high due to the need to set up an HTTP link for each 
directory
# How tangible are the benefits for S3N? -in cluster and remote? 
# Have you implemented the same patch for the Swift object store? What were the 
results?


 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907.patch


 This patch utilizes benefits of recursive listing proposed in 
 https://issues.apache.org/jira/browse/HADOOP-10634



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011521#comment-14011521
 ] 

Hadoop QA commented on MAPREDUCE-5907:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647196/MAPREDUCE-5907.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4631//console

This message is automatically generated.

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.4.0
Reporter: Sumit Kumar
 Attachments: MAPREDUCE-5907.patch


 This patch utilizes benefits of recursive listing proposed in 
 https://issues.apache.org/jira/browse/HADOOP-10634



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-05-28 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011752#comment-14011752
 ] 

Sumit Kumar commented on MAPREDUCE-5907:


this failed to compile because of a dependency on 
https://issues.apache.org/jira/browse/HADOOP-10634

I'm not sure how to map this dependency here. Can someone help please?

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907.patch


 This patch utilizes benefits of recursive listing proposed in 
 https://issues.apache.org/jira/browse/HADOOP-10634



--
This message was sent by Atlassian JIRA
(v6.2#6252)