[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2018-02-12 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated MAPREDUCE-5907:
--
Labels:   (was: BB2015-05-TBR)

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>Priority: Major
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-08-07 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated MAPREDUCE-5907:
--
Status: Open  (was: Patch Available)

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2017-08-07 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated MAPREDUCE-5907:
--
Status: Patch Available  (was: Open)

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> 
>
> Key: MAPREDUCE-5907
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2015-05-05 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated MAPREDUCE-5907:

Labels: BB2015-05-TBR  (was: )

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
 MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-06-02 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated MAPREDUCE-5907:
---

Attachment: MAPREDUCE-5907-2.patch

Incorporated feedback on using the iterator apis

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-06-02 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated MAPREDUCE-5907:
---

Release Note:   (was: Please review the attached patch.)
  Status: Patch Available  (was: Open)

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-06-02 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated MAPREDUCE-5907:
---

Attachment: MAPREDUCE-5907-3.patch

Fixed the first findbug issue which was really an issue. I'm not sure if the 
second one will still pass now. Let's see

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
 MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-05-29 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated MAPREDUCE-5907:
---

Description: 
FileInputFormat (both mapreduce and mapred implementations) use recursive 
listing while calculating splits. They however do this by doing listing level 
by level. That means to discover files in /foo/bar means they do listing at 
/foo/bar first to get the immediate children, then make the same call on all 
immediate children for /foo/bar to discover their immediate children and so on. 
This doesn't scale well for object store based fs implementations like s3 and 
swift because every listStatus call ends up being a webservice call to backend. 
In cases where large number of files are considered for input, this makes 
getSplits() call slow. 

This patch adds a new set of recursive list apis that gives opportunity to the 
fs implementations to optimize. The behavior remains the same for other 
implementations (that is a default implementation is provided for other fs so 
they don't have to implement anything new). However for objectstore based fs 
implementations it provides a simple change to include recursive flag as true 
(as shown in the patch) to improve listing performance.

  was:This patch utilizes benefits of recursive listing proposed in 
https://issues.apache.org/jira/browse/HADOOP-10634


 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-05-29 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated MAPREDUCE-5907:
---

Status: Open  (was: Patch Available)

Wasn't aware of Swift object store, thanks for pointing out.

Just for future reference, Swift object store implementation is HADOOP-8545 and 
i could find code pieces in 
https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-openstack

Will be posting test results for both s3 and swift object store performance 
along with merged patch (changes for HADOOP-10634 included)

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907.patch


 FileInputFormat (both mapreduce and mapred implementations) use recursive 
 listing while calculating splits. They however do this by doing listing level 
 by level. That means to discover files in /foo/bar means they do listing at 
 /foo/bar first to get the immediate children, then make the same call on all 
 immediate children for /foo/bar to discover their immediate children and so 
 on. This doesn't scale well for object store based fs implementations like s3 
 and swift because every listStatus call ends up being a webservice call to 
 backend. In cases where large number of files are considered for input, this 
 makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that gives opportunity to 
 the fs implementations to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for objectstore based fs 
 implementations it provides a simple change to include recursive flag as true 
 (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-05-28 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated MAPREDUCE-5907:
---

Attachment: MAPREDUCE-5907.patch

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.4.0
Reporter: Sumit Kumar
 Attachments: MAPREDUCE-5907.patch


 This patch utilizes benefits of recursive listing proposed in 
 https://issues.apache.org/jira/browse/HADOOP-10634



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-05-28 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated MAPREDUCE-5907:
---

Release Note: Please review the attached patch.
  Status: Patch Available  (was: Open)

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.4.0
Reporter: Sumit Kumar
 Attachments: MAPREDUCE-5907.patch


 This patch utilizes benefits of recursive listing proposed in 
 https://issues.apache.org/jira/browse/HADOOP-10634



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

2014-05-28 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated MAPREDUCE-5907:
---

Component/s: client

 Improve getSplits() performance for fs implementations that can utilize 
 performance gains from recursive listing
 

 Key: MAPREDUCE-5907
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: MAPREDUCE-5907.patch


 This patch utilizes benefits of recursive listing proposed in 
 https://issues.apache.org/jira/browse/HADOOP-10634



--
This message was sent by Atlassian JIRA
(v6.2#6252)