[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated MAPREDUCE-5907: -- Labels: (was: BB2015-05-TBR) > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar >Priority: Major > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, > MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated MAPREDUCE-5907: -- Status: Open (was: Patch Available) > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Labels: BB2015-05-TBR > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, > MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated MAPREDUCE-5907: -- Status: Patch Available (was: Open) > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Labels: BB2015-05-TBR > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, > MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated MAPREDUCE-5907: Labels: BB2015-05-TBR (was: ) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing Key: MAPREDUCE-5907 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Labels: BB2015-05-TBR Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, MAPREDUCE-5907.patch FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Attachment: MAPREDUCE-5907-2.patch Incorporated feedback on using the iterator apis Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing Key: MAPREDUCE-5907 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907.patch FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Release Note: (was: Please review the attached patch.) Status: Patch Available (was: Open) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing Key: MAPREDUCE-5907 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907.patch FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Attachment: MAPREDUCE-5907-3.patch Fixed the first findbug issue which was really an issue. I'm not sure if the second one will still pass now. Let's see Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing Key: MAPREDUCE-5907 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, MAPREDUCE-5907.patch FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Description: FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance. was:This patch utilizes benefits of recursive listing proposed in https://issues.apache.org/jira/browse/HADOOP-10634 Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing Key: MAPREDUCE-5907 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Attachments: MAPREDUCE-5907.patch FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Status: Open (was: Patch Available) Wasn't aware of Swift object store, thanks for pointing out. Just for future reference, Swift object store implementation is HADOOP-8545 and i could find code pieces in https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-openstack Will be posting test results for both s3 and swift object store performance along with merged patch (changes for HADOOP-10634 included) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing Key: MAPREDUCE-5907 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Attachments: MAPREDUCE-5907.patch FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Attachment: MAPREDUCE-5907.patch Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing Key: MAPREDUCE-5907 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sumit Kumar Attachments: MAPREDUCE-5907.patch This patch utilizes benefits of recursive listing proposed in https://issues.apache.org/jira/browse/HADOOP-10634 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Release Note: Please review the attached patch. Status: Patch Available (was: Open) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing Key: MAPREDUCE-5907 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sumit Kumar Attachments: MAPREDUCE-5907.patch This patch utilizes benefits of recursive listing proposed in https://issues.apache.org/jira/browse/HADOOP-10634 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Component/s: client Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing Key: MAPREDUCE-5907 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Attachments: MAPREDUCE-5907.patch This patch utilizes benefits of recursive listing proposed in https://issues.apache.org/jira/browse/HADOOP-10634 -- This message was sent by Atlassian JIRA (v6.2#6252)