[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated MAPREDUCE-5907: -- Labels: (was: BB2015-05-TBR) > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar >Priority: Major > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, > MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated MAPREDUCE-5907: -- Status: Open (was: Patch Available) > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Labels: BB2015-05-TBR > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, > MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated MAPREDUCE-5907: -- Status: Patch Available (was: Open) > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Labels: BB2015-05-TBR > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, > MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated MAPREDUCE-5907: Labels: BB2015-05-TBR (was: ) > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Labels: BB2015-05-TBR > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, > MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Attachment: MAPREDUCE-5907-3.patch Fixed the first findbug issue which was really an issue. I'm not sure if the second one will still pass now. Let's see > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, > MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Release Note: (was: Please review the attached patch.) Status: Patch Available (was: Open) > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Attachment: MAPREDUCE-5907-2.patch Incorporated feedback on using the iterator apis > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Status: Open (was: Patch Available) Wasn't aware of Swift object store, thanks for pointing out. Just for future reference, Swift object store implementation is HADOOP-8545 and i could find code pieces in https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-openstack Will be posting test results for both s3 and swift object store performance along with merged patch (changes for HADOOP-10634 included) > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Attachments: MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Description: FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance. was:This patch utilizes benefits of recursive listing proposed in https://issues.apache.org/jira/browse/HADOOP-10634 > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Attachments: MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive > listing while calculating splits. They however do this by doing listing level > by level. That means to discover files in /foo/bar means they do listing at > /foo/bar first to get the immediate children, then make the same call on all > immediate children for /foo/bar to discover their immediate children and so > on. This doesn't scale well for object store based fs implementations like s3 > and swift because every listStatus call ends up being a webservice call to > backend. In cases where large number of files are considered for input, this > makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to > the fs implementations to optimize. The behavior remains the same for other > implementations (that is a default implementation is provided for other fs so > they don't have to implement anything new). However for objectstore based fs > implementations it provides a simple change to include recursive flag as true > (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Component/s: client > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Attachments: MAPREDUCE-5907.patch > > > This patch utilizes benefits of recursive listing proposed in > https://issues.apache.org/jira/browse/HADOOP-10634 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Release Note: Please review the attached patch. Status: Patch Available (was: Open) > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Affects Versions: 2.4.0 >Reporter: Sumit Kumar > Attachments: MAPREDUCE-5907.patch > > > This patch utilizes benefits of recursive listing proposed in > https://issues.apache.org/jira/browse/HADOOP-10634 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated MAPREDUCE-5907: --- Attachment: MAPREDUCE-5907.patch > Improve getSplits() performance for fs implementations that can utilize > performance gains from recursive listing > > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Affects Versions: 2.4.0 >Reporter: Sumit Kumar > Attachments: MAPREDUCE-5907.patch > > > This patch utilizes benefits of recursive listing proposed in > https://issues.apache.org/jira/browse/HADOOP-10634 -- This message was sent by Atlassian JIRA (v6.2#6252)