[jira] [Commented] (HADOOP-13868) New defaults for S3A multi-part configuration
[ https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889149#comment-16889149 ] Hudson commented on HADOOP-13868: - FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16957 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16957/]) HADOOP-13868. [s3a] New default for S3A multi-part configuration (#1125) (github: rev 7f1b76ca3598acb0a0a843b2364c8963c70edf4d) * (edit) hadoop-common-project/hadoop-common/src/main/resources/core-default.xml * (edit) hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java * (edit) hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md > New defaults for S3A multi-part configuration > - > > Key: HADOOP-13868 > URL: https://issues.apache.org/jira/browse/HADOOP-13868 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.7.0, 3.0.0-alpha1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Attachments: HADOOP-13868.001.patch, HADOOP-13868.002.patch, > optimizing-multipart-s3a.sh > > > I've been looking at a big performance regression when writing to S3 from > Spark that appears to have been introduced with HADOOP-12891. > In the Amazon SDK, the default threshold for multi-part copies is 320x the > threshold for multi-part uploads (and the block size is 20x bigger), so I > don't think it's necessarily wise for us to have them be the same. > I did some quick tests and it seems to me the sweet spot when multi-part > copies start being faster is around 512MB. It wasn't as significant, but > using 104857600 (Amazon's default) for the blocksize was also slightly better. > I propose we do the following, although they're independent decisions: > (1) Split the configuration. Ideally, I'd like to have > fs.s3a.multipart.copy.threshold and fs.s3a.multipart.upload.threshold (and > corresponding properties for the block size). But then there's the question > of what to do with the existing fs.s3a.multipart.* properties. Deprecation? > Leave it as a short-hand for configuring both (that's overridden by the more > specific properties?). > (2) Consider increasing the default values. In my tests, 256 MB seemed to be > where multipart uploads came into their own, and 512 MB was where multipart > copies started outperforming the alternative. Would be interested to hear > what other people have seen. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13868) New defaults for S3A multi-part configuration
[ https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888395#comment-16888395 ] Sean Mackrory commented on HADOOP-13868: Bit-rot after only 2 1/2 years? Imagine that! Actually the only part that doesn't apply cleanly is the documentation, and that's just because it's looking 100 lines away from where it should. Resubmitted as a pull request to verify a clean Yetus run, but as the patch is virtually identical I'll assume your +1 still applies unless I hear otherwise. > New defaults for S3A multi-part configuration > - > > Key: HADOOP-13868 > URL: https://issues.apache.org/jira/browse/HADOOP-13868 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.7.0, 3.0.0-alpha1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Attachments: HADOOP-13868.001.patch, HADOOP-13868.002.patch, > optimizing-multipart-s3a.sh > > > I've been looking at a big performance regression when writing to S3 from > Spark that appears to have been introduced with HADOOP-12891. > In the Amazon SDK, the default threshold for multi-part copies is 320x the > threshold for multi-part uploads (and the block size is 20x bigger), so I > don't think it's necessarily wise for us to have them be the same. > I did some quick tests and it seems to me the sweet spot when multi-part > copies start being faster is around 512MB. It wasn't as significant, but > using 104857600 (Amazon's default) for the blocksize was also slightly better. > I propose we do the following, although they're independent decisions: > (1) Split the configuration. Ideally, I'd like to have > fs.s3a.multipart.copy.threshold and fs.s3a.multipart.upload.threshold (and > corresponding properties for the block size). But then there's the question > of what to do with the existing fs.s3a.multipart.* properties. Deprecation? > Leave it as a short-hand for configuring both (that's overridden by the more > specific properties?). > (2) Consider increasing the default values. In my tests, 256 MB seemed to be > where multipart uploads came into their own, and 512 MB was where multipart > copies started outperforming the alternative. Would be interested to hear > what other people have seen. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13868) New defaults for S3A multi-part configuration
[ https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888392#comment-16888392 ] Hadoop QA commented on HADOOP-13868: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s{color} | {color:red} HADOOP-13868 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HADOOP-13868 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12842566/HADOOP-13868.002.patch | | Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/16388/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > New defaults for S3A multi-part configuration > - > > Key: HADOOP-13868 > URL: https://issues.apache.org/jira/browse/HADOOP-13868 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.7.0, 3.0.0-alpha1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Attachments: HADOOP-13868.001.patch, HADOOP-13868.002.patch, > optimizing-multipart-s3a.sh > > > I've been looking at a big performance regression when writing to S3 from > Spark that appears to have been introduced with HADOOP-12891. > In the Amazon SDK, the default threshold for multi-part copies is 320x the > threshold for multi-part uploads (and the block size is 20x bigger), so I > don't think it's necessarily wise for us to have them be the same. > I did some quick tests and it seems to me the sweet spot when multi-part > copies start being faster is around 512MB. It wasn't as significant, but > using 104857600 (Amazon's default) for the blocksize was also slightly better. > I propose we do the following, although they're independent decisions: > (1) Split the configuration. Ideally, I'd like to have > fs.s3a.multipart.copy.threshold and fs.s3a.multipart.upload.threshold (and > corresponding properties for the block size). But then there's the question > of what to do with the existing fs.s3a.multipart.* properties. Deprecation? > Leave it as a short-hand for configuring both (that's overridden by the more > specific properties?). > (2) Consider increasing the default values. In my tests, 256 MB seemed to be > where multipart uploads came into their own, and 512 MB was where multipart > copies started outperforming the alternative. Would be interested to hear > what other people have seen. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13868) New defaults for S3A multi-part configuration
[ https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888382#comment-16888382 ] Steve Loughran commented on HADOOP-13868: - LGTM +1 > New defaults for S3A multi-part configuration > - > > Key: HADOOP-13868 > URL: https://issues.apache.org/jira/browse/HADOOP-13868 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.7.0, 3.0.0-alpha1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Attachments: HADOOP-13868.001.patch, HADOOP-13868.002.patch, > optimizing-multipart-s3a.sh > > > I've been looking at a big performance regression when writing to S3 from > Spark that appears to have been introduced with HADOOP-12891. > In the Amazon SDK, the default threshold for multi-part copies is 320x the > threshold for multi-part uploads (and the block size is 20x bigger), so I > don't think it's necessarily wise for us to have them be the same. > I did some quick tests and it seems to me the sweet spot when multi-part > copies start being faster is around 512MB. It wasn't as significant, but > using 104857600 (Amazon's default) for the blocksize was also slightly better. > I propose we do the following, although they're independent decisions: > (1) Split the configuration. Ideally, I'd like to have > fs.s3a.multipart.copy.threshold and fs.s3a.multipart.upload.threshold (and > corresponding properties for the block size). But then there's the question > of what to do with the existing fs.s3a.multipart.* properties. Deprecation? > Leave it as a short-hand for configuring both (that's overridden by the more > specific properties?). > (2) Consider increasing the default values. In my tests, 256 MB seemed to be > where multipart uploads came into their own, and 512 MB was where multipart > copies started outperforming the alternative. Would be interested to hear > what other people have seen. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13868) New defaults for S3A multi-part configuration
[ https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866128#comment-15866128 ] Sean Mackrory commented on HADOOP-13868: Just pinging on this - I'd like to resolve it soon. > New defaults for S3A multi-part configuration > - > > Key: HADOOP-13868 > URL: https://issues.apache.org/jira/browse/HADOOP-13868 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.7.0, 3.0.0-alpha1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-13868.001.patch, HADOOP-13868.002.patch, > optimizing-multipart-s3a.sh > > > I've been looking at a big performance regression when writing to S3 from > Spark that appears to have been introduced with HADOOP-12891. > In the Amazon SDK, the default threshold for multi-part copies is 320x the > threshold for multi-part uploads (and the block size is 20x bigger), so I > don't think it's necessarily wise for us to have them be the same. > I did some quick tests and it seems to me the sweet spot when multi-part > copies start being faster is around 512MB. It wasn't as significant, but > using 104857600 (Amazon's default) for the blocksize was also slightly better. > I propose we do the following, although they're independent decisions: > (1) Split the configuration. Ideally, I'd like to have > fs.s3a.multipart.copy.threshold and fs.s3a.multipart.upload.threshold (and > corresponding properties for the block size). But then there's the question > of what to do with the existing fs.s3a.multipart.* properties. Deprecation? > Leave it as a short-hand for configuring both (that's overridden by the more > specific properties?). > (2) Consider increasing the default values. In my tests, 256 MB seemed to be > where multipart uploads came into their own, and 512 MB was where multipart > copies started outperforming the alternative. Would be interested to hear > what other people have seen. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13868) New defaults for S3A multi-part configuration
[ https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15735690#comment-15735690 ] Hadoop QA commented on HADOOP-13868: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 18s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 26s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 34s{color} | {color:green} hadoop-aws in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 37s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 77m 9s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:a9ad5d6 | | JIRA Issue | HADOOP-13868 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12842566/HADOOP-13868.002.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit xml findbugs checkstyle | | uname | Linux dfd8afad53d8 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 80b8023 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/11236/testReport/ | | modules | C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: . | | Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/11236/console | | Powered by | Apache Yetus
[jira] [Commented] (HADOOP-13868) New defaults for S3A multi-part configuration
[ https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15735529#comment-15735529 ] Sean Mackrory commented on HADOOP-13868: {quote}128MB seems a reasonable increase{quote} Just to be clear, it's a decrease. I was mistaken about what the previous defaults were in trunk. But the current value is also significantly sub-optimal (at least in all the US regions I tested, despite significantly varying raw performance between them). > New defaults for S3A multi-part configuration > - > > Key: HADOOP-13868 > URL: https://issues.apache.org/jira/browse/HADOOP-13868 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.7.0, 3.0.0-alpha1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-13868.001.patch, HADOOP-13868.002.patch, > optimizing-multipart-s3a.sh > > > I've been looking at a big performance regression when writing to S3 from > Spark that appears to have been introduced with HADOOP-12891. > In the Amazon SDK, the default threshold for multi-part copies is 320x the > threshold for multi-part uploads (and the block size is 20x bigger), so I > don't think it's necessarily wise for us to have them be the same. > I did some quick tests and it seems to me the sweet spot when multi-part > copies start being faster is around 512MB. It wasn't as significant, but > using 104857600 (Amazon's default) for the blocksize was also slightly better. > I propose we do the following, although they're independent decisions: > (1) Split the configuration. Ideally, I'd like to have > fs.s3a.multipart.copy.threshold and fs.s3a.multipart.upload.threshold (and > corresponding properties for the block size). But then there's the question > of what to do with the existing fs.s3a.multipart.* properties. Deprecation? > Leave it as a short-hand for configuring both (that's overridden by the more > specific properties?). > (2) Consider increasing the default values. In my tests, 256 MB seemed to be > where multipart uploads came into their own, and 512 MB was where multipart > copies started outperforming the alternative. Would be interested to hear > what other people have seen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13868) New defaults for S3A multi-part configuration
[ https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15735019#comment-15735019 ] Steve Loughran commented on HADOOP-13868: - 128MB seems a reasonable increase. But could the patched values be of the form 128M, rather than the multiplied out number. That way it's easier for people reading it to see what the actual number means > New defaults for S3A multi-part configuration > - > > Key: HADOOP-13868 > URL: https://issues.apache.org/jira/browse/HADOOP-13868 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.7.0, 3.0.0-alpha1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-13868.001.patch, optimizing-multipart-s3a.sh > > > I've been looking at a big performance regression when writing to S3 from > Spark that appears to have been introduced with HADOOP-12891. > In the Amazon SDK, the default threshold for multi-part copies is 320x the > threshold for multi-part uploads (and the block size is 20x bigger), so I > don't think it's necessarily wise for us to have them be the same. > I did some quick tests and it seems to me the sweet spot when multi-part > copies start being faster is around 512MB. It wasn't as significant, but > using 104857600 (Amazon's default) for the blocksize was also slightly better. > I propose we do the following, although they're independent decisions: > (1) Split the configuration. Ideally, I'd like to have > fs.s3a.multipart.copy.threshold and fs.s3a.multipart.upload.threshold (and > corresponding properties for the block size). But then there's the question > of what to do with the existing fs.s3a.multipart.* properties. Deprecation? > Leave it as a short-hand for configuring both (that's overridden by the more > specific properties?). > (2) Consider increasing the default values. In my tests, 256 MB seemed to be > where multipart uploads came into their own, and 512 MB was where multipart > copies started outperforming the alternative. Would be interested to hear > what other people have seen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org