[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450978#comment-16450978 ] Hudson commented on HADOOP-14688: - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14057 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14057/]) HADOOP-14688. Intern strings in KeyVersion and EncryptedKeyVersion. (xyao: rev 89ec91cb004ff36b3b7f327167cb3b45b8baadd2) * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/crypto/key/KeyProviderCryptoExtension.java * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/crypto/key/KeyProvider.java > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement > Components: kms >Reporter: Xiao Chen >Assignee: Xiao Chen >Priority: Major > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: GC root of the String.png, HADOOP-14688.01.patch, > heapdump analysis.png, jxray.report > > > This is inspired by [~mi...@cloudera.com]'s work on HDFS-11383. > The key names and key version names are usually the same for a bunch of > {{KeyVersion}} and {{EncryptedKeyVersion}}. We should not create duplicate > objects for them. > This is more important to HDFS-10899, where we try to re-encrypt all files' > EDEKs in a given EZ. Those EDEKs all has the same key name, and mostly using > no more than a couple of key version names. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157692#comment-16157692 ] Hudson commented on HADOOP-14688: - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12811 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/12811/]) HADOOP-14688. Intern strings in KeyVersion and EncryptedKeyVersion. (weichiu: rev ad32759fd9f33e7bd18758ad1a5a464dab3bcbd9) * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/crypto/key/KeyProviderCryptoExtension.java * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/crypto/key/KeyProvider.java > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement > Components: kms >Reporter: Xiao Chen >Assignee: Xiao Chen > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: GC root of the String.png, HADOOP-14688.01.patch, > heapdump analysis.png, jxray.report > > > This is inspired by [~mi...@cloudera.com]'s work on HDFS-11383. > The key names and key version names are usually the same for a bunch of > {{KeyVersion}} and {{EncryptedKeyVersion}}. We should not create duplicate > objects for them. > This is more important to HDFS-10899, where we try to re-encrypt all files' > EDEKs in a given EZ. Those EDEKs all has the same key name, and mostly using > no more than a couple of key version names. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156000#comment-16156000 ] Xiao Chen commented on HADOOP-14688: Thanks Wei-Chiu for review and commit. Also thanks Daryn and Misha for commenting. > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement > Components: kms >Reporter: Xiao Chen >Assignee: Xiao Chen > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: GC root of the String.png, HADOOP-14688.01.patch, > heapdump analysis.png, jxray.report > > > This is inspired by [~mi...@cloudera.com]'s work on HDFS-11383. > The key names and key version names are usually the same for a bunch of > {{KeyVersion}} and {{EncryptedKeyVersion}}. We should not create duplicate > objects for them. > This is more important to HDFS-10899, where we try to re-encrypt all files' > EDEKs in a given EZ. Those EDEKs all has the same key name, and mostly using > no more than a couple of key version names. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154008#comment-16154008 ] Wei-Chiu Chuang commented on HADOOP-14688: -- +1. will commit today. > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement > Components: kms >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: GC root of the String.png, HADOOP-14688.01.patch, > heapdump analysis.png, jxray.report > > > This is inspired by [~mi...@cloudera.com]'s work on HDFS-11383. > The key names and key version names are usually the same for a bunch of > {{KeyVersion}} and {{EncryptedKeyVersion}}. We should not create duplicate > objects for them. > This is more important to HDFS-10899, where we try to re-encrypt all files' > EDEKs in a given EZ. Those EDEKs all has the same key name, and mostly using > no more than a couple of key version names. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150085#comment-16150085 ] Wei-Chiu Chuang commented on HADOOP-14688: -- Thanks [~mi...@cloudera.com] for your insightful comment, and thanks [~xiaochen] for the performance test. I think the performance optimization makes sense to you especially in the context of key-rotation. Hi [~daryn], how do you think? I would like to cast my +1 if there are no objections. > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement > Components: kms >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: GC root of the String.png, HADOOP-14688.01.patch, > heapdump analysis.png, jxray.report > > > This is inspired by [~mi...@cloudera.com]'s work on HDFS-11383. > The key names and key version names are usually the same for a bunch of > {{KeyVersion}} and {{EncryptedKeyVersion}}. We should not create duplicate > objects for them. > This is more important to HDFS-10899, where we try to re-encrypt all files' > EDEKs in a given EZ. Those EDEKs all has the same key name, and mostly using > no more than a couple of key version names. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139371#comment-16139371 ] Misha Dmitriev commented on HADOOP-14688: - [~daryn]: when a live heap dump is captured, as done here, a full GC is performed before a heap snapshot is taken. So if the given application produces objects that are very short-lived, i.e. quickly become garbage, then we will only see those of them that are live at the moment, which is typically not much. Conversely, most objects in a live heap dump tend to be relatively long-lived. Furthermore, experience has shown that for reasonably long-lived strings, the CPU overhead of interning is small compared to the reduction in the memory pressure, reduced GC pauses, etc. That is, the cost of a fast internal String.intern() call is comparable to the cost of GC scanning and moving around all the extra copies of a string that remain in memory without interning. > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement > Components: kms >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: GC root of the String.png, HADOOP-14688.01.patch, > heapdump analysis.png, jxray.report > > > This is inspired by [~mi...@cloudera.com]'s work on HDFS-11383. > The key names and key version names are usually the same for a bunch of > {{KeyVersion}} and {{EncryptedKeyVersion}}. We should not create duplicate > objects for them. > This is more important to HDFS-10899, where we try to re-encrypt all files' > EDEKs in a given EZ. Those EDEKs all has the same key name, and mostly using > no more than a couple of key version names. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129144#comment-16129144 ] Xiao Chen commented on HADOOP-14688: Bumping up on this... I understand the concern of the added overhead for a normal operation like getFileEncryptionInfo. But from internal runs I did not see this interning causing any visible impact on NN throughput. On the other hand, heap is pretty ugly without this one during re-encryption. Attaching a report ran from [jxray|http://www.jxray.com/]. The most related section is: {quote} 7. DUPLICATE STRINGS Total strings: 2,570,432 Unique strings: 1,033,993 Duplicate values: 3,559 Overhead: 170,572K (8.4%) Top duplicate strings: Ovhd Num char[]s Num objs Value 103,775K (5.1%) 830205 830205 "mGscEhbOphwD8GQkGxVfHnV4PVo3lhmpPWurw3vGsLf" 23,042K (1.1%) 184337 184337 "0OFmjElLqXgtjvWKkgfRoLpUj92dHrEaQCPeh3VDh8V" 8,668K (0.4%) 184937 184937 "EEK" 2,853K (0.1%)12176 12176 "POST /kms/v1/keyversion/mGscEhbOphwD8GQkGxVfHnV4PVo3lhmpPWurw3vGsLf/_eek?eek_op=reencrypt HTTP/1.1" 2,473K (0.1%)12177 12177 "/kms/v1/keyversion/mGscEhbOphwD8GQkGxVfHnV4PVo3lhmpPWurw3vGsLf/_eek?eek_op=reencrypt" 2,298K (0.1%)13374 13374 "/kms/v1/keyversion/mGscEhbOphwD8GQkGxVfHnV4PVo3lhmpPWurw3vGsLf/_eek" {quote} > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement > Components: kms >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: GC root of the String.png, HADOOP-14688.01.patch, > heapdump analysis.png, jxray.report > > > This is inspired by [~mi...@cloudera.com]'s work on HDFS-11383. > The key names and key version names are usually the same for a bunch of > {{KeyVersion}} and {{EncryptedKeyVersion}}. We should not create duplicate > objects for them. > This is more important to HDFS-10899, where we try to re-encrypt all files' > EDEKs in a given EZ. Those EDEKs all has the same key name, and mostly using > no more than a couple of key version names. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103609#comment-16103609 ] Xiao Chen commented on HADOOP-14688: The heapdumps are too big to attach here, so I uploaded a screenshot of the most relevant analysis result out of it. The 2 most duplicated strings (mG... and 0O...) are the 2 key version names. I was running re-encryption on a zone with 1M files. 2 different key versions were among those files in this run. Verified after interning, this goes away. [~daryn], do you think this makes sense? Thanks! > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement > Components: kms >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HADOOP-14688.01.patch, heapdump analysis.png > > > This is inspired by [~mi...@cloudera.com]'s work on HDFS-11383. > The key names and key version names are usually the same for a bunch of > {{KeyVersion}} and {{EncryptedKeyVersion}}. We should not create duplicate > objects for them. > This is more important to HDFS-10899, where we try to re-encrypt all files' > EDEKs in a given EZ. Those EDEKs all has the same key name, and mostly using > no more than a couple of key version names. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102240#comment-16102240 ] Xiao Chen commented on HADOOP-14688: Thanks again [~daryn] for the review series! :) True the edeks are stored in a xattr. During re-encryption though, the EDEK object is constructed, and sent to KMS, where a new EDEK is returned. What's tricky here is, contacting KMS requires to be done outside of the lock. Therefore, the EDEK object has to exist for that time being. And since we're trying to re-encrypt many EDEKs per batch, there're many on-the-fly EDEK objects. Relative code in {{ReencryptionHandler$EDEKReencryptCallable#call}} of HDFS-10899. To make things worse, since KMS is proven to be the bottleneck of this, we'd like to multi thread the 'contact KMS' part, which means more on-the-fly EDEKs (see multi-threading part of HDFS-10899's [doc|https://issues.apache.org/jira/secure/attachment/12874358/Re-encrypt%20edek%20design%20doc%20V2.pdf]) > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement > Components: kms >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HADOOP-14688.01.patch > > > This is inspired by [~mi...@cloudera.com]'s work on HDFS-11383. > The key names and key version names are usually the same for a bunch of > {{KeyVersion}} and {{EncryptedKeyVersion}}. We should not create duplicate > objects for them. > This is more important to HDFS-10899, where we try to re-encrypt all files' > EDEKs in a given EZ. Those EDEKs all has the same key name, and mostly using > no more than a couple of key version names. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102198#comment-16102198 ] Hadoop QA commented on HADOOP-14688: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 11m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 29s{color} | {color:red} hadoop-common in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 60m 10s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.security.TestKDiag | | | hadoop.net.TestDNS | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:14b5c93 | | JIRA Issue | HADOOP-14688 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12879026/HADOOP-14688.01.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 0665569c2ac5 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 27a1a5f | | Default Java | 1.8.0_131 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HADOOP-Build/12863/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/12863/testReport/ | | modules | C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common | | Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/12863/console | | Powered by | Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement >
[jira] [Commented] (HADOOP-14688) Intern strings in KeyVersion and EncryptedKeyVersion
[ https://issues.apache.org/jira/browse/HADOOP-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102183#comment-16102183 ] Daryn Sharp commented on HADOOP-14688: -- Where does the interning play a meaningful part in the process of the related jira? Haven't dug into the code, but isn't this information typically encoded in a xattr and only transiently exists during a namesystem operation? If yes, the overhead of interning is likely unworth it. Ie. you already have a unique string, why bother with effectively a hash lookup to sub it with another string if the unique instance is being gc'ed soon anyway. > Intern strings in KeyVersion and EncryptedKeyVersion > > > Key: HADOOP-14688 > URL: https://issues.apache.org/jira/browse/HADOOP-14688 > Project: Hadoop Common > Issue Type: Improvement > Components: kms >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HADOOP-14688.01.patch > > > This is inspired by [~mi...@cloudera.com]'s work on HDFS-11383. > The key names and key version names are usually the same for a bunch of > {{KeyVersion}} and {{EncryptedKeyVersion}}. We should not create duplicate > objects for them. > This is more important to HDFS-10899, where we try to re-encrypt all files' > EDEKs in a given EZ. Those EDEKs all has the same key name, and mostly using > no more than a couple of key version names. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org