[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568535#comment-16568535 ] Vihang Karajgaonkar commented on HIVE-19937: HI [~stakiar] Sorry for the delay in responding. Can you please add a comment mentioning that set method interns the duplicate strings, just before the {{mapWork.setPathToPartitionInfo}} and {{partitionDesc.setBaseFileName}} in the customer deserializer's read method so that its more obvious why we are calling the set method. I feel it is easier to understand that way. Rest looks good. Thanks for the patch. +1 > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, HIVE-19937.5.patch, > post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567402#comment-16567402 ] Sahil Takiar commented on HIVE-19937: - Ping [~vihangk1] > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, HIVE-19937.5.patch, > post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554477#comment-16554477 ] Sahil Takiar commented on HIVE-19937: - [~vihangk1] could you take a look? > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, HIVE-19937.5.patch, > post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553538#comment-16553538 ] Sahil Takiar commented on HIVE-19937: - Thanks Misha, yes I will keep this in mind. I don't think this part of the code is CPU bound, so it shouldn't be an issue. > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, HIVE-19937.5.patch, > post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553279#comment-16553279 ] Misha Dmitriev commented on HIVE-19937: --- The last patch looks good to me. The only slight concern that I got from looking at it once again is the following: in one or two places you switched from passing around Strings to passing around Paths, and subsequently switched someĀ HashMap(s) to HashMap. Note that a lookup in a map where keys are complex objects is slower, because Path.equals(Path) is slower than String.equals(String) - it may involve comparison of many strings, etc. I haven't seen any reports on Hive CPU performance problems, and I hope this code is not on a critical path, and/or that GC-related savings would offset the potential hashmap lookup slowdown... but anyway, I guess it's worth remembering about this. > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, HIVE-19937.5.patch, > post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552113#comment-16552113 ] Hive QA commented on HIVE-19937: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12932601/HIVE-19937.5.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 14675 tests executed *Failed tests:* {noformat} TestMiniDruidCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=192) [druidmini_dynamic_partition.q,druidmini_expressions.q,druidmini_test_alter.q,druidmini_test1.q,druidmini_test_insert.q] org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver[druid_timestamptz] (batchId=193) org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver[druidmini_joins] (batchId=193) org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver[druidmini_masking] (batchId=193) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/12780/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12780/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12780/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12932601 - PreCommit-HIVE-Build > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, HIVE-19937.5.patch, > post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552067#comment-16552067 ] Hive QA commented on HIVE-19937: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 3s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 4m 2s{color} | {color:blue} ql in master has 2280 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 41s{color} | {color:red} ql: The patch generated 5 new + 298 unchanged - 4 fixed = 303 total (was 302) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 4m 18s{color} | {color:red} ql generated 3 new + 2279 unchanged - 1 fixed = 2282 total (was 2280) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 24m 1s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:ql | | | Redundant nullcheck of nominal which is known to be null in org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(Path) Redundant null check at AbstractMapOperator.java:is known to be null in org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(Path) Redundant null check at AbstractMapOperator.java:[line 113] | | | org.apache.hadoop.hive.ql.exec.SerializationUtilities$MapWorkSerializer implements Comparator but not Serializable At SerializationUtilities.java:Serializable At SerializationUtilities.java:[lines 551-562] | | | org.apache.hadoop.hive.ql.exec.SerializationUtilities$PartitionDescSerializer implements Comparator but not Serializable At SerializationUtilities.java:Serializable At SerializationUtilities.java:[lines 572-585] | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-12780/dev-support/hive-personality.sh | | git revision | master / cce3a05 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-12780/yetus/diff-checkstyle-ql.txt | | findbugs | http://104.198.109.242/logs//PreCommit-HIVE-Build-12780/yetus/new-findbugs-ql.html | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-12780/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, HIVE-19937.5.patch, > post-patch-report.html, report.html > > > When
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550341#comment-16550341 ] Hive QA commented on HIVE-19937: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12932285/HIVE-19937.4.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/12721/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12721/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12721/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Tests exited with: Exception: Patch URL https://issues.apache.org/jira/secure/attachment/12932285/HIVE-19937.4.patch was found in seen patch url's cache and a test was probably run already on it. Aborting... {noformat} This message is automatically generated. ATTACHMENT ID: 12932285 - PreCommit-HIVE-Build > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550340#comment-16550340 ] Hive QA commented on HIVE-19937: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12932285/HIVE-19937.4.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 14679 tests executed *Failed tests:* {noformat} org.apache.hive.minikdc.TestJdbcWithMiniKdc.testTokenAuth (batchId=263) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/12720/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12720/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12720/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12932285 - PreCommit-HIVE-Build > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550312#comment-16550312 ] Hive QA commented on HIVE-19937: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 23s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 10s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 42s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 4m 8s{color} | {color:blue} ql in master has 2281 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 43s{color} | {color:red} ql: The patch generated 5 new + 298 unchanged - 4 fixed = 303 total (was 302) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 4m 18s{color} | {color:red} ql generated 3 new + 2280 unchanged - 1 fixed = 2283 total (was 2281) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 13s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 24m 39s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:ql | | | Redundant nullcheck of nominal which is known to be null in org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(Path) Redundant null check at AbstractMapOperator.java:is known to be null in org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(Path) Redundant null check at AbstractMapOperator.java:[line 113] | | | org.apache.hadoop.hive.ql.exec.SerializationUtilities$MapWorkSerializer implements Comparator but not Serializable At SerializationUtilities.java:Serializable At SerializationUtilities.java:[lines 551-562] | | | org.apache.hadoop.hive.ql.exec.SerializationUtilities$PartitionDescSerializer implements Comparator but not Serializable At SerializationUtilities.java:Serializable At SerializationUtilities.java:[lines 572-585] | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-12720/dev-support/hive-personality.sh | | git revision | master / 851c8ab | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-12720/yetus/diff-checkstyle-ql.txt | | findbugs | http://104.198.109.242/logs//PreCommit-HIVE-Build-12720/yetus/new-findbugs-ql.html | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-12720/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, post-patch-report.html, report.html > > > When fixing HIVE-16395, we
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549768#comment-16549768 ] Misha Dmitriev commented on HIVE-19937: --- +1 One small comment: instead of {{this.baseFileName = baseFileName == null ? null : baseFileName.intern();}} you can just use {{this.baseFileName = StringUtils.internIfNotNull(baseFileName);}} > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, HIVE-19937.4.patch, post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549704#comment-16549704 ] Sahil Takiar commented on HIVE-19937: - The overheads probably won't grow in proportion to the heap, but the goal is to allow users to run Hive-on-Spark successfully even with low heap settings (e.g. 1g). The overheads are more of a function of the workload. In this case, the workload is TPC-DS (a standard SQL benchmarks). Hive users to run queries that scan more partitions can expect the overheads to increase. > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549667#comment-16549667 ] Misha Dmitriev commented on HIVE-19937: --- Thank you for publishing the updated jxray report, [~stakiar] Looks like things work as expected. One small question/concern: I see that this test uses a pretty small heap, less than 1GB. Is it expected that all the overheads that you are fixing will grow proportionally with a bigger heap? > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19937) Intern fields in MapWork on deserialization
[ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549601#comment-16549601 ] Sahil Takiar commented on HIVE-19937: - Attached an jxray report after applying the patch on a cluster and running the same test. When comparing the two reports, the overhead from fields in {{MapWork}} is now gone. There is still a lot of overhead due to duplicate entires in {{Properties}} objects, I need to fix that. I noticed there is a bit of overhead in {{ProjectionPusher}} that I plan to fix in the next version of this patch. > Intern fields in MapWork on deserialization > --- > > Key: HIVE-19937 > URL: https://issues.apache.org/jira/browse/HIVE-19937 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, > HIVE-19937.3.patch, post-patch-report.html, report.html > > > When fixing HIVE-16395, we decided that each new Spark task should clone the > {{JobConf}} object to prevent any {{ConcurrentModificationException}} from > being thrown. However, setting this variable comes at a cost of storing a > duplicate {{JobConf}} object for each Spark task. These objects can take up a > significant amount of memory, we should intern them so that Spark tasks > running in the same JVM don't store duplicate copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)