[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388915#comment-16388915 ] Rui Li commented on HIVE-15104: --- [~stakiar], thanks for trying this out. bq. The HiveKryoRegistrator still seems to be serializing the hashCode so where are the actual savings coming from? I didn't look deeply into kryo, but I think the reason is generic kryo SerDe has some overhead to store class meta info, while in {{HiveKryoRegistrator}} we just store the data. My earlier [comment|https://issues.apache.org/jira/browse/HIVE-15104?focusedCommentId=16007788=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16007788] shows custom SerDe can bring improvements for BytesWritable too. bq. I'm not sure I understand why the performance should improve when hive.spark.use.groupby.shuffle is set to false. I guess the difference is due to the different shuffle we used -- if {{hive.spark.use.groupby.shuffle}} is false, group-by-key shuffle is replaced with repartition-and-sort-within-partition shuffle. And yes, the registrator is same for the two cases. bq. why do we need the hashCode after deserializing the data? For MR, the hash code is not needed for deserialized HiveKey (see HiveKey::hashCode), because when HiveKey is deserialized, it's already been distributed to the proper reducer. For Spark, RDDs may get cached during the execution. So if we deserialize a cached RDD and try to partition it to a downstream reducer, we'll need the hash code available after deserialization. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li >Priority: Major > Fix For: 3.0.0 > > Attachments: HIVE-15104.1.patch, HIVE-15104.10.patch, > HIVE-15104.2.patch, HIVE-15104.3.patch, HIVE-15104.4.patch, > HIVE-15104.5.patch, HIVE-15104.6.patch, HIVE-15104.7.patch, > HIVE-15104.8.patch, HIVE-15104.9.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388551#comment-16388551 ] Sahil Takiar commented on HIVE-15104: - Hey [~lirui] I found some time to do some internal testing of this patch. I ran a 1 TB Parquet TPC-DS benchmark (subset of 49 queries, run three times each) on a physical cluster and found similar results to your TPC-DS benchmark. || ||Baseline Run (default configuration)||Optimized Serde||Optimized Serde + No GroupBy|| ||Shuffle Bytes Read|699.5 GB|530.7 GB|531.6 GB| ||Shuffle Bytes Written|690 GB|529.5 GB|530.3 GB| ||Total Latency (min)|202|191|190| So about a 25% improvement on shuffle data and 5% performance improvement. I think the improvement for the shuffle data is significant and is a good improvement. A few questions on the implementation. * The {{HiveKryoRegistrator}} still seems to be serializing the {{hashCode}} so where are the actual savings coming from? * I'm not sure I understand why the performance should improve when {{hive.spark.use.groupby.shuffle}} is set to {{false}}. It's still using the same registrator right? * You said that we need to serialize the {{hashCode}} because "{{The cached RDDs will be serialized when stored to disk or transferred via network, then we need the hash code after the data is deserialized"}} - why do we need the {{hashCode}} after deserializing the data? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li >Priority: Major > Fix For: 3.0.0 > > Attachments: HIVE-15104.1.patch, HIVE-15104.10.patch, > HIVE-15104.2.patch, HIVE-15104.3.patch, HIVE-15104.4.patch, > HIVE-15104.5.patch, HIVE-15104.6.patch, HIVE-15104.7.patch, > HIVE-15104.8.patch, HIVE-15104.9.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233602#comment-16233602 ] Lefty Leverenz commented on HIVE-15104: --- Good doc, thanks [~lirui]. I removed the TODOC3.0 label. Here's a direct link to the doc: * [hive.spark.optimize.shuffle.serde | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.spark.optimize.shuffle.serde] > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li >Priority: Major > Fix For: 3.0.0 > > Attachments: HIVE-15104.1.patch, HIVE-15104.10.patch, > HIVE-15104.2.patch, HIVE-15104.3.patch, HIVE-15104.4.patch, > HIVE-15104.5.patch, HIVE-15104.6.patch, HIVE-15104.7.patch, > HIVE-15104.8.patch, HIVE-15104.9.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224326#comment-16224326 ] Rui Li commented on HIVE-15104: --- Thanks [~leftylev] for the reminder. I've updated the wiki. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Labels: TODOC3.0 > Fix For: 3.0.0 > > Attachments: HIVE-15104.1.patch, HIVE-15104.10.patch, > HIVE-15104.2.patch, HIVE-15104.3.patch, HIVE-15104.4.patch, > HIVE-15104.5.patch, HIVE-15104.6.patch, HIVE-15104.7.patch, > HIVE-15104.8.patch, HIVE-15104.9.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219861#comment-16219861 ] Lefty Leverenz commented on HIVE-15104: --- Doc note: This adds *hive.spark.optimize.shuffle.serde* to HiveConf.java, so it needs to be documented in the wiki. * [Configuration Properties -- Spark | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Spark] Added a TODOC3.0 label. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Labels: TODOC3.0 > Fix For: 3.0.0 > > Attachments: HIVE-15104.1.patch, HIVE-15104.10.patch, > HIVE-15104.2.patch, HIVE-15104.3.patch, HIVE-15104.4.patch, > HIVE-15104.5.patch, HIVE-15104.6.patch, HIVE-15104.7.patch, > HIVE-15104.8.patch, HIVE-15104.9.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217849#comment-16217849 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12893666/HIVE-15104.10.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 11319 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] (batchId=110) org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut (batchId=205) org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints (batchId=222) org.apache.hadoop.hive.ql.parse.authorization.plugin.sqlstd.TestOperation2Privilege.checkHiveOperationTypeMatch (batchId=270) org.apache.hive.jdbc.TestTriggersWorkloadManager.testTriggerHighShuffleBytes (batchId=229) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7456/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7456/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7456/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12893666 - PreCommit-HIVE-Build > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.10.patch, > HIVE-15104.2.patch, HIVE-15104.3.patch, HIVE-15104.4.patch, > HIVE-15104.5.patch, HIVE-15104.6.patch, HIVE-15104.7.patch, > HIVE-15104.8.patch, HIVE-15104.9.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217431#comment-16217431 ] Xuefu Zhang commented on HIVE-15104: +1 > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.10.patch, > HIVE-15104.2.patch, HIVE-15104.3.patch, HIVE-15104.4.patch, > HIVE-15104.5.patch, HIVE-15104.6.patch, HIVE-15104.7.patch, > HIVE-15104.8.patch, HIVE-15104.9.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208904#comment-16208904 ] Rui Li commented on HIVE-15104: --- The sub-query failures are tracked by HIVE-17823. Others are not related. I've put the 9th patch on RB. [~xuefuz], could you take another look? Thanks. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.6.patch, HIVE-15104.7.patch, HIVE-15104.8.patch, > HIVE-15104.9.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207563#comment-16207563 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12892549/HIVE-15104.9.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 13 failed/errored test(s), 11276 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] (batchId=145) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[optimize_nullscan] (batchId=163) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] (batchId=110) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_notin] (batchId=133) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_scalar] (batchId=119) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_select] (batchId=119) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_views] (batchId=108) org.apache.hadoop.hive.cli.TestSparkPerfCliDriver.testCliDriver[query16] (batchId=243) org.apache.hadoop.hive.cli.TestSparkPerfCliDriver.testCliDriver[query94] (batchId=243) org.apache.hadoop.hive.cli.TestTezPerfCliDriver.testCliDriver[query16] (batchId=241) org.apache.hadoop.hive.cli.TestTezPerfCliDriver.testCliDriver[query94] (batchId=241) org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut (batchId=204) org.apache.hive.jdbc.TestTriggersWorkloadManager.testTriggerHighShuffleBytes (batchId=229) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7348/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7348/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7348/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 13 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12892549 - PreCommit-HIVE-Build > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.6.patch, HIVE-15104.7.patch, HIVE-15104.8.patch, > HIVE-15104.9.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207028#comment-16207028 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12892511/HIVE-15104.7.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7341/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7341/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7341/ Messages: {noformat} This message was trimmed, see log for full details + PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'MAVEN_OPTS=-Xmx1g ' + MAVEN_OPTS='-Xmx1g ' + cd /data/hiveptest/working/ + tee /data/hiveptest/logs/PreCommit-HIVE-Build-7341/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ git = \s\v\n ]] + [[ git = \g\i\t ]] + [[ -z master ]] + [[ -d apache-github-source-source ]] + [[ ! -d apache-github-source-source/.git ]] + [[ ! -d apache-github-source-source ]] + date '+%Y-%m-%d %T.%3N' 2017-10-17 05:22:03.427 + cd apache-github-source-source + git fetch origin + git reset --hard HEAD HEAD is now at 8c3f0e4 HIVE-17815: prevent OOM with Atlas Hive hook (Anishek Agarwal reviewed by Thejas Nair) + git clean -f -d + git checkout master Already on 'master' Your branch is up-to-date with 'origin/master'. + git reset --hard origin/master HEAD is now at 8c3f0e4 HIVE-17815: prevent OOM with Atlas Hive hook (Anishek Agarwal reviewed by Thejas Nair) + git merge --ff-only origin/master Already up-to-date. + date '+%Y-%m-%d %T.%3N' 2017-10-17 05:22:03.911 + patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hiveptest/working/scratch/build.patch + [[ -f /data/hiveptest/working/scratch/build.patch ]] + chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh + /data/hiveptest/working/scratch/smart-apply-patch.sh /data/hiveptest/working/scratch/build.patch Going to apply patch with: patch -p0 patching file common/src/java/org/apache/hadoop/hive/conf/HiveConf.java patching file hive-kryo-registrator/pom.xml patching file hive-kryo-registrator/src/main/java/org/apache/hive/spark/HiveKryoRegistrator.java patching file itests/src/test/resources/testconfiguration.properties patching file packaging/pom.xml patching file pom.xml patching file ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveSparkClientFactory.java patching file ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java patching file ql/src/test/queries/clientpositive/spark_opt_shuffle_serde.q patching file ql/src/test/results/clientpositive/spark/spark_opt_shuffle_serde.q.out patching file spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java patching file spark-client/src/main/java/org/apache/hive/spark/client/SparkClientUtilities.java + [[ maven == \m\a\v\e\n ]] + rm -rf /data/hiveptest/working/maven/org/apache/hive + mvn -B clean install -DskipTests -T 4 -q -Dmaven.repo.local=/data/hiveptest/working/maven protoc-jar: protoc version: 250, detected platform: linux/amd64 protoc-jar: executing: [/tmp/protoc9130095883787762036.exe, -I/data/hiveptest/working/apache-github-source-source/standalone-metastore/src/main/protobuf/org/apache/hadoop/hive/metastore, --java_out=/data/hiveptest/working/apache-github-source-source/standalone-metastore/target/generated-sources, /data/hiveptest/working/apache-github-source-source/standalone-metastore/src/main/protobuf/org/apache/hadoop/hive/metastore/metastore.proto] ANTLR Parser Generator Version 3.5.2 Output file /data/hiveptest/working/apache-github-source-source/standalone-metastore/target/generated-sources/antlr3/org/apache/hadoop/hive/metastore/parser/FilterParser.java does not exist: must build /data/hiveptest/working/apache-github-source-source/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/parser/Filter.g org/apache/hadoop/hive/metastore/parser/Filter.g DataNucleus Enhancer (version 4.1.17) for API "JDO" DataNucleus Enhancer : Classpath >> /usr/share/maven/boot/plexus-classworlds-2.x.jar ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MDatabase ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MFieldSchema ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MType ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MTable ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MConstraint ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MSerDeInfo ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MOrder ENHANCED (Persistable) :
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205863#comment-16205863 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12892380/HIVE-15104.6.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7323/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7323/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7323/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ date '+%Y-%m-%d %T.%3N' 2017-10-16 13:13:35.520 + [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]] + export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'MAVEN_OPTS=-Xmx1g ' + MAVEN_OPTS='-Xmx1g ' + cd /data/hiveptest/working/ + tee /data/hiveptest/logs/PreCommit-HIVE-Build-7323/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ git = \s\v\n ]] + [[ git = \g\i\t ]] + [[ -z master ]] + [[ -d apache-github-source-source ]] + [[ ! -d apache-github-source-source/.git ]] + [[ ! -d apache-github-source-source ]] + date '+%Y-%m-%d %T.%3N' 2017-10-16 13:13:35.522 + cd apache-github-source-source + git fetch origin >From https://github.com/apache/hive 6339936..da304ef master -> origin/master + git reset --hard HEAD HEAD is now at 6339936 HIVE-17749: Multiple class have missed the ASF header (Saijin Huang via Rui) + git clean -f -d Removing standalone-metastore/src/gen/org/ + git checkout master Already on 'master' Your branch is behind 'origin/master' by 2 commits, and can be fast-forwarded. (use "git pull" to update your local branch) + git reset --hard origin/master HEAD is now at da304ef HIVE-17798: When replacing the src table names in BeeLine testing, the table names shouldn't be changed to lower case (Marta Kuczora, via Peter Vary) + git merge --ff-only origin/master Already up-to-date. + date '+%Y-%m-%d %T.%3N' 2017-10-16 13:13:39.565 + patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hiveptest/working/scratch/build.patch + [[ -f /data/hiveptest/working/scratch/build.patch ]] + chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh + /data/hiveptest/working/scratch/smart-apply-patch.sh /data/hiveptest/working/scratch/build.patch error: a/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java: No such file or directory error: a/itests/src/test/resources/testconfiguration.properties: No such file or directory error: a/packaging/pom.xml: No such file or directory error: a/pom.xml: No such file or directory error: a/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveSparkClientFactory.java: No such file or directory error: a/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java: No such file or directory error: a/spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java: No such file or directory error: a/spark-client/src/main/java/org/apache/hive/spark/client/SparkClientUtilities.java: No such file or directory The patch does not appear to apply with p0, p1, or p2 + exit 1 ' {noformat} This message is automatically generated. ATTACHMENT ID: 12892380 - PreCommit-HIVE-Build > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.6.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203867#comment-16203867 ] Xuefu Zhang commented on HIVE-15104: I think it's fairly safe to assume that hive-exec.jar and the new jar are in the same location. We can error out if the jar cannot be found in that location. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203332#comment-16203332 ] Rui Li commented on HIVE-15104: --- [~xuefuz], we need to locate the jar on Hive side, before we call spark-submit. I made Hive include it in the {{HIVE_HOME/lib}} directory. I guess we can find the path to hive-exec.jar (which is also under lib) and search for the registrator jar under the same path (or relative ones). But that will totally depend on how Hive is installed. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202897#comment-16202897 ] Xuefu Zhang commented on HIVE-15104: Hi [~lirui], to locate the jar, can we assume that the jar is located somewhere in Hive's installation path? I'm not sure where (Hive, spark-submit, or remote driver) we need to find the location of the jar. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201695#comment-16201695 ] Rui Li commented on HIVE-15104: --- One correction: the {{NoClassDefFoundError}} is for {{com.esotericsoftware.kryo.Serializer}}. That's because our HiveKey and BytesWritable serializer extend kryo's Serializer. When loading our classes, the super class also needs to be loaded and thus the error. Since the serializers are static nested classes of HiveKryoRegistrator, I tried loading the class w/o linking it, i.e. by calling {{ClassLoader.loadClass()}}. And that can avoid the NoClassDefFoundError. But not sure whether this is reliable and independent from JVM implementations. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201405#comment-16201405 ] Rui Li commented on HIVE-15104: --- Hi [~xuefuz], sorry for taking so long to update. I tried out your proposal. The idea is to build the trivial package and add it with the {{--jars}} config when we launch the Spark app. So we need to locate the jar at runtime. To locate the jar, we can use reflection to get the class and call {{SparkContext.jarOfClass}} to get the URI. But since kryo is shaded, we don't have {{com.esotericsoftware.kryo.Kryo}} in our class path on Hive side. When I try to get the registrator class, I get a {{NoClassDefFoundError}} for {{com.esotericsoftware.kryo.Kryo}}. I guess one workaround is to let user specify the path to the jar, but that seems not very friendly. Any suggestions? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149777#comment-16149777 ] Xuefu Zhang commented on HIVE-15104: Hi [~lirui], I think creating a trivial package is still better than the runtime compilation/packaging. Plus, non-hos developers doesn't need to bother with that package. I don't foresee any problem with a separate project. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148485#comment-16148485 ] Rui Li commented on HIVE-15104: --- [~xuefuz], I'll try if that's feasible. Do you think it's OK to create a package just for one single class? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148351#comment-16148351 ] Xuefu Zhang commented on HIVE-15104: I see. It might be possible to put this class in a new package (jar), for which we don't relocate kryo? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148340#comment-16148340 ] Rui Li commented on HIVE-15104: --- [~xuefuz], my previous [comment|https://issues.apache.org/jira/browse/HIVE-15104?focusedCommentId=15998177=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15998177] has some explanations about the relocation problem. Basically, the problem is we need to implement some method defined by Spark, and the method accepts a kryo parameter. With relocation, Hive's kryo and Spark's kryo are in different packages. If we compile the class in Hive and runs it in Spark, Spark will find the method not implemented because it has a different signature. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148174#comment-16148174 ] Xuefu Zhang commented on HIVE-15104: The patch looks good to me. My only concern is about the reliability of the runtime compilation and jar creating. I'd think it's best if we can avoid that. I'm not 100% sure of the class loading problem we faced. If we define class HiveKryoRegistrator in Hive, with relocation, Spark's unrelocated kryo isn't able to find it? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141431#comment-16141431 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12883637/HIVE-15104.5.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 11001 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata] (batchId=61) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning] (batchId=169) org.apache.hive.jdbc.TestJdbcWithMiniHS2.testHttpRetryOnServerIdleTimeout (batchId=228) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6535/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6535/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6535/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12883637 - PreCommit-HIVE-Build > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, > HIVE-15104.5.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134639#comment-16134639 ] Rui Li commented on HIVE-15104: --- Thanks [~xuefuz] and take your time. I guess we can also run a round of QA test with the switch turned on. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133290#comment-16133290 ] Xuefu Zhang commented on HIVE-15104: [~lirui], I found it difficulty to backport HIVE-17114 to our code base, so I had to give up. However, since you have a configuration to turn this on/off, I think it's find to have this and postpone the verification on my side later until we upgrade our Hive. I need some time to review your latest patch as it's different from the previous has some low-level class/jar manipulations. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129895#comment-16129895 ] Xuefu Zhang commented on HIVE-15104: Hi [~lirui], thanks for continuing the work here. The improvement is impressive and not much perf degradation is observed. Let me get back my old benchmarks and see if those patches help. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129871#comment-16129871 ] Rui Li commented on HIVE-15104: --- Hi [~xuefuz], with HIVE-17114 and HIVE-17321 the benchmark results become more stable and the improvement is a little higher. Here's the latest [100GB TPC-DS result|https://docs.google.com/spreadsheets/d/1ba-AbUpJOHNb0_5PZyWQHzrH4wfRljMxQP9vA9JACHg/edit?usp=sharing]. Would you mind share your benchmark tool so that I can look into the perf degradation? Thanks. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085465#comment-16085465 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12877028/HIVE-15104.4.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 10889 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[create_merge_compressed] (batchId=237) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] (batchId=143) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] (batchId=232) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema (batchId=177) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema (batchId=177) org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation (batchId=177) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6004/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6004/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6004/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 6 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12877028 - PreCommit-HIVE-Build > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085384#comment-16085384 ] Rui Li commented on HIVE-15104: --- Hi [~xuefuz], I can't reproduce the perf degradation on my side. Some case runs slower with the patch which I think is just variations. The overall perf (total time taken to run the benchmark) is still slightly improved. I don't have physical nodes to run the benchmark so it's done on VMs. It'd be great if you could rerun your benchmark, and we can take a closer look at the degradation. Thanks. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085324#comment-16085324 ] Xuefu Zhang commented on HIVE-15104: [~lirui], I'm wondering if there is anything new (other than moving code around). Last time we benchmarked and found there was actual performance degradation. We can do that again, and if the perf degradation still exists, we may not want this at lest not by default. I wasn't able to figure out why this degradation might happen. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085276#comment-16085276 ] Rui Li commented on HIVE-15104: --- I also run another round of TPC-DS. The overall shuffle data is reduced by 12%. The query time improvement is however negligible - about 1.5%. [~xuefuz] do you think it's worth the effort? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, HIVE-15104.4.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052006#comment-16052006 ] Rui Li commented on HIVE-15104: --- The approach here can cause problem when we cache RDDs, e.g. combining equivalent works. The cached RDDs will be serialized when stored to disk or transferred via network, then we need the hash code after the data is deserialized. I think we have to ser/de the hash code anyway to be safe. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16018507#comment-16018507 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12869112/TPC-H%20100G.xlsx {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5367/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5367/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5367/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ date '+%Y-%m-%d %T.%3N' 2017-05-20 15:31:32.562 + [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]] + export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'MAVEN_OPTS=-Xmx1g ' + MAVEN_OPTS='-Xmx1g ' + cd /data/hiveptest/working/ + tee /data/hiveptest/logs/PreCommit-HIVE-Build-5367/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ git = \s\v\n ]] + [[ git = \g\i\t ]] + [[ -z master ]] + [[ -d apache-github-source-source ]] + [[ ! -d apache-github-source-source/.git ]] + [[ ! -d apache-github-source-source ]] + date '+%Y-%m-%d %T.%3N' 2017-05-20 15:31:32.564 + cd apache-github-source-source + git fetch origin + git reset --hard HEAD HEAD is now at 7429f5f HIVE-16717: Extend shared scan optimizer to handle partitions (Jesus Camacho Rodriguez, reviewed by Ashutosh Chauhan) + git clean -f -d Removing ql/src/gen/vectorization/UDAFTemplates/VectorUDAFAvgDecimal.txt Removing ql/src/gen/vectorization/UDAFTemplates/VectorUDAFAvgDecimalMerge.txt Removing ql/src/gen/vectorization/UDAFTemplates/VectorUDAFAvgMerge.txt Removing ql/src/gen/vectorization/UDAFTemplates/VectorUDAFAvgTimestamp.txt Removing ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/aggregates/VectorUDAFSumTimestamp.java + git checkout master Already on 'master' Your branch is up-to-date with 'origin/master'. + git reset --hard origin/master HEAD is now at 7429f5f HIVE-16717: Extend shared scan optimizer to handle partitions (Jesus Camacho Rodriguez, reviewed by Ashutosh Chauhan) + git merge --ff-only origin/master Already up-to-date. + date '+%Y-%m-%d %T.%3N' 2017-05-20 15:31:33.510 + patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hiveptest/working/scratch/build.patch + [[ -f /data/hiveptest/working/scratch/build.patch ]] + chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh + /data/hiveptest/working/scratch/smart-apply-patch.sh /data/hiveptest/working/scratch/build.patch patch: Only garbage was found in the patch input. patch: Only garbage was found in the patch input. patch: Only garbage was found in the patch input. fatal: unrecognized input The patch does not appear to apply with p0, p1, or p2 + exit 1 ' {noformat} This message is automatically generated. ATTACHMENT ID: 12869112 - PreCommit-HIVE-Build > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch, TPC-H 100G.xlsx > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017621#comment-16017621 ] Rui Li commented on HIVE-15104: --- Patch v3 compiles the registrators at runtime, so that we don't have to disable kryo relocation. I've also put it on RB. Will upload a benchmark result later. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017527#comment-16017527 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12868935/HIVE-15104.3.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 10738 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_3] (batchId=97) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] (batchId=97) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query24] (batchId=231) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5349/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5349/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5349/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12868935 - PreCommit-HIVE-Build > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, > HIVE-15104.3.patch > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010433#comment-16010433 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12868055/HIVE-15104.2.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 10698 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_ppd_decimal] (batchId=9) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr] (batchId=144) org.apache.hive.jdbc.TestJdbcWithMiniHS2.testConcurrentStatements (batchId=225) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5260/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5260/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5260/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12868055 - PreCommit-HIVE-Build > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009026#comment-16009026 ] Rui Li commented on HIVE-15104: --- [~xuefuz], kryo was relocated in HIVE-5915. So it's not intended for Spark. Actually, we're on the same version as Spark-2.0.0: kryo-shaded-3.0.3. bq. I'm concerned that class conflicts might come back if we stop relocating Kryo You're right. I'm not sure whether it's a conflict or loading issue, but when I tried to run some TPC-H benchmark, I got a ClassNotFoundException, although the class is there in hive-exec.jar. I'll see how to workaround this. BTW, the test in my last comment shuffles very little data. That's why optimizing the overhead can have a significant improvement. I guess this won't be the case in real world query. That's why I want to run some more serious benchmark. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008502#comment-16008502 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12867727/HIVE-15104.1.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 10688 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr] (batchId=144) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_3] (batchId=97) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] (batchId=97) org.apache.hadoop.hive.cli.TestNegativeCliDriver.org.apache.hadoop.hive.cli.TestNegativeCliDriver (batchId=89) org.apache.hive.jdbc.TestJdbcWithLocalClusterSpark.testSparkQuery (batchId=225) org.apache.hive.jdbc.TestJdbcWithLocalClusterSpark.testTempTable (batchId=225) org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark.testSparkQuery (batchId=225) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5226/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5226/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5226/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 7 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12867727 - PreCommit-HIVE-Build > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008107#comment-16008107 ] Xuefu Zhang commented on HIVE-15104: [~lirui], great progress! Thanks for keeping up the effort. As to Kryo class relocation, I think Hive did that to avoid version difference between Spark and Hive. (git history might confirm this.) I'm concerned that class conflicts might come back if we stop relocating Kryo. Any thoughts? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > Attachments: HIVE-15104.1.patch > > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998437#comment-15998437 ] Rui Li commented on HIVE-15104: --- Tried disabling relocation locally. It does solve the AbstractMethodError. However, seems Spark still needs the hashCode on reducer side. Will dig more ... > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998177#comment-15998177 ] Rui Li commented on HIVE-15104: --- I looked at the shuffle writers of Spark and none of them seem to need the hashCode/partitionId after the HiveKey is serialized. But I got a problem during implementation. The plan is to implement this Spark trait: {code} trait KryoRegistrator { def registerClasses(kryo: Kryo): Unit } {code} Then we set this implementing class to {{spark.kryo.registrator}}. At runtime, Spark will use reflection to instantiate our class and call its registerClasses to register the optimized SerDe for HiveKey. However, Kryo is relocated in Hive. After build, the method signature of our class will actually be: {{public void registerClasses(org.apache.hive.com.esotericsoftware.kryo.Kryo kryo)}}. When Spark calls the method, we get an {{AbstractMethodError}}. I suppose this is because the {{public void registerClasses(com.esotericsoftware.kryo.Kryo kryo)}} method is not really implemented. Does anybody know how this can be resolved? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Rui Li > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960801#comment-15960801 ] Aihua Xu commented on HIVE-15104: - [~lirui] I didn't have time to work on that . Feel free to take it over. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960146#comment-15960146 ] Rui Li commented on HIVE-15104: --- Hi [~aihuaxu], are you still working on this? If not, do you mind if I take over? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644296#comment-15644296 ] Aihua Xu commented on HIVE-15104: - I will take a look at Spark to see if it's needed after it's serialized. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634914#comment-15634914 ] Rui Li commented on HIVE-15104: --- [~xuefuz], [~aihuaxu], both MR and Spark need HiveKey.hashCode to compute the partition number. HiveKey's [hashCode|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java#L58] is not computed on demand. Instead it's a field we set to the HiveKey. I think what we need to investigate is, whether the hash code is still needed after the HiveKey is serialized, e.g. Spark somehow deserialize the HiveKey and access the hash code again. If not, we can just ser/de the BytesWritable part of HiveKey because the hash code is not needed any more. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634112#comment-15634112 ] Xuefu Zhang commented on HIVE-15104: I checked the source code and it seems that both Spark (Partitioner.scala) and MapReduce (HashPartitioner.java) calls key.hashCode() to get the partition number. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633820#comment-15633820 ] Xuefu Zhang commented on HIVE-15104: [~lirui], thanks for sharing your findings. Can you confirm that Spark also uses BytesWritable.hashcode() to partition the RS output rows? If this is true, then there is no difference for Spark because the actual object Hive passed to Spark by RS is HiveKey, whose hashcode will be used for partitioning. If this is the case, then we should be able to define the output of our map function and reduce function just as, for which we don't need a custom serializer because we don't need to declare the type as . It seems that there is still a gap in our understanding. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15632623#comment-15632623 ] Aihua Xu commented on HIVE-15104: - [~lirui] So what you are saying is, it depends on how spark shuffles the data and whether spark relies on such hashCode to shuffle the data? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15632244#comment-15632244 ] Rui Li commented on HIVE-15104: --- [~xuefuz], here's what I find so far. Firstly, MR uses HiveKey as the [key type|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecDriver.java#L253]. So we're inline with MR. HiveKey extends BytesWritable. I think the main reason we need HiveKey is we don't want the hash code simply computed from the internal bytes. Instead, we somehow compute the hash code elsewhere and set it into HiveKey. During shuffle, MR passes the HiveKey to OutputCollector. OutputCollector computes the proper partition for HiveKey (using the hash code), and uses [WritableSerialization|https://github.com/apache/hadoop/blob/release-2.7.2-RC2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/WritableSerialization.java#L97] to serialize it. At this point, it's OK to just serialize the BytesWritable part since the partition has already been figured out. On reduce side, WritableSerialization is used again to deserialize input key as HiveKey, then feed it to ExecReducer. Of course at this point, the HiveKey's hash code is just 0. But it doesn't matter because it's not needed any more. And ExecReducer just cast the input key as BytesWritable. Therefore, I think whether we can achieve the same depends on how Spark deals with the HiveKey we pass to it. If that's possible, we can register a custom Serializer for HiveKey and only ser/de the BytesWritable part. Here're some docs regarding how to do that: [spark.kryo.registrator|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization] [kryo registration|https://github.com/EsotericSoftware/kryo#registration] > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15631376#comment-15631376 ] Xuefu Zhang commented on HIVE-15104: This is rather interesting. I know I originally reviewed HIVE-8017, but I didn't really know why ByteWritable works for MR while we need HiveKey for Spark. Since Spark is stable now, it would be interesting to find out at least why, whether we can optimize or not. [~ruili], since you originally discovered the problem, could you revisit the issue? Thanks. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15631053#comment-15631053 ] Rui Li commented on HIVE-15104: --- We need to use HiveKey because it holds the proper hash code to be used for partitioning. MR also uses HiveKey, but in OutputCollector, seems it only serializes the BytesWritable part. [~wenli], is this what you mean? I suspect we'll need help from Spark if we want to do something similar. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15630435#comment-15630435 ] Aihua Xu commented on HIVE-15104: - This is changed by HIVE-8017. [~lirui] Do you recall what kind of issues it caused? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15628605#comment-15628605 ] Rui Li commented on HIVE-15104: --- Seems MR can just serialize the key as BytesWritable instead of HiveKey. We once hit some problem when only serializing the BytesWritable part. But I think it's worth investigating whether we can improve. > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627812#comment-15627812 ] wangwenli commented on HIVE-15104: -- try select count(distinct col1), count (distinct col2) from table, and see the statistic for shuffle data size. if you cann't reproduce, let me know.i will find one table in tpch, and reproduce , then tell the details step. thank you~ > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu >Priority: Critical > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626085#comment-15626085 ] Aihua Xu commented on HIVE-15104: - [~wenli] Can you give an example that I can run and compare? > Hive on Spark generate more shuffle data than hive on mr > > > Key: HIVE-15104 > URL: https://issues.apache.org/jira/browse/HIVE-15104 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.2.1 >Reporter: wangwenli >Assignee: Aihua Xu >Priority: Critical > > the same sql, running on spark and mr engine, will generate different size > of shuffle data. > i think it is because of hive on mr just serialize part of HiveKey, but hive > on spark which using kryo will serialize full of Hivekey object. > what is your opionion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)