[jira] [Created] (HIVE-12888) TestSparkNegativeCliDriver does not run in Spark mode[Spark Branch]
Chengxiang Li created HIVE-12888: Summary: TestSparkNegativeCliDriver does not run in Spark mode[Spark Branch] Key: HIVE-12888 URL: https://issues.apache.org/jira/browse/HIVE-12888 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 1.2.1 Reporter: Chengxiang Li Assignee: Chengxiang Li During test, i found TestSparkNegativeCliDriver run in MR mode actually, it should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12515) Clean the SparkCounters related code after remove counter based stats collection[Spark Branch]
Chengxiang Li created HIVE-12515: Summary: Clean the SparkCounters related code after remove counter based stats collection[Spark Branch] Key: HIVE-12515 URL: https://issues.apache.org/jira/browse/HIVE-12515 Project: Hive Issue Type: Improvement Components: Spark Reporter: Chengxiang Li Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 36475: HIVE-11082 Support multi edge between nodes in SparkPlan[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36475/ --- (Updated July 16, 2015, 2:33 a.m.) Review request for hive and Xuefu Zhang. Changes --- fix nit format issues. Bugs: HIVE-11082 https://issues.apache.org/jira/browse/HIVE-11082 Repository: hive-git Description --- see JIRA description. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java 762f734 ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java 3518823 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java b7c57e8 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java 37012b4 ql/src/test/queries/clientpositive/dynamic_rdd_cache.q a380b15 ql/src/test/results/clientpositive/dynamic_rdd_cache.q.out bc716a0 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_10.q.out 90085a8 ql/src/test/results/clientpositive/spark/dynamic_rdd_cache.q.out 505cc59 ql/src/test/results/clientpositive/spark/skewjoinopt9.q.out 155515d ql/src/test/results/clientpositive/spark/union15.q.out 6be13c9 ql/src/test/results/clientpositive/spark/union16.q.out 5e2c77b ql/src/test/results/clientpositive/spark/union2.q.out e4afb1b ql/src/test/results/clientpositive/spark/union25.q.out 5193c06 ql/src/test/results/clientpositive/spark/union9.q.out d420ef1 Diff: https://reviews.apache.org/r/36475/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 36475: HIVE-11082 Support multi edge between nodes in SparkPlan[Spark Branch]
On July 15, 2015, 1:18 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java, line 223 https://reviews.apache.org/r/36475/diff/2/?file=1012279#file1012279line223 Nit: maybe we should return an empty map instead. It's only used by ExplainTask and OperatorComparators, both of them could handle null, i just keep consistent with other similiar explain methods like JoinDesc::getFiltersStringMap here, not sure how empty map effect the printed explain plan, i think we can just return null here. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36475/#review91736 --- On July 15, 2015, 6:59 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36475/ --- (Updated July 15, 2015, 6:59 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-11082 https://issues.apache.org/jira/browse/HIVE-11082 Repository: hive-git Description --- see JIRA description. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java 762f734 ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java 3518823 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java b7c57e8 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java 37012b4 ql/src/test/queries/clientpositive/dynamic_rdd_cache.q a380b15 ql/src/test/results/clientpositive/dynamic_rdd_cache.q.out bc716a0 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_10.q.out 90085a8 ql/src/test/results/clientpositive/spark/dynamic_rdd_cache.q.out 505cc59 ql/src/test/results/clientpositive/spark/skewjoinopt9.q.out 155515d ql/src/test/results/clientpositive/spark/union15.q.out 6be13c9 ql/src/test/results/clientpositive/spark/union16.q.out 5e2c77b ql/src/test/results/clientpositive/spark/union2.q.out e4afb1b ql/src/test/results/clientpositive/spark/union25.q.out 5193c06 ql/src/test/results/clientpositive/spark/union9.q.out d420ef1 Diff: https://reviews.apache.org/r/36475/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 36475: HIVE-11082 Support multi edge between nodes in SparkPlan[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36475/ --- (Updated July 15, 2015, 6:59 a.m.) Review request for hive and Xuefu Zhang. Changes --- update qtest output Bugs: HIVE-11082 https://issues.apache.org/jira/browse/HIVE-11082 Repository: hive-git Description --- see JIRA description. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java 762f734 ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java 3518823 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java b7c57e8 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinDesc.java 37012b4 ql/src/test/queries/clientpositive/dynamic_rdd_cache.q a380b15 ql/src/test/results/clientpositive/dynamic_rdd_cache.q.out bc716a0 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_10.q.out 90085a8 ql/src/test/results/clientpositive/spark/dynamic_rdd_cache.q.out 505cc59 ql/src/test/results/clientpositive/spark/skewjoinopt9.q.out 155515d ql/src/test/results/clientpositive/spark/union15.q.out 6be13c9 ql/src/test/results/clientpositive/spark/union16.q.out 5e2c77b ql/src/test/results/clientpositive/spark/union2.q.out e4afb1b ql/src/test/results/clientpositive/spark/union25.q.out 5193c06 ql/src/test/results/clientpositive/spark/union9.q.out d420ef1 Diff: https://reviews.apache.org/r/36475/diff/ Testing --- Thanks, chengxiang li
[jira] [Created] (HIVE-11267) Combine equavilent leaf works in SparkWork[Spark Branch]
Chengxiang Li created HIVE-11267: Summary: Combine equavilent leaf works in SparkWork[Spark Branch] Key: HIVE-11267 URL: https://issues.apache.org/jira/browse/HIVE-11267 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor There could be multi leaf works in SparkWork, like self-union query. If the subqueries are same with each other, we may combine the subqueries, and just execute once, then fetch twice in FetchTask. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Review Request 36475: HIVE-11082 Support multi edge between nodes in SparkPlan[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36475/ --- Review request for hive and Xuefu Zhang. Bugs: HIVE-11082 https://issues.apache.org/jira/browse/HIVE-11082 Repository: hive-git Description --- see JIRA description. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java 762f734 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java b7c57e8 ql/src/test/queries/clientpositive/dynamic_rdd_cache.q a380b15 ql/src/test/results/clientpositive/dynamic_rdd_cache.q.out bc716a0 ql/src/test/results/clientpositive/spark/dynamic_rdd_cache.q.out 505cc59 Diff: https://reviews.apache.org/r/36475/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34666: HIVE-9152 - Dynamic Partition Pruning [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34666/#review91427 --- Ship it! Ship It! - chengxiang li On 七月 8, 2015, 6:04 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34666/ --- (Updated 七月 8, 2015, 6:04 p.m.) Review request for hive, chengxiang li and Xuefu Zhang. Bugs: HIVE-9152 https://issues.apache.org/jira/browse/HIVE-9152 Repository: hive-git Description --- Tez implemented dynamic partition pruning in HIVE-7826. This is a nice optimization and we should implement the same in HOS. Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 27f68df itests/src/test/resources/testconfiguration.properties 4f2de12 ql/if/queryplan.thrift c8dfa35 ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java e18f935 ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java f58a10b ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveSparkClientFactory.java 21398d8 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkDynamicPartitionPruner.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java ca0ffb6 ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorSparkPartitionPruningSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1de7e40 ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 2ff3951 ql/src/java/org/apache/hadoop/hive/ql/optimizer/DynamicPartitionPruningOptimization.java 8546d21 ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java a7cf8b7 ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkRemoveDynamicPruningBySize.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java ad47547 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkPartitionPruningSinkDesc.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 447f104 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 7992c88 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/OptimizeSparkProcContext.java f7586a4 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 7f2c079 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkPartitionPruningSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SplitOpTreeForDPP.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java 3217df2 ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 9e9a2a2 ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java 363e49e ql/src/test/queries/clientpositive/spark_dynamic_partition_pruning.q PRE-CREATION ql/src/test/queries/clientpositive/spark_dynamic_partition_pruning_2.q PRE-CREATION ql/src/test/queries/clientpositive/spark_vectorized_dynamic_partition_pruning.q PRE-CREATION ql/src/test/results/clientpositive/spark/spark_dynamic_partition_pruning.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/spark_dynamic_partition_pruning_2.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/spark_vectorized_dynamic_partition_pruning.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/vectorized_dynamic_partition_pruning.q.out PRE-CREATION Diff: https://reviews.apache.org/r/34666/diff/ Testing --- spark_dynamic_partition_pruning.q, spark_dynamic_partition_pruning_2.q - both are clone from tez's test. Thanks, Chao Sun
[jira] [Created] (HIVE-11204) Research on recent failed qtests[Spark Branch]
Chengxiang Li created HIVE-11204: Summary: Research on recent failed qtests[Spark Branch] Key: HIVE-11204 URL: https://issues.apache.org/jira/browse/HIVE-11204 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Chengxiang Li Priority: Minor Found some strange failed qtests in HIVE-11053 Hive QA, as it's pretty sure that failed qtests are not related to HIVE-11053 patch, so just reproduce and research it here. Failed tests: org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_resolution org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_literals org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapreduce1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_15 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_4 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_8 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_view -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 36156: HIVE-11053: Add more tests for HIVE-10844[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36156/#review90859 --- Ship it! Ship It! - chengxiang li On 七月 8, 2015, 3:05 a.m., lun gao wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36156/ --- (Updated 七月 8, 2015, 3:05 a.m.) Review request for hive and chengxiang li. Bugs: HIVE-11053 https://issues.apache.org/jira/browse/HIVE-11053 Repository: hive-git Description --- Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. Diffs - itests/src/test/resources/testconfiguration.properties 4f2de12 ql/src/test/queries/clientpositive/dynamic_rdd_cache.q PRE-CREATION ql/src/test/results/clientpositive/dynamic_rdd_cache.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/dynamic_rdd_cache.q.out PRE-CREATION Diff: https://reviews.apache.org/r/36156/diff/ Testing --- Thanks, lun gao
Re: Review Request 34666: HIVE-9152 - Dynamic Partition Pruning [Spark Branch]
On 七月 2, 2015, 6:36 a.m., chengxiang li wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkRemoveDynamicPruningBySize.java, line 59 https://reviews.apache.org/r/34666/diff/1/?file=971706#file971706line59 The statistic data shoud be quite unaccurate after filter and group, as it's computered based on estimation during compile time. I think threshold verification on unaccurate data should be unacceptable as that means the threshold may not work at all. We may check this threshold in SparkPartitionPruningSinkOperator at runtime. Chao Sun wrote: Switching to runtime would be very different - here we want to check this threshold, and avoid generating the pruning task if possible. How inaccurate the stats would be? I'm fine if it's always more conservative. Take FilterOperator for example, the worst case is, it may just half the input rows as its statistic, you can find the rule for FilterOperator at FilterStatsRule, so it's a bad news that estimated statistics is not always conservative, this would make the threshold does not work as expected sometimes. You may create a followup work for this if it changes a lot. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34666/#review90197 --- On 七月 3, 2015, 10:45 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34666/ --- (Updated 七月 3, 2015, 10:45 p.m.) Review request for hive, chengxiang li and Xuefu Zhang. Bugs: HIVE-9152 https://issues.apache.org/jira/browse/HIVE-9152 Repository: hive-git Description --- Tez implemented dynamic partition pruning in HIVE-7826. This is a nice optimization and we should implement the same in HOS. Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc itests/src/test/resources/testconfiguration.properties 2a5f7e3 ql/if/queryplan.thrift c8dfa35 ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 91e8a02 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveSparkClientFactory.java 21398d8 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkDynamicPartitionPruner.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorSparkPartitionPruningSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1de7e40 ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 9d5730d ql/src/java/org/apache/hadoop/hive/ql/optimizer/DynamicPartitionPruningOptimization.java 8546d21 ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ea5efe5 ql/src/java/org/apache/hadoop/hive/ql/optimizer/RemoveDynamicPruningBySize.java 4803959 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java 5f731d7 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkPartitionPruningSinkDesc.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 447f104 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e27ce0d ql/src/java/org/apache/hadoop/hive/ql/parse/spark/OptimizeSparkProcContext.java f7586a4 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkPartitionPruningSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SplitOpTreeForDPP.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java 05a5841 ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java aa291b9 ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java 363e49e ql/src/test/queries/clientpositive/spark_dynamic_partition_pruning.q PRE-CREATION ql/src/test/queries/clientpositive/spark_dynamic_partition_pruning_2.q PRE-CREATION ql/src/test/results/clientpositive/spark/bucket2.q.out 89c3b4c ql/src/test/results/clientpositive/spark/bucket3.q.out 2fc4855 ql/src/test/results/clientpositive/spark/bucket4.q.out 44e0f9f ql/src/test/results/clientpositive/spark/column_access_stats.q.out 3e16f61 ql/src/test/results/clientpositive/spark/limit_partition_metadataonly.q.out e95d2ab ql/src/test/results/clientpositive/spark/list_bucket_dml_2.q.java1.7.out e38ccf8 ql/src/test/results/clientpositive/spark/optimize_nullscan.q.out 881f41a ql/src/test/results/clientpositive/spark/pcr.q.out 4c22f0b ql/src/test/results/clientpositive/spark/sample3.q.out 2fe6b0d ql/src/test/results
Re: Review Request 36156: HIVE-11053: Add more tests for HIVE-10844[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36156/#review90647 --- ql/src/test/queries/clientpositive/dynamic_rdd_cache.q (line 21) https://reviews.apache.org/r/36156/#comment143770 Are the temp table X/Y/Z be created actually? - chengxiang li On 七月 6, 2015, 6:35 a.m., lun gao wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36156/ --- (Updated 七月 6, 2015, 6:35 a.m.) Review request for hive and chengxiang li. Bugs: HIVE-11053 https://issues.apache.org/jira/browse/HIVE-11053 Repository: hive-git Description --- Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. Diffs - ql/src/test/queries/clientpositive/dynamic_rdd_cache.q PRE-CREATION ql/src/test/results/clientpositive/spark/dynamic_rdd_cache.q.out PRE-CREATION Diff: https://reviews.apache.org/r/36156/diff/ Testing --- Thanks, lun gao
Re: Review Request 36156: HIVE-11053: Add more tests for HIVE-10844[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36156/#review90314 --- ql/src/test/queries/clientpositive/dynamic_rdd_cache.q (line 78) https://reviews.apache.org/r/36156/#comment143335 this query is quite same as the previous one, we shoud just need one of thoese. - chengxiang li On 七月 3, 2015, 7:34 a.m., lun gao wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36156/ --- (Updated 七月 3, 2015, 7:34 a.m.) Review request for hive and chengxiang li. Bugs: HIVE-11053 https://issues.apache.org/jira/browse/HIVE-11053 Repository: hive-git Description --- Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. Diffs - ql/src/test/queries/clientpositive/dynamic_rdd_cache.q PRE-CREATION ql/src/test/results/clientpositive/spark/dynamic_rdd_cache.q.out PRE-CREATION Diff: https://reviews.apache.org/r/36156/diff/ Testing --- Thanks, lun gao
Re: Review Request 36156: HIVE-11053: Add more tests for HIVE-10844[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36156/#review90312 --- ql/src/test/queries/clientpositive/dynamic_rdd_cache.q (line 102) https://reviews.apache.org/r/36156/#comment143334 drop temp table at the end. - chengxiang li On 七月 3, 2015, 7:34 a.m., lun gao wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36156/ --- (Updated 七月 3, 2015, 7:34 a.m.) Review request for hive and chengxiang li. Bugs: HIVE-11053 https://issues.apache.org/jira/browse/HIVE-11053 Repository: hive-git Description --- Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. Diffs - ql/src/test/queries/clientpositive/dynamic_rdd_cache.q PRE-CREATION ql/src/test/results/clientpositive/spark/dynamic_rdd_cache.q.out PRE-CREATION Diff: https://reviews.apache.org/r/36156/diff/ Testing --- Thanks, lun gao
Re: Review Request 34666: HIVE-9152 - Dynamic Partition Pruning [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34666/#review90197 --- ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkRemoveDynamicPruningBySize.java (line 59) https://reviews.apache.org/r/34666/#comment143202 The statistic data shoud be quite unaccurate after filter and group, as it's computered based on estimation during compile time. I think threshold verification on unaccurate data should be unacceptable as that means the threshold may not work at all. We may check this threshold in SparkPartitionPruningSinkOperator at runtime. ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java (line 396) https://reviews.apache.org/r/34666/#comment143199 Why we need List for table/cloumnname/partkey here? do we support multi PartitionPruningSinkOperator inside single operator tree? ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkPartitionPruningSinkOperator.java (line 61) https://reviews.apache.org/r/34666/#comment143203 While append data size overwhelm its capability, DataOutputBuffer expand its byte array size by create a new byte array with 2x size and copy old one to new one. A estimated initial byte array size should be able to reduce most array copy. - chengxiang li On 五月 26, 2015, 4:28 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34666/ --- (Updated 五月 26, 2015, 4:28 p.m.) Review request for hive, chengxiang li and Xuefu Zhang. Bugs: HIVE-9152 https://issues.apache.org/jira/browse/HIVE-9152 Repository: hive-git Description --- Tez implemented dynamic partition pruning in HIVE-7826. This is a nice optimization and we should implement the same in HOS. Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc itests/src/test/resources/testconfiguration.properties 2a5f7e3 metastore/src/gen/thrift/gen-cpp/ThriftHiveMetastore.h 0f86117 metastore/src/gen/thrift/gen-cpp/ThriftHiveMetastore.cpp a0b34cb metastore/src/gen/thrift/gen-cpp/hive_metastore_types.h 55e0385 metastore/src/gen/thrift/gen-cpp/hive_metastore_types.cpp 749c97a metastore/src/gen/thrift/gen-py/hive_metastore/ThriftHiveMetastore.py 4cc54e8 ql/if/queryplan.thrift c8dfa35 ql/src/gen/thrift/gen-cpp/queryplan_types.h ac73bc5 ql/src/gen/thrift/gen-cpp/queryplan_types.cpp 19d4806 ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java e18f935 ql/src/gen/thrift/gen-php/Types.php 7121ed4 ql/src/gen/thrift/gen-py/queryplan/ttypes.py 53c0106 ql/src/gen/thrift/gen-rb/queryplan_types.rb c2c4220 ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 9867739 ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 91e8a02 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveSparkClientFactory.java 21398d8 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkDynamicPartitionPruner.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorSparkPartitionPruningSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1de7e40 ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 9d5730d ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ea5efe5 ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkDynamicPartitionPruningOptimization.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkRemoveDynamicPruningBySize.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 8e56263 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java 5f731d7 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkPartitionPruningSinkDesc.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 447f104 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e27ce0d ql/src/java/org/apache/hadoop/hive/ql/parse/spark/OptimizeSparkProcContext.java f7586a4 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkPartitionPruningOptimizer.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkPartitionPruningSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java 05a5841 ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java aa291b9 ql/src/java/org/apache/hadoop/hive/ql
Re: Review Request 34666: HIVE-9152 - Dynamic Partition Pruning [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34666/#review90191 --- ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkDynamicPartitionPruner.java (line 246) https://reviews.apache.org/r/34666/#comment143192 Should encapsulated with LOG.isDebugEnabled. ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkDynamicPartitionPruner.java (line 251) https://reviews.apache.org/r/34666/#comment143193 Log in error level should means some error happens,the process would be interrupted, if we really expect single field here, should we throw an exception while it has more? otherwise, we should downgrade the log level to WARN with more precise information. - chengxiang li On 五月 26, 2015, 4:28 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34666/ --- (Updated 五月 26, 2015, 4:28 p.m.) Review request for hive, chengxiang li and Xuefu Zhang. Bugs: HIVE-9152 https://issues.apache.org/jira/browse/HIVE-9152 Repository: hive-git Description --- Tez implemented dynamic partition pruning in HIVE-7826. This is a nice optimization and we should implement the same in HOS. Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc itests/src/test/resources/testconfiguration.properties 2a5f7e3 metastore/src/gen/thrift/gen-cpp/ThriftHiveMetastore.h 0f86117 metastore/src/gen/thrift/gen-cpp/ThriftHiveMetastore.cpp a0b34cb metastore/src/gen/thrift/gen-cpp/hive_metastore_types.h 55e0385 metastore/src/gen/thrift/gen-cpp/hive_metastore_types.cpp 749c97a metastore/src/gen/thrift/gen-py/hive_metastore/ThriftHiveMetastore.py 4cc54e8 ql/if/queryplan.thrift c8dfa35 ql/src/gen/thrift/gen-cpp/queryplan_types.h ac73bc5 ql/src/gen/thrift/gen-cpp/queryplan_types.cpp 19d4806 ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java e18f935 ql/src/gen/thrift/gen-php/Types.php 7121ed4 ql/src/gen/thrift/gen-py/queryplan/ttypes.py 53c0106 ql/src/gen/thrift/gen-rb/queryplan_types.rb c2c4220 ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 9867739 ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 91e8a02 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveSparkClientFactory.java 21398d8 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkDynamicPartitionPruner.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorSparkPartitionPruningSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1de7e40 ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 9d5730d ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ea5efe5 ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkDynamicPartitionPruningOptimization.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkRemoveDynamicPruningBySize.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 8e56263 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java 5f731d7 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkPartitionPruningSinkDesc.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 447f104 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e27ce0d ql/src/java/org/apache/hadoop/hive/ql/parse/spark/OptimizeSparkProcContext.java f7586a4 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkPartitionPruningOptimizer.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkPartitionPruningSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java 05a5841 ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java aa291b9 ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java 363e49e ql/src/test/queries/clientpositive/spark_dynamic_partition_pruning.q PRE-CREATION ql/src/test/queries/clientpositive/spark_dynamic_partition_pruning_2.q PRE-CREATION ql/src/test/results/clientpositive/spark/bucket2.q.out 89c3b4c ql/src/test/results/clientpositive/spark/bucket3.q.out 2fc4855 ql/src/test/results/clientpositive/spark/bucket4.q.out 44e0f9f ql/src/test/results/clientpositive/spark/column_access_stats.q.out 3e16f61 ql/src/test/results/clientpositive/spark
Re: Review Request 34757: HIVE-10844: Combine equivalent Works for HoS[Spark Branch]
On 六月 23, 2015, 1:31 p.m., Xuefu Zhang wrote: ql/src/test/results/clientpositive/spark/groupby10.q.out, line 60 https://reviews.apache.org/r/34757/diff/3-4/?file=988071#file988071line60 Interesting. How come we got more stages now? Not sure, introduced by latest merge from trunk. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/#review88966 --- On 六月 23, 2015, 7:24 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/ --- (Updated 六月 23, 2015, 7:24 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-10844 https://issues.apache.org/jira/browse/HIVE-10844 Repository: hive-git Description --- Some Hive queries(like TPCDS Q39) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinCondDesc.java b307b16 ql/src/test/results/clientpositive/spark/auto_join30.q.out 7b5c5e7 ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 8a43d78 ql/src/test/results/clientpositive/spark/groupby10.q.out dd9d9fe ql/src/test/results/clientpositive/spark/groupby7_map.q.out abd6459 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out 5e69b31 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 3418b99 ql/src/test/results/clientpositive/spark/groupby7_noskew_multi_single_reducer.q.out 2cb126d ql/src/test/results/clientpositive/spark/groupby8.q.out c249b61 ql/src/test/results/clientpositive/spark/groupby8_map_skew.q.out 2fb1d73 ql/src/test/results/clientpositive/spark/insert_into3.q.out 7df5ba8 ql/src/test/results/clientpositive/spark/join22.q.out b1e5b67 ql/src/test/results/clientpositive/spark/skewjoinopt11.q.out 8a278ef ql/src/test/results/clientpositive/spark/union10.q.out 5e8fe38 ql/src/test/results/clientpositive/spark/union11.q.out 20c27c7 ql/src/test/results/clientpositive/spark/union20.q.out 6f0dca6 ql/src/test/results/clientpositive/spark/union28.q.out 98582df ql/src/test/results/clientpositive/spark/union3.q.out 834b6d4 ql/src/test/results/clientpositive/spark/union30.q.out 3409623 ql/src/test/results/clientpositive/spark/union4.q.out c121ef0 ql/src/test/results/clientpositive/spark/union5.q.out afee988 ql/src/test/results/clientpositive/spark/union_remove_1.q.out ba0e293 ql/src/test/results/clientpositive/spark/union_remove_15.q.out 26cfbab ql/src/test/results/clientpositive/spark/union_remove_16.q.out 7a7aaf2 ql/src/test/results/clientpositive/spark/union_remove_18.q.out a5e15c5 ql/src/test/results/clientpositive/spark/union_remove_19.q.out ad44400 ql/src/test/results/clientpositive/spark/union_remove_20.q.out 1d67177 ql/src/test/results/clientpositive/spark/union_remove_21.q.out 9f5b070 ql/src/test/results/clientpositive/spark/union_remove_22.q.out 2e01432 ql/src/test/results/clientpositive/spark/union_remove_24.q.out 2659798 ql/src/test/results/clientpositive/spark/union_remove_25.q.out 0a94684 ql/src/test/results/clientpositive/spark/union_remove_4.q.out 6c3d596 ql/src/test/results/clientpositive/spark/union_remove_6.q.out cd36189 ql/src/test/results/clientpositive/spark/union_remove_6_subq.q.out c981ae4 ql/src/test/results/clientpositive/spark/union_remove_7.q.out 084fbd6 ql/src/test/results/clientpositive/spark/union_top_level.q.out dede1ef Diff: https://reviews.apache.org/r/34757/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34757: HIVE-10844: Combine equivalent Works for HoS[Spark Branch]
On 六月 19, 2015, 1:47 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java, line 98 https://reviews.apache.org/r/34757/diff/3/?file=988066#file988066line98 I think the recursion should go on even if there is only one child for a given work. For examle, if we have: w1 | w2 | w3 / \ w4 w5 Even if each of w1 and w2 has only one child, it's still possible that we can combine w4 and w5. created HIVE-11082 to track this. On 六月 19, 2015, 1:47 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java, line 207 https://reviews.apache.org/r/34757/diff/3/?file=988066#file988066line207 Could you explain the reason here? add comments in latest patch. While combine multi equivalent works into single one, we need to update all the references to the replaced works. leave works output should be read by further SparkWork/FetchWork, we does not able to update work reference across SparkWork, so combine leave works may lead to error. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/#review88537 --- On 六月 23, 2015, 7:24 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/ --- (Updated 六月 23, 2015, 7:24 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-10844 https://issues.apache.org/jira/browse/HIVE-10844 Repository: hive-git Description --- Some Hive queries(like TPCDS Q39) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinCondDesc.java b307b16 ql/src/test/results/clientpositive/spark/auto_join30.q.out 7b5c5e7 ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 8a43d78 ql/src/test/results/clientpositive/spark/groupby10.q.out dd9d9fe ql/src/test/results/clientpositive/spark/groupby7_map.q.out abd6459 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out 5e69b31 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 3418b99 ql/src/test/results/clientpositive/spark/groupby7_noskew_multi_single_reducer.q.out 2cb126d ql/src/test/results/clientpositive/spark/groupby8.q.out c249b61 ql/src/test/results/clientpositive/spark/groupby8_map_skew.q.out 2fb1d73 ql/src/test/results/clientpositive/spark/insert_into3.q.out 7df5ba8 ql/src/test/results/clientpositive/spark/join22.q.out b1e5b67 ql/src/test/results/clientpositive/spark/skewjoinopt11.q.out 8a278ef ql/src/test/results/clientpositive/spark/union10.q.out 5e8fe38 ql/src/test/results/clientpositive/spark/union11.q.out 20c27c7 ql/src/test/results/clientpositive/spark/union20.q.out 6f0dca6 ql/src/test/results/clientpositive/spark/union28.q.out 98582df ql/src/test/results/clientpositive/spark/union3.q.out 834b6d4 ql/src/test/results/clientpositive/spark/union30.q.out 3409623 ql/src/test/results/clientpositive/spark/union4.q.out c121ef0 ql/src/test/results/clientpositive/spark/union5.q.out afee988 ql/src/test/results/clientpositive/spark/union_remove_1.q.out ba0e293 ql/src/test/results/clientpositive/spark/union_remove_15.q.out 26cfbab ql/src/test/results/clientpositive/spark/union_remove_16.q.out 7a7aaf2 ql/src/test/results/clientpositive/spark/union_remove_18.q.out a5e15c5 ql/src/test/results/clientpositive/spark/union_remove_19.q.out ad44400 ql/src/test/results/clientpositive/spark/union_remove_20.q.out 1d67177 ql/src/test/results/clientpositive/spark/union_remove_21.q.out 9f5b070 ql/src/test/results/clientpositive/spark/union_remove_22.q.out 2e01432 ql/src/test/results/clientpositive/spark/union_remove_24.q.out 2659798 ql/src/test/results/clientpositive/spark/union_remove_25.q.out 0a94684 ql/src/test/results/clientpositive/spark/union_remove_4.q.out 6c3d596 ql/src/test/results/clientpositive/spark/union_remove_6.q.out cd36189 ql/src/test/results/clientpositive/spark/union_remove_6_subq.q.out c981ae4 ql/src/test/results/clientpositive/spark
Re: Review Request 34757: HIVE-10844: Combine equivalent Works for HoS[Spark Branch]
On 六月 19, 2015, 3:42 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java, line 207 https://reviews.apache.org/r/34757/diff/2/?file=986303#file986303line207 I think in SparkWork, there couldn't be two parents connectting to the same child. UnionWork wold be such a child, but SparkWork doesn't have UnionWork, if I'm not mistaken. I don't think SparkPlan has a limitation of only link between to trans. If there are two links between a parent to a child, the input will be self unioned and the result is the input to the child. chengxiang li wrote: Take self-join for example, there would be 2 MapWork connect to same ReduceWork. if we combine these 2 MapWorks into 1, SparkPlan::connect would throw exception during SparkPlan generation. Xuefu Zhang wrote: I see. Thanks for the explanation. However, I'm wondering if we should remove the restriction. Otherwise, certain cases such as self join will not take the advantage of this feature, right? Yes, this is a further optimization we can continue to work on, i would create a following up JIRA to research on this. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/#review88484 --- On 六月 19, 2015, 7:22 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/ --- (Updated 六月 19, 2015, 7:22 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-10844 https://issues.apache.org/jira/browse/HIVE-10844 Repository: hive-git Description --- Some Hive queries(like TPCDS Q39) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinCondDesc.java b307b16 ql/src/test/results/clientpositive/spark/auto_join30.q.out 7b5c5e7 ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 8a43d78 ql/src/test/results/clientpositive/spark/groupby10.q.out 9d3cf36 ql/src/test/results/clientpositive/spark/groupby7_map.q.out abd6459 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out 5e69b31 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 3418b99 ql/src/test/results/clientpositive/spark/groupby7_noskew_multi_single_reducer.q.out 2cb126d ql/src/test/results/clientpositive/spark/groupby8.q.out 307395f ql/src/test/results/clientpositive/spark/groupby8_map_skew.q.out ba04a57 ql/src/test/results/clientpositive/spark/insert_into3.q.out 7df5ba8 ql/src/test/results/clientpositive/spark/join22.q.out b1e5b67 ql/src/test/results/clientpositive/spark/skewjoinopt11.q.out 8a278ef ql/src/test/results/clientpositive/spark/union10.q.out 5e8fe38 ql/src/test/results/clientpositive/spark/union11.q.out 20c27c7 ql/src/test/results/clientpositive/spark/union20.q.out 6f0dca6 ql/src/test/results/clientpositive/spark/union28.q.out 98582df ql/src/test/results/clientpositive/spark/union3.q.out 834b6d4 ql/src/test/results/clientpositive/spark/union30.q.out 3409623 ql/src/test/results/clientpositive/spark/union4.q.out c121ef0 ql/src/test/results/clientpositive/spark/union5.q.out afee988 ql/src/test/results/clientpositive/spark/union_remove_1.q.out ba0e293 ql/src/test/results/clientpositive/spark/union_remove_15.q.out 26cfbab ql/src/test/results/clientpositive/spark/union_remove_16.q.out 7a7aaf2 ql/src/test/results/clientpositive/spark/union_remove_18.q.out a5e15c5 ql/src/test/results/clientpositive/spark/union_remove_19.q.out ad44400 ql/src/test/results/clientpositive/spark/union_remove_20.q.out 1d67177 ql/src/test/results/clientpositive/spark/union_remove_21.q.out 9f5b070 ql/src/test/results/clientpositive/spark/union_remove_22.q.out 2e01432 ql/src/test/results/clientpositive/spark/union_remove_24.q.out 2659798 ql/src/test/results/clientpositive/spark/union_remove_25.q.out 0a94684 ql/src/test/results/clientpositive/spark/union_remove_4.q.out 6c3d596 ql/src/test/results/clientpositive/spark/union_remove_6.q.out cd36189 ql/src/test/results/clientpositive/spark/union_remove_6_subq.q.out c981ae4 ql/src/test
[jira] [Created] (HIVE-11082) Support multi edge between nodes in SparkPlan[Spark Branch]
Chengxiang Li created HIVE-11082: Summary: Support multi edge between nodes in SparkPlan[Spark Branch] Key: HIVE-11082 URL: https://issues.apache.org/jira/browse/HIVE-11082 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li For Dynamic RDD caching optimization, we found SparkPlan::connect throw exception while we try to combine 2 works with same child, support multi edge between nodes in SparkPlan would help to enable dynamic RDD caching in more use cases, like self join and self union. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 34757: HIVE-10844: Combine equivalent Works for HoS[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/ --- (Updated June 23, 2015, 7:24 a.m.) Review request for hive and Xuefu Zhang. Changes --- fix Xuefu's second round comments. Bugs: HIVE-10844 https://issues.apache.org/jira/browse/HIVE-10844 Repository: hive-git Description --- Some Hive queries(like TPCDS Q39) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinCondDesc.java b307b16 ql/src/test/results/clientpositive/spark/auto_join30.q.out 7b5c5e7 ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 8a43d78 ql/src/test/results/clientpositive/spark/groupby10.q.out dd9d9fe ql/src/test/results/clientpositive/spark/groupby7_map.q.out abd6459 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out 5e69b31 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 3418b99 ql/src/test/results/clientpositive/spark/groupby7_noskew_multi_single_reducer.q.out 2cb126d ql/src/test/results/clientpositive/spark/groupby8.q.out c249b61 ql/src/test/results/clientpositive/spark/groupby8_map_skew.q.out 2fb1d73 ql/src/test/results/clientpositive/spark/insert_into3.q.out 7df5ba8 ql/src/test/results/clientpositive/spark/join22.q.out b1e5b67 ql/src/test/results/clientpositive/spark/skewjoinopt11.q.out 8a278ef ql/src/test/results/clientpositive/spark/union10.q.out 5e8fe38 ql/src/test/results/clientpositive/spark/union11.q.out 20c27c7 ql/src/test/results/clientpositive/spark/union20.q.out 6f0dca6 ql/src/test/results/clientpositive/spark/union28.q.out 98582df ql/src/test/results/clientpositive/spark/union3.q.out 834b6d4 ql/src/test/results/clientpositive/spark/union30.q.out 3409623 ql/src/test/results/clientpositive/spark/union4.q.out c121ef0 ql/src/test/results/clientpositive/spark/union5.q.out afee988 ql/src/test/results/clientpositive/spark/union_remove_1.q.out ba0e293 ql/src/test/results/clientpositive/spark/union_remove_15.q.out 26cfbab ql/src/test/results/clientpositive/spark/union_remove_16.q.out 7a7aaf2 ql/src/test/results/clientpositive/spark/union_remove_18.q.out a5e15c5 ql/src/test/results/clientpositive/spark/union_remove_19.q.out ad44400 ql/src/test/results/clientpositive/spark/union_remove_20.q.out 1d67177 ql/src/test/results/clientpositive/spark/union_remove_21.q.out 9f5b070 ql/src/test/results/clientpositive/spark/union_remove_22.q.out 2e01432 ql/src/test/results/clientpositive/spark/union_remove_24.q.out 2659798 ql/src/test/results/clientpositive/spark/union_remove_25.q.out 0a94684 ql/src/test/results/clientpositive/spark/union_remove_4.q.out 6c3d596 ql/src/test/results/clientpositive/spark/union_remove_6.q.out cd36189 ql/src/test/results/clientpositive/spark/union_remove_6_subq.q.out c981ae4 ql/src/test/results/clientpositive/spark/union_remove_7.q.out 084fbd6 ql/src/test/results/clientpositive/spark/union_top_level.q.out dede1ef Diff: https://reviews.apache.org/r/34757/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34757: HIVE-10844: Combine equivalent Works for HoS[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/ --- (Updated June 19, 2015, 7:22 a.m.) Review request for hive and Xuefu Zhang. Changes --- fix xuefu's first round review. Bugs: HIVE-10844 https://issues.apache.org/jira/browse/HIVE-10844 Repository: hive-git Description --- Some Hive queries(like TPCDS Q39) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinCondDesc.java b307b16 ql/src/test/results/clientpositive/spark/auto_join30.q.out 7b5c5e7 ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 8a43d78 ql/src/test/results/clientpositive/spark/groupby10.q.out 9d3cf36 ql/src/test/results/clientpositive/spark/groupby7_map.q.out abd6459 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out 5e69b31 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 3418b99 ql/src/test/results/clientpositive/spark/groupby7_noskew_multi_single_reducer.q.out 2cb126d ql/src/test/results/clientpositive/spark/groupby8.q.out 307395f ql/src/test/results/clientpositive/spark/groupby8_map_skew.q.out ba04a57 ql/src/test/results/clientpositive/spark/insert_into3.q.out 7df5ba8 ql/src/test/results/clientpositive/spark/join22.q.out b1e5b67 ql/src/test/results/clientpositive/spark/skewjoinopt11.q.out 8a278ef ql/src/test/results/clientpositive/spark/union10.q.out 5e8fe38 ql/src/test/results/clientpositive/spark/union11.q.out 20c27c7 ql/src/test/results/clientpositive/spark/union20.q.out 6f0dca6 ql/src/test/results/clientpositive/spark/union28.q.out 98582df ql/src/test/results/clientpositive/spark/union3.q.out 834b6d4 ql/src/test/results/clientpositive/spark/union30.q.out 3409623 ql/src/test/results/clientpositive/spark/union4.q.out c121ef0 ql/src/test/results/clientpositive/spark/union5.q.out afee988 ql/src/test/results/clientpositive/spark/union_remove_1.q.out ba0e293 ql/src/test/results/clientpositive/spark/union_remove_15.q.out 26cfbab ql/src/test/results/clientpositive/spark/union_remove_16.q.out 7a7aaf2 ql/src/test/results/clientpositive/spark/union_remove_18.q.out a5e15c5 ql/src/test/results/clientpositive/spark/union_remove_19.q.out ad44400 ql/src/test/results/clientpositive/spark/union_remove_20.q.out 1d67177 ql/src/test/results/clientpositive/spark/union_remove_21.q.out 9f5b070 ql/src/test/results/clientpositive/spark/union_remove_22.q.out 2e01432 ql/src/test/results/clientpositive/spark/union_remove_24.q.out 2659798 ql/src/test/results/clientpositive/spark/union_remove_25.q.out 0a94684 ql/src/test/results/clientpositive/spark/union_remove_4.q.out 6c3d596 ql/src/test/results/clientpositive/spark/union_remove_6.q.out cd36189 ql/src/test/results/clientpositive/spark/union_remove_6_subq.q.out c981ae4 ql/src/test/results/clientpositive/spark/union_remove_7.q.out 084fbd6 ql/src/test/results/clientpositive/spark/union_top_level.q.out dede1ef Diff: https://reviews.apache.org/r/34757/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34757: HIVE-10844: Combine equivalent Works for HoS[Spark Branch]
On June 19, 2015, 3:42 a.m., Xuefu Zhang wrote: !. First round review. Only at a high level. 2. Patch looks very good and clean. 3. It will be better if we can add some test cases for self union, self-join, CWE, and repeated sub-queries. This can be a followup task, though. Created HIVE-11053 to add more tests. On June 19, 2015, 3:42 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java, line 157 https://reviews.apache.org/r/34757/diff/2/?file=986303#file986303line157 Could parents be null, in case of top-level works? Same for children. SparkWork always return not null List now, but it may changes, so it always not harm to add null verification. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/#review88484 --- On June 17, 2015, 8:59 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/ --- (Updated June 17, 2015, 8:59 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-10844 https://issues.apache.org/jira/browse/HIVE-10844 Repository: hive-git Description --- Some Hive queries(like TPCDS Q39) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinCondDesc.java b307b16 ql/src/test/results/clientpositive/spark/auto_join30.q.out 7b5c5e7 ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 8a43d78 ql/src/test/results/clientpositive/spark/groupby10.q.out 9d3cf36 ql/src/test/results/clientpositive/spark/groupby7_map.q.out abd6459 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out 5e69b31 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 3418b99 ql/src/test/results/clientpositive/spark/groupby7_noskew_multi_single_reducer.q.out 2cb126d ql/src/test/results/clientpositive/spark/groupby8.q.out 307395f ql/src/test/results/clientpositive/spark/groupby8_map_skew.q.out ba04a57 ql/src/test/results/clientpositive/spark/insert_into3.q.out 7df5ba8 ql/src/test/results/clientpositive/spark/join22.q.out b1e5b67 ql/src/test/results/clientpositive/spark/skewjoinopt11.q.out 8a278ef ql/src/test/results/clientpositive/spark/union10.q.out 5e8fe38 ql/src/test/results/clientpositive/spark/union11.q.out 20c27c7 ql/src/test/results/clientpositive/spark/union20.q.out 6f0dca6 ql/src/test/results/clientpositive/spark/union28.q.out 98582df ql/src/test/results/clientpositive/spark/union3.q.out 834b6d4 ql/src/test/results/clientpositive/spark/union30.q.out 3409623 ql/src/test/results/clientpositive/spark/union4.q.out c121ef0 ql/src/test/results/clientpositive/spark/union5.q.out afee988 ql/src/test/results/clientpositive/spark/union_remove_1.q.out ba0e293 ql/src/test/results/clientpositive/spark/union_remove_15.q.out 26cfbab ql/src/test/results/clientpositive/spark/union_remove_16.q.out 7a7aaf2 ql/src/test/results/clientpositive/spark/union_remove_18.q.out a5e15c5 ql/src/test/results/clientpositive/spark/union_remove_19.q.out ad44400 ql/src/test/results/clientpositive/spark/union_remove_20.q.out 1d67177 ql/src/test/results/clientpositive/spark/union_remove_21.q.out 9f5b070 ql/src/test/results/clientpositive/spark/union_remove_22.q.out 2e01432 ql/src/test/results/clientpositive/spark/union_remove_24.q.out 2659798 ql/src/test/results/clientpositive/spark/union_remove_25.q.out 0a94684 ql/src/test/results/clientpositive/spark/union_remove_4.q.out 6c3d596 ql/src/test/results/clientpositive/spark/union_remove_6.q.out cd36189 ql/src/test/results/clientpositive/spark/union_remove_6_subq.q.out c981ae4 ql/src/test/results/clientpositive/spark/union_remove_7.q.out 084fbd6 ql/src/test/results/clientpositive/spark/union_top_level.q.out dede1ef Diff: https://reviews.apache.org/r/34757/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34757: HIVE-10844: Combine equivalent Works for HoS[Spark Branch]
On June 19, 2015, 3:42 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java, line 207 https://reviews.apache.org/r/34757/diff/2/?file=986303#file986303line207 I think in SparkWork, there couldn't be two parents connectting to the same child. UnionWork wold be such a child, but SparkWork doesn't have UnionWork, if I'm not mistaken. I don't think SparkPlan has a limitation of only link between to trans. If there are two links between a parent to a child, the input will be self unioned and the result is the input to the child. Take self-join for example, there would be 2 MapWork connect to same ReduceWork. if we combine these 2 MapWorks into 1, SparkPlan::connect would throw exception during SparkPlan generation. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/#review88484 --- On June 17, 2015, 8:59 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/ --- (Updated June 17, 2015, 8:59 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-10844 https://issues.apache.org/jira/browse/HIVE-10844 Repository: hive-git Description --- Some Hive queries(like TPCDS Q39) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinCondDesc.java b307b16 ql/src/test/results/clientpositive/spark/auto_join30.q.out 7b5c5e7 ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 8a43d78 ql/src/test/results/clientpositive/spark/groupby10.q.out 9d3cf36 ql/src/test/results/clientpositive/spark/groupby7_map.q.out abd6459 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out 5e69b31 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 3418b99 ql/src/test/results/clientpositive/spark/groupby7_noskew_multi_single_reducer.q.out 2cb126d ql/src/test/results/clientpositive/spark/groupby8.q.out 307395f ql/src/test/results/clientpositive/spark/groupby8_map_skew.q.out ba04a57 ql/src/test/results/clientpositive/spark/insert_into3.q.out 7df5ba8 ql/src/test/results/clientpositive/spark/join22.q.out b1e5b67 ql/src/test/results/clientpositive/spark/skewjoinopt11.q.out 8a278ef ql/src/test/results/clientpositive/spark/union10.q.out 5e8fe38 ql/src/test/results/clientpositive/spark/union11.q.out 20c27c7 ql/src/test/results/clientpositive/spark/union20.q.out 6f0dca6 ql/src/test/results/clientpositive/spark/union28.q.out 98582df ql/src/test/results/clientpositive/spark/union3.q.out 834b6d4 ql/src/test/results/clientpositive/spark/union30.q.out 3409623 ql/src/test/results/clientpositive/spark/union4.q.out c121ef0 ql/src/test/results/clientpositive/spark/union5.q.out afee988 ql/src/test/results/clientpositive/spark/union_remove_1.q.out ba0e293 ql/src/test/results/clientpositive/spark/union_remove_15.q.out 26cfbab ql/src/test/results/clientpositive/spark/union_remove_16.q.out 7a7aaf2 ql/src/test/results/clientpositive/spark/union_remove_18.q.out a5e15c5 ql/src/test/results/clientpositive/spark/union_remove_19.q.out ad44400 ql/src/test/results/clientpositive/spark/union_remove_20.q.out 1d67177 ql/src/test/results/clientpositive/spark/union_remove_21.q.out 9f5b070 ql/src/test/results/clientpositive/spark/union_remove_22.q.out 2e01432 ql/src/test/results/clientpositive/spark/union_remove_24.q.out 2659798 ql/src/test/results/clientpositive/spark/union_remove_25.q.out 0a94684 ql/src/test/results/clientpositive/spark/union_remove_4.q.out 6c3d596 ql/src/test/results/clientpositive/spark/union_remove_6.q.out cd36189 ql/src/test/results/clientpositive/spark/union_remove_6_subq.q.out c981ae4 ql/src/test/results/clientpositive/spark/union_remove_7.q.out 084fbd6 ql/src/test/results/clientpositive/spark/union_top_level.q.out dede1ef Diff: https://reviews.apache.org/r/34757/diff/ Testing --- Thanks, chengxiang li
[jira] [Created] (HIVE-11053) Add more tests for HIVE-10844[Spark Branch]
Chengxiang Li created HIVE-11053: Summary: Add more tests for HIVE-10844[Spark Branch] Key: HIVE-11053 URL: https://issues.apache.org/jira/browse/HIVE-11053 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Priority: Minor Add some test cases for self union, self-join, CWE, and repeated sub-queries to verify the job of combining quivalent works in HIVE-10844. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 34757: HIVE-10844: Combine equivalent Works for HoS[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/ --- (Updated June 17, 2015, 8:59 a.m.) Review request for hive and Xuefu Zhang. Changes --- improve the compare algorithm and update qfile output Bugs: HIVE-10844 https://issues.apache.org/jira/browse/HIVE-10844 Repository: hive-git Description --- Some Hive queries(like TPCDS Q39) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorComparatorFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/JoinCondDesc.java b307b16 ql/src/test/results/clientpositive/spark/auto_join30.q.out 7b5c5e7 ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 8a43d78 ql/src/test/results/clientpositive/spark/groupby10.q.out 9d3cf36 ql/src/test/results/clientpositive/spark/groupby7_map.q.out abd6459 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out 5e69b31 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 3418b99 ql/src/test/results/clientpositive/spark/groupby7_noskew_multi_single_reducer.q.out 2cb126d ql/src/test/results/clientpositive/spark/groupby8.q.out 307395f ql/src/test/results/clientpositive/spark/groupby8_map_skew.q.out ba04a57 ql/src/test/results/clientpositive/spark/insert_into3.q.out 7df5ba8 ql/src/test/results/clientpositive/spark/join22.q.out b1e5b67 ql/src/test/results/clientpositive/spark/skewjoinopt11.q.out 8a278ef ql/src/test/results/clientpositive/spark/union10.q.out 5e8fe38 ql/src/test/results/clientpositive/spark/union11.q.out 20c27c7 ql/src/test/results/clientpositive/spark/union20.q.out 6f0dca6 ql/src/test/results/clientpositive/spark/union28.q.out 98582df ql/src/test/results/clientpositive/spark/union3.q.out 834b6d4 ql/src/test/results/clientpositive/spark/union30.q.out 3409623 ql/src/test/results/clientpositive/spark/union4.q.out c121ef0 ql/src/test/results/clientpositive/spark/union5.q.out afee988 ql/src/test/results/clientpositive/spark/union_remove_1.q.out ba0e293 ql/src/test/results/clientpositive/spark/union_remove_15.q.out 26cfbab ql/src/test/results/clientpositive/spark/union_remove_16.q.out 7a7aaf2 ql/src/test/results/clientpositive/spark/union_remove_18.q.out a5e15c5 ql/src/test/results/clientpositive/spark/union_remove_19.q.out ad44400 ql/src/test/results/clientpositive/spark/union_remove_20.q.out 1d67177 ql/src/test/results/clientpositive/spark/union_remove_21.q.out 9f5b070 ql/src/test/results/clientpositive/spark/union_remove_22.q.out 2e01432 ql/src/test/results/clientpositive/spark/union_remove_24.q.out 2659798 ql/src/test/results/clientpositive/spark/union_remove_25.q.out 0a94684 ql/src/test/results/clientpositive/spark/union_remove_4.q.out 6c3d596 ql/src/test/results/clientpositive/spark/union_remove_6.q.out cd36189 ql/src/test/results/clientpositive/spark/union_remove_6_subq.q.out c981ae4 ql/src/test/results/clientpositive/spark/union_remove_7.q.out 084fbd6 ql/src/test/results/clientpositive/spark/union_top_level.q.out dede1ef Diff: https://reviews.apache.org/r/34757/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/ --- (Updated May 28, 2015, 3:30 a.m.) Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang. Changes --- remove configs, and move common parent match logic in SparkPlanGenerator directly. Bugs: HIVE-10550 https://issues.apache.org/jira/browse/HIVE-10550 Repository: hive-git Description --- see jira description Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c Diff: https://reviews.apache.org/r/34455/diff/ Testing --- Thanks, chengxiang li
Review Request 34757: HIVE-10844: Combine equivalent Works for HoS[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34757/ --- Review request for hive and Xuefu Zhang. Bugs: HIVE-10844 https://issues.apache.org/jira/browse/HIVE-10844 Repository: hive-git Description --- Some Hive queries(like TPCDS Q39) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/CombineEquivalentWorkResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 Diff: https://reviews.apache.org/r/34757/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]
On 五月 27, 2015, 10:13 p.m., Xuefu Zhang wrote: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 2062 https://reviews.apache.org/r/34455/diff/3/?file=972428#file972428line2062 Sorry for pointing this out late. I'm not certain if it's a good idea to expose these two configurations. Also this introduces a change of behavior. For now, can we get rid of them and change the persistency level back to MEM+DISK? We can come back to revisit this later on. At this moment, I don't feel confident to make the call. persistent to MEM + DISK may hurt the performance in certain cases, i think at least we should have a switch to open/close this optimization, - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/#review85451 --- On 五月 27, 2015, 1:50 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/ --- (Updated 五月 27, 2015, 1:50 a.m.) Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang. Bugs: HIVE-10550 https://issues.apache.org/jira/browse/HIVE-10550 Repository: hive-git Description --- see jira description Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 Diff: https://reviews.apache.org/r/34455/diff/ Testing --- Thanks, chengxiang li
[jira] [Created] (HIVE-10844) Combine equivalent Works for HoS[Spark Branch]
Chengxiang Li created HIVE-10844: Summary: Combine equivalent Works for HoS[Spark Branch] Key: HIVE-10844 URL: https://issues.apache.org/jira/browse/HIVE-10844 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Some Hive queries(like [TPCDS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]) may share the same subquery, which translated into sperate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benifit from following dynamic RDD caching optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]
On 五月 27, 2015, 10:13 p.m., Xuefu Zhang wrote: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 2062 https://reviews.apache.org/r/34455/diff/3/?file=972428#file972428line2062 Sorry for pointing this out late. I'm not certain if it's a good idea to expose these two configurations. Also this introduces a change of behavior. For now, can we get rid of them and change the persistency level back to MEM+DISK? We can come back to revisit this later on. At this moment, I don't feel confident to make the call. chengxiang li wrote: persistent to MEM + DISK may hurt the performance in certain cases, i think at least we should have a switch to open/close this optimization, Xuefu Zhang wrote: Agreed. However, before we find out more about in what cases this helps or hurts, I think it's better we keep the existing behavior. This doesn't prevent us from adding a flag later on. Ok, i would remove these configurations from patch in temp, we can discuss later when we got more knowledge about it. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/#review85451 --- On 五月 27, 2015, 1:50 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/ --- (Updated 五月 27, 2015, 1:50 a.m.) Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang. Bugs: HIVE-10550 https://issues.apache.org/jira/browse/HIVE-10550 Repository: hive-git Description --- see jira description Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 Diff: https://reviews.apache.org/r/34455/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/ --- (Updated May 27, 2015, 1:50 a.m.) Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang. Changes --- fix what listed in comments. Bugs: HIVE-10550 https://issues.apache.org/jira/browse/HIVE-10550 Repository: hive-git Description --- see jira description Diffs (updated) - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 Diff: https://reviews.apache.org/r/34455/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/ --- (Updated May 22, 2015, 6:18 a.m.) Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang. Changes --- Keep all the previous multi-insert cache code. Bugs: HIVE-10550 https://issues.apache.org/jira/browse/HIVE-10550 Repository: hive-git Description --- see jira description Diffs (updated) - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 Diff: https://reviews.apache.org/r/34455/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]
On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java, line 41 https://reviews.apache.org/r/34455/diff/1/?file=964754#file964754line41 Currently the storage level is memory+disk. Any reason to change it to memory_only? Cache data to disk means that data need serialization and deserialization, it's costly, and sometime may overwhlem the gain of cache, and it's hard to measure programatically, as read from source file just do deserialization, cache in disk need an additional serialization Instead of add an optimizer which may or may not promote performance for user, i think it may be better to narrow the the optimzir scope a little bit, to make sure this optimizer do promote the performance. On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java, line 63 https://reviews.apache.org/r/34455/diff/1/?file=964756#file964756line63 Can we keep the old code around. I understand it's not currently used. Of course we can, it just make the code a little mess, you knon, for others who want to read the cache related code. On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java, line 25 https://reviews.apache.org/r/34455/diff/1/?file=964757#file964757line25 I cannot construct a case where a MapTran would need caching. Do you have an example? For any queries which contains SparkWork like this: MapWork -- ReduceWork \ -- ReduceWork for example, from person_orc insert overwrite table p1 select city, count(*) as s group by city order by s insert overwrite table p2 select city, avg(age) as g group by city order by g; On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote: spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java, line 419 https://reviews.apache.org/r/34455/diff/1/?file=964774#file964774line419 Do you think it makes sense for us to release the cache as soon as the job is completed, as it's done here? Theoretically we does not need to, i mean it would not lead to any extra memory leak issue, the only benefit of unpersist cache manually i can image is that it reduce GC effort, as Hive do it programatically instead of let GC collect it. The reason i remove it is that, it add extra complexility to code, and not expandable for share cached RDD cross Spark job. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/#review84572 --- On 五月 20, 2015, 2:37 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/ --- (Updated 五月 20, 2015, 2:37 a.m.) Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang. Bugs: HIVE-10550 https://issues.apache.org/jira/browse/HIVE-10550 Repository: hive-git Description --- see jira description Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 19d3fee ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 26cfebd ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 8b15099 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/LocalSparkJobStatus.java 5d62596 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 8e56263 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSkewJoinProcFactory.java 5990d17 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SplitSparkWorkResolver.java fb20080 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java af6332e spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java
Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/ --- Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang. Bugs: HIVE-10550 https://issues.apache.org/jira/browse/HIVE-10550 Repository: hive-git Description --- see jira description Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 19d3fee ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 26cfebd ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 8b15099 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/LocalSparkJobStatus.java 5d62596 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 8e56263 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSkewJoinProcFactory.java 5990d17 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SplitSparkWorkResolver.java fb20080 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java af6332e spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java beed8a3 spark-client/src/main/java/org/apache/hive/spark/client/MonitorCallback.java e1e899e spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java b77c9e8 spark-client/src/test/java/org/apache/hive/spark/client/TestSparkClient.java d33ad7e Diff: https://reviews.apache.org/r/34455/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 34293: HIVE-10721 SparkSessionManagerImpl leaks SparkSessions [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34293/#review84094 --- ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionManagerImpl.java https://reviews.apache.org/r/34293/#comment135202 SparkClientFactory.initialize would be invoked only once, which means RpcServer would be initialized once inside either, so while we update spark client rpc related paramters, RpcServer does not really updated. This should be another issue, i just list here as it's found while read the code. - chengxiang li On 五月 15, 2015, 9:53 p.m., Jimmy Xiang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34293/ --- (Updated 五月 15, 2015, 9:53 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-10721 https://issues.apache.org/jira/browse/HIVE-10721 Repository: hive-git Description --- Add a SparkSession to createdSessions only after the session is opened properly if doOpen is specified. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 7e33a3f ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java bae30f3 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionImpl.java 603f1ca ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionManagerImpl.java ad012b6 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 1bcd221 Diff: https://reviews.apache.org/r/34293/diff/ Testing --- Thanks, Jimmy Xiang
Re: Review Request 34293: HIVE-10721 SparkSessionManagerImpl leaks SparkSessions [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34293/#review84096 --- ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionManagerImpl.java https://reviews.apache.org/r/34293/#comment135204 Just curious, it looks to me that AtomaticBoolean works here either, is that possible 2 threads executed this block both? - chengxiang li On 五月 15, 2015, 9:53 p.m., Jimmy Xiang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34293/ --- (Updated 五月 15, 2015, 9:53 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-10721 https://issues.apache.org/jira/browse/HIVE-10721 Repository: hive-git Description --- Add a SparkSession to createdSessions only after the session is opened properly if doOpen is specified. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 7e33a3f ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java bae30f3 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionImpl.java 603f1ca ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionManagerImpl.java ad012b6 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 1bcd221 Diff: https://reviews.apache.org/r/34293/diff/ Testing --- Thanks, Jimmy Xiang
Re: Review Request 34293: HIVE-10721 SparkSessionManagerImpl leaks SparkSessions [Spark Branch]
On 五月 18, 2015, 2:26 a.m., chengxiang li wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionManagerImpl.java, line 96 https://reviews.apache.org/r/34293/diff/1/?file=961679#file961679line96 SparkClientFactory.initialize would be invoked only once, which means RpcServer would be initialized once inside either, so while we update spark client rpc related paramters, RpcServer does not really updated. This should be another issue, i just list here as it's found while read the code. Jimmy Xiang wrote: Are you saying the RpcServer should be restated too, because some configuration used by RpcServer could be changed? We may need to track those related properties separately. This could complicate the code however. Of course, I agree with you this is indeed an issue. Yes, if we want to make these rpc configurations dynamically effectual, RpcServer should be restarted as well. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34293/#review84094 --- On 五月 15, 2015, 9:53 p.m., Jimmy Xiang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34293/ --- (Updated 五月 15, 2015, 9:53 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-10721 https://issues.apache.org/jira/browse/HIVE-10721 Repository: hive-git Description --- Add a SparkSession to createdSessions only after the session is opened properly if doOpen is specified. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 7e33a3f ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java bae30f3 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionImpl.java 603f1ca ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionManagerImpl.java ad012b6 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 1bcd221 Diff: https://reviews.apache.org/r/34293/diff/ Testing --- Thanks, Jimmy Xiang
Re: Review Request 34293: HIVE-10721 SparkSessionManagerImpl leaks SparkSessions [Spark Branch]
On 五月 18, 2015, 2:37 a.m., chengxiang li wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionManagerImpl.java, line 88 https://reviews.apache.org/r/34293/diff/1/?file=961679#file961679line88 Just curious, it looks to me that AtomaticBoolean works here either, is that possible 2 threads executed this block both? Jimmy Xiang wrote: If several sessions connect to the same HS2, they might execute this block concurrently. One issue with AtomaticBoolean instead of synchonized here is that we have to make sure the SparkClientFactory is properly initialized. Sometimes, we see it throws an exception, in which case, we may need to initialize it again. Ok, I see, thanks for explaination. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34293/#review84096 --- On 五月 15, 2015, 9:53 p.m., Jimmy Xiang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34293/ --- (Updated 五月 15, 2015, 9:53 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-10721 https://issues.apache.org/jira/browse/HIVE-10721 Repository: hive-git Description --- Add a SparkSession to createdSessions only after the session is opened properly if doOpen is specified. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 7e33a3f ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java bae30f3 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionImpl.java 603f1ca ql/src/java/org/apache/hadoop/hive/ql/exec/spark/session/SparkSessionManagerImpl.java ad012b6 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 1bcd221 Diff: https://reviews.apache.org/r/34293/diff/ Testing --- Thanks, Jimmy Xiang
[jira] [Created] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
Chengxiang Li created HIVE-10550: Summary: Dynamic RDD caching optimization for HoS.[Spark Branch] Key: HIVE-10550 URL: https://issues.apache.org/jira/browse/HIVE-10550 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Review Request 33119: HIVE-10235: Loop optimization for SIMD in ColumnDivideColumn.txt
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33119/ --- Review request for hive and Gopal V. Bugs: HIVE-10235 https://issues.apache.org/jira/browse/HIVE-10235 Repository: hive Description --- Found two loop which could be optimized for packed instruction set during execution. 1. hasDivBy0 depends on the result of last loop, which prevent the loop be executed vectorized. for(int i = 0; i != n; i++) { OperandType2 denom = vector2[i]; outputVector[i] = vector1[0] OperatorSymbol denom; hasDivBy0 = hasDivBy0 || (denom == 0); } 2. same as HIVE-10180, vector2[0] reference provent JVM optimizing loop into packed instruction set. for(int i = 0; i != n; i++) { outputVector[i] = vector1[i] OperatorSymbol vector2[0]; } Diffs - trunk/itests/hive-jmh/src/main/java/org/apache/hive/benchmark/vectorization/VectorizationBench.java 1673092 trunk/ql/src/gen/vectorization/ExpressionTemplates/ColumnDivideColumn.txt 1673092 Diff: https://reviews.apache.org/r/33119/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 32920: HIVE-10189: Create a micro benchmark tool for vectorization to evaluate the performance gain after SIMD optimization
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32920/#review79327 --- itests/hive-jmh/src/main/java/org/apache/hive/benchmark/vectorization/VectorizationBench.java https://reviews.apache.org/r/32920/#comment128590 This static variables is specified for expressions of 2 paramater operator, i think we can move it to each setup() method. - chengxiang li On 四月 8, 2015, 8:42 a.m., cheng xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32920/ --- (Updated 四月 8, 2015, 8:42 a.m.) Review request for hive and chengxiang li. Repository: hive-git Description --- Add microbenchmark tool to show performance improvement by JMH Diffs - itests/hive-jmh/src/main/java/org/apache/hive/benchmark/vectorization/VectorizationBench.java PRE-CREATION Diff: https://reviews.apache.org/r/32920/diff/ Testing --- Thanks, cheng xu
Re: Review Request 32920: HIVE-10189: Create a micro benchmark tool for vectorization to evaluate the performance gain after SIMD optimization
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32920/#review79341 --- Ship it! Ship It! - chengxiang li On 四月 8, 2015, 8:42 a.m., cheng xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32920/ --- (Updated 四月 8, 2015, 8:42 a.m.) Review request for hive and chengxiang li. Repository: hive-git Description --- Add microbenchmark tool to show performance improvement by JMH Diffs - itests/hive-jmh/src/main/java/org/apache/hive/benchmark/vectorization/VectorizationBench.java PRE-CREATION Diff: https://reviews.apache.org/r/32920/diff/ Testing --- Thanks, cheng xu
Re: Review Request 32920: HIVE-10189: Create a micro benchmark tool for vectorization to evaluate the performance gain after SIMD optimization
On 四月 8, 2015, 9:22 a.m., chengxiang li wrote: itests/hive-jmh/src/main/java/org/apache/hive/benchmark/vectorization/VectorizationBench.java, line 73 https://reviews.apache.org/r/32920/diff/3/?file=920776#file920776line73 This static variables is specified for expressions of 2 paramater operator, i think we can move it to each setup() method. cheng xu wrote: Thank you for your comments. These variables are reused and will be initialized in setup method of the VectorizationBench. The variables initialization is not time comsuming, and it's beyond the measured time of benchmark method, so it should be ok to initiate for each benchmark. Anyway, we can fix this while add one input column expression or three column expression. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32920/#review79327 --- On 四月 8, 2015, 8:42 a.m., cheng xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32920/ --- (Updated 四月 8, 2015, 8:42 a.m.) Review request for hive and chengxiang li. Repository: hive-git Description --- Add microbenchmark tool to show performance improvement by JMH Diffs - itests/hive-jmh/src/main/java/org/apache/hive/benchmark/vectorization/VectorizationBench.java PRE-CREATION Diff: https://reviews.apache.org/r/32920/diff/ Testing --- Thanks, cheng xu
[jira] [Created] (HIVE-10235) Loop optimization for SIMD in ColumnDivideColumn.txt
Chengxiang Li created HIVE-10235: Summary: Loop optimization for SIMD in ColumnDivideColumn.txt Key: HIVE-10235 URL: https://issues.apache.org/jira/browse/HIVE-10235 Project: Hive Issue Type: Sub-task Components: Vectorization Affects Versions: 1.1.0 Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Found two loop which could be optimized for packed instruction set during execution. 1. hasDivBy0 depends on the result of last loop, which prevent the loop be executed vectorized. {code:java} for(int i = 0; i != n; i++) { OperandType2 denom = vector2[i]; outputVector[i] = vector1[0] OperatorSymbol denom; hasDivBy0 = hasDivBy0 || (denom == 0); } {code} 2. same as HIVE-10180, vector2\[0\] reference provent JVM optimizing loop into packed instruction set. {code:java} for(int i = 0; i != n; i++) { outputVector[i] = vector1[i] OperatorSymbol vector2[0]; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 32918: HIVE-10180 Loop optimization for SIMD in ColumnArithmeticColumn.txt
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32918/ --- (Updated 四月 7, 2015, 7:24 a.m.) Review request for hive. Changes --- mark variables as final. Bugs: Hive-10180 https://issues.apache.org/jira/browse/Hive-10180 Repository: hive Description --- JVM is quite strict on the code schema which may executed with SIMD instructions, take a loop in DoubleColAddDoubleColumn.java for example, for (int i = 0; i != n; i++) { outputVector[i] = vector1[0] + vector2[i]; } The vector1[0] reference would prevent JVM to execute this part of code with vectorized instructions, we need to assign the vector1[0] to a variable outside of loop, and use that variable in loop. Diffs (updated) - trunk/ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt 1671736 Diff: https://reviews.apache.org/r/32918/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 32920: HIVE-10189: Create a micro benchmark tool for vectorization to evaluate the performance gain after SIMD optimization
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32920/#review79136 --- itests/hive-jmh/src/main/java/org/apache/hive/benchmark/vectorization/VectorizationBench.java https://reviews.apache.org/r/32920/#comment128267 The benchmark look good, my only concern is that how could we expand this benchmark to other expressions? - chengxiang li On April 7, 2015, 6:06 a.m., cheng xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32920/ --- (Updated April 7, 2015, 6:06 a.m.) Review request for hive and chengxiang li. Repository: hive-git Description --- Add microbenchmark tool to show performance improvement by JMH Diffs - itests/hive-jmh/src/main/java/org/apache/hive/benchmark/vectorization/VectorizationBench.java PRE-CREATION Diff: https://reviews.apache.org/r/32920/diff/ Testing --- Thanks, cheng xu
[jira] [Created] (HIVE-10238) Loop optimization for SIMD in IfExprColumnColumn.txt
Chengxiang Li created HIVE-10238: Summary: Loop optimization for SIMD in IfExprColumnColumn.txt Key: HIVE-10238 URL: https://issues.apache.org/jira/browse/HIVE-10238 Project: Hive Issue Type: Sub-task Components: Vectorization Affects Versions: 1.1.0 Reporter: Chengxiang Li Assignee: Jitendra Nath Pandey Priority: Minor The ?: operator as following could not be vectorized in loop, we may transfer it into mathematical expression. {code:java} for(int j = 0; j != n; j++) { int i = sel[j]; outputVector[i] = (vector1[i] == 1 ? vector2[i] : vector3[i]); outputIsNull[i] = (vector1[i] == 1 ? arg2ColVector.isNull[i] : arg3ColVector.isNull[i]); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Review Request 32918: HIVE-10180 Loop optimization for SIMD in ColumnArithmeticColumn.txt
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32918/ --- Review request for hive. Bugs: Hive-10180 https://issues.apache.org/jira/browse/Hive-10180 Repository: hive Description --- JVM is quite strict on the code schema which may executed with SIMD instructions, take a loop in DoubleColAddDoubleColumn.java for example, for (int i = 0; i != n; i++) { outputVector[i] = vector1[0] + vector2[i]; } The vector1[0] reference would prevent JVM to execute this part of code with vectorized instructions, we need to assign the vector1[0] to a variable outside of loop, and use that variable in loop. Diffs - trunk/ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt 1671736 Diff: https://reviews.apache.org/r/32918/diff/ Testing --- Thanks, chengxiang li
[jira] [Created] (HIVE-10180) Loop optimization in ColumnArithmeticColumn.txt
Chengxiang Li created HIVE-10180: Summary: Loop optimization in ColumnArithmeticColumn.txt Key: HIVE-10180 URL: https://issues.apache.org/jira/browse/HIVE-10180 Project: Hive Issue Type: Sub-task Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor JVM is quite strict on the code schema which may executed with SIMD instructions, take a loop in DoubleColAddDoubleColumn.java for example, {code:java} for (int i = 0; i != n; i++) { outputVector[i] = vector1[0] + vector2[i]; } {code} The vector1[0] reference would prevent JVM to execute this part of code with vectorized instructions, we need to assign the vector1[0] to a variable outside of loop, and use that variable in loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10179) Optimization for SIMD instructions in Hive
Chengxiang Li created HIVE-10179: Summary: Optimization for SIMD instructions in Hive Key: HIVE-10179 URL: https://issues.apache.org/jira/browse/HIVE-10179 Project: Hive Issue Type: Improvement Reporter: Chengxiang Li Assignee: Chengxiang Li [SIMD|http://en.wikipedia.org/wiki/SIMD] instuctions could be found in most of current CPUs, such as Intel's SSE2, SSE3, SSE4.x, AVX and AVX2, and it would help Hive to outperform if we can vectorize the mathematical manipulation part of Hive. This umbrella JIRA may contains but not limited to the subtasks like: # Code schema adaption, current JVM is quite strictly on the code schema which could be transformed into SIMD instructions during execution. # New implementation of mathematical manipulation part of Hive which designed to be optimized for SIMD instructions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10052) HiveInputFormat implementations getsplits may lead to memory leak.[Spark Branch]
Chengxiang Li created HIVE-10052: Summary: HiveInputFormat implementations getsplits may lead to memory leak.[Spark Branch] Key: HIVE-10052 URL: https://issues.apache.org/jira/browse/HIVE-10052 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li HiveInputFormat::init would cache MapWork/ReduceWork in ThreadLocal map, we need to clear the cache after getSplits from HiveInputFormat(or its implementations), or just not cache MapWork/ReduceWork during generation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Review Request 32288: HIVE-10006 RSC has memory leak while execute multi queries
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32288/ --- Review request for hive. Bugs: HIVE-10006 https://issues.apache.org/jira/browse/HIVE-10006 Repository: hive Description --- In RSC, while spark call CombineHiveInputFormat::getSplits to split the job into tasks in a thread called dag-scheduler-event-loop, MapWork would be added to a ThreadLocal map of dag-scheduler-event-loop, and never get removed. As the dag-scheduler-event-loop thread is a long live and daemon thread, so all the MapWorks would be hold in the ThreadLocal map until RSC jvm crash or exit. Diffs - branches/spark/serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyPrimitiveObjectInspectorFactory.java 1667894 branches/spark/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryUtils.java 1667894 branches/spark/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/objectinspector/LazyBinaryObjectInspectorFactory.java 1667894 branches/spark/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/MetadataListStructObjectInspector.java 1667894 branches/spark/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorFactory.java 1667894 branches/spark/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoFactory.java 1667894 branches/spark/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoUtils.java 1667894 Diff: https://reviews.apache.org/r/32288/diff/ Testing --- Thanks, chengxiang li
[jira] [Created] (HIVE-10006) RSC has memory leak while execute multi queries.[Spark Branch]
Chengxiang Li created HIVE-10006: Summary: RSC has memory leak while execute multi queries.[Spark Branch] Key: HIVE-10006 URL: https://issues.apache.org/jira/browse/HIVE-10006 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.1.0 Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Critical While execute query with RSC, MapWork/ReduceWork number is increased all the time, and lead to OOM at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9425) External Function Jar files are not available for Driver when running with yarn-cluster mode [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9425: Assignee: Rui Li (was: Chengxiang Li) External Function Jar files are not available for Driver when running with yarn-cluster mode [Spark Branch] --- Key: HIVE-9425 URL: https://issues.apache.org/jira/browse/HIVE-9425 Project: Hive Issue Type: Sub-task Components: spark-branch Reporter: Xiaomin Zhang Assignee: Rui Li 15/01/20 00:27:31 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done 15/01/20 00:27:31 ERROR spark.SparkContext: Error adding jar (java.io.FileNotFoundException: hive-exec-0.15.0-SNAPSHOT.jar (No such file or directory)), was the --addJars option used? 15/01/20 00:27:31 ERROR spark.SparkContext: Error adding jar (java.io.FileNotFoundException: opennlp-maxent-3.0.3.jar (No such file or directory)), was the --addJars option used? 15/01/20 00:27:31 ERROR spark.SparkContext: Error adding jar (java.io.FileNotFoundException: bigbenchqueriesmr.jar (No such file or directory)), was the --addJars option used? 15/01/20 00:27:31 ERROR spark.SparkContext: Error adding jar (java.io.FileNotFoundException: opennlp-tools-1.5.3.jar (No such file or directory)), was the --addJars option used? 15/01/20 00:27:31 ERROR spark.SparkContext: Error adding jar (java.io.FileNotFoundException: jcl-over-slf4j-1.7.5.jar (No such file or directory)), was the --addJars option used? 15/01/20 00:27:31 INFO client.RemoteDriver: Received job request fef081b0-5408-4804-9531-d131fdd628e6 15/01/20 00:27:31 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 15/01/20 00:27:31 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 15/01/20 00:27:31 INFO client.RemoteDriver: Failed to run job fef081b0-5408-4804-9531-d131fdd628e6 org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) invertedWorkGraph (org.apache.hadoop.hive.ql.plan.SparkWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) It seems the additional Jar files are not uploaded to DistributedCache, so that the Driver cannot access it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9410) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304382#comment-14304382 ] Chengxiang Li commented on HIVE-9410: - Not actually, as you can see from the patch, i stored added jar paths to a list in JobContextImpl, and add the jar paths in JobContextImpl to current thead context class loader while execute JobStatusJob each time, as JobContextImpl is a singleton instance for RemoteDriver service, so later request thead could get the jar paths as well. ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch] -- Key: HIVE-9410 URL: https://issues.apache.org/jira/browse/HIVE-9410 Project: Hive Issue Type: Sub-task Components: Spark Environment: CentOS 6.5 JDK1.7 Reporter: Xin Hao Assignee: Chengxiang Li Fix For: spark-branch, 1.1.0 Attachments: HIVE-9410.1-spark.patch, HIVE-9410.2-spark.patch, HIVE-9410.3-spark.patch, HIVE-9410.4-spark.patch, HIVE-9410.4-spark.patch We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). It will be passed for default Hive (on MR) mode, while failed for Hive On Spark mode (both Standalone and Yarn-Client). Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue still exists. BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed. Detail Error Message is as below (NOTE: de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar bigbenchqueriesmr.jar, and we have add command like 'add jar /location/to/bigbenchqueriesmr.jar;' into .sql explicitly) {code} INFO [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) right (org.apache.commons.lang3.tuple.ImmutablePair) edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) ... Caused by: java.lang.ClassNotFoundException: de.bankmark.bigbench.queries.q10.SentimentUDF at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358
[jira] [Created] (HIVE-9540) Enable infer_bucket_sort_dyn_part.q for TestMiniSparkOnYarnCliDriver test. [Spark Branch]
Chengxiang Li created HIVE-9540: --- Summary: Enable infer_bucket_sort_dyn_part.q for TestMiniSparkOnYarnCliDriver test. [Spark Branch] Key: HIVE-9540 URL: https://issues.apache.org/jira/browse/HIVE-9540 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li infer_bucket_sort_dyn_part.q output changes on TestMiniSparkOnYarnCliDriver test, we should figure out why and try to enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9211: Attachment: HIVE-9211.6-spark.patch [~xuefuz], the output of infer_bucket_sort_dyn_part.q changes during the test, so i remote it from the miniSparkOnYarn.query.files, and created extra HIVE-9540 to track it. Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.3-spark.patch, HIVE-9211.4-spark.patch, HIVE-9211.5-spark.patch, HIVE-9211.6-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9542) SparkSessionImpl calcualte wrong cores number in TestSparkCliDriver [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9542: Summary: SparkSessionImpl calcualte wrong cores number in TestSparkCliDriver [Spark Branch] (was: SparkSessionImpl calcualte wrong number of cores number in TestSparkCliDriver [Spark Branch]) SparkSessionImpl calcualte wrong cores number in TestSparkCliDriver [Spark Branch] -- Key: HIVE-9542 URL: https://issues.apache.org/jira/browse/HIVE-9542 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li TestSparkCliDriver launch local spark cluster with [2,2,1024], which means 2 executor with 2 cores for each execuotr, HoS get the core number as 2 instead of 4. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9211: Attachment: HIVE-9211.7-spark.patch TestSparkCliDriver launch local spark cluster with \[2,2,1024\], which means 2 executor with 2 cores for each execuotr, while HoS use spark.executor.cores values to caculate all cores number, so TestSparkCliDriver set reduce partition number as 2 instead of 4. Currently caculation logic of cores number is spark-invaded and easy to be broken, we may handle it in a better way after SPARK-5080 is resoved. groupby2.q and join1.q is failed due to the previous reason during EXPLAIN queries, and HIVE-9542 is created for this issue. ql_rewrite_gbtoidx_cbo_2.q failed on TestMinimrCliDriver as i add result order tag to the qfile before and did not update TestMinimrCliDriver output. encryption_join_with_different_encryption_keys.q failure should not related to this patch from the log file. Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.3-spark.patch, HIVE-9211.4-spark.patch, HIVE-9211.5-spark.patch, HIVE-9211.6-spark.patch, HIVE-9211.7-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9542) SparkSessionImpl calcualte wrong number of cores number in TestSparkCliDriver [Spark Branch]
Chengxiang Li created HIVE-9542: --- Summary: SparkSessionImpl calcualte wrong number of cores number in TestSparkCliDriver [Spark Branch] Key: HIVE-9542 URL: https://issues.apache.org/jira/browse/HIVE-9542 Project: Hive Issue Type: Sub-task Reporter: Chengxiang Li TestSparkCliDriver launch local spark cluster with [2,2,1024], which means 2 executor with 2 cores for each execuotr, HoS get the core number as 2 instead of 4. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9211: Attachment: HIVE-9211.5-spark.patch Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.3-spark.patch, HIVE-9211.4-spark.patch, HIVE-9211.5-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298642#comment-14298642 ] Chengxiang Li commented on HIVE-9211: - I build spark v1.2.0 with -Dhadoop.version=2.6.0 locally, and remove embedded hadoop packages, it works. Besides, why we remove hadoop packages from spark assembly jar? try to avoid potential hadoop conflict? Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.3-spark.patch, HIVE-9211.4-spark.patch, HIVE-9211.5-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9449) Push YARN configuration to Spark while deply Spark on YARN[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9449: Attachment: HiveonSparkconfiguration.pdf Push YARN configuration to Spark while deply Spark on YARN[Spark Branch] Key: HIVE-9449 URL: https://issues.apache.org/jira/browse/HIVE-9449 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Fix For: spark-branch Attachments: HIVE-9449.1-spark.patch, HIVE-9449.1-spark.patch, HIVE-9449.2-spark.patch, HiveonSparkconfiguration.pdf We only push Spark configuration and RSC configuration to Spark while launch Spark cluster now, for Spark on YARN mode, Spark need extra YARN configuration to launch Spark cluster. Besides this, to support dynamically configuration setting for RSC configuration/YARN configuration, we need to recreate SparkSession while RSC configuration/YARN configuration update as well, as they may influence the Spark cluster deployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298426#comment-14298426 ] Chengxiang Li commented on HIVE-9211: - Hi, [~brocknoland], the missed class is from commons-collections jar, i left the exception stace at the last, the Spark assembly from current Spark tarball does not include commons-collections jar. I build spark v1.2.0 on my own environment, it does include commons-collections jar. {noformat} Exception in thread main java.lang.NoClassDefFoundError: org/apache/commons/collections/map/UnmodifiableMap at org.apache.hadoop.conf.Configuration$DeprecationContext.init(Configuration.java:398) at org.apache.hadoop.conf.Configuration.clinit(Configuration.java:438) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.newConfiguration(YarnSparkHadoopUtil.scala:57) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.init(YarnSparkHadoopUtil.scala:45) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at java.lang.Class.newInstance0(Class.java:374) at java.lang.Class.newInstance(Class.java:327) at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:196) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:161) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.lang.ClassNotFoundException: org.apache.commons.collections.map.UnmodifiableMap at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang {noformat} Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.3-spark.patch, HIVE-9211.4-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9211: Attachment: HIVE-9211.4-spark.patch [~brocknoland], what code base is our current Spark installation built upon? I run into some inconsistent jar dependency issue in test, and update Spark installation based latest Spark branch-1.2 code fix it. The Hive spark branch depends on Hadoop 2.6.0 for hadoop2 now, we may need to build spark consistent with it. Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.3-spark.patch, HIVE-9211.4-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9487) Make Remote Spark Context secure [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296395#comment-14296395 ] Chengxiang Li commented on HIVE-9487: - +1, the patch looks good to me. Make Remote Spark Context secure [Spark Branch] --- Key: HIVE-9487 URL: https://issues.apache.org/jira/browse/HIVE-9487 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Attachments: HIVE-9487.1-spark.patch The RSC currently uses an ad-hoc, insecure authentication mechanism. We should instead use a proper auth mechanism and add encryption to the mix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9211: Attachment: HIVE-9211.3-spark.patch Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.3-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293613#comment-14293613 ] Chengxiang Li commented on HIVE-9211: - No log files found in the container log directory,quite strange, need further research. Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.3-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9211: Attachment: HIVE-9211.2-spark.patch Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9211: Attachment: HIVE-9211.2-spark.patch Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9211: Attachment: (was: HIVE-9211.2-spark.patch) Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293057#comment-14293057 ] Chengxiang Li commented on HIVE-9211: - I work on Linux. Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-9425) External Function Jar files are not available for Driver when running with yarn-cluster mode [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li reassigned HIVE-9425: --- Assignee: Chengxiang Li External Function Jar files are not available for Driver when running with yarn-cluster mode [Spark Branch] --- Key: HIVE-9425 URL: https://issues.apache.org/jira/browse/HIVE-9425 Project: Hive Issue Type: Sub-task Components: spark-branch Reporter: Xiaomin Zhang Assignee: Chengxiang Li 15/01/20 00:27:31 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done 15/01/20 00:27:31 ERROR spark.SparkContext: Error adding jar (java.io.FileNotFoundException: hive-exec-0.15.0-SNAPSHOT.jar (No such file or directory)), was the --addJars option used? 15/01/20 00:27:31 ERROR spark.SparkContext: Error adding jar (java.io.FileNotFoundException: opennlp-maxent-3.0.3.jar (No such file or directory)), was the --addJars option used? 15/01/20 00:27:31 ERROR spark.SparkContext: Error adding jar (java.io.FileNotFoundException: bigbenchqueriesmr.jar (No such file or directory)), was the --addJars option used? 15/01/20 00:27:31 ERROR spark.SparkContext: Error adding jar (java.io.FileNotFoundException: opennlp-tools-1.5.3.jar (No such file or directory)), was the --addJars option used? 15/01/20 00:27:31 ERROR spark.SparkContext: Error adding jar (java.io.FileNotFoundException: jcl-over-slf4j-1.7.5.jar (No such file or directory)), was the --addJars option used? 15/01/20 00:27:31 INFO client.RemoteDriver: Received job request fef081b0-5408-4804-9531-d131fdd628e6 15/01/20 00:27:31 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 15/01/20 00:27:31 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 15/01/20 00:27:31 INFO client.RemoteDriver: Failed to run job fef081b0-5408-4804-9531-d131fdd628e6 org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) invertedWorkGraph (org.apache.hadoop.hive.ql.plan.SparkWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) It seems the additional Jar files are not uploaded to DistributedCache, so that the Driver cannot access it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 30264: HIVE-9221 enable unit test for mini Spark on YARN cluster[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30264/ --- (Updated Jan. 27, 2015, 2:03 a.m.) Review request for hive, Szehon Ho and Xuefu Zhang. Changes --- fixed commented issues. Bugs: HIVE-9211 https://issues.apache.org/jira/browse/HIVE-9211 Repository: hive-git Description --- MiniSparkOnYarnCluster is enabled for unit test, Spark is deployed on miniYarnCluster on yarn-client mode, all qfiles in minimr.query.files are enabled in this unit test except 3 qfile: bucket_num_reducers.q, bucket_num_reducers2.q, udf_using.q, which is not supported in HoS. Diffs (updated) - data/conf/spark/hive-site.xml 016f568 data/conf/spark/standalone/hive-site.xml PRE-CREATION data/conf/spark/yarn-client/hive-site.xml PRE-CREATION itests/pom.xml e1e88f6 itests/qtest-spark/pom.xml d12fad5 itests/src/test/resources/testconfiguration.properties f583aaf itests/util/src/main/java/org/apache/hadoop/hive/ql/QTestUtil.java 095b9bd ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 41a2ab7 ql/src/test/results/clientpositive/spark/bucket5.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/bucket6.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/bucketizedhiveinputformat.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/constprog_partitioner.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/empty_dir_in_table.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/external_table_with_space_in_location_path.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/file_with_header_footer.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/import_exported_table.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/index_bitmap3.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/index_bitmap_auto.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/infer_bucket_sort_bucketed_table.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/infer_bucket_sort_dyn_part.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/infer_bucket_sort_map_operators.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/infer_bucket_sort_merge.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/infer_bucket_sort_num_buckets.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/infer_bucket_sort_reducers_power_two.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/input16_cc.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/list_bucket_dml_10.q.java1.7.out PRE-CREATION ql/src/test/results/clientpositive/spark/load_fs2.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/load_hdfs_file_with_space_in_the_name.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/parallel_orderby.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/ql_rewrite_gbtoidx.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/ql_rewrite_gbtoidx_cbo_1.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/ql_rewrite_gbtoidx_cbo_2.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/quotedid_smb.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/reduce_deduplicate.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/remote_script.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/root_dir_external_table.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/schemeAuthority.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/schemeAuthority2.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/temp_table_external.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/truncate_column_buckets.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/uber_reduce.q.out PRE-CREATION shims/0.20S/src/main/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java b17f465 shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java a61c3ac shims/0.23/src/main/java/org/apache/hadoop/hive/shims/MiniSparkOnYARNCluster.java PRE-CREATION shims/common/src/main/java/org/apache/hadoop/hive/shims/HadoopShims.java 064304c spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java aea90db Diff: https://reviews.apache.org/r/30264/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 30264: HIVE-9221 enable unit test for mini Spark on YARN cluster[Spark Branch]
On 一月 26, 2015, 10:30 p.m., Xuefu Zhang wrote: data/conf/spark/yarn-client/hive-site.xml, line 225 https://reviews.apache.org/r/30264/diff/1/?file=834064#file834064line225 Only one executor? Maybe 2 will make it more general. Yes, that make sense. On 一月 26, 2015, 10:30 p.m., chengxiang li wrote: I'm wondering why we have a new set of .out files? Every Test*CliDriver has its own output directory, i didn't think much about this previously. With your remind, i think, yes, we could share the golden files with TestSparkCliDriver, as it's golden file should be the same as TestMiniSparkOnYarnCliDriver for each qtest. One thing more to be note here is that, as spark.query.files contains more than 500 qtests, and it take long enough time to run a full unit test for hive now, i didn't enable all spark.query.files qtest for TestMiniSparkOnYarnCliDriver, instead, i enable qtests from minimr,query.files for it, which contains about 50 qtests, it takes about 10 minutes in my own desktop. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30264/#review69685 --- On 一月 26, 2015, 6:37 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30264/ --- (Updated 一月 26, 2015, 6:37 a.m.) Review request for hive, Szehon Ho and Xuefu Zhang. Bugs: HIVE-9211 https://issues.apache.org/jira/browse/HIVE-9211 Repository: hive-git Description --- MiniSparkOnYarnCluster is enabled for unit test, Spark is deployed on miniYarnCluster on yarn-client mode, all qfiles in minimr.query.files are enabled in this unit test except 3 qfile: bucket_num_reducers.q, bucket_num_reducers2.q, udf_using.q, which is not supported in HoS. Diffs - data/conf/spark/hive-site.xml 016f568 data/conf/spark/standalone/hive-site.xml PRE-CREATION data/conf/spark/yarn-client/hive-site.xml PRE-CREATION itests/pom.xml e1e88f6 itests/qtest-spark/pom.xml d12fad5 itests/src/test/resources/testconfiguration.properties f583aaf itests/util/src/main/java/org/apache/hadoop/hive/ql/QTestUtil.java 095b9bd ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 41a2ab7 ql/src/test/results/clientpositive/miniSparkOnYarn/auto_sortmerge_join_16.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/bucket4.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/bucket5.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/bucket6.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/bucketizedhiveinputformat.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/bucketmapjoin6.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/bucketmapjoin7.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/constprog_partitioner.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/disable_merge_for_bucketing.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/empty_dir_in_table.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/external_table_with_space_in_location_path.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/file_with_header_footer.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/groupby1.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/groupby2.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/import_exported_table.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/index_bitmap3.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/index_bitmap_auto.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/infer_bucket_sort_bucketed_table.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/infer_bucket_sort_dyn_part.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/infer_bucket_sort_map_operators.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/infer_bucket_sort_merge.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/infer_bucket_sort_num_buckets.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/infer_bucket_sort_reducers_power_two.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/input16_cc.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/join1.q.out PRE-CREATION ql/src/test/results/clientpositive
[jira] [Updated] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9211: Status: Patch Available (was: Open) Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293022#comment-14293022 ] Chengxiang Li commented on HIVE-9211: - From hive.log, seems like some error happens in yarn container, but i can't reproduce it in my own machine. The container log is located at \{HIVE_HOME\}/itests/qtest-spark/target/sparkOnYarn/SparkOnYarn-logDir-nm-\*\_\*/application\_\*/container\_\*, [~xuefuz], is there any chance these container logs can be accessed through http service as hive.log? Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293070#comment-14293070 ] Chengxiang Li commented on HIVE-9211: - Great, thanks, [~brocknoland]. Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch, HIVE-9211.2-spark.patch, HIVE-9211.2-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9449) Push YARN configuration to Spark while deply Spark on YARN[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9449: Attachment: HIVE-9449.2-spark.patch Push YARN configuration to Spark while deply Spark on YARN[Spark Branch] Key: HIVE-9449 URL: https://issues.apache.org/jira/browse/HIVE-9449 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Attachments: HIVE-9449.1-spark.patch, HIVE-9449.1-spark.patch, HIVE-9449.2-spark.patch We only push Spark configuration and RSC configuration to Spark while launch Spark cluster now, for Spark on YARN mode, Spark need extra YARN configuration to launch Spark cluster. Besides this, to support dynamically configuration setting for RSC configuration/YARN configuration, we need to recreate SparkSession while RSC configuration/YARN configuration update as well, as they may influence the Spark cluster deployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 30208: HIVE-9449 Push YARN configuration to Spark while deply Spark on YARN[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30208/ --- (Updated Jan. 26, 2015, 5:06 a.m.) Review request for hive and Xuefu Zhang. Changes --- Fix unit test failure. Bugs: HIVE-9449 https://issues.apache.org/jira/browse/HIVE-9449 Repository: hive-git Description --- We only push Spark configuration and RSC configuration to Spark while launch Spark cluster now, for Spark on YARN mode, Spark need extra YARN configuration to launch Spark cluster. Besides this, to support dynamically configuration setting for RSC configuration/YARN configuration, we need to recreate SparkSession while RSC configuration/YARN configuration update as well, as they may influence the Spark cluster deployment as well. Diffs (updated) - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java d4d98d7 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveSparkClientFactory.java 9dc6c47 Diff: https://reviews.apache.org/r/30208/diff/ Testing --- Thanks, chengxiang li
[jira] [Updated] (HIVE-9211) Research on build mini HoS cluster on YARN for unit test[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9211: Attachment: HIVE-9211.1-spark.patch Research on build mini HoS cluster on YARN for unit test[Spark Branch] -- Key: HIVE-9211 URL: https://issues.apache.org/jira/browse/HIVE-9211 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9211.1-spark.patch HoS on YARN is a common use case in product environment, we'd better enable unit test for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 30264: HIVE-9221 enable unit test for mini Spark on YARN cluster[Spark Branch]
/smb_mapjoin_8.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/stats_counter.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/stats_counter_partitioned.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/temp_table_external.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/truncate_column_buckets.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/uber_reduce.q.out PRE-CREATION shims/0.20S/src/main/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java b17f465 shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java a61c3ac shims/0.23/src/main/java/org/apache/hadoop/hive/shims/MiniSparkOnYARNCluster.java PRE-CREATION shims/common/src/main/java/org/apache/hadoop/hive/shims/HadoopShims.java 064304c spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java aea90db Diff: https://reviews.apache.org/r/30264/diff/ Testing --- Thanks, chengxiang li
Review Request 30264: HIVE-9221 enable unit test for mini Spark on YARN cluster[Spark Branch]
ql/src/test/results/clientpositive/miniSparkOnYarn/stats_counter.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/stats_counter_partitioned.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/temp_table_external.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/truncate_column_buckets.q.out PRE-CREATION ql/src/test/results/clientpositive/miniSparkOnYarn/uber_reduce.q.out PRE-CREATION shims/0.20S/src/main/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java b17f465 shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java a61c3ac shims/0.23/src/main/java/org/apache/hadoop/hive/shims/MiniSparkOnYARNCluster.java PRE-CREATION shims/common/src/main/java/org/apache/hadoop/hive/shims/HadoopShims.java 064304c spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java aea90db Diff: https://reviews.apache.org/r/30264/diff/ Testing --- Thanks, chengxiang li
[jira] [Updated] (HIVE-9370) SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9370: Attachment: HIVE-9370.1-spark.patch SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch] -- Key: HIVE-9370 URL: https://issues.apache.org/jira/browse/HIVE-9370 Project: Hive Issue Type: Sub-task Components: Spark Reporter: yuyun.chen Assignee: Chengxiang Li Attachments: HIVE-9370.1-spark.patch enable hive on spark and run BigBench Query 8 then got the following exception: 2015-01-14 11:43:46,057 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 ERROR [main]: status.SparkJobMonitor (SessionState.java:printError(839)) - Status: Failed 2015-01-14 11:43:46,062 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=SparkRunJob start=1421206996052 end=1421207026062 duration=30010 from=org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - 15/01/14 11:43:46 INFO RemoteDriver: Failed to run job 0a9a7782-0e0b-4561-8468-959a6d8df0a3 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - java.lang.InterruptedException 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Native Method) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Object.java:503) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1282) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1300) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1314) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.RDD.collect(RDD.scala:780) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:262) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner.init(Partitioner.scala:124) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:63) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:894) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:864) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.SortByShuffler.shuffle(SortByShuffler.java:48) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.ShuffleTran.transform(ShuffleTran.java:45) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436
[jira] [Updated] (HIVE-9370) SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9370: Status: Patch Available (was: Open) SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch] -- Key: HIVE-9370 URL: https://issues.apache.org/jira/browse/HIVE-9370 Project: Hive Issue Type: Sub-task Components: Spark Reporter: yuyun.chen Assignee: Chengxiang Li Attachments: HIVE-9370.1-spark.patch enable hive on spark and run BigBench Query 8 then got the following exception: 2015-01-14 11:43:46,057 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 ERROR [main]: status.SparkJobMonitor (SessionState.java:printError(839)) - Status: Failed 2015-01-14 11:43:46,062 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=SparkRunJob start=1421206996052 end=1421207026062 duration=30010 from=org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - 15/01/14 11:43:46 INFO RemoteDriver: Failed to run job 0a9a7782-0e0b-4561-8468-959a6d8df0a3 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - java.lang.InterruptedException 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Native Method) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Object.java:503) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1282) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1300) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1314) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.RDD.collect(RDD.scala:780) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:262) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner.init(Partitioner.scala:124) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:63) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:894) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:864) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.SortByShuffler.shuffle(SortByShuffler.java:48) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.ShuffleTran.transform(ShuffleTran.java:45) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436
Review Request 30162: HIVE-9370 SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30162/ --- Review request for hive and Xuefu Zhang. Bugs: HIVE-9370 https://issues.apache.org/jira/browse/HIVE-9370 Repository: hive-git Description --- On RSC mode, monitor based on new remote job state instead of Spark job state, as we could ge more detail information through the former interface.For example, the STARTED state of remote job indicate that remote job has submitted to RemoteDriver and related is going to submitted on the next, so we would not timeout after this state detected although we may not get spark job info yet. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 32e5530 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 30a00a7 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTask.java a4554ac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/LocalSparkJobMonitor.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/RemoteSparkJobMonitor.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobMonitor.java 4f54612 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobRef.java fe2d9f7 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/LocalSparkJobRef.java f28c02b ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/RemoteSparkJobRef.java a2707d1 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/RemoteSparkJobStatus.java a8ac482 Diff: https://reviews.apache.org/r/30162/diff/ Testing --- Thanks, chengxiang li
[jira] [Updated] (HIVE-9410) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9410: Attachment: HIVE-9410.3-spark.patch ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch] -- Key: HIVE-9410 URL: https://issues.apache.org/jira/browse/HIVE-9410 Project: Hive Issue Type: Sub-task Components: Spark Environment: CentOS 6.5 JDK1.7 Reporter: Xin Hao Assignee: Chengxiang Li Attachments: HIVE-9410.1-spark.patch, HIVE-9410.2-spark.patch, HIVE-9410.3-spark.patch We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). It will be passed for default Hive (on MR) mode, while failed for Hive On Spark mode (both Standalone and Yarn-Client). Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue still exists. BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed. Detail Error Message is as below (NOTE: de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar bigbenchqueriesmr.jar, and we have add command like 'add jar /location/to/bigbenchqueriesmr.jar;' into .sql explicitly) INFO [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) right (org.apache.commons.lang3.tuple.ImmutablePair) edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) ... Caused by: java.lang.ClassNotFoundException: de.bankmark.bigbench.queries.q10.SentimentUDF at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136) ... 55 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 30107: HIVE-9410, ClassNotFoundException occurs during hive query case execution with UDF defined[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30107/ --- (Updated Jan. 22, 2015, 9:23 a.m.) Review request for hive and Xuefu Zhang. Changes --- Spark driver may need to load extra added class in 2 place, first, while execute GetJobStatusJob, it need to deserialize SparkWork. Second, while HiveInputFormat get splits, it need to deserialize MapWork. Remote Driver execute AddJarJob in netty rpc thread directly as it's SyncJobRquest, and execute GetJobStatusJob(which wraps spark job) with its threadpool. HiveInputFormat get splits may happens in akka thread pool, as Spark send message through akka between SparkContext and DAGScheduler. So we may need to reset 2 threads classloader to enable this dynamic add jar in RSC. Bugs: HIVE-9410 https://issues.apache.org/jira/browse/HIVE-9410 Repository: hive-git Description --- The RemoteDriver does not contains added jar in it's classpath, so it would failed to desrialize SparkWork due to NoClassFoundException. For Hive on MR, while use add jar through Hive CLI, Hive add jar into CLI classpath(through thread context classloader) and add it to distributed cache as well. Compare to Hive on MR, Hive on Spark has an extra RemoteDriver componnet, we should add added jar into it's classpath as well. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java d7cb111 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 30a00a7 spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java 00aa4ec spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java 1eb3ff2 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 5f9be65 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientUtilities.java PRE-CREATION Diff: https://reviews.apache.org/r/30107/diff/ Testing --- Thanks, chengxiang li
[jira] [Commented] (HIVE-9410) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287161#comment-14287161 ] Chengxiang Li commented on HIVE-9410: - [~xuefuz], all contrib related qtest is launched with TestContribCliDriver, we can not enable these qtests in TestSparkCliDriver directly, I'm not sure how how to do it yet, and it's should be beyond this JIRA's scope, i think we may create another JIRA to track it. ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch] -- Key: HIVE-9410 URL: https://issues.apache.org/jira/browse/HIVE-9410 Project: Hive Issue Type: Sub-task Components: Spark Environment: CentOS 6.5 JDK1.7 Reporter: Xin Hao Assignee: Chengxiang Li Attachments: HIVE-9410.1-spark.patch, HIVE-9410.2-spark.patch, HIVE-9410.3-spark.patch We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). It will be passed for default Hive (on MR) mode, while failed for Hive On Spark mode (both Standalone and Yarn-Client). Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue still exists. BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed. Detail Error Message is as below (NOTE: de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar bigbenchqueriesmr.jar, and we have add command like 'add jar /location/to/bigbenchqueriesmr.jar;' into .sql explicitly) INFO [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) right (org.apache.commons.lang3.tuple.ImmutablePair) edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) ... Caused by: java.lang.ClassNotFoundException: de.bankmark.bigbench.queries.q10.SentimentUDF at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName
Re: Review Request 30107: HIVE-9410, ClassNotFoundException occurs during hive query case execution with UDF defined[Spark Branch]
On 一月 23, 2015, 2:05 a.m., Xuefu Zhang wrote: I'm wondering what's the story for Hive CLI. Hive CLI can add jars from local file system. Would this work for Hive on Spark? Hive CLI add jars to classpath dynamically same as this patch does for RemoteDriver, update thread context classloader with added jars path included. For Hive on Spark, Hive CLI stay the same, the issue is that RemoteDriver does not add these added jars into its class path, so the NoClassFound error come out while RemoteDriver side need related class. On 一月 23, 2015, 2:05 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java, line 367 https://reviews.apache.org/r/30107/diff/4/?file=829688#file829688line367 Callers of getBaseWork() will add the jars to the classpath. Why this is necessary? Who are the callers? Any side-effect? The reason why we need to do this is that, getBaseWork() would generate MapWork/ReduceWork which contains Hive operators inside, and UDTFOperator which contains added jar class need to be loaded. To load added jar dynamically, we need to reset thread context classloader, as mentioned in previous change summary, unlike HiveCLI, there are 2 threads in RemoteDriver side may need to load added jar, For akka thread, there is no proper cut-in point for add jars to classpath. The side-effect is that, many HiveCLI threads may have to check to update its classload unneccsary. Another possible solution is that, we update SystemClassLoader for RemoteDriver dynamically, which must be done in a quite hacky way, such as: URLClassLoader sysloader = (URLClassLoader) ClassLoader.getSystemClassLoader(); Class sysclass = URLClassLoader.class; try { Method method = sysclass.getDeclaredMethod(addURL, parameters); method.setAccessible(true); method.invoke(sysloader, new Object[] {u}); } catch (Throwable t) { t.printStackTrace(); throw new IOException(Error, could not add URL to system classloader); } Which one do you prefer? On 一月 23, 2015, 2:05 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java, line 220 https://reviews.apache.org/r/30107/diff/4/?file=829689#file829689line220 So, this is the code that adds the jars to the classpath of the remote driver? I'm wondering why these jars are necessary in order to deserailize SparkWork. Same as previous comments, SparkWork contains MapWork/ReduceWork which contains operator tree, UTFFOperator need to load added jar class. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30107/#review69329 --- On 一月 22, 2015, 9:23 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30107/ --- (Updated 一月 22, 2015, 9:23 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-9410 https://issues.apache.org/jira/browse/HIVE-9410 Repository: hive-git Description --- The RemoteDriver does not contains added jar in it's classpath, so it would failed to desrialize SparkWork due to NoClassFoundException. For Hive on MR, while use add jar through Hive CLI, Hive add jar into CLI classpath(through thread context classloader) and add it to distributed cache as well. Compare to Hive on MR, Hive on Spark has an extra RemoteDriver componnet, we should add added jar into it's classpath as well. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java d7cb111 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 30a00a7 spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java 00aa4ec spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java 1eb3ff2 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 5f9be65 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientUtilities.java PRE-CREATION Diff: https://reviews.apache.org/r/30107/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 30107: HIVE-9410, ClassNotFoundException occurs during hive query case execution with UDF defined[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30107/#review69336 --- ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java https://reviews.apache.org/r/30107/#comment114014 #3 this would be executed in akka thread, get extra jar path from JobConf, and add to current thread classloader. ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java https://reviews.apache.org/r/30107/#comment114013 #2 this job is executed in thread RemoteDriver threadpool, it get extra jar paths from JobContext, add them to current thread classloader, and set them to JobConf. spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java https://reviews.apache.org/r/30107/#comment114012 #1 add extra jar path to JobContext, this job is executed in netty connection thread. - chengxiang li On 一月 22, 2015, 9:23 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30107/ --- (Updated 一月 22, 2015, 9:23 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-9410 https://issues.apache.org/jira/browse/HIVE-9410 Repository: hive-git Description --- The RemoteDriver does not contains added jar in it's classpath, so it would failed to desrialize SparkWork due to NoClassFoundException. For Hive on MR, while use add jar through Hive CLI, Hive add jar into CLI classpath(through thread context classloader) and add it to distributed cache as well. Compare to Hive on MR, Hive on Spark has an extra RemoteDriver componnet, we should add added jar into it's classpath as well. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java d7cb111 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 30a00a7 spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java 00aa4ec spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java 1eb3ff2 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 5f9be65 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientUtilities.java PRE-CREATION Diff: https://reviews.apache.org/r/30107/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 30107: HIVE-9410, ClassNotFoundException occurs during hive query case execution with UDF defined[Spark Branch]
On 一月 23, 2015, 2:05 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java, line 220 https://reviews.apache.org/r/30107/diff/4/?file=829689#file829689line220 So, this is the code that adds the jars to the classpath of the remote driver? I'm wondering why these jars are necessary in order to deserailize SparkWork. chengxiang li wrote: Same as previous comments, SparkWork contains MapWork/ReduceWork which contains operator tree, UTFFOperator need to load added jar class. Xuefu Zhang wrote: Sorry, but which operator? UTFFOperator? I could find it in hive source. Sorry, as you can see from the error log in JIRA, the extra class in added jar is contained in UDTFOperator: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30107/#review69329 --- On 一月 22, 2015, 9:23 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30107/ --- (Updated 一月 22, 2015, 9:23 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-9410 https://issues.apache.org/jira/browse/HIVE-9410 Repository: hive-git Description --- The RemoteDriver does not contains added jar in it's classpath, so it would failed to desrialize SparkWork due to NoClassFoundException. For Hive on MR, while use add jar through Hive CLI, Hive add jar into CLI classpath(through thread context classloader) and add it to distributed cache as well. Compare to Hive on MR, Hive on Spark has an extra RemoteDriver componnet, we should add added jar into it's classpath as well. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java d7cb111 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 30a00a7 spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java 00aa4ec spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java 1eb3ff2 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 5f9be65 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientUtilities.java PRE-CREATION Diff: https://reviews.apache.org/r/30107/diff/ Testing --- Thanks, chengxiang li
Re: Review Request 30107: HIVE-9410, ClassNotFoundException occurs during hive query case execution with UDF defined[Spark Branch]
On 一月 23, 2015, 3:02 a.m., chengxiang li wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java, line 371 https://reviews.apache.org/r/30107/diff/4/?file=829688#file829688line371 #3 this would be executed in akka thread, get extra jar path from JobConf, and add to current thread classloader. Xuefu Zhang wrote: what thread is referred as akka thread? Inside Spark driver, SparkContext submit spark job to DAGSchedule through akka message instead of directly invoke, akka hold a thread pool to handle messages. - chengxiang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30107/#review69336 --- On 一月 22, 2015, 9:23 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30107/ --- (Updated 一月 22, 2015, 9:23 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-9410 https://issues.apache.org/jira/browse/HIVE-9410 Repository: hive-git Description --- The RemoteDriver does not contains added jar in it's classpath, so it would failed to desrialize SparkWork due to NoClassFoundException. For Hive on MR, while use add jar through Hive CLI, Hive add jar into CLI classpath(through thread context classloader) and add it to distributed cache as well. Compare to Hive on MR, Hive on Spark has an extra RemoteDriver componnet, we should add added jar into it's classpath as well. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java d7cb111 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 30a00a7 spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java 00aa4ec spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java 1eb3ff2 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 5f9be65 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientUtilities.java PRE-CREATION Diff: https://reviews.apache.org/r/30107/diff/ Testing --- Thanks, chengxiang li
[jira] [Commented] (HIVE-9370) SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288684#comment-14288684 ] Chengxiang Li commented on HIVE-9370: - RSC have timeout in netty level, so if remote spark context do not response in netty level, we would get the exception. One question is that the sparksession is still alive, use could still submit queries but failed to execute as PRC channel is already closed, user need to restart Hive CLI or use a tricky way to new remote spark context, like update spark configuration. SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch] -- Key: HIVE-9370 URL: https://issues.apache.org/jira/browse/HIVE-9370 Project: Hive Issue Type: Sub-task Components: Spark Reporter: yuyun.chen Assignee: Chengxiang Li Fix For: spark-branch Attachments: HIVE-9370.1-spark.patch enable hive on spark and run BigBench Query 8 then got the following exception: 2015-01-14 11:43:46,057 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 ERROR [main]: status.SparkJobMonitor (SessionState.java:printError(839)) - Status: Failed 2015-01-14 11:43:46,062 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=SparkRunJob start=1421206996052 end=1421207026062 duration=30010 from=org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - 15/01/14 11:43:46 INFO RemoteDriver: Failed to run job 0a9a7782-0e0b-4561-8468-959a6d8df0a3 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - java.lang.InterruptedException 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Native Method) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Object.java:503) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1282) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1300) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1314) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.RDD.collect(RDD.scala:780) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:262) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner.init(Partitioner.scala:124) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:63) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:894) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:864) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436
[jira] [Commented] (HIVE-9410) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288797#comment-14288797 ] Chengxiang Li commented on HIVE-9410: - Yes, Spark would address this issue more properly, I've create SPARK-5377 for this. About the unit test, udf_example_add.q should not suitable to verify this issue, as Hive does not need to load UDF class during SparkWork serialization, i would try to enable some UTDF unit test for this. ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch] -- Key: HIVE-9410 URL: https://issues.apache.org/jira/browse/HIVE-9410 Project: Hive Issue Type: Sub-task Components: Spark Environment: CentOS 6.5 JDK1.7 Reporter: Xin Hao Assignee: Chengxiang Li Attachments: HIVE-9410.1-spark.patch, HIVE-9410.2-spark.patch, HIVE-9410.3-spark.patch We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). It will be passed for default Hive (on MR) mode, while failed for Hive On Spark mode (both Standalone and Yarn-Client). Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue still exists. BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed. Detail Error Message is as below (NOTE: de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar bigbenchqueriesmr.jar, and we have add command like 'add jar /location/to/bigbenchqueriesmr.jar;' into .sql explicitly) INFO [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) right (org.apache.commons.lang3.tuple.ImmutablePair) edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) ... Caused by: java.lang.ClassNotFoundException: de.bankmark.bigbench.queries.q10.SentimentUDF at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270
[jira] [Commented] (HIVE-9410) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288848#comment-14288848 ] Chengxiang Li commented on HIVE-9410: - As ser/deser between Hive driver and remote spark context is beyond spark, we still need this fix even SPARK-5377 is resolved. ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch] -- Key: HIVE-9410 URL: https://issues.apache.org/jira/browse/HIVE-9410 Project: Hive Issue Type: Sub-task Components: Spark Environment: CentOS 6.5 JDK1.7 Reporter: Xin Hao Assignee: Chengxiang Li Attachments: HIVE-9410.1-spark.patch, HIVE-9410.2-spark.patch, HIVE-9410.3-spark.patch We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). It will be passed for default Hive (on MR) mode, while failed for Hive On Spark mode (both Standalone and Yarn-Client). Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue still exists. BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed. Detail Error Message is as below (NOTE: de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar bigbenchqueriesmr.jar, and we have add command like 'add jar /location/to/bigbenchqueriesmr.jar;' into .sql explicitly) INFO [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) right (org.apache.commons.lang3.tuple.ImmutablePair) edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) ... Caused by: java.lang.ClassNotFoundException: de.bankmark.bigbench.queries.q10.SentimentUDF at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136) ... 55 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9410) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9410: Attachment: (was: HIVE-9410.4-spark.patch) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch] -- Key: HIVE-9410 URL: https://issues.apache.org/jira/browse/HIVE-9410 Project: Hive Issue Type: Sub-task Components: Spark Environment: CentOS 6.5 JDK1.7 Reporter: Xin Hao Assignee: Chengxiang Li Attachments: HIVE-9410.1-spark.patch, HIVE-9410.2-spark.patch, HIVE-9410.3-spark.patch We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). It will be passed for default Hive (on MR) mode, while failed for Hive On Spark mode (both Standalone and Yarn-Client). Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue still exists. BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed. Detail Error Message is as below (NOTE: de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar bigbenchqueriesmr.jar, and we have add command like 'add jar /location/to/bigbenchqueriesmr.jar;' into .sql explicitly) INFO [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) right (org.apache.commons.lang3.tuple.ImmutablePair) edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) ... Caused by: java.lang.ClassNotFoundException: de.bankmark.bigbench.queries.q10.SentimentUDF at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136) ... 55 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9410) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-9410: Attachment: HIVE-9410.4-spark.patch ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch] -- Key: HIVE-9410 URL: https://issues.apache.org/jira/browse/HIVE-9410 Project: Hive Issue Type: Sub-task Components: Spark Environment: CentOS 6.5 JDK1.7 Reporter: Xin Hao Assignee: Chengxiang Li Attachments: HIVE-9410.1-spark.patch, HIVE-9410.2-spark.patch, HIVE-9410.3-spark.patch, HIVE-9410.4-spark.patch We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). It will be passed for default Hive (on MR) mode, while failed for Hive On Spark mode (both Standalone and Yarn-Client). Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue still exists. BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed. Detail Error Message is as below (NOTE: de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar bigbenchqueriesmr.jar, and we have add command like 'add jar /location/to/bigbenchqueriesmr.jar;' into .sql explicitly) INFO [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) right (org.apache.commons.lang3.tuple.ImmutablePair) edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) ... Caused by: java.lang.ClassNotFoundException: de.bankmark.bigbench.queries.q10.SentimentUDF at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136) ... 55 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 30107: HIVE-9410, ClassNotFoundException occurs during hive query case execution with UDF defined[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30107/ --- (Updated Jan. 23, 2015, 6:44 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-9410 https://issues.apache.org/jira/browse/HIVE-9410 Repository: hive-git Description --- The RemoteDriver does not contains added jar in it's classpath, so it would failed to desrialize SparkWork due to NoClassFoundException. For Hive on MR, while use add jar through Hive CLI, Hive add jar into CLI classpath(through thread context classloader) and add it to distributed cache as well. Compare to Hive on MR, Hive on Spark has an extra RemoteDriver componnet, we should add added jar into it's classpath as well. Diffs (updated) - itests/src/test/resources/testconfiguration.properties 6340d1c ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 9d9f4e6 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java a4a166a ql/src/test/queries/clientpositive/lateral_view_explode2.q PRE-CREATION ql/src/test/results/clientpositive/lateral_view_explode2.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out PRE-CREATION spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java 00aa4ec spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java 1eb3ff2 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 5f9be65 spark-client/src/main/java/org/apache/hive/spark/client/SparkClientUtilities.java PRE-CREATION Diff: https://reviews.apache.org/r/30107/diff/ Testing --- Thanks, chengxiang li