[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286005#comment-15286005 ] Xuefu Zhang commented on HIVE-13293: Please commit. Thanks. > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > Attachments: HIVE-13293.1.patch, HIVE-13293.2.patch, > HIVE-13293.3.patch, HIVE-13293.3.patch, HIVE-13293.3.patch > > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285944#comment-15285944 ] Rui Li commented on HIVE-13293: --- Hi [~xuefuz], any further comments on this one? > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > Attachments: HIVE-13293.1.patch, HIVE-13293.2.patch, > HIVE-13293.3.patch, HIVE-13293.3.patch, HIVE-13293.3.patch > > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284369#comment-15284369 ] Rui Li commented on HIVE-13293: --- None of the failures can be reproduced locally. Seems the test framework has become quite unstable though. > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > Attachments: HIVE-13293.1.patch, HIVE-13293.2.patch, > HIVE-13293.3.patch, HIVE-13293.3.patch, HIVE-13293.3.patch > > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284226#comment-15284226 ] Hive QA commented on HIVE-13293: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12804096/HIVE-13293.3.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 48 failed/errored test(s), 9941 tests executed *Failed tests:* {noformat} TestHWISessionManager - did not produce a TEST-*.xml file TestMiniLlapCliDriver - did not produce a TEST-*.xml file TestMiniTezCliDriver-auto_join1.q-schema_evol_text_vec_mapwork_part_all_complex.q-vector_complex_join.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-constprog_dpp.q-dynamic_partition_pruning.q-vectorization_10.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-enforce_order.q-vector_partition_diff_num_cols.q-unionDistinct_1.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-explainuser_4.q-update_after_multiple_inserts.q-mapreduce2.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-join1.q-mapjoin_decimal.q-union5.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-load_dyn_part2.q-selectDistinctStar.q-vector_decimal_5.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-vector_coalesce.q-cbo_windowing.q-tez_join.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-vector_distinct_2.q-tez_joins_explain.q-cte_mat_1.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-vector_interval_2.q-schema_evol_text_nonvec_mapwork_part_all_primitive.q-tez_fsstat.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-vectorized_parquet.q-insert_values_non_partitioned.q-schema_evol_orc_nonvec_mapwork_part.q-and-12-more - did not produce a TEST-*.xml file TestMinimrCliDriver-join1.q-infer_bucket_sort_bucketed_table.q-root_dir_external_table.q-and-1-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ivyDownload org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_join18 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_join_reordering_values org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_cbo_udf_udaf org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dynamic_rdd_cache org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby3_map org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_having org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_identity_project_remove_skip org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_insert_into1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join22 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_merge_multi_expressions org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapjoin_test_outer org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_pcr org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_ptf_seqfile org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_sample6 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt8 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_11 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_14 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_7 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_stats1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_subquery_in org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_temp_table org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_udf_max org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_10 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_11 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_3 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_uniquejoin org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_vector_cast_constant org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_vectorization_short_regress org.apache.hadoop.hive.llap.tez.TestConverters.testFragmentSpecToTaskSpec org.apache.hadoop.hive.llap.tezplugins.TestLlapTaskCommunicator.testFinishableStateUpdateFailure org.apache.hive.service.cli.session.TestHiveSessionImpl.testLeakOperationHandle {noformat} Test results: http://ec2-54-177-240-2.us-west-1.compute.amazonaws.com/job/PreCommit-HIVE-MASTER-Build/295/testReport Console output:
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283746#comment-15283746 ] Hive QA commented on HIVE-13293: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12803817/HIVE-13293.3.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: http://ec2-54-177-240-2.us-west-1.compute.amazonaws.com/job/PreCommit-HIVE-MASTER-Build/278/testReport Console output: http://ec2-54-177-240-2.us-west-1.compute.amazonaws.com/job/PreCommit-HIVE-MASTER-Build/278/console Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-278/ Messages: {noformat} This message was trimmed, see log for full details [INFO] Excluding org.apache.spark:spark-core_2.10:jar:1.6.0 from the shaded jar. [INFO] Excluding com.twitter:chill_2.10:jar:0.5.0 from the shaded jar. [INFO] Excluding com.twitter:chill-java:jar:0.5.0 from the shaded jar. [INFO] Excluding org.apache.xbean:xbean-asm5-shaded:jar:4.4 from the shaded jar. [INFO] Excluding org.apache.hadoop:hadoop-client:jar:2.6.0 from the shaded jar. [INFO] Excluding org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.6.0 from the shaded jar. [INFO] Excluding org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.6.0 from the shaded jar. [INFO] Excluding org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.6.0 from the shaded jar. [INFO] Excluding org.apache.spark:spark-launcher_2.10:jar:1.6.0 from the shaded jar. [INFO] Excluding org.apache.spark:spark-network-common_2.10:jar:1.6.0 from the shaded jar. [INFO] Excluding org.apache.spark:spark-network-shuffle_2.10:jar:1.6.0 from the shaded jar. [INFO] Excluding org.apache.spark:spark-unsafe_2.10:jar:1.6.0 from the shaded jar. [INFO] Excluding org.slf4j:jul-to-slf4j:jar:1.7.10 from the shaded jar. [INFO] Excluding org.slf4j:jcl-over-slf4j:jar:1.7.10 from the shaded jar. [INFO] Excluding com.ning:compress-lzf:jar:1.0.3 from the shaded jar. [INFO] Excluding net.jpountz.lz4:lz4:jar:1.3.0 from the shaded jar. [INFO] Excluding com.typesafe.akka:akka-remote_2.10:jar:2.3.11 from the shaded jar. [INFO] Excluding com.typesafe.akka:akka-actor_2.10:jar:2.3.11 from the shaded jar. [INFO] Excluding com.typesafe:config:jar:1.2.1 from the shaded jar. [INFO] Excluding org.uncommons.maths:uncommons-maths:jar:1.2.2a from the shaded jar. [INFO] Excluding com.typesafe.akka:akka-slf4j_2.10:jar:2.3.11 from the shaded jar. [INFO] Excluding org.scala-lang:scala-library:jar:2.10.4 from the shaded jar. [INFO] Excluding org.json4s:json4s-jackson_2.10:jar:3.2.10 from the shaded jar. [INFO] Excluding org.json4s:json4s-core_2.10:jar:3.2.10 from the shaded jar. [INFO] Excluding org.json4s:json4s-ast_2.10:jar:3.2.10 from the shaded jar. [INFO] Excluding org.scala-lang:scalap:jar:2.10.0 from the shaded jar. [INFO] Excluding org.scala-lang:scala-compiler:jar:2.10.0 from the shaded jar. [INFO] Excluding org.apache.mesos:mesos:jar:shaded-protobuf:0.21.1 from the shaded jar. [INFO] Excluding com.clearspring.analytics:stream:jar:2.7.0 from the shaded jar. [INFO] Excluding io.dropwizard.metrics:metrics-graphite:jar:3.1.2 from the shaded jar. [INFO] Excluding com.fasterxml.jackson.module:jackson-module-scala_2.10:jar:2.4.4 from the shaded jar. [INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded jar. [INFO] Excluding oro:oro:jar:2.0.8 from the shaded jar. [INFO] Excluding org.tachyonproject:tachyon-client:jar:0.8.2 from the shaded jar. [INFO] Excluding org.tachyonproject:tachyon-underfs-hdfs:jar:0.8.2 from the shaded jar. [INFO] Excluding org.tachyonproject:tachyon-underfs-s3:jar:0.8.2 from the shaded jar. [INFO] Excluding org.tachyonproject:tachyon-underfs-local:jar:0.8.2 from the shaded jar. [INFO] Excluding net.razorvine:pyrolite:jar:4.9 from the shaded jar. [INFO] Excluding net.sf.py4j:py4j:jar:0.9 from the shaded jar. [INFO] Excluding org.spark-project.spark:unused:jar:1.0.0 from the shaded jar. [INFO] Excluding org.slf4j:slf4j-api:jar:1.7.10 from the shaded jar. [INFO] Replacing original artifact with shaded artifact. [INFO] Replacing /data/hive-ptest/working/apache-github-source-source/ql/target/hive-exec-2.1.0-SNAPSHOT.jar with /data/hive-ptest/working/apache-github-source-source/ql/target/hive-exec-2.1.0-SNAPSHOT-shaded.jar [INFO] Dependency-reduced POM written at: /data/hive-ptest/working/apache-github-source-source/ql/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at: /data/hive-ptest/working/apache-github-source-source/ql/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at: /data/hive-ptest/working/apache-github-source-source/ql/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at: /data/hive-ptest/working/apache-github-source-source/ql/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at:
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282841#comment-15282841 ] Xuefu Zhang commented on HIVE-13293: Hi [~lirui], Is checking num of partitions necessary because sampling job is triggered only if more than one partition? > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > Attachments: HIVE-13293.1.patch, HIVE-13293.2.patch, > HIVE-13293.3.patch > > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281488#comment-15281488 ] Hive QA commented on HIVE-13293: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12803166/HIVE-13293.1.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 68 failed/errored test(s), 9194 tests executed *Failed tests:* {noformat} TestHWISessionManager - did not produce a TEST-*.xml file TestMiniLlapCliDriver - did not produce a TEST-*.xml file TestMiniTezCliDriver-bucket_map_join_tez1.q-auto_sortmerge_join_16.q-skewjoin.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-join1.q-schema_evol_orc_nonvec_mapwork_part.q-mapjoin_decimal.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-load_dyn_part2.q-selectDistinctStar.q-vector_decimal_5.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-mapjoin_mapjoin.q-insert_into1.q-vector_decimal_2.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-order_null.q-vector_acid3.q-orc_merge10.q-and-12-more - did not produce a TEST-*.xml file TestNegativeCliDriver-udf_invalid.q-nopart_insert.q-insert_into_with_schema.q-and-734-more - did not produce a TEST-*.xml file TestSparkCliDriver-ppd_transform.q-union_remove_7.q-date_udf.q-and-12-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ivyDownload org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_bucket4 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_bucket5 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_bucket6 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_disable_merge_for_bucketing org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_infer_bucket_sort_map_operators org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_infer_bucket_sort_num_buckets org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_infer_bucket_sort_reducers_power_two org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_list_bucket_dml_10 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_orc_merge1 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_orc_merge2 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_orc_merge9 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_orc_merge_diff_fs org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_reduce_deduplicate org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_vector_outer_join1 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_vector_outer_join2 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_vector_outer_join3 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_vector_outer_join4 org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_vector_outer_join5 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_4 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_cbo_stats org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby2_map_skew org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby7_noskew_multi_single_reducer org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_ppr org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join34 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join35 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join6 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_load_dyn_part2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_load_dyn_part5 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_multi_insert_gby org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_sample5 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt14 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt16 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_17 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_vectorization_3 org.apache.hadoop.hive.llap.daemon.impl.TestTaskExecutorService.testPreemptionQueueComparator org.apache.hadoop.hive.llap.daemon.impl.comparator.TestShortestJobFirstComparator.testWaitQueueComparatorWithinDagPriority org.apache.hadoop.hive.llap.tez.TestConverters.testFragmentSpecToTaskSpec org.apache.hadoop.hive.llap.tezplugins.TestLlapTaskCommunicator.testFinishableStateUpdateFailure org.apache.hadoop.hive.metastore.TestAuthzApiEmbedAuthorizerInRemote.org.apache.hadoop.hive.metastore.TestAuthzApiEmbedAuthorizerInRemote
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280380#comment-15280380 ] Xuefu Zhang commented on HIVE-13293: Okay. Sounds good. +1 > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > Attachments: HIVE-13293.1.patch, HIVE-13293.1.patch > > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280347#comment-15280347 ] Rui Li commented on HIVE-13293: --- Hi [~xuefuz], yeah order by is mostly at the end of stages. But that doesn't mean the amount of data is small - that's why we need parallel order by. During our benchmark, we hit OOM for several cases, which is due to some bug in Spark 1.6.0. So I thought using memory level cache may make it even worse. To your second question, we unpersist cached RDDs at the end of each job. You can refer to {{RemoteDriver#JobWrapper}} for that. > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > Attachments: HIVE-13293.1.patch, HIVE-13293.1.patch > > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280312#comment-15280312 ] Xuefu Zhang commented on HIVE-13293: [~lirui], thanks for working on this. The patch looks good, but one thing I'm not very sure of is the persistence level. Order by is almost always at the end of stages. Thus, does it make sense to have a mixed of memory and disk? As a side, out of scope question, do we need to explicitly call rdd.unpersist() for those cached rdds once a query is completed? Right now, rdds are never reused across queries. > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > Attachments: HIVE-13293.1.patch, HIVE-13293.1.patch > > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242888#comment-15242888 ] Hive QA commented on HIVE-13293: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12797965/HIVE-13293.1.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 9 failed/errored test(s), 9964 tests executed *Failed tests:* {noformat} TestJdbcWithMiniHS2 - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_index_compact_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_llap_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_non_ascii_literal2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_grouping_sets org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_interval_mapjoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_join_filters org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby7_noskew_multi_single_reducer {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/7597/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/7597/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-7597/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 9 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12797965 - PreCommit-HIVE-TRUNK-Build > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > Attachments: HIVE-13293.1.patch > > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238556#comment-15238556 ] Rui Li commented on HIVE-13293: --- Thanks [~xuefuz] for the review. I mean it can work with queries that have only one ShuffleMapStage. It will definitely work with queries that have multiple ShuffleMapStage too. But as I said in previous comment, what we care about here is just the last ShuffleMapStage because that's what gets re-computed in parallel order by. On the other hand, splitting task that has only one ShuffleMapStage seems weird and may be bad for performance. That's why I chose to cache the RDD. > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > Attachments: HIVE-13293.1.patch > > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237434#comment-15237434 ] Xuefu Zhang commented on HIVE-13293: [~lirui], thanks for the investigation and the patch, which seems simple and straightforward. One question: what do you mean by "only works queries that have only one ShuffleMapStage"? In your previous example, there are actually a few such stages. Isn't your patch supposed to help that as well? > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > Attachments: HIVE-13293.1.patch > > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211748#comment-15211748 ] Rui Li commented on HIVE-13293: --- Just did some research about this. Actually the overhead is not so big as I thought. If a query is complicated, we'll have multiple stages. For spark, intermediate stages are called {{ShuffleMapStage}} and the last stage is called {{ResultStage}}. Suppose we have the following stage graph: {noformat} ShuffleMapStage1 | (join) ShuflleMapStage2 | (groupBy) ShuffleMapStage3 | (sortByKey) ResultStage4 {noformat} When calling sortByKey, spark launches the sampling job. The job triggers computation of ShuffleMapStage1, ShuflleMapStage2 and a ResultStage that shares most of ShuffleMapStage3. When we launch the real job, we'll submit the above stage graph. But at this point, spark will consider ShuffleMapStage1 and ShuflleMapStage2 as already computed because the shuffle outputs are still in local disk. Therefore what's re-computed is just ShuffleMapStage3. I have done some tests to verify this. That being said, when ShuffleMapStage3 is complicated enough, we'll still have some considerable overhead. And I think that's the case for Q10 in TPCx-BB. Rather than splitting the task, I think a better and easier way is to cache the RDD before calling sortByKey. We can use DISK_ONLY storage level if memory is a concern. I'll come up with a patch for review. > Query occurs performance degradation after enabling parallel order by for > Hive on Spark > --- > > Key: HIVE-13293 > URL: https://issues.apache.org/jira/browse/HIVE-13293 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 2.0.0 >Reporter: Lifeng Wang >Assignee: Rui Li > > I use TPCx-BB to do some performance test on Hive on Spark engine. And found > query 10 has performance degradation when enabling parallel order by. > It seems that sampling cost much time before running the real query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)