[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-05-16 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286005#comment-15286005
 ] 

Xuefu Zhang commented on HIVE-13293:


Please commit. Thanks.

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
> Attachments: HIVE-13293.1.patch, HIVE-13293.2.patch, 
> HIVE-13293.3.patch, HIVE-13293.3.patch, HIVE-13293.3.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-05-16 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285944#comment-15285944
 ] 

Rui Li commented on HIVE-13293:
---

Hi [~xuefuz], any further comments on this one?

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
> Attachments: HIVE-13293.1.patch, HIVE-13293.2.patch, 
> HIVE-13293.3.patch, HIVE-13293.3.patch, HIVE-13293.3.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-05-16 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284369#comment-15284369
 ] 

Rui Li commented on HIVE-13293:
---

None of the failures can be reproduced locally. Seems the test framework has 
become quite unstable though.

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
> Attachments: HIVE-13293.1.patch, HIVE-13293.2.patch, 
> HIVE-13293.3.patch, HIVE-13293.3.patch, HIVE-13293.3.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-05-16 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284226#comment-15284226
 ] 

Hive QA commented on HIVE-13293:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12804096/HIVE-13293.3.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 48 failed/errored test(s), 9941 tests 
executed
*Failed tests:*
{noformat}
TestHWISessionManager - did not produce a TEST-*.xml file
TestMiniLlapCliDriver - did not produce a TEST-*.xml file
TestMiniTezCliDriver-auto_join1.q-schema_evol_text_vec_mapwork_part_all_complex.q-vector_complex_join.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-constprog_dpp.q-dynamic_partition_pruning.q-vectorization_10.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-enforce_order.q-vector_partition_diff_num_cols.q-unionDistinct_1.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-explainuser_4.q-update_after_multiple_inserts.q-mapreduce2.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-join1.q-mapjoin_decimal.q-union5.q-and-12-more - did not 
produce a TEST-*.xml file
TestMiniTezCliDriver-load_dyn_part2.q-selectDistinctStar.q-vector_decimal_5.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-vector_coalesce.q-cbo_windowing.q-tez_join.q-and-12-more - 
did not produce a TEST-*.xml file
TestMiniTezCliDriver-vector_distinct_2.q-tez_joins_explain.q-cte_mat_1.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-vector_interval_2.q-schema_evol_text_nonvec_mapwork_part_all_primitive.q-tez_fsstat.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-vectorized_parquet.q-insert_values_non_partitioned.q-schema_evol_orc_nonvec_mapwork_part.q-and-12-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-join1.q-infer_bucket_sort_bucketed_table.q-root_dir_external_table.q-and-1-more
 - did not produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ivyDownload
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_join18
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_join_reordering_values
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_1
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_cbo_udf_udaf
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dynamic_rdd_cache
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby3_map
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_having
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_identity_project_remove_skip
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_insert_into1
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join22
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_merge_multi_expressions
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapjoin_test_outer
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_pcr
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_ptf_seqfile
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_sample6
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt8
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_11
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_14
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_smb_mapjoin_7
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_stats1
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_subquery_in
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_temp_table
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_udf_max
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_1
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_10
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_11
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_3
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_uniquejoin
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_vector_cast_constant
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_vectorization_short_regress
org.apache.hadoop.hive.llap.tez.TestConverters.testFragmentSpecToTaskSpec
org.apache.hadoop.hive.llap.tezplugins.TestLlapTaskCommunicator.testFinishableStateUpdateFailure
org.apache.hive.service.cli.session.TestHiveSessionImpl.testLeakOperationHandle
{noformat}

Test results: 
http://ec2-54-177-240-2.us-west-1.compute.amazonaws.com/job/PreCommit-HIVE-MASTER-Build/295/testReport
Console output: 

[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-05-14 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283746#comment-15283746
 ] 

Hive QA commented on HIVE-13293:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12803817/HIVE-13293.3.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: 
http://ec2-54-177-240-2.us-west-1.compute.amazonaws.com/job/PreCommit-HIVE-MASTER-Build/278/testReport
Console output: 
http://ec2-54-177-240-2.us-west-1.compute.amazonaws.com/job/PreCommit-HIVE-MASTER-Build/278/console
Test logs: 
http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-278/

Messages:
{noformat}
 This message was trimmed, see log for full details 
[INFO] Excluding org.apache.spark:spark-core_2.10:jar:1.6.0 from the shaded jar.
[INFO] Excluding com.twitter:chill_2.10:jar:0.5.0 from the shaded jar.
[INFO] Excluding com.twitter:chill-java:jar:0.5.0 from the shaded jar.
[INFO] Excluding org.apache.xbean:xbean-asm5-shaded:jar:4.4 from the shaded jar.
[INFO] Excluding org.apache.hadoop:hadoop-client:jar:2.6.0 from the shaded jar.
[INFO] Excluding org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.6.0 from 
the shaded jar.
[INFO] Excluding org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.6.0 
from the shaded jar.
[INFO] Excluding org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.6.0 
from the shaded jar.
[INFO] Excluding org.apache.spark:spark-launcher_2.10:jar:1.6.0 from the shaded 
jar.
[INFO] Excluding org.apache.spark:spark-network-common_2.10:jar:1.6.0 from the 
shaded jar.
[INFO] Excluding org.apache.spark:spark-network-shuffle_2.10:jar:1.6.0 from the 
shaded jar.
[INFO] Excluding org.apache.spark:spark-unsafe_2.10:jar:1.6.0 from the shaded 
jar.
[INFO] Excluding org.slf4j:jul-to-slf4j:jar:1.7.10 from the shaded jar.
[INFO] Excluding org.slf4j:jcl-over-slf4j:jar:1.7.10 from the shaded jar.
[INFO] Excluding com.ning:compress-lzf:jar:1.0.3 from the shaded jar.
[INFO] Excluding net.jpountz.lz4:lz4:jar:1.3.0 from the shaded jar.
[INFO] Excluding com.typesafe.akka:akka-remote_2.10:jar:2.3.11 from the shaded 
jar.
[INFO] Excluding com.typesafe.akka:akka-actor_2.10:jar:2.3.11 from the shaded 
jar.
[INFO] Excluding com.typesafe:config:jar:1.2.1 from the shaded jar.
[INFO] Excluding org.uncommons.maths:uncommons-maths:jar:1.2.2a from the shaded 
jar.
[INFO] Excluding com.typesafe.akka:akka-slf4j_2.10:jar:2.3.11 from the shaded 
jar.
[INFO] Excluding org.scala-lang:scala-library:jar:2.10.4 from the shaded jar.
[INFO] Excluding org.json4s:json4s-jackson_2.10:jar:3.2.10 from the shaded jar.
[INFO] Excluding org.json4s:json4s-core_2.10:jar:3.2.10 from the shaded jar.
[INFO] Excluding org.json4s:json4s-ast_2.10:jar:3.2.10 from the shaded jar.
[INFO] Excluding org.scala-lang:scalap:jar:2.10.0 from the shaded jar.
[INFO] Excluding org.scala-lang:scala-compiler:jar:2.10.0 from the shaded jar.
[INFO] Excluding org.apache.mesos:mesos:jar:shaded-protobuf:0.21.1 from the 
shaded jar.
[INFO] Excluding com.clearspring.analytics:stream:jar:2.7.0 from the shaded jar.
[INFO] Excluding io.dropwizard.metrics:metrics-graphite:jar:3.1.2 from the 
shaded jar.
[INFO] Excluding 
com.fasterxml.jackson.module:jackson-module-scala_2.10:jar:2.4.4 from the 
shaded jar.
[INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded jar.
[INFO] Excluding oro:oro:jar:2.0.8 from the shaded jar.
[INFO] Excluding org.tachyonproject:tachyon-client:jar:0.8.2 from the shaded 
jar.
[INFO] Excluding org.tachyonproject:tachyon-underfs-hdfs:jar:0.8.2 from the 
shaded jar.
[INFO] Excluding org.tachyonproject:tachyon-underfs-s3:jar:0.8.2 from the 
shaded jar.
[INFO] Excluding org.tachyonproject:tachyon-underfs-local:jar:0.8.2 from the 
shaded jar.
[INFO] Excluding net.razorvine:pyrolite:jar:4.9 from the shaded jar.
[INFO] Excluding net.sf.py4j:py4j:jar:0.9 from the shaded jar.
[INFO] Excluding org.spark-project.spark:unused:jar:1.0.0 from the shaded jar.
[INFO] Excluding org.slf4j:slf4j-api:jar:1.7.10 from the shaded jar.
[INFO] Replacing original artifact with shaded artifact.
[INFO] Replacing 
/data/hive-ptest/working/apache-github-source-source/ql/target/hive-exec-2.1.0-SNAPSHOT.jar
 with 
/data/hive-ptest/working/apache-github-source-source/ql/target/hive-exec-2.1.0-SNAPSHOT-shaded.jar
[INFO] Dependency-reduced POM written at: 
/data/hive-ptest/working/apache-github-source-source/ql/dependency-reduced-pom.xml
[INFO] Dependency-reduced POM written at: 
/data/hive-ptest/working/apache-github-source-source/ql/dependency-reduced-pom.xml
[INFO] Dependency-reduced POM written at: 
/data/hive-ptest/working/apache-github-source-source/ql/dependency-reduced-pom.xml
[INFO] Dependency-reduced POM written at: 
/data/hive-ptest/working/apache-github-source-source/ql/dependency-reduced-pom.xml
[INFO] Dependency-reduced POM written at: 

[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-05-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282841#comment-15282841
 ] 

Xuefu Zhang commented on HIVE-13293:


Hi [~lirui], Is checking num of partitions necessary because sampling job is 
triggered only if more than one partition?

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
> Attachments: HIVE-13293.1.patch, HIVE-13293.2.patch, 
> HIVE-13293.3.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-05-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281488#comment-15281488
 ] 

Hive QA commented on HIVE-13293:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12803166/HIVE-13293.1.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 68 failed/errored test(s), 9194 tests 
executed
*Failed tests:*
{noformat}
TestHWISessionManager - did not produce a TEST-*.xml file
TestMiniLlapCliDriver - did not produce a TEST-*.xml file
TestMiniTezCliDriver-bucket_map_join_tez1.q-auto_sortmerge_join_16.q-skewjoin.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-join1.q-schema_evol_orc_nonvec_mapwork_part.q-mapjoin_decimal.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-load_dyn_part2.q-selectDistinctStar.q-vector_decimal_5.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-mapjoin_mapjoin.q-insert_into1.q-vector_decimal_2.q-and-12-more
 - did not produce a TEST-*.xml file
TestMiniTezCliDriver-order_null.q-vector_acid3.q-orc_merge10.q-and-12-more - 
did not produce a TEST-*.xml file
TestNegativeCliDriver-udf_invalid.q-nopart_insert.q-insert_into_with_schema.q-and-734-more
 - did not produce a TEST-*.xml file
TestSparkCliDriver-ppd_transform.q-union_remove_7.q-date_udf.q-and-12-more - 
did not produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ivyDownload
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_bucket4
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_bucket5
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_bucket6
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_disable_merge_for_bucketing
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_infer_bucket_sort_map_operators
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_infer_bucket_sort_num_buckets
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_infer_bucket_sort_reducers_power_two
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_list_bucket_dml_10
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_orc_merge1
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_orc_merge2
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_orc_merge9
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_orc_merge_diff_fs
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_reduce_deduplicate
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_vector_outer_join1
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_vector_outer_join2
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_vector_outer_join3
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_vector_outer_join4
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_vector_outer_join5
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_4
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_cbo_stats
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby2_map_skew
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby7_noskew_multi_single_reducer
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_ppr
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join34
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join35
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join6
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_load_dyn_part2
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_load_dyn_part5
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_multi_insert_gby
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_sample5
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt14
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt16
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_17
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_vectorization_3
org.apache.hadoop.hive.llap.daemon.impl.TestTaskExecutorService.testPreemptionQueueComparator
org.apache.hadoop.hive.llap.daemon.impl.comparator.TestShortestJobFirstComparator.testWaitQueueComparatorWithinDagPriority
org.apache.hadoop.hive.llap.tez.TestConverters.testFragmentSpecToTaskSpec
org.apache.hadoop.hive.llap.tezplugins.TestLlapTaskCommunicator.testFinishableStateUpdateFailure
org.apache.hadoop.hive.metastore.TestAuthzApiEmbedAuthorizerInRemote.org.apache.hadoop.hive.metastore.TestAuthzApiEmbedAuthorizerInRemote

[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-05-11 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280380#comment-15280380
 ] 

Xuefu Zhang commented on HIVE-13293:


Okay. Sounds good. +1

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
> Attachments: HIVE-13293.1.patch, HIVE-13293.1.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-05-11 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280347#comment-15280347
 ] 

Rui Li commented on HIVE-13293:
---

Hi [~xuefuz], yeah order by is mostly at the end of stages. But that doesn't 
mean the amount of data is small - that's why we need parallel order by. During 
our benchmark, we hit OOM for several cases, which is due to some bug in Spark 
1.6.0. So I thought using memory level cache may make it even worse.

To your second question, we unpersist cached RDDs at the end of each job. You 
can refer to {{RemoteDriver#JobWrapper}} for that.

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
> Attachments: HIVE-13293.1.patch, HIVE-13293.1.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-05-11 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280312#comment-15280312
 ] 

Xuefu Zhang commented on HIVE-13293:


[~lirui], thanks for working on this. The patch looks good, but one thing I'm 
not very sure of is the persistence level. Order by is almost always at the end 
of stages. Thus, does it make sense to have a mixed of memory and disk?

As a side, out of scope question, do we need to explicitly call rdd.unpersist() 
for those cached rdds once a query is completed? Right now, rdds are never 
reused across queries.

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
> Attachments: HIVE-13293.1.patch, HIVE-13293.1.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-04-15 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242888#comment-15242888
 ] 

Hive QA commented on HIVE-13293:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12797965/HIVE-13293.1.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 9 failed/errored test(s), 9964 tests executed
*Failed tests:*
{noformat}
TestJdbcWithMiniHS2 - did not produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_index_compact_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_llap_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_non_ascii_literal2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_grouping_sets
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_interval_mapjoin
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_join_filters
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby7_noskew_multi_single_reducer
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/7597/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/7597/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-7597/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 9 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12797965 - PreCommit-HIVE-TRUNK-Build

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
> Attachments: HIVE-13293.1.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-04-12 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238556#comment-15238556
 ] 

Rui Li commented on HIVE-13293:
---

Thanks [~xuefuz] for the review.
I mean it can work with queries that have only one ShuffleMapStage. It will 
definitely work with queries that have multiple ShuffleMapStage too. But as I 
said in previous comment, what we care about here is just the last 
ShuffleMapStage because that's what gets re-computed in parallel order by.
On the other hand, splitting task that has only one ShuffleMapStage seems weird 
and may be bad for performance. That's why I chose to cache the RDD.

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
> Attachments: HIVE-13293.1.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-04-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237434#comment-15237434
 ] 

Xuefu Zhang commented on HIVE-13293:


[~lirui], thanks for the investigation and the patch, which seems simple and 
straightforward. One question: what do you mean by "only works queries that 
have only one ShuffleMapStage"? In your previous example, there are actually a 
few such stages. Isn't your patch supposed to help that as well?

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
> Attachments: HIVE-13293.1.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

2016-03-25 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211748#comment-15211748
 ] 

Rui Li commented on HIVE-13293:
---

Just did some research about this. Actually the overhead is not so big as I 
thought. If a query is complicated, we'll have multiple stages. For spark, 
intermediate stages are called {{ShuffleMapStage}} and the last stage is called 
{{ResultStage}}. Suppose we have the following stage graph:
{noformat}
ShuffleMapStage1
  | (join)
ShuflleMapStage2
  | (groupBy)
ShuffleMapStage3
  | (sortByKey)
   ResultStage4
{noformat}
When calling sortByKey, spark launches the sampling job. The job triggers 
computation of ShuffleMapStage1, ShuflleMapStage2 and a ResultStage that shares 
most of ShuffleMapStage3. When we launch the real job, we'll submit the above 
stage graph. But at this point, spark will consider ShuffleMapStage1 and 
ShuflleMapStage2 as already computed because the shuffle outputs are still in 
local disk. Therefore what's re-computed is just ShuffleMapStage3. I have done 
some tests to verify this.
That being said, when ShuffleMapStage3 is complicated enough, we'll still have 
some considerable overhead. And I think that's the case for Q10 in TPCx-BB.
Rather than splitting the task, I think a better and easier way is to cache the 
RDD before calling sortByKey. We can use DISK_ONLY storage level if memory is a 
concern. I'll come up with a patch for review.

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---
>
> Key: HIVE-13293
> URL: https://issues.apache.org/jira/browse/HIVE-13293
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 2.0.0
>Reporter: Lifeng Wang
>Assignee: Rui Li
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)