[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-17 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827378#comment-15827378
 ] 

Xuefu Zhang commented on HIVE-15527:


Hi [~Ferd], [~dapengsun], We found a better fix, and the patch here is likely 
abandoned. The new fix is in HIVE-15580 and the patch there addresses both 
issues. Please feel free to try the patch there and provide your feedback. 
Thanks.

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
> Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, 
> HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, 
> HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, 
> HIVE-15527.7.patch, HIVE-15527.8.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-17 Thread Dapeng Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827376#comment-15827376
 ] 

Dapeng Sun commented on HIVE-15527:
---

Thank [~csun] and [~Ferd], here is the detail log:
17/01/17 xx:xx:xx INFO client.RemoteDriver: Failed to run job 

java.lang.NumberFormatException: null
at java.lang.Long.parseLong(Long.java:552)
at java.lang.Long.parseLong(Long.java:631)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:202)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:141)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
at 
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:335)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:366)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/01/17 xx:xx:xx INFO client.RemoteDriver: Shutting down remote driver.


> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
> Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, 
> HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, 
> HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, 
> HIVE-15527.7.patch, HIVE-15527.8.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-17 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827369#comment-15827369
 ] 

Ferdinand Xu commented on HIVE-15527:
-

Hi [~csun], thank you for working on this issue and we can reproduce OOM caused 
by Arraylist when data increased 100TB at a skewed data. We're trying to use 
your patch to see whether it works. BTW, a minor issue exists in your patch. 
If SPARK_SHUFFLE_BUFFER_SIZE is not set, the default long value will be passed 
to parseLong(String) method.
{noformat}
long maxBufferSize = 
Long.parseLong(this.jobConf.get(HiveConf.ConfVars.SPARK_SHUFFLE_BUFFER_SIZE.varname));
{noformat}


> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
> Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, 
> HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, 
> HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, 
> HIVE-15527.7.patch, HIVE-15527.8.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-11 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819439#comment-15819439
 ] 

Chao Sun commented on HIVE-15527:
-

[~lirui], your concern is valid. I was using the patch to test the cases that 
the iterator could be reused, or that the inner iterator is not consumed. 
Basically just want to check whether this approach is possible. 

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
> Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, 
> HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, 
> HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, 
> HIVE-15527.7.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-11 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818027#comment-15818027
 ] 

Rui Li commented on HIVE-15527:
---

Hi [~csun], it seems the current solution relies on how the iterators are used 
- the inner iterator must be consumed before moving forward the outer iterator. 
Is there any guarantee this assumption will hold? Even if it works at the 
moment, I'm afraid changes in Hive or Spark can easily break it.

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
> Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, 
> HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, 
> HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, 
> HIVE-15527.7.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-11 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817998#comment-15817998
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12846750/HIVE-15527.7.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 19 failed/errored test(s), 10766 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=101)

[limit_pushdown2.q,skewjoin_noskew.q,leftsemijoin_mr.q,bucket3.q,skewjoinopt13.q,bucketmapjoin9.q,auto_join15.q,ptf.q,join22.q,vectorized_nested_mapjoin.q,sample4.q,union18.q,multi_insert_gby.q,join33.q,join_cond_pushdown_unqual2.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=108)

[union_remove_1.q,ppd_outer_join2.q,date_udf.q,groupby1_noskew.q,join20.q,smb_mapjoin_13.q,groupby_rollup1.q,temp_table_gb1.q,vector_string_concat.q,smb_mapjoin_6.q,metadata_only_queries.q,auto_sortmerge_join_12.q,groupby_bigdata.q,groupby3_map_multi_distinct.q,innerjoin.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=114)

[escape_distributeby1.q,join9.q,groupby2.q,groupby4_map.q,udf_max.q,vectorization_pushdown.q,cbo_gby_empty.q,join_cond_pushdown_unqual3.q,vectorization_short_regress.q,join8.q,sample10.q,cross_product_check_1.q,auto_join_stats.q,input_part2.q,groupby_multi_single_reducer3.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=115)

[groupby_map_ppr_multi_distinct.q,vectorization_13.q,mapjoin_mapjoin.q,union2.q,join41.q,groupby8_map.q,cbo_subq_not_in.q,identity_project_remove_skip.q,stats5.q,groupby8_map_skew.q,nullgroup2.q,mapjoin_subquery.q,bucket2.q,smb_mapjoin_1.q,union_remove_8.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=117)

[timestamp_lazy.q,union29.q,runtime_skewjoin_mapjoin_spark.q,auto_join22.q,union8.q,groupby5_map.q,dynamic_rdd_cache.q,auto_join29.q,groupby6.q,merge1.q,mapjoin_distinct.q,vector_decimal_mapjoin.q,sample5.q,multi_insert_move_tasks_share_dependencies.q,join_array.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=119)

[groupby4_noskew.q,groupby3_map_skew.q,join_cond_pushdown_2.q,union19.q,union24.q,union_remove_5.q,groupby7_noskew_multi_single_reducer.q,vectorization_1.q,index_auto_self_join.q,auto_smb_mapjoin_14.q,script_env_var2.q,pcr.q,auto_join_filters.q,join0.q,join37.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=120)

[stats12.q,groupby4.q,union_top_level.q,stats2.q,groupby10.q,mapjoin_filter_on_outerjoin.q,auto_sortmerge_join_4.q,limit_partition_metadataonly.q,load_dyn_part4.q,union3.q,groupby_multi_single_reducer.q,smb_mapjoin_14.q,groupby3_noskew_multi_distinct.q,stats18.q,union_remove_21.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=123)

[groupby_complex_types.q,multigroupby_singlemr.q,union11.q,groupby7.q,join5.q,bucketmapjoin_negative2.q,vectorization_div0.q,union_script.q,add_part_multiple.q,limit_pushdown.q,union_remove_17.q,uniquejoin.q,metadata_only_queries_with_filters.q,union25.q,load_dyn_part13.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=126)

[smb_mapjoin_15.q,script_pipe.q,auto_join24.q,filter_join_breaktask.q,bucket4.q,ppd_multi_insert.q,skewjoinopt20.q,join_thrift.q,multi_insert_gby3.q,groupby8.q,join_map_ppr.q,auto_sortmerge_join_8.q,escape_clusterby1.q,groupby_multi_insert_common_distinct.q,join6.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=131)

[join2.q,join36.q,avro_joins_native.q,join18.q,smb_mapjoin_10.q,temp_table.q,union_remove_13.q,auto_sortmerge_join_5.q,groupby5_noskew.q,auto_join0.q,vectorization_17.q,auto_join_stats2.q,skewjoin_union_remove_1.q,union16.q,join_literals.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=94)

[bucketmapjoin4.q,bucket_map_join_spark4.q,union21.q,groupby2_noskew.q,timestamp_2.q,date_join1.q,mergejoins.q,smb_mapjoin_11.q,auto_sortmerge_join_3.q,mapjoin_test_outer.q,vectorization_9.q,merge2.q,groupby6_noskew.q,auto_join_without_localtask.q,multi_join_union.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=98)

[groupby_map_ppr.q,nullgroup4_multi_distinct.q,join_rc.q,union14.q,smb_mapjoin_12.q,vector_cast_constant.q,union_remove_4.q,auto_join11.q,load_dyn_part7.q,udaf_collect_set.q,vectorization_12.q,groupby_sort_skew_1.q,groupby_sort_skew_1_23.q,smb_mapjoin_25.q,skewjoinopt12.q]

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-10 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817121#comment-15817121
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12846698/HIVE-15527.7.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 19 failed/errored test(s), 10769 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=101)

[limit_pushdown2.q,skewjoin_noskew.q,leftsemijoin_mr.q,bucket3.q,skewjoinopt13.q,bucketmapjoin9.q,auto_join15.q,ptf.q,join22.q,vectorized_nested_mapjoin.q,sample4.q,union18.q,multi_insert_gby.q,join33.q,join_cond_pushdown_unqual2.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=108)

[union_remove_1.q,ppd_outer_join2.q,date_udf.q,groupby1_noskew.q,join20.q,smb_mapjoin_13.q,groupby_rollup1.q,temp_table_gb1.q,vector_string_concat.q,smb_mapjoin_6.q,metadata_only_queries.q,auto_sortmerge_join_12.q,groupby_bigdata.q,groupby3_map_multi_distinct.q,innerjoin.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=114)

[escape_distributeby1.q,join9.q,groupby2.q,groupby4_map.q,udf_max.q,vectorization_pushdown.q,cbo_gby_empty.q,join_cond_pushdown_unqual3.q,vectorization_short_regress.q,join8.q,sample10.q,cross_product_check_1.q,auto_join_stats.q,input_part2.q,groupby_multi_single_reducer3.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=115)

[groupby_map_ppr_multi_distinct.q,vectorization_13.q,mapjoin_mapjoin.q,union2.q,join41.q,groupby8_map.q,cbo_subq_not_in.q,identity_project_remove_skip.q,stats5.q,groupby8_map_skew.q,nullgroup2.q,mapjoin_subquery.q,bucket2.q,smb_mapjoin_1.q,union_remove_8.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=117)

[timestamp_lazy.q,union29.q,runtime_skewjoin_mapjoin_spark.q,auto_join22.q,union8.q,groupby5_map.q,dynamic_rdd_cache.q,auto_join29.q,groupby6.q,merge1.q,mapjoin_distinct.q,vector_decimal_mapjoin.q,sample5.q,multi_insert_move_tasks_share_dependencies.q,join_array.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=119)

[groupby4_noskew.q,groupby3_map_skew.q,join_cond_pushdown_2.q,union19.q,union24.q,union_remove_5.q,groupby7_noskew_multi_single_reducer.q,vectorization_1.q,index_auto_self_join.q,auto_smb_mapjoin_14.q,script_env_var2.q,pcr.q,auto_join_filters.q,join0.q,join37.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=120)

[stats12.q,groupby4.q,union_top_level.q,stats2.q,groupby10.q,mapjoin_filter_on_outerjoin.q,auto_sortmerge_join_4.q,limit_partition_metadataonly.q,load_dyn_part4.q,union3.q,groupby_multi_single_reducer.q,smb_mapjoin_14.q,groupby3_noskew_multi_distinct.q,stats18.q,union_remove_21.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=123)

[groupby_complex_types.q,multigroupby_singlemr.q,union11.q,groupby7.q,join5.q,bucketmapjoin_negative2.q,vectorization_div0.q,union_script.q,add_part_multiple.q,limit_pushdown.q,union_remove_17.q,uniquejoin.q,metadata_only_queries_with_filters.q,union25.q,load_dyn_part13.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=126)

[smb_mapjoin_15.q,script_pipe.q,auto_join24.q,filter_join_breaktask.q,bucket4.q,ppd_multi_insert.q,skewjoinopt20.q,join_thrift.q,multi_insert_gby3.q,groupby8.q,join_map_ppr.q,auto_sortmerge_join_8.q,escape_clusterby1.q,groupby_multi_insert_common_distinct.q,join6.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=131)

[join2.q,join36.q,avro_joins_native.q,join18.q,smb_mapjoin_10.q,temp_table.q,union_remove_13.q,auto_sortmerge_join_5.q,groupby5_noskew.q,auto_join0.q,vectorization_17.q,auto_join_stats2.q,skewjoin_union_remove_1.q,union16.q,join_literals.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=94)

[bucketmapjoin4.q,bucket_map_join_spark4.q,union21.q,groupby2_noskew.q,timestamp_2.q,date_join1.q,mergejoins.q,smb_mapjoin_11.q,auto_sortmerge_join_3.q,mapjoin_test_outer.q,vectorization_9.q,merge2.q,groupby6_noskew.q,auto_join_without_localtask.q,multi_join_union.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=98)

[groupby_map_ppr.q,nullgroup4_multi_distinct.q,join_rc.q,union14.q,smb_mapjoin_12.q,vector_cast_constant.q,union_remove_4.q,auto_join11.q,load_dyn_part7.q,udaf_collect_set.q,vectorization_12.q,groupby_sort_skew_1.q,groupby_sort_skew_1_23.q,smb_mapjoin_25.q,skewjoinopt12.q]

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-10 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816676#comment-15816676
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12846683/HIVE-15527.7.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 281 failed/errored test(s), 8212 tests 
executed
*Failed tests:*
{noformat}
TestAccumuloCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=217)
TestAccumuloConnectionParameters - did not produce a TEST-*.xml file (likely 
timed out) (batchId=165)
TestAccumuloHiveRow - did not produce a TEST-*.xml file (likely timed out) 
(batchId=165)
TestAccumuloPredicateHandler - did not produce a TEST-*.xml file (likely timed 
out) (batchId=165)
TestAccumuloRangeGenerator - did not produce a TEST-*.xml file (likely timed 
out) (batchId=165)
TestAccumuloRowSerializer - did not produce a TEST-*.xml file (likely timed 
out) (batchId=165)
TestAccumuloSerDe - did not produce a TEST-*.xml file (likely timed out) 
(batchId=165)
TestAccumuloSerDeParameters - did not produce a TEST-*.xml file (likely timed 
out) (batchId=165)
TestAccumuloStorageHandler - did not produce a TEST-*.xml file (likely timed 
out) (batchId=165)
TestAcidOnTez - did not produce a TEST-*.xml file (likely timed out) 
(batchId=202)
TestAcidOnTezWithSplitUpdate - did not produce a TEST-*.xml file (likely timed 
out) (batchId=205)
TestAddResource - did not produce a TEST-*.xml file (likely timed out) 
(batchId=269)
TestAdminUser - did not produce a TEST-*.xml file (likely timed out) 
(batchId=197)
TestAggregateStatsCache - did not produce a TEST-*.xml file (likely timed out) 
(batchId=186)
TestAvroHCatLoader - did not produce a TEST-*.xml file (likely timed out) 
(batchId=172)
TestAvroHCatStorer - did not produce a TEST-*.xml file (likely timed out) 
(batchId=172)
TestBeeLineExceptionHandling - did not produce a TEST-*.xml file (likely timed 
out) (batchId=167)
TestBeeLineHistory - did not produce a TEST-*.xml file (likely timed out) 
(batchId=167)
TestBeelineArgParsing - did not produce a TEST-*.xml file (likely timed out) 
(batchId=167)
TestBlobstoreCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=229)
TestBlobstoreNegativeCliDriver - did not produce a TEST-*.xml file (likely 
timed out) (batchId=229)
TestCLIAuthzSessionContext - did not produce a TEST-*.xml file (likely timed 
out) (batchId=215)
TestCleaner2 - did not produce a TEST-*.xml file (likely timed out) 
(batchId=248)
TestClientCommandHookFactory - did not produce a TEST-*.xml file (likely timed 
out) (batchId=167)
TestColumnAccess - did not produce a TEST-*.xml file (likely timed out) 
(batchId=256)
TestColumnEncoding - did not produce a TEST-*.xml file (likely timed out) 
(batchId=165)
TestColumnMapper - did not produce a TEST-*.xml file (likely timed out) 
(batchId=165)
TestColumnMappingFactory - did not produce a TEST-*.xml file (likely timed out) 
(batchId=165)
TestCommandProcessorFactory - did not produce a TEST-*.xml file (likely timed 
out) (batchId=274)
TestCompileProcessor - did not produce a TEST-*.xml file (likely timed out) 
(batchId=274)
TestConditionalResolverCommonJoin - did not produce a TEST-*.xml file (likely 
timed out) (batchId=270)
TestContext - did not produce a TEST-*.xml file (likely timed out) (batchId=261)
TestContribCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=220)
TestCreateMacroDesc - did not produce a TEST-*.xml file (likely timed out) 
(batchId=270)
TestCuckooSet - did not produce a TEST-*.xml file (likely timed out) 
(batchId=263)
TestDbTxnManager - did not produce a TEST-*.xml file (likely timed out) 
(batchId=274)
TestDeadline - did not produce a TEST-*.xml file (likely timed out) 
(batchId=187)
TestDefaultAccumuloRowIdFactory - did not produce a TEST-*.xml file (likely 
timed out) (batchId=165)
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
TestDoubleCompare - did not produce a TEST-*.xml file (likely timed out) 
(batchId=165)
TestDropMacroDesc - did not produce a TEST-*.xml file (likely timed out) 
(batchId=270)
TestDummy - did not produce a TEST-*.xml file (likely timed out) (batchId=217)
TestDummyTxnManager - did not produce a TEST-*.xml file (likely timed out) 
(batchId=274)
TestE2EScenarios - did not produce a TEST-*.xml file (likely timed out) 
(batchId=172)
TestEmbeddedLockManager - did not produce a TEST-*.xml file (likely timed out) 
(batchId=273)
TestEmbeddedThriftBinaryCLIService - did not produce a TEST-*.xml file (likely 
timed out) (batchId=211)
TestEncryptedHDFSCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=156)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-10 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815544#comment-15815544
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12846609/HIVE-15527.0.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 51 failed/errored test(s), 10296 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] 
(batchId=75)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer]
 (batchId=150)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_part]
 (batchId=148)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver
 (batchId=161)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=100)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=101)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=102)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=103)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=104)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=105)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=106)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=107)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=108)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=109)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=110)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=111)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=112)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=113)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=114)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=115)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=116)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=117)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=118)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=119)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=120)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=121)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=122)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=123)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=124)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=125)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=126)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=127)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=128)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=129)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-10 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815326#comment-15815326
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12846604/HIVE-15527.0.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 51 failed/errored test(s), 10296 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] 
(batchId=75)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer]
 (batchId=150)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_part]
 (batchId=148)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver
 (batchId=161)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=100)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=101)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=102)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=103)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=104)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=105)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=106)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=107)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=108)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=109)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=110)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=111)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=112)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=113)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=114)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=115)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=116)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=117)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=118)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=119)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=120)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=121)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=122)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=123)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=124)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=125)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=126)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=127)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=128)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=129)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-09 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813670#comment-15813670
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12846469/HIVE-15527.6.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 10918 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
TestSSL - did not produce a TEST-*.xml file (likely timed out) (batchId=212)
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=103)

[skewjoinopt19.q,order.q,join_merge_multi_expressions.q,skewjoinopt10.q,insert_into1.q,vectorized_math_funcs.q,vectorization_4.q,vectorization_2.q,skewjoinopt6.q,union_remove_19.q,decimal_1_1.q,join14.q,outer_join_ppr.q,rcfile_bigdata.q,load_dyn_part10.q]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] 
(batchId=75)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer]
 (batchId=150)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_part]
 (batchId=148)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_notin]
 (batchId=150)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby7_noskew_multi_single_reducer]
 (batchId=119)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[multi_insert_with_join]
 (batchId=122)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ptf] (batchId=101)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] 
(batchId=121)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2852/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2852/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2852/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 14 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12846469 - PreCommit-HIVE-Build

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.4.patch, HIVE-15527.5.patch, 
> HIVE-15527.6.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-09 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813088#comment-15813088
 ] 

Chao Sun commented on HIVE-15527:
-

Test failures not related. [~lirui], [~xuefuz]: can you take a look? thanks.

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-09 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812719#comment-15812719
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12846390/HIVE-15527.5.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 10918 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] 
(batchId=75)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer]
 (batchId=150)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2843/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2843/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2843/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12846390 - PreCommit-HIVE-Build

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-04 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15799429#comment-15799429
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12845609/HIVE-15527.4.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 25 failed/errored test(s), 10914 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] 
(batchId=75)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join5]
 (batchId=161)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_join_without_localtask]
 (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_3]
 (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucket_map_join_spark4]
 (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketmapjoin4] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[date_join1] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby2_noskew] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby6_noskew] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[mapjoin_test_outer] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[merge2] (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[multi_join_union] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ptf] (batchId=101)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[smb_mapjoin_11] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[timestamp_2] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[union21] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vector_between_in] 
(batchId=118)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_9] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] 
(batchId=121)
org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark.testSparkQuery 
(batchId=215)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2781/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2781/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2781/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 25 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12845609 - PreCommit-HIVE-Build

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.4.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return 

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-03 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795782#comment-15795782
 ] 

Chao Sun commented on HIVE-15527:
-

I'll take over this from [~xuefuz]. Will attempt to address the above issues 
and submit a new patch.

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Chao Sun
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15794194#comment-15794194
 ] 

Xuefu Zhang commented on HIVE-15527:


[~csun], [~lirui], and [~kellyzly], thanks for your feedback. The patch here is 
more like a POC, so improvement is needed for production. Here are a few 
thoughts:

1) I'm not sure what caused the result diff, though there might be a bug in 
HiveKVResultCache that's manifested. The diff seems invalid when comparing to 
MR result. Also, there seems some randomness generating the diff.

2) As to performance, Rui's idea concern is valid. What I tried to demo is that 
we need something similar to HiveKVResultCache but only for values.

3) Similar to 2), we need to have a good cache size to avoid FIO for regular 
group sizes. Currently HiveKVREsultCache has cache only for 1024 rows, which 
seems rather small.

4) Performance impact needs to be evaluated.

5) The idea here could be used to solve the same problem for Spark's 
groupByKey() in Hive. We could use Spark's reduceByKey() instead and in Hive we 
do in-group value caching like what we can do here.

I'm not sure if I have bandwidth to move this forward at full speed. Please 
feel free to take this (and other issues) forward. Thanks.


> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-02 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793967#comment-15793967
 ] 

liyunzhang_intel commented on HIVE-15527:
-

[~xuefuz] and [~lirui]: HiveKVResultCache will write key value pair if buffer 
is full and this will do some limits to the memory usage. But is there anything 
to show that the ArrayList use a lot of memory? test this by memory analysis 
tool?



> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793903#comment-15793903
 ] 

Rui Li commented on HIVE-15527:
---

Since the HiveKVResultCache here only stores values for the same key, I think 
we can avoid Ser/De the HiveKey to improve performance?

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-02 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793570#comment-15793570
 ] 

Chao Sun commented on HIVE-15527:
-

Patch looks good. Any idea why the qfile result is different?

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-01 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15791917#comment-15791917
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12845259/HIVE-15527.3.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10913 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ptf] (batchId=101)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] 
(batchId=121)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2757/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2757/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2757/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12845259 - PreCommit-HIVE-Build

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-01 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15791783#comment-15791783
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12845257/HIVE-15527.2.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 8 failed/errored test(s), 10883 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=151)

[smb_mapjoin_15.q,insert_values_partitioned.q,selectDistinctStar.q,bucket4.q,vectorized_distinct_gby.q,vector_groupby_mapjoin.q,insert_values_dynamic_partitioned.q,vector_nvl.q,join_nullsafe.q,vectorized_mapjoin.q,vectorized_shufflejoin.q,tez_smb_1.q,cbo_union.q,tez_vector_dynpart_hashjoin_1.q,vector_coalesce_2.q,filter_join_breaktask2.q,table_access_keys_stats.q,vector_data_types.q,multiMapJoin2.q,filter_join_breaktask.q,schema_evol_orc_nonvec_part.q,alter_merge_2_orc.q,vectorization_3.q,union4.q,auto_sortmerge_join_8.q,stats_based_fetch_decision.q,vectorized_date_funcs.q,auto_sortmerge_join_10.q,vector_varchar_simple.q,vector_decimal_udf2.q]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer]
 (batchId=150)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_1] 
(batchId=92)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] 
(batchId=121)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2756/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2756/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2756/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 8 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12845257 - PreCommit-HIVE-Build

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-01 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15791340#comment-15791340
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12845241/HIVE-15527.1.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 21 failed/errored test(s), 10898 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=123)

[groupby_complex_types.q,multigroupby_singlemr.q,mapjoin_decimal.q,groupby7.q,join5.q,bucketmapjoin_negative2.q,vectorization_div0.q,union_script.q,add_part_multiple.q,limit_pushdown.q,union_remove_17.q,uniquejoin.q,metadata_only_queries_with_filters.q,union25.q,load_dyn_part13.q]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_1] 
(batchId=92)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_join_without_localtask]
 (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_3]
 (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucket_map_join_spark4]
 (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketmapjoin4] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[date_join1] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby2_noskew] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby6_noskew] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[mapjoin_test_outer] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[merge2] (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[multi_join_union] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[smb_mapjoin_11] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[timestamp_2] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[union21] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_9] 
(batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] 
(batchId=121)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2754/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2754/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2754/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 21 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12845241 - PreCommit-HIVE-Build

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2 next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2(key, 
> values);
> }
> curKey = pair._1();
> 

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2016-12-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15789063#comment-15789063
 ] 

Hive QA commented on HIVE-15527:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12845215/HIVE-15527.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 358 failed/errored test(s), 10895 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[columnstats_part_coltype]
 (batchId=150)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[auto_sortmerge_join_16]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucket4] 
(batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucket5] 
(batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucket6] 
(batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucketmapjoin6]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucketmapjoin7]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[constprog_partitioner]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[constprog_semijoin]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[disable_merge_for_bucketing]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[empty_dir_in_table]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[gen_udf_example_add10]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[index_bitmap3]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[index_bitmap_auto]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[infer_bucket_sort_bucketed_table]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[infer_bucket_sort_map_operators]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[infer_bucket_sort_merge]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[infer_bucket_sort_reducers_power_two]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[leftsemijoin_mr]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[orc_merge_incompat2]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[parallel_orderby]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[quotedid_smb]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[reduce_deduplicate]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[remote_script]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[root_dir_external_table]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[schemeAuthority2]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[schemeAuthority]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[scriptfile1]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[truncate_column_buckets]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join1]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join2]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join3]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join4]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join5]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_3] 
(batchId=92)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_4] 
(batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=97)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_join0] 
(batchId=131)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_join15] 
(batchId=101)