[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827378#comment-15827378 ] Xuefu Zhang commented on HIVE-15527: Hi [~Ferd], [~dapengsun], We found a better fix, and the patch here is likely abandoned. The new fix is in HIVE-15580 and the patch there addresses both issues. Please feel free to try the patch there and provide your feedback. Thanks. > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chao Sun > Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, > HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, > HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, > HIVE-15527.7.patch, HIVE-15527.8.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827376#comment-15827376 ] Dapeng Sun commented on HIVE-15527: --- Thank [~csun] and [~Ferd], here is the detail log: 17/01/17 xx:xx:xx INFO client.RemoteDriver: Failed to run job java.lang.NumberFormatException: null at java.lang.Long.parseLong(Long.java:552) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:202) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:141) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:335) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:366) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/01/17 xx:xx:xx INFO client.RemoteDriver: Shutting down remote driver. > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chao Sun > Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, > HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, > HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, > HIVE-15527.7.patch, HIVE-15527.8.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827369#comment-15827369 ] Ferdinand Xu commented on HIVE-15527: - Hi [~csun], thank you for working on this issue and we can reproduce OOM caused by Arraylist when data increased 100TB at a skewed data. We're trying to use your patch to see whether it works. BTW, a minor issue exists in your patch. If SPARK_SHUFFLE_BUFFER_SIZE is not set, the default long value will be passed to parseLong(String) method. {noformat} long maxBufferSize = Long.parseLong(this.jobConf.get(HiveConf.ConfVars.SPARK_SHUFFLE_BUFFER_SIZE.varname)); {noformat} > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chao Sun > Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, > HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, > HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, > HIVE-15527.7.patch, HIVE-15527.8.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819439#comment-15819439 ] Chao Sun commented on HIVE-15527: - [~lirui], your concern is valid. I was using the patch to test the cases that the iterator could be reused, or that the inner iterator is not consumed. Basically just want to check whether this approach is possible. > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chao Sun > Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, > HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, > HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, > HIVE-15527.7.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818027#comment-15818027 ] Rui Li commented on HIVE-15527: --- Hi [~csun], it seems the current solution relies on how the iterators are used - the inner iterator must be consumed before moving forward the outer iterator. Is there any guarantee this assumption will hold? Even if it works at the moment, I'm afraid changes in Hive or Spark can easily break it. > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chao Sun > Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, > HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, > HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, > HIVE-15527.7.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817998#comment-15817998 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12846750/HIVE-15527.7.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 19 failed/errored test(s), 10766 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=101) [limit_pushdown2.q,skewjoin_noskew.q,leftsemijoin_mr.q,bucket3.q,skewjoinopt13.q,bucketmapjoin9.q,auto_join15.q,ptf.q,join22.q,vectorized_nested_mapjoin.q,sample4.q,union18.q,multi_insert_gby.q,join33.q,join_cond_pushdown_unqual2.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=108) [union_remove_1.q,ppd_outer_join2.q,date_udf.q,groupby1_noskew.q,join20.q,smb_mapjoin_13.q,groupby_rollup1.q,temp_table_gb1.q,vector_string_concat.q,smb_mapjoin_6.q,metadata_only_queries.q,auto_sortmerge_join_12.q,groupby_bigdata.q,groupby3_map_multi_distinct.q,innerjoin.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=114) [escape_distributeby1.q,join9.q,groupby2.q,groupby4_map.q,udf_max.q,vectorization_pushdown.q,cbo_gby_empty.q,join_cond_pushdown_unqual3.q,vectorization_short_regress.q,join8.q,sample10.q,cross_product_check_1.q,auto_join_stats.q,input_part2.q,groupby_multi_single_reducer3.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=115) [groupby_map_ppr_multi_distinct.q,vectorization_13.q,mapjoin_mapjoin.q,union2.q,join41.q,groupby8_map.q,cbo_subq_not_in.q,identity_project_remove_skip.q,stats5.q,groupby8_map_skew.q,nullgroup2.q,mapjoin_subquery.q,bucket2.q,smb_mapjoin_1.q,union_remove_8.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=117) [timestamp_lazy.q,union29.q,runtime_skewjoin_mapjoin_spark.q,auto_join22.q,union8.q,groupby5_map.q,dynamic_rdd_cache.q,auto_join29.q,groupby6.q,merge1.q,mapjoin_distinct.q,vector_decimal_mapjoin.q,sample5.q,multi_insert_move_tasks_share_dependencies.q,join_array.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=119) [groupby4_noskew.q,groupby3_map_skew.q,join_cond_pushdown_2.q,union19.q,union24.q,union_remove_5.q,groupby7_noskew_multi_single_reducer.q,vectorization_1.q,index_auto_self_join.q,auto_smb_mapjoin_14.q,script_env_var2.q,pcr.q,auto_join_filters.q,join0.q,join37.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=120) [stats12.q,groupby4.q,union_top_level.q,stats2.q,groupby10.q,mapjoin_filter_on_outerjoin.q,auto_sortmerge_join_4.q,limit_partition_metadataonly.q,load_dyn_part4.q,union3.q,groupby_multi_single_reducer.q,smb_mapjoin_14.q,groupby3_noskew_multi_distinct.q,stats18.q,union_remove_21.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=123) [groupby_complex_types.q,multigroupby_singlemr.q,union11.q,groupby7.q,join5.q,bucketmapjoin_negative2.q,vectorization_div0.q,union_script.q,add_part_multiple.q,limit_pushdown.q,union_remove_17.q,uniquejoin.q,metadata_only_queries_with_filters.q,union25.q,load_dyn_part13.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=126) [smb_mapjoin_15.q,script_pipe.q,auto_join24.q,filter_join_breaktask.q,bucket4.q,ppd_multi_insert.q,skewjoinopt20.q,join_thrift.q,multi_insert_gby3.q,groupby8.q,join_map_ppr.q,auto_sortmerge_join_8.q,escape_clusterby1.q,groupby_multi_insert_common_distinct.q,join6.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=131) [join2.q,join36.q,avro_joins_native.q,join18.q,smb_mapjoin_10.q,temp_table.q,union_remove_13.q,auto_sortmerge_join_5.q,groupby5_noskew.q,auto_join0.q,vectorization_17.q,auto_join_stats2.q,skewjoin_union_remove_1.q,union16.q,join_literals.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=94) [bucketmapjoin4.q,bucket_map_join_spark4.q,union21.q,groupby2_noskew.q,timestamp_2.q,date_join1.q,mergejoins.q,smb_mapjoin_11.q,auto_sortmerge_join_3.q,mapjoin_test_outer.q,vectorization_9.q,merge2.q,groupby6_noskew.q,auto_join_without_localtask.q,multi_join_union.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=98) [groupby_map_ppr.q,nullgroup4_multi_distinct.q,join_rc.q,union14.q,smb_mapjoin_12.q,vector_cast_constant.q,union_remove_4.q,auto_join11.q,load_dyn_part7.q,udaf_collect_set.q,vectorization_12.q,groupby_sort_skew_1.q,groupby_sort_skew_1_23.q,smb_mapjoin_25.q,skewjoinopt12.q]
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817121#comment-15817121 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12846698/HIVE-15527.7.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 19 failed/errored test(s), 10769 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=101) [limit_pushdown2.q,skewjoin_noskew.q,leftsemijoin_mr.q,bucket3.q,skewjoinopt13.q,bucketmapjoin9.q,auto_join15.q,ptf.q,join22.q,vectorized_nested_mapjoin.q,sample4.q,union18.q,multi_insert_gby.q,join33.q,join_cond_pushdown_unqual2.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=108) [union_remove_1.q,ppd_outer_join2.q,date_udf.q,groupby1_noskew.q,join20.q,smb_mapjoin_13.q,groupby_rollup1.q,temp_table_gb1.q,vector_string_concat.q,smb_mapjoin_6.q,metadata_only_queries.q,auto_sortmerge_join_12.q,groupby_bigdata.q,groupby3_map_multi_distinct.q,innerjoin.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=114) [escape_distributeby1.q,join9.q,groupby2.q,groupby4_map.q,udf_max.q,vectorization_pushdown.q,cbo_gby_empty.q,join_cond_pushdown_unqual3.q,vectorization_short_regress.q,join8.q,sample10.q,cross_product_check_1.q,auto_join_stats.q,input_part2.q,groupby_multi_single_reducer3.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=115) [groupby_map_ppr_multi_distinct.q,vectorization_13.q,mapjoin_mapjoin.q,union2.q,join41.q,groupby8_map.q,cbo_subq_not_in.q,identity_project_remove_skip.q,stats5.q,groupby8_map_skew.q,nullgroup2.q,mapjoin_subquery.q,bucket2.q,smb_mapjoin_1.q,union_remove_8.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=117) [timestamp_lazy.q,union29.q,runtime_skewjoin_mapjoin_spark.q,auto_join22.q,union8.q,groupby5_map.q,dynamic_rdd_cache.q,auto_join29.q,groupby6.q,merge1.q,mapjoin_distinct.q,vector_decimal_mapjoin.q,sample5.q,multi_insert_move_tasks_share_dependencies.q,join_array.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=119) [groupby4_noskew.q,groupby3_map_skew.q,join_cond_pushdown_2.q,union19.q,union24.q,union_remove_5.q,groupby7_noskew_multi_single_reducer.q,vectorization_1.q,index_auto_self_join.q,auto_smb_mapjoin_14.q,script_env_var2.q,pcr.q,auto_join_filters.q,join0.q,join37.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=120) [stats12.q,groupby4.q,union_top_level.q,stats2.q,groupby10.q,mapjoin_filter_on_outerjoin.q,auto_sortmerge_join_4.q,limit_partition_metadataonly.q,load_dyn_part4.q,union3.q,groupby_multi_single_reducer.q,smb_mapjoin_14.q,groupby3_noskew_multi_distinct.q,stats18.q,union_remove_21.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=123) [groupby_complex_types.q,multigroupby_singlemr.q,union11.q,groupby7.q,join5.q,bucketmapjoin_negative2.q,vectorization_div0.q,union_script.q,add_part_multiple.q,limit_pushdown.q,union_remove_17.q,uniquejoin.q,metadata_only_queries_with_filters.q,union25.q,load_dyn_part13.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=126) [smb_mapjoin_15.q,script_pipe.q,auto_join24.q,filter_join_breaktask.q,bucket4.q,ppd_multi_insert.q,skewjoinopt20.q,join_thrift.q,multi_insert_gby3.q,groupby8.q,join_map_ppr.q,auto_sortmerge_join_8.q,escape_clusterby1.q,groupby_multi_insert_common_distinct.q,join6.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=131) [join2.q,join36.q,avro_joins_native.q,join18.q,smb_mapjoin_10.q,temp_table.q,union_remove_13.q,auto_sortmerge_join_5.q,groupby5_noskew.q,auto_join0.q,vectorization_17.q,auto_join_stats2.q,skewjoin_union_remove_1.q,union16.q,join_literals.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=94) [bucketmapjoin4.q,bucket_map_join_spark4.q,union21.q,groupby2_noskew.q,timestamp_2.q,date_join1.q,mergejoins.q,smb_mapjoin_11.q,auto_sortmerge_join_3.q,mapjoin_test_outer.q,vectorization_9.q,merge2.q,groupby6_noskew.q,auto_join_without_localtask.q,multi_join_union.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=98) [groupby_map_ppr.q,nullgroup4_multi_distinct.q,join_rc.q,union14.q,smb_mapjoin_12.q,vector_cast_constant.q,union_remove_4.q,auto_join11.q,load_dyn_part7.q,udaf_collect_set.q,vectorization_12.q,groupby_sort_skew_1.q,groupby_sort_skew_1_23.q,smb_mapjoin_25.q,skewjoinopt12.q]
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816676#comment-15816676 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12846683/HIVE-15527.7.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 281 failed/errored test(s), 8212 tests executed *Failed tests:* {noformat} TestAccumuloCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=217) TestAccumuloConnectionParameters - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestAccumuloHiveRow - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestAccumuloPredicateHandler - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestAccumuloRangeGenerator - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestAccumuloRowSerializer - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestAccumuloSerDe - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestAccumuloSerDeParameters - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestAccumuloStorageHandler - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestAcidOnTez - did not produce a TEST-*.xml file (likely timed out) (batchId=202) TestAcidOnTezWithSplitUpdate - did not produce a TEST-*.xml file (likely timed out) (batchId=205) TestAddResource - did not produce a TEST-*.xml file (likely timed out) (batchId=269) TestAdminUser - did not produce a TEST-*.xml file (likely timed out) (batchId=197) TestAggregateStatsCache - did not produce a TEST-*.xml file (likely timed out) (batchId=186) TestAvroHCatLoader - did not produce a TEST-*.xml file (likely timed out) (batchId=172) TestAvroHCatStorer - did not produce a TEST-*.xml file (likely timed out) (batchId=172) TestBeeLineExceptionHandling - did not produce a TEST-*.xml file (likely timed out) (batchId=167) TestBeeLineHistory - did not produce a TEST-*.xml file (likely timed out) (batchId=167) TestBeelineArgParsing - did not produce a TEST-*.xml file (likely timed out) (batchId=167) TestBlobstoreCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=229) TestBlobstoreNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=229) TestCLIAuthzSessionContext - did not produce a TEST-*.xml file (likely timed out) (batchId=215) TestCleaner2 - did not produce a TEST-*.xml file (likely timed out) (batchId=248) TestClientCommandHookFactory - did not produce a TEST-*.xml file (likely timed out) (batchId=167) TestColumnAccess - did not produce a TEST-*.xml file (likely timed out) (batchId=256) TestColumnEncoding - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestColumnMapper - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestColumnMappingFactory - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestCommandProcessorFactory - did not produce a TEST-*.xml file (likely timed out) (batchId=274) TestCompileProcessor - did not produce a TEST-*.xml file (likely timed out) (batchId=274) TestConditionalResolverCommonJoin - did not produce a TEST-*.xml file (likely timed out) (batchId=270) TestContext - did not produce a TEST-*.xml file (likely timed out) (batchId=261) TestContribCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=220) TestCreateMacroDesc - did not produce a TEST-*.xml file (likely timed out) (batchId=270) TestCuckooSet - did not produce a TEST-*.xml file (likely timed out) (batchId=263) TestDbTxnManager - did not produce a TEST-*.xml file (likely timed out) (batchId=274) TestDeadline - did not produce a TEST-*.xml file (likely timed out) (batchId=187) TestDefaultAccumuloRowIdFactory - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) TestDoubleCompare - did not produce a TEST-*.xml file (likely timed out) (batchId=165) TestDropMacroDesc - did not produce a TEST-*.xml file (likely timed out) (batchId=270) TestDummy - did not produce a TEST-*.xml file (likely timed out) (batchId=217) TestDummyTxnManager - did not produce a TEST-*.xml file (likely timed out) (batchId=274) TestE2EScenarios - did not produce a TEST-*.xml file (likely timed out) (batchId=172) TestEmbeddedLockManager - did not produce a TEST-*.xml file (likely timed out) (batchId=273) TestEmbeddedThriftBinaryCLIService - did not produce a TEST-*.xml file (likely timed out) (batchId=211) TestEncryptedHDFSCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=156)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815544#comment-15815544 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12846609/HIVE-15527.0.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 51 failed/errored test(s), 10296 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] (batchId=61) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] (batchId=28) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] (batchId=75) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=134) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a] (batchId=135) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer] (batchId=150) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_part] (batchId=148) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver (batchId=161) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=100) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=101) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=102) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=103) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=104) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=105) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=106) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=107) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=108) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=109) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=110) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=111) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=112) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=113) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=114) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=115) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=116) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=117) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=118) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=119) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=120) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=121) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=122) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=123) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=124) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=125) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=126) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=127) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=128) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=129) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815326#comment-15815326 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12846604/HIVE-15527.0.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 51 failed/errored test(s), 10296 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] (batchId=61) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] (batchId=28) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] (batchId=75) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=134) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a] (batchId=135) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer] (batchId=150) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_part] (batchId=148) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver (batchId=161) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=100) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=101) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=102) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=103) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=104) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=105) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=106) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=107) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=108) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=109) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=110) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=111) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=112) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=113) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=114) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=115) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=116) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=117) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=118) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=119) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=120) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=121) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=122) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=123) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=124) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=125) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=126) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=127) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=128) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=129) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813670#comment-15813670 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12846469/HIVE-15527.6.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 10918 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) TestSSL - did not produce a TEST-*.xml file (likely timed out) (batchId=212) TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=103) [skewjoinopt19.q,order.q,join_merge_multi_expressions.q,skewjoinopt10.q,insert_into1.q,vectorized_math_funcs.q,vectorization_4.q,vectorization_2.q,skewjoinopt6.q,union_remove_19.q,decimal_1_1.q,join14.q,outer_join_ppr.q,rcfile_bigdata.q,load_dyn_part10.q] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] (batchId=61) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] (batchId=28) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] (batchId=75) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=134) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer] (batchId=150) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_part] (batchId=148) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_notin] (batchId=150) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby7_noskew_multi_single_reducer] (batchId=119) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[multi_insert_with_join] (batchId=122) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ptf] (batchId=101) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] (batchId=121) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2852/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2852/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2852/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 14 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12846469 - PreCommit-HIVE-Build > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chao Sun > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, > HIVE-15527.3.patch, HIVE-15527.4.patch, HIVE-15527.5.patch, > HIVE-15527.6.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. --
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813088#comment-15813088 ] Chao Sun commented on HIVE-15527: - Test failures not related. [~lirui], [~xuefuz]: can you take a look? thanks. > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chao Sun > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, > HIVE-15527.3.patch, HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812719#comment-15812719 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12846390/HIVE-15527.5.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 10918 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] (batchId=61) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] (batchId=28) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] (batchId=75) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=134) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer] (batchId=150) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2843/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2843/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2843/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 6 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12846390 - PreCommit-HIVE-Build > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chao Sun > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, > HIVE-15527.3.patch, HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15799429#comment-15799429 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12845609/HIVE-15527.4.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 25 failed/errored test(s), 10914 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] (batchId=61) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] (batchId=28) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] (batchId=75) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=134) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a] (batchId=135) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join5] (batchId=161) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_join_without_localtask] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_3] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucket_map_join_spark4] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketmapjoin4] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[date_join1] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby2_noskew] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby6_noskew] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[mapjoin_test_outer] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[merge2] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[multi_join_union] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ptf] (batchId=101) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[smb_mapjoin_11] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[timestamp_2] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[union21] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vector_between_in] (batchId=118) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_9] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] (batchId=121) org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark.testSparkQuery (batchId=215) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2781/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2781/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2781/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 25 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12845609 - PreCommit-HIVE-Build > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chao Sun > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, > HIVE-15527.3.patch, HIVE-15527.4.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795782#comment-15795782 ] Chao Sun commented on HIVE-15527: - I'll take over this from [~xuefuz]. Will attempt to address the above issues and submit a new patch. > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Chao Sun > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, > HIVE-15527.3.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15794194#comment-15794194 ] Xuefu Zhang commented on HIVE-15527: [~csun], [~lirui], and [~kellyzly], thanks for your feedback. The patch here is more like a POC, so improvement is needed for production. Here are a few thoughts: 1) I'm not sure what caused the result diff, though there might be a bug in HiveKVResultCache that's manifested. The diff seems invalid when comparing to MR result. Also, there seems some randomness generating the diff. 2) As to performance, Rui's idea concern is valid. What I tried to demo is that we need something similar to HiveKVResultCache but only for values. 3) Similar to 2), we need to have a good cache size to avoid FIO for regular group sizes. Currently HiveKVREsultCache has cache only for 1024 rows, which seems rather small. 4) Performance impact needs to be evaluated. 5) The idea here could be used to solve the same problem for Spark's groupByKey() in Hive. We could use Spark's reduceByKey() instead and in Hive we do in-group value caching like what we can do here. I'm not sure if I have bandwidth to move this forward at full speed. Please feel free to take this (and other issues) forward. Thanks. > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, > HIVE-15527.3.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793967#comment-15793967 ] liyunzhang_intel commented on HIVE-15527: - [~xuefuz] and [~lirui]: HiveKVResultCache will write key value pair if buffer is full and this will do some limits to the memory usage. But is there anything to show that the ArrayList use a lot of memory? test this by memory analysis tool? > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, > HIVE-15527.3.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793903#comment-15793903 ] Rui Li commented on HIVE-15527: --- Since the HiveKVResultCache here only stores values for the same key, I think we can avoid Ser/De the HiveKey to improve performance? > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, > HIVE-15527.3.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793570#comment-15793570 ] Chao Sun commented on HIVE-15527: - Patch looks good. Any idea why the qfile result is different? > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, > HIVE-15527.3.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15791917#comment-15791917 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12845259/HIVE-15527.3.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10913 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=134) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ptf] (batchId=101) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] (batchId=121) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2757/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2757/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2757/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12845259 - PreCommit-HIVE-Build > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, > HIVE-15527.3.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15791783#comment-15791783 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12845257/HIVE-15527.2.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 8 failed/errored test(s), 10883 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=151) [smb_mapjoin_15.q,insert_values_partitioned.q,selectDistinctStar.q,bucket4.q,vectorized_distinct_gby.q,vector_groupby_mapjoin.q,insert_values_dynamic_partitioned.q,vector_nvl.q,join_nullsafe.q,vectorized_mapjoin.q,vectorized_shufflejoin.q,tez_smb_1.q,cbo_union.q,tez_vector_dynpart_hashjoin_1.q,vector_coalesce_2.q,filter_join_breaktask2.q,table_access_keys_stats.q,vector_data_types.q,multiMapJoin2.q,filter_join_breaktask.q,schema_evol_orc_nonvec_part.q,alter_merge_2_orc.q,vectorization_3.q,union4.q,auto_sortmerge_join_8.q,stats_based_fetch_decision.q,vectorized_date_funcs.q,auto_sortmerge_join_10.q,vector_varchar_simple.q,vector_decimal_udf2.q] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=134) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a] (batchId=135) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer] (batchId=150) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_1] (batchId=92) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] (batchId=121) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2756/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2756/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2756/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 8 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12845257 - PreCommit-HIVE-Build > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2 (key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15791340#comment-15791340 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12845241/HIVE-15527.1.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 21 failed/errored test(s), 10898 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=123) [groupby_complex_types.q,multigroupby_singlemr.q,mapjoin_decimal.q,groupby7.q,join5.q,bucketmapjoin_negative2.q,vectorization_div0.q,union_script.q,add_part_multiple.q,limit_pushdown.q,union_remove_17.q,uniquejoin.q,metadata_only_queries_with_filters.q,union25.q,load_dyn_part13.q] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=134) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a] (batchId=135) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_1] (batchId=92) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_join_without_localtask] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_3] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucket_map_join_spark4] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketmapjoin4] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[date_join1] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby2_noskew] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby6_noskew] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[mapjoin_test_outer] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[merge2] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[multi_join_union] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[smb_mapjoin_11] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[timestamp_2] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[union21] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_9] (batchId=94) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] (batchId=121) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2754/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2754/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2754/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 21 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12845241 - PreCommit-HIVE-Build > Memory usage is unbound in SortByShuffler for Spark > --- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15527.1.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2 pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List values = curValues; > curKey = pair._1(); > curValues = new ArrayList(); > curValues.add(pair._2()); > return new Tuple2 (key, > values); > } > curKey = pair._1(); >
[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15789063#comment-15789063 ] Hive QA commented on HIVE-15527: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12845215/HIVE-15527.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 358 failed/errored test(s), 10895 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=233) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=134) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a] (batchId=135) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[columnstats_part_coltype] (batchId=150) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[auto_sortmerge_join_16] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucket4] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucket5] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucket6] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucketmapjoin6] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucketmapjoin7] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[constprog_partitioner] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[constprog_semijoin] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[disable_merge_for_bucketing] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[empty_dir_in_table] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[gen_udf_example_add10] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[index_bitmap3] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[index_bitmap_auto] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[infer_bucket_sort_bucketed_table] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[infer_bucket_sort_map_operators] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[infer_bucket_sort_merge] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[infer_bucket_sort_reducers_power_two] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[leftsemijoin_mr] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[orc_merge_incompat2] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[parallel_orderby] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[quotedid_smb] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[reduce_deduplicate] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[remote_script] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[root_dir_external_table] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[schemeAuthority2] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[schemeAuthority] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[scriptfile1] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[truncate_column_buckets] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join1] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join2] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join3] (batchId=159) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join4] (batchId=161) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[vector_outer_join5] (batchId=161) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] (batchId=93) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_3] (batchId=92) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_4] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=97) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_join0] (batchId=131) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_join15] (batchId=101)