Re: Review Request 60632: HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle

2017-07-04 Thread Rui Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60632/#review179595
---


Ship it!




Ship It!

- Rui Li


On July 5, 2017, 4:07 a.m., Bing Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60632/
> ---
> 
> (Updated July 5, 2017, 4:07 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle
> 
> 
> Diffs
> -
> 
>   itests/src/test/resources/testconfiguration.properties 19ff316 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RepartitionShuffler.java 
> d0c708c 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
> 5f85f9e 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 
> b9901da 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java afbeccb 
>   ql/src/test/queries/clientpositive/spark_explain_groupbyshuffle.q 
> PRE-CREATION 
>   ql/src/test/results/clientpositive/spark/spark_explain_groupbyshuffle.q.out 
> PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/60632/diff/2/
> 
> 
> Testing
> ---
> 
> set hive.spark.use.groupby.shuffle=true;
> explain select key, count(val) from t1 group by key;
> 
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> 
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   Edges:
> Reducer 2 <- Map 1 (GROUP, 2)
>   DagName: root_20170625202742_58335619-7107-4026-9911-43d2ec449088:2
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: t1
>   Statistics: Num rows: 20 Data size: 140 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: key (type: int), val (type: string)
> outputColumnNames: key, val
> Statistics: Num rows: 20 Data size: 140 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count(val)
>   keys: key (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 20 Data size: 140 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 20 Data size: 140 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col1 (type: bigint)
> Reducer 2
> Reduce Operator Tree:
>   Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
> Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 10 Data size: 70 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> 
> 
> set hive.spark.use.groupby.shuffle=false;
> explain select key, count(val) from t1 group by key;
> 
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> 
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   Edges:
> Reducer 2 <- Map 1 (GROUP, 2)
>   DagName: root_20170625203122_3afe01dd-41cc-477e-9098-ddd58b37ad4e:3
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: t1
>   Statistics: Num rows: 20 Data size: 140 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: key (type: int), val (type: string)
> outputColumnNames: key, val
> Statistics: Num rows: 20 Data size: 140 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count(val)
>   keys: key (ty

Re: Review Request 60632: HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle

2017-07-04 Thread Bing Li via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60632/
---

(Updated July 5, 2017, 4:07 a.m.)


Review request for hive.


Changes
---

Update GenSparkUtils.java based on Rui's comments


Repository: hive-git


Description
---

HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle


Diffs (updated)
-

  itests/src/test/resources/testconfiguration.properties 19ff316 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RepartitionShuffler.java 
d0c708c 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
5f85f9e 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java b9901da 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java afbeccb 
  ql/src/test/queries/clientpositive/spark_explain_groupbyshuffle.q 
PRE-CREATION 
  ql/src/test/results/clientpositive/spark/spark_explain_groupbyshuffle.q.out 
PRE-CREATION 


Diff: https://reviews.apache.org/r/60632/diff/2/

Changes: https://reviews.apache.org/r/60632/diff/1-2/


Testing
---

set hive.spark.use.groupby.shuffle=true;
explain select key, count(val) from t1 group by key;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (GROUP, 2)
  DagName: root_20170625202742_58335619-7107-4026-9911-43d2ec449088:2
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: t1
  Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: key (type: int), val (type: string)
outputColumnNames: key, val
Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  aggregations: count(val)
  keys: key (type: int)
  mode: hash
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reducer 2
Reduce Operator Tree:
  Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
  table:
  input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink


set hive.spark.use.groupby.shuffle=false;
explain select key, count(val) from t1 group by key;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (GROUP, 2)
  DagName: root_20170625203122_3afe01dd-41cc-477e-9098-ddd58b37ad4e:3
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: t1
  Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: key (type: int), val (type: string)
outputColumnNames: key, val
Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  aggregations: count(val)
  keys: key (type: int)
  mode: hash
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num r

Re: Review Request 60632: HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle

2017-07-04 Thread Rui Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60632/#review179554
---




ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
Lines 68 (patched)


Please avoid * import



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
Lines 432 (patched)


it's preferable to use HiveConf::getBoolVar



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
Line 438 (original), 441 (patched)


nit: extra space before !useSparkGroupBy



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
Line 471 (original), 477 (patched)


let's delete this comment


- Rui Li


On July 4, 2017, 8:48 a.m., Bing Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60632/
> ---
> 
> (Updated July 4, 2017, 8:48 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle
> 
> 
> Diffs
> -
> 
>   itests/src/test/resources/testconfiguration.properties 19ff316 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RepartitionShuffler.java 
> d0c708c 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
> 5f85f9e 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 
> b9901da 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java afbeccb 
>   ql/src/test/queries/clientpositive/spark_explain_groupbyshuffle.q 
> PRE-CREATION 
>   ql/src/test/results/clientpositive/spark/spark_explain_groupbyshuffle.q.out 
> PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/60632/diff/1/
> 
> 
> Testing
> ---
> 
> set hive.spark.use.groupby.shuffle=true;
> explain select key, count(val) from t1 group by key;
> 
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> 
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   Edges:
> Reducer 2 <- Map 1 (GROUP, 2)
>   DagName: root_20170625202742_58335619-7107-4026-9911-43d2ec449088:2
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: t1
>   Statistics: Num rows: 20 Data size: 140 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: key (type: int), val (type: string)
> outputColumnNames: key, val
> Statistics: Num rows: 20 Data size: 140 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count(val)
>   keys: key (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 20 Data size: 140 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 20 Data size: 140 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col1 (type: bigint)
> Reducer 2
> Reduce Operator Tree:
>   Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
> Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 10 Data size: 70 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> 
> 
> set hive.spark.use.groupby.shuffle=false;
> explain select key, count(val) from t1 group by key;
> 
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> 
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   Edges

Review Request 60632: HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle

2017-07-04 Thread Bing Li via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60632/
---

Review request for hive.


Repository: hive-git


Description
---

HIVE-16659: Query plan should reflect hive.spark.use.groupby.shuffle


Diffs
-

  itests/src/test/resources/testconfiguration.properties 19ff316 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RepartitionShuffler.java 
d0c708c 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
5f85f9e 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java b9901da 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java afbeccb 
  ql/src/test/queries/clientpositive/spark_explain_groupbyshuffle.q 
PRE-CREATION 
  ql/src/test/results/clientpositive/spark/spark_explain_groupbyshuffle.q.out 
PRE-CREATION 


Diff: https://reviews.apache.org/r/60632/diff/1/


Testing
---

set hive.spark.use.groupby.shuffle=true;
explain select key, count(val) from t1 group by key;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (GROUP, 2)
  DagName: root_20170625202742_58335619-7107-4026-9911-43d2ec449088:2
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: t1
  Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: key (type: int), val (type: string)
outputColumnNames: key, val
Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  aggregations: count(val)
  keys: key (type: int)
  mode: hash
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reducer 2
Reduce Operator Tree:
  Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
  table:
  input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink


set hive.spark.use.groupby.shuffle=false;
explain select key, count(val) from t1 group by key;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (GROUP, 2)
  DagName: root_20170625203122_3afe01dd-41cc-477e-9098-ddd58b37ad4e:3
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: t1
  Statistics: Num rows: 20 Data size: 140 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: key (type: int), val (type: string)
outputColumnNames: key, val
Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  aggregations: count(val)
  keys: key (type: int)
  mode: hash
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 20 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reducer 2
Reduce Oper