[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-27 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r260656870
 
 

 ##
 File path: 
ql/src/test/results/clientpositive/distinct_groupby_without_cbo.q.out
 ##
 @@ -0,0 +1,1191 @@
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (key) IN (128, 146, 150) (type: boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (key) IN (128, 146, 150) (type: boolean)
+  Statistics: Num rows: 5 Data size: 430 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 5 Data size: 430 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+3
+PREHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-2 depends on stages: Stage-1
+  Stage-0 depends on stages: Stage-2
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (key) IN (128, 146, 150) (type: boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (key) IN (128, 146, 150) (type: boolean)
+  Statistics: Num rows: 5 Data size: 430 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 5 Data size: 430 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  Group By Operator
+keys: _col0 (type: bigint)
+mode: hash
+

[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-27 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r260653962
 
 

 ##
 File path: 
ql/src/test/results/clientpositive/distinct_groupby_without_cbo.q.out
 ##
 @@ -0,0 +1,2018 @@
+PREHOOK: query: explain select distinct count(a.value) from src a group by 
a.key
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(a.value) from src a group by 
a.key
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: a
+Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE 
Column stats: COMPLETE
+Select Operator
+  expressions: key (type: string), value (type: string)
+  outputColumnNames: key, value
+  Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Group By Operator
+aggregations: count(value)
+keys: key (type: string)
+mode: hash
+outputColumnNames: _col0, _col1
+Statistics: Num rows: 250 Data size: 23750 Basic stats: 
COMPLETE Column stats: COMPLETE
+Reduce Output Operator
+  key expressions: _col0 (type: string)
+  sort order: +
+  Map-reduce partition columns: _col0 (type: string)
+  Statistics: Num rows: 250 Data size: 23750 Basic stats: 
COMPLETE Column stats: COMPLETE
+  value expressions: _col1 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  keys: KEY._col0 (type: string)
+  mode: mergepartial
+  outputColumnNames: _col0, _col1
+  Statistics: Num rows: 250 Data size: 23750 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+expressions: _col1 (type: bigint)
+outputColumnNames: _col0
+Statistics: Num rows: 250 Data size: 2000 Basic stats: COMPLETE 
Column stats: COMPLETE
+File Output Operator
+  compressed: false
+  Statistics: Num rows: 250 Data size: 2000 Basic stats: COMPLETE 
Column stats: COMPLETE
+  table:
+  input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
+  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(a.value) from src a group by a.key
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(a.value) from src a group by a.key
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src
+ A masked pattern was here 
+3
+1
+2
+2
+2
+1
 
 Review comment:
   this resultset is not distinct; I guess we have an issue with the non-cbo 
path


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-27 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r260653418
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4195,21 +4195,18 @@ public static long unsetBit(long bitmap, int bitIdx) {
* DISTINCT, if present, will be handled when generating the SELECT.
*/
   List getGroupByForClause(QBParseInfo parseInfo, String dest) throws 
SemanticException {
-// When *not* invoked by CalcitePlanner, return the DISTINCT as a GBY
-// CBO will handle the DISTINCT in 
CalcitePlannerAction.genSelectLogicalPlan
 ASTNode selectExpr = parseInfo.getSelForClause(dest);
 Collection aggregateFunction = 
parseInfo.getDestToAggregationExprs().get(dest).values();
-if (isSelectDistinct(selectExpr) && !isGroupBy(selectExpr) && 
!isAggregateInSelect(selectExpr, aggregateFunction)) {
+if (isSelectDistinct(selectExpr) && !hasGroupBySibling(selectExpr) &&
+!isAggregateInSelect(selectExpr, aggregateFunction)) {
   List result = new ArrayList(selectExpr == null ? 0 : 
selectExpr.getChildCount());
 
 Review comment:
   remove this `== null` check ; it's not needed 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-27 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r260649158
 
 

 ##
 File path: ql/src/test/results/clientpositive/distinct_groupby.q.out
 ##
 @@ -0,0 +1,1709 @@
+PREHOOK: query: explain select distinct count(value) from src group by key
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(value) from src group by key
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-2 depends on stages: Stage-1
+  Stage-0 depends on stages: Stage-2
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src
+Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE 
Column stats: COMPLETE
+Select Operator
+  expressions: key (type: string), value (type: string)
+  outputColumnNames: key, value
+  Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Group By Operator
+aggregations: count(value)
+keys: key (type: string)
+mode: hash
+outputColumnNames: _col0, _col1
+Statistics: Num rows: 250 Data size: 23750 Basic stats: 
COMPLETE Column stats: COMPLETE
+Reduce Output Operator
+  key expressions: _col0 (type: string)
+  sort order: +
+  Map-reduce partition columns: _col0 (type: string)
+  Statistics: Num rows: 250 Data size: 23750 Basic stats: 
COMPLETE Column stats: COMPLETE
+  value expressions: _col1 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  keys: KEY._col0 (type: string)
+  mode: mergepartial
+  outputColumnNames: _col0, _col1
+  Statistics: Num rows: 250 Data size: 23750 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+expressions: _col1 (type: bigint)
+outputColumnNames: _col1
+Statistics: Num rows: 250 Data size: 23750 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  keys: _col1 (type: bigint)
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 125 Data size: 1000 Basic stats: COMPLETE 
Column stats: COMPLETE
+  File Output Operator
+compressed: false
+table:
+input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
+
+  Stage: Stage-2
+Map Reduce
+  Map Operator Tree:
+  TableScan
+Reduce Output Operator
+  key expressions: _col0 (type: bigint)
+  sort order: +
+  Map-reduce partition columns: _col0 (type: bigint)
+  Statistics: Num rows: 125 Data size: 1000 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  keys: KEY._col0 (type: bigint)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 125 Data size: 1000 Basic stats: COMPLETE 
Column stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 125 Data size: 1000 Basic stats: COMPLETE 
Column stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(value) from src group by key
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(value) from src group by key
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src
+ A masked pattern was here 
+1
+2
+3
+4
+5
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 

[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259878502
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4230,6 +4229,34 @@ public static long unsetBit(long bitmap, int bitIdx) {
 }
   }
 
+  protected boolean isGroupBy(ASTNode expr) {
+boolean isGroupBy = false;
+if (expr.getParent() != null && expr.getParent() instanceof Node)
+for (Node sibling : ((Node)expr.getParent()).getChildren()) {
+  isGroupBy |= sibling instanceof ASTNode && ((ASTNode)sibling).getType() 
== HiveParser.TOK_GROUPBY;
+}
+
+return isGroupBy;
+  }
+
+  protected boolean isSelectDistinct(ASTNode expr) {
+return expr.getType() == HiveParser.TOK_SELECTDI;
+  }
+
+  protected boolean isAggregateInSelect(Node node, Collection 
aggregateFunction) {
+if (node.getChildren() == null) {
+  return false;
+}
+
+for (Node child : node.getChildren()) {
 
 Review comment:
   I was thinking something really odd:
   ```
   select distinct (select count(*) from t where t.a=e.a) from e
   ```
   but in this case (beyond that it might not accept by hive at all) the count 
aggregate is not present at the top level.
   Do you know an example when this method returns false; however there are 
aggreagations being done?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259780016
 
 

 ##
 File path: ql/src/test/results/clientpositive/distinct_groupby.q.out
 ##
 @@ -0,0 +1,1530 @@
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+3
+PREHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 

[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259788172
 
 

 ##
 File path: ql/src/test/results/clientpositive/distinct_groupby.q.out
 ##
 @@ -0,0 +1,1530 @@
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+3
+PREHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 

[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259784389
 
 

 ##
 File path: ql/src/test/results/clientpositive/distinct_groupby.q.out
 ##
 @@ -0,0 +1,1530 @@
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+3
+PREHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 

[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259794509
 
 

 ##
 File path: ql/src/test/queries/clientpositive/distinct_groupby.q
 ##
 @@ -0,0 +1,57 @@
+--! qt:dataset:src1
+
 
 Review comment:
   could we have some tests for non-cbo path as well?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259793850
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4230,6 +4229,34 @@ public static long unsetBit(long bitmap, int bitIdx) {
 }
   }
 
+  protected boolean isGroupBy(ASTNode expr) {
+boolean isGroupBy = false;
+if (expr.getParent() != null && expr.getParent() instanceof Node)
+for (Node sibling : ((Node)expr.getParent()).getChildren()) {
+  isGroupBy |= sibling instanceof ASTNode && ((ASTNode)sibling).getType() 
== HiveParser.TOK_GROUPBY;
+}
+
+return isGroupBy;
+  }
+
+  protected boolean isSelectDistinct(ASTNode expr) {
+return expr.getType() == HiveParser.TOK_SELECTDI;
+  }
+
+  protected boolean isAggregateInSelect(Node node, Collection 
aggregateFunction) {
+if (node.getChildren() == null) {
+  return false;
+}
+
+for (Node child : node.getChildren()) {
 
 Review comment:
   is it safe to traverse all the children? I'm wondering if can't we can get a 
false + somehow...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259763434
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
 ##
 @@ -3730,7 +3697,9 @@ private RelNode genGBLogicalPlan(QB qb, RelNode srcRel) 
throws SemanticException
 ASTNode node = (ASTNode) selExprList.getChild(0).getChild(0);
 if (node.getToken().getType() == HiveParser.TOK_ALLCOLREF) {
   // As we said before, here we use genSelectLogicalPlan to rewrite 
AllColRef
-  srcRel = genSelectLogicalPlan(qb, srcRel, srcRel, null, null, 
true).getKey();
+  if (!(isSelectDistinct(selExprList) && isGroupBy(selExprList))) {
 
 Review comment:
   `isSelectDistinct(selExprList)` is always true here; condition could be 
changed to `!isGroupBy`
   
   I'm not sure if I understand why we would "skip" rewrite `TOK_ALLCOLREF`  in 
case it's not in groupby; I don't think the below statements make sense:
   ```
   create table t (c1 integer,c2 integer);
   select distinct * from t group by c1;
   ```
   I might be missing something here..


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259776469
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4194,27 +4191,29 @@ public static long unsetBit(long bitmap, int bitIdx) {
   }
 
   /**
-   * This function is a wrapper of parseInfo.getGroupByForClause which
-   * automatically translates SELECT DISTINCT a,b,c to SELECT a,b,c GROUP BY
-   * a,b,c.
+   * Returns the GBY, if present;
+   * DISTINCT, if present, will be handled when generating the SELECT.
*/
   List getGroupByForClause(QBParseInfo parseInfo, String dest) throws 
SemanticException {
-if (parseInfo.getSelForClause(dest).getToken().getType() == 
HiveParser.TOK_SELECTDI) {
-  ASTNode selectExprs = parseInfo.getSelForClause(dest);
-  List result = new ArrayList(selectExprs == null ? 0
-  : selectExprs.getChildCount());
-  if (selectExprs != null) {
-for (int i = 0; i < selectExprs.getChildCount(); ++i) {
-  if (((ASTNode) selectExprs.getChild(i)).getToken().getType() == 
HiveParser.QUERY_HINT) {
+// When *not* invoked by CalcitePlanner, return the DISTINCT as a GBY
+// CBO will handle the DISTINCT in 
CalcitePlannerAction.genSelectLogicalPlan
+ASTNode selectExpr = parseInfo.getSelForClause(dest);
+Collection aggregateFunction = 
parseInfo.getDestToAggregationExprs().get(dest).values();
+if (isSelectDistinct(selectExpr) && !isGroupBy(selectExpr) && 
!isAggregateInSelect(selectExpr, aggregateFunction)) {
+  List result = new ArrayList(selectExpr == null ? 0 : 
selectExpr.getChildCount());
+  if (selectExpr != null) {
 
 Review comment:
   please kill these null checks
   if it would be null we would already have run into an NPE in the earlier and 
in the new code


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259778186
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4194,27 +4191,29 @@ public static long unsetBit(long bitmap, int bitIdx) {
   }
 
   /**
-   * This function is a wrapper of parseInfo.getGroupByForClause which
-   * automatically translates SELECT DISTINCT a,b,c to SELECT a,b,c GROUP BY
-   * a,b,c.
+   * Returns the GBY, if present;
+   * DISTINCT, if present, will be handled when generating the SELECT.
*/
   List getGroupByForClause(QBParseInfo parseInfo, String dest) throws 
SemanticException {
-if (parseInfo.getSelForClause(dest).getToken().getType() == 
HiveParser.TOK_SELECTDI) {
-  ASTNode selectExprs = parseInfo.getSelForClause(dest);
-  List result = new ArrayList(selectExprs == null ? 0
-  : selectExprs.getChildCount());
-  if (selectExprs != null) {
-for (int i = 0; i < selectExprs.getChildCount(); ++i) {
-  if (((ASTNode) selectExprs.getChild(i)).getToken().getType() == 
HiveParser.QUERY_HINT) {
+// When *not* invoked by CalcitePlanner, return the DISTINCT as a GBY
+// CBO will handle the DISTINCT in 
CalcitePlannerAction.genSelectLogicalPlan
+ASTNode selectExpr = parseInfo.getSelForClause(dest);
+Collection aggregateFunction = 
parseInfo.getDestToAggregationExprs().get(dest).values();
+if (isSelectDistinct(selectExpr) && !isGroupBy(selectExpr) && 
!isAggregateInSelect(selectExpr, aggregateFunction)) {
 
 Review comment:
   what plan are we ending up if we have select distinct which has an aggregate 
(this condition will be false) and the cbo is not working? is there a test for 
that?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259792059
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4230,6 +4229,34 @@ public static long unsetBit(long bitmap, int bitIdx) {
 }
   }
 
+  protected boolean isGroupBy(ASTNode expr) {
 
 Review comment:
   this method name suggest to me that the expression *is* a groupby; but 
instead it seems it checks for wether the expression has a "brother" which is a 
groupby


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services