[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=502643=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-502643
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 20/Oct/20 11:56
Start Date: 20/Oct/20 11:56
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk merged pull request #1553:
URL: https://github.com/apache/hive/pull/1553


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 502643)
Time Spent: 3h 40m  (was: 3.5h)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=502642=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-502642
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 20/Oct/20 11:56
Start Date: 20/Oct/20 11:56
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r508438776



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##
@@ -284,6 +304,54 @@ public ParseContext transform(ParseContext pctx) throws 
SemanticException {
 return pctx;
   }
 
+  /** SharedWorkOptimization strategy modes */
+  public enum Mode {
+/**
+ * Merges two identical subtrees.
+ */
+SubtreeMerge,
+/**
+ * Merges a filtered scan into a non-filtered scan.
+ *
+ * In case we are already scanning the whole table - we should not scan it 
twice.
+ */
+RemoveSemijoin,
+/**
+ * Fuses two filtered table scans into a single one.
+ *
+ * Dynamic filter subtree is kept on both sides - but the table is onlt 
scanned once.

Review comment:
   added fix to HIVE-24241





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 502642)
Time Spent: 3.5h  (was: 3h 20m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=502520=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-502520
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 20/Oct/20 05:39
Start Date: 20/Oct/20 05:39
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r508220345



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##
@@ -284,6 +304,54 @@ public ParseContext transform(ParseContext pctx) throws 
SemanticException {
 return pctx;
   }
 
+  /** SharedWorkOptimization strategy modes */
+  public enum Mode {
+/**
+ * Merges two identical subtrees.
+ */
+SubtreeMerge,
+/**
+ * Merges a filtered scan into a non-filtered scan.
+ *
+ * In case we are already scanning the whole table - we should not scan it 
twice.
+ */
+RemoveSemijoin,
+/**
+ * Fuses two filtered table scans into a single one.
+ *
+ * Dynamic filter subtree is kept on both sides - but the table is onlt 
scanned once.

Review comment:
   typo. onlt





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 502520)
Time Spent: 3h 20m  (was: 3h 10m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=500151=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-500151
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 16:30
Start Date: 13/Oct/20 16:30
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r504075393



##
File path: 
ql/src/test/results/clientpositive/llap/vectorized_dynamic_partition_pruning.q.out
##
@@ -4816,18 +4816,34 @@ STAGE PLANS:
   alias: srcpart
   filterExpr: ds is not null (type: boolean)
   Statistics: Num rows: 2000 Data size: 389248 Basic stats: 
COMPLETE Column stats: COMPLETE
-  Group By Operator
-keys: ds (type: string)
-minReductionHashAggr: 0.99
-mode: hash
-outputColumnNames: _col0
-Statistics: Num rows: 2 Data size: 368 Basic stats: 
COMPLETE Column stats: COMPLETE
-Reduce Output Operator
-  key expressions: _col0 (type: string)
-  null sort order: z
-  sort order: +
-  Map-reduce partition columns: _col0 (type: string)
+  Filter Operator

Review comment:
   these 2 filter operators will be merged by the 'downstream merge'  patch





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 500151)
Time Spent: 3h 10m  (was: 3h)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499982=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499982
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 10:31
Start Date: 13/Oct/20 10:31
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503843227



##
File path: ql/src/test/results/clientpositive/perf/tez/constraints/query2.q.out
##
@@ -128,46 +128,104 @@ Plan optimized by CBO.
 
 Vertex dependency in root stage
 Map 1 <- Union 2 (CONTAINS)
-Map 9 <- Union 2 (CONTAINS)
-Reducer 3 <- Map 10 (SIMPLE_EDGE), Union 2 (SIMPLE_EDGE)
+Map 13 <- Union 14 (CONTAINS)
+Map 15 <- Union 14 (CONTAINS)
+Map 8 <- Union 2 (CONTAINS)
+Reducer 10 <- Map 9 (SIMPLE_EDGE), Union 14 (SIMPLE_EDGE)
+Reducer 11 <- Reducer 10 (SIMPLE_EDGE)
+Reducer 12 <- Map 9 (SIMPLE_EDGE), Reducer 11 (SIMPLE_EDGE)
+Reducer 3 <- Map 9 (SIMPLE_EDGE), Union 2 (SIMPLE_EDGE)
 Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
-Reducer 5 <- Map 10 (SIMPLE_EDGE), Reducer 4 (SIMPLE_EDGE)
-Reducer 6 <- Reducer 5 (SIMPLE_EDGE), Reducer 8 (SIMPLE_EDGE)
+Reducer 5 <- Map 9 (SIMPLE_EDGE), Reducer 4 (SIMPLE_EDGE)
+Reducer 6 <- Reducer 12 (SIMPLE_EDGE), Reducer 5 (SIMPLE_EDGE)
 Reducer 7 <- Reducer 6 (SIMPLE_EDGE)
-Reducer 8 <- Map 10 (SIMPLE_EDGE), Reducer 4 (SIMPLE_EDGE)
 
 Stage-0
   Fetch Operator
 limit:-1
 Stage-1
   Reducer 7 vectorized
-  File Output Operator [FS_173]
-Select Operator [SEL_172] (rows=12881 width=788)
+  File Output Operator [FS_187]
+Select Operator [SEL_186] (rows=12881 width=788)
   
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
 <-Reducer 6 [SIMPLE_EDGE]
   SHUFFLE [RS_57]
 Select Operator [SEL_56] (rows=12881 width=788)
   
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
   Merge Join Operator [MERGEJOIN_146] (rows=12881 width=1572)
 Conds:RS_53.(_col0 - 
53)=RS_54._col0(Inner),Output:["_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16"]
+  <-Reducer 12 [SIMPLE_EDGE]
+SHUFFLE [RS_54]
+  PartitionCols:_col0
+  Merge Join Operator [MERGEJOIN_145] (rows=652 width=788)
+
Conds:RS_185._col0=RS_181._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
+  <-Map 9 [SIMPLE_EDGE] vectorized
+SHUFFLE [RS_181]
+  PartitionCols:_col0
+  Select Operator [SEL_177] (rows=652 width=4)
+Output:["_col0"]
+Filter Operator [FIL_173] (rows=652 width=8)
+  predicate:((d_year = 2001) and d_week_seq is not 
null)
+  TableScan [TS_8] (rows=73049 width=99)
+
default@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_week_seq","d_day_name","d_year"]
+  <-Reducer 11 [SIMPLE_EDGE] vectorized
+SHUFFLE [RS_185]
+  PartitionCols:_col0
+  Group By Operator [GBY_184] (rows=13152 width=788)
+
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"],aggregations:["sum(VALUE._col0)","sum(VALUE._col1)","sum(VALUE._col2)","sum(VALUE._col3)","sum(VALUE._col4)","sum(VALUE._col5)","sum(VALUE._col6)"],keys:KEY._col0
+  <-Reducer 10 [SIMPLE_EDGE]
+SHUFFLE [RS_40]
+  PartitionCols:_col0
+  Group By Operator [GBY_39] (rows=3182784 width=788)
+
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"],aggregations:["sum(_col1)","sum(_col2)","sum(_col3)","sum(_col4)","sum(_col5)","sum(_col6)","sum(_col7)"],keys:_col0
+Select Operator [SEL_37] (rows=430516591 width=143)
+  
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
+  Merge Join Operator [MERGEJOIN_144] 
(rows=430516591 width=143)
+Conds:Union 
14._col0=RS_180._col0(Inner),Output:["_col1","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10"]
+  <-Map 9 [SIMPLE_EDGE] vectorized
+SHUFFLE [RS_180]
+  PartitionCols:_col0
+  Select Operator [SEL_176] (rows=73049 
width=36)
+

[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499981=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499981
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 10:27
Start Date: 13/Oct/20 10:27
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503840610



##
File path: ql/src/test/results/clientpositive/perf/tez/constraints/query44.q.out
##
@@ -103,102 +107,143 @@ Stage-0
 Top N Key Operator [TNK_99] (rows=6951 width=218)
   keys:_col1,top n:100
   Merge Join Operator [MERGEJOIN_116] (rows=6951 width=218)
-
Conds:RS_66._col2=RS_146._col0(Inner),Output:["_col1","_col5","_col7"]
-  <-Map 11 [SIMPLE_EDGE] vectorized
-SHUFFLE [RS_146]
+
Conds:RS_66._col2=RS_163._col0(Inner),Output:["_col1","_col5","_col7"]
+  <-Map 14 [SIMPLE_EDGE] vectorized
+SHUFFLE [RS_163]
   PartitionCols:_col0
-  Select Operator [SEL_144] (rows=462000 width=111)
+  Select Operator [SEL_161] (rows=462000 width=111)
 Output:["_col0","_col1"]
 TableScan [TS_56] (rows=462000 width=111)
   
default@item,i1,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_product_name"]
   <-Reducer 6 [SIMPLE_EDGE]
 SHUFFLE [RS_66]
   PartitionCols:_col2
   Merge Join Operator [MERGEJOIN_115] (rows=6951 width=115)
-
Conds:RS_63._col0=RS_145._col0(Inner),Output:["_col1","_col2","_col5"]
-  <-Map 11 [SIMPLE_EDGE] vectorized
-SHUFFLE [RS_145]
+
Conds:RS_63._col0=RS_162._col0(Inner),Output:["_col1","_col2","_col5"]
+  <-Map 14 [SIMPLE_EDGE] vectorized
+SHUFFLE [RS_162]
   PartitionCols:_col0
-   Please refer to the previous Select Operator 
[SEL_144]
+   Please refer to the previous Select Operator 
[SEL_161]
   <-Reducer 5 [SIMPLE_EDGE]
 SHUFFLE [RS_63]
   PartitionCols:_col0
   Merge Join Operator [MERGEJOIN_114] (rows=6951 
width=12)
-
Conds:RS_138._col1=RS_143._col1(Inner),Output:["_col0","_col1","_col2"]
-  <-Reducer 4 [SIMPLE_EDGE] vectorized
-SHUFFLE [RS_138]
+
Conds:RS_146._col1=RS_160._col1(Inner),Output:["_col0","_col1","_col2"]
+  <-Reducer 12 [SIMPLE_EDGE] vectorized
+SHUFFLE [RS_160]
   PartitionCols:_col1
-  Select Operator [SEL_137] (rows=6951 width=8)
+  Select Operator [SEL_159] (rows=6951 width=8)
 Output:["_col0","_col1"]
-Filter Operator [FIL_136] (rows=6951 width=116)
+Filter Operator [FIL_158] (rows=6951 width=116)
   predicate:(rank_window_0 < 11)
-  PTF Operator [PTF_135] (rows=20854 width=116)
-Function 
definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 ASC NULLS 
LAST","partition by:":"0"}]
-Select Operator [SEL_134] (rows=20854 
width=116)
+  PTF Operator [PTF_157] (rows=20854 width=116)
+Function 
definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS 
FIRST","partition by:":"0"}]
+Select Operator [SEL_156] (rows=20854 
width=116)
   Output:["_col0","_col1"]
-<-Reducer 3 [SIMPLE_EDGE]
-  SHUFFLE [RS_21]
+<-Reducer 11 [SIMPLE_EDGE]
+  SHUFFLE [RS_49]
 PartitionCols:0
-Top N Key Operator [TNK_100] 
(rows=20854 width=228)
+Top N Key Operator [TNK_101] 
(rows=20854 width=228)
   keys:_col1,top n:11
-  Filter Operator [FIL_20] (rows=20854 
width=228)
+  

[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499979=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499979
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 10:26
Start Date: 13/Oct/20 10:26
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503840362



##
File path: ql/src/test/results/clientpositive/llap/subquery_in.q.out
##
@@ -5078,9 +5087,10 @@ STAGE PLANS:
   Edges:
 Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 5 (SIMPLE_EDGE)
 Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE)
-Reducer 4 <- Reducer 3 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE)
+Reducer 4 <- Reducer 3 (SIMPLE_EDGE), Reducer 7 (SIMPLE_EDGE)
 Reducer 5 <- Map 1 (SIMPLE_EDGE)
 Reducer 6 <- Map 1 (SIMPLE_EDGE)
+Reducer 7 <- Map 1 (SIMPLE_EDGE)

Review comment:
   yes; I was too brave to enable ts merging for every existing op... :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499979)
Time Spent: 2h 40m  (was: 2.5h)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499978=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499978
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 10:24
Start Date: 13/Oct/20 10:24
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503838695



##
File path: ql/src/test/results/clientpositive/llap/subquery_in.q.out
##
@@ -4355,6 +4355,9 @@ STAGE PLANS:
 sort order: +
 Map-reduce partition columns: _col0 (type: string)
 Statistics: Num rows: 13 Data size: 1352 Basic stats: 
COMPLETE Column stats: COMPLETE
+  Filter Operator

Review comment:
   this difference is gone - no more ts merging for now; will get back to 
it later





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499978)
Time Spent: 2.5h  (was: 2h 20m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499977=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499977
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 10:23
Start Date: 13/Oct/20 10:23
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503838522



##
File path: 
ql/src/test/results/clientpositive/llap/special_character_in_tabnames_1.q.out
##
@@ -1986,18 +1986,18 @@ STAGE PLANS:
 Tez
  A masked pattern was here 
   Edges:
-Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 6 (SIMPLE_EDGE)
+Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE)
 Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
-Reducer 4 <- Reducer 3 (SIMPLE_EDGE), Reducer 7 (SIMPLE_EDGE)
+Reducer 4 <- Reducer 3 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE)
 Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
-Reducer 7 <- Map 6 (SIMPLE_EDGE)
+Reducer 6 <- Map 1 (SIMPLE_EDGE)
  A masked pattern was here 
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: b
-  filterExpr: key is not null (type: boolean)
+  filterExpr: (key is not null or (key > '9')) (type: boolean)

Review comment:
   this difference is gone - no more ts merging for now; will get back to 
it later





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499977)
Time Spent: 2h 20m  (was: 2h 10m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499976=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499976
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 10:23
Start Date: 13/Oct/20 10:23
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503838169



##
File path: ql/src/test/results/clientpositive/llap/sharedworkresidual.q.out
##
@@ -143,6 +143,10 @@ STAGE PLANS:
 sort order: 
 Statistics: Num rows: 1 Data size: 188 Basic stats: 
COMPLETE Column stats: NONE
 value expressions: _col0 (type: string)
+Select Operator

Review comment:
   this difference is gone - no more ts merging for now; will get back to 
it later





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499976)
Time Spent: 2h 10m  (was: 2h)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499975=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499975
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 10:22
Start Date: 13/Oct/20 10:22
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503837757



##
File path: ql/src/test/results/clientpositive/llap/ppd_repeated_alias.q.out
##
@@ -348,14 +348,14 @@ STAGE PLANS:
  A masked pattern was here 
   Edges:
 Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
-Reducer 3 <- Map 4 (XPROD_EDGE), Reducer 2 (XPROD_EDGE)
+Reducer 3 <- Map 1 (XPROD_EDGE), Reducer 2 (XPROD_EDGE)
  A masked pattern was here 
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: c
-  filterExpr: foo is not null (type: boolean)
+  filterExpr: (foo is not null or (foo = 1)) (type: boolean)

Review comment:
   this difference is gone - no more ts merging for now; will get back to 
it later
   
   yes; we might want to run simplification on these - I've opened HIVE-24269 
to add that





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499975)
Time Spent: 2h  (was: 1h 50m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499972=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499972
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 10:18
Start Date: 13/Oct/20 10:18
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503835536



##
File path: ql/src/test/results/clientpositive/llap/join_parse.q.out
##
@@ -499,34 +499,41 @@ STAGE PLANS:
 sort order: +
 Map-reduce partition columns: _col0 (type: string)
 Statistics: Num rows: 500 Data size: 43500 Basic 
stats: COMPLETE Column stats: COMPLETE
+  Filter Operator
+predicate: (value is not null and key is not null) (type: 
boolean)
+Statistics: Num rows: 500 Data size: 89000 Basic stats: 
COMPLETE Column stats: COMPLETE
+Select Operator
+  expressions: key (type: string), value (type: string)
+  outputColumnNames: _col0, _col1
+  Statistics: Num rows: 500 Data size: 89000 Basic stats: 
COMPLETE Column stats: COMPLETE
   Reduce Output Operator
 key expressions: _col0 (type: string)
 null sort order: z
 sort order: +
 Map-reduce partition columns: _col0 (type: string)
-Statistics: Num rows: 500 Data size: 43500 Basic 
stats: COMPLETE Column stats: COMPLETE
+Statistics: Num rows: 500 Data size: 89000 Basic 
stats: COMPLETE Column stats: COMPLETE
+value expressions: _col1 (type: string)
 Execution mode: vectorized, llap
 LLAP IO: all inputs
 Map 6 
 Map Operator Tree:
 TableScan
-  alias: src1
-  filterExpr: (value is not null and key is not null) (type: 
boolean)
-  Statistics: Num rows: 500 Data size: 89000 Basic stats: 
COMPLETE Column stats: COMPLETE
+  alias: src2

Review comment:
   difference is gone - no more ts merging for now; will get back to it 
later





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499972)
Time Spent: 1h 50m  (was: 1h 40m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499969=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499969
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 10:17
Start Date: 13/Oct/20 10:17
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503834803



##
File path: 
ql/src/test/results/clientpositive/llap/dynamic_partition_pruning.q.out
##
@@ -4277,21 +4277,37 @@ STAGE PLANS:
   alias: srcpart
   filterExpr: ds is not null (type: boolean)
   Statistics: Num rows: 2000 Data size: 389248 Basic stats: 
COMPLETE Column stats: COMPLETE
-  Group By Operator
-keys: ds (type: string)
-minReductionHashAggr: 0.99
-mode: hash
-outputColumnNames: _col0
-Statistics: Num rows: 2 Data size: 368 Basic stats: 
COMPLETE Column stats: COMPLETE
-Reduce Output Operator
-  key expressions: _col0 (type: string)
-  null sort order: z
-  sort order: +
-  Map-reduce partition columns: _col0 (type: string)
+  Filter Operator
+predicate: ds is not null (type: boolean)

Review comment:
   I see 2 filter operators doing the same in this plan - which will be 
merged by the "downstream merge" patch.
   
   However; to my best knowledge the TS filterExpr should only be considered 
"best-effort" because the reader may decide to not filter by some parts of the 
expr.
   
   In this plan we should probably have removed 2 extra srcpart scans by the 
`SubTree` logic - that should be investigated - I've opened HIVE-24268
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499969)
Time Spent: 1h 40m  (was: 1.5h)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499962=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499962
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 10:03
Start Date: 13/Oct/20 10:03
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503826556



##
File path: 
ql/src/test/results/clientpositive/llap/cbo_SortUnionTransposeRule.q.out
##
@@ -1006,6 +1027,22 @@ STAGE PLANS:
   output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
   serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 Reducer 5 
+Execution mode: vectorized, llap

Review comment:
   This is caused by enabling the schema merge for all the optimizations. 
Apparently the greedy operator chain matching logic worked slightly better when 
it was run w/o ts-merge first and only then executed to also consider merging 
the schema.
   
   In this patch I will only introduce the new optimization and do the 
generalization in either in the "downstream merge" patch or completely 
separately; it will worth because it will enable the `RemoveSemiJoin` mode to 
consider merging the ts schema





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499962)
Time Spent: 1.5h  (was: 1h 20m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499959=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499959
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 09:58
Start Date: 13/Oct/20 09:58
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503823443



##
File path: ql/src/test/queries/clientpositive/explainuser_1.q
##
@@ -9,6 +9,7 @@
 --! qt:dataset:cbo_t1
 set hive.vectorized.execution.enabled=false;
 set hive.strict.checks.bucketing=false;
+set hive.optimize.shared.work.dppunion=false;

Review comment:
   I didn't wanted the new feature to do further twists with the q.out of 
existing "directed" tests
   
   I've removed these set calls from the q files





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499959)
Time Spent: 1h 20m  (was: 1h 10m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499957=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499957
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 09:54
Start Date: 13/Oct/20 09:54
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503821169



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##
@@ -386,125 +456,81 @@ public boolean sharedWorkOptimization(ParseContext pctx, 
SharedWorkOptimizerCach
   LOG.debug("Merging subtree starting at {} into subtree starting 
at {}",
   discardableTsOp, retainableTsOp);
 } else {
-  ExprNodeDesc newRetainableTsFilterExpr = null;
-  List semijoinExprNodes = new ArrayList<>();
-  if (retainableTsOp.getConf().getFilterExpr() != null) {
-// Gather SJ expressions and normal expressions
-List allExprNodesExceptSemijoin = new 
ArrayList<>();
-splitExpressions(retainableTsOp.getConf().getFilterExpr(),
-allExprNodesExceptSemijoin, semijoinExprNodes);
-// Create new expressions
-if (allExprNodesExceptSemijoin.size() > 1) {
-  newRetainableTsFilterExpr = 
ExprNodeGenericFuncDesc.newInstance(
-  new GenericUDFOPAnd(), allExprNodesExceptSemijoin);
-} else if (allExprNodesExceptSemijoin.size() > 0 &&
-allExprNodesExceptSemijoin.get(0) instanceof 
ExprNodeGenericFuncDesc) {
-  newRetainableTsFilterExpr = 
allExprNodesExceptSemijoin.get(0);
-}
-// Push filter on top of children for retainable
-pushFilterToTopOfTableScan(optimizerCache, retainableTsOp);
+
+  if (sr.discardableOps.size() > 1) {
+throw new RuntimeException("we can't discard more in this 
path");
   }
-  ExprNodeDesc newDiscardableTsFilterExpr = null;
-  if (discardableTsOp.getConf().getFilterExpr() != null) {
-// If there is a single discardable operator, it is a 
TableScanOperator
-// and it means that we will merge filter expressions for it. 
Thus, we
-// might need to remove DPP predicates before doing that
-List allExprNodesExceptSemijoin = new 
ArrayList<>();
-splitExpressions(discardableTsOp.getConf().getFilterExpr(),
-allExprNodesExceptSemijoin, new ArrayList<>());
-// Create new expressions
-if (allExprNodesExceptSemijoin.size() > 1) {
-  newDiscardableTsFilterExpr = 
ExprNodeGenericFuncDesc.newInstance(
-  new GenericUDFOPAnd(), allExprNodesExceptSemijoin);
-} else if (allExprNodesExceptSemijoin.size() > 0 &&
-allExprNodesExceptSemijoin.get(0) instanceof 
ExprNodeGenericFuncDesc) {
-  newDiscardableTsFilterExpr = 
allExprNodesExceptSemijoin.get(0);
-}
-// Remove and add semijoin filter from expressions
-replaceSemijoinExpressions(discardableTsOp, semijoinExprNodes);
-// Push filter on top of children for discardable
-pushFilterToTopOfTableScan(optimizerCache, discardableTsOp);
+
+  SharedWorkModel modelR = new SharedWorkModel(retainableTsOp);
+  SharedWorkModel modelD = new SharedWorkModel(discardableTsOp);
+
+  // Push filter on top of children for retainable
+  pushFilterToTopOfTableScan(optimizerCache, retainableTsOp);
+
+  if (mode == Mode.RemoveSemijoin || mode == Mode.SubtreeMerge) {
+// FIXME: I think idea here is to clear the discardable's 
semijoin filter
+// - by using the retainable's (which should be empty in case 
of this mode)
+replaceSemijoinExpressions(discardableTsOp, 
modelR.getSemiJoinFilter());
   }
+  // Push filter on top of children for discardable
+  pushFilterToTopOfTableScan(optimizerCache, discardableTsOp);
+
   // Obtain filter for shared TS operator
-  ExprNodeGenericFuncDesc exprNode = null;
-  if (newRetainableTsFilterExpr != null && 
newDiscardableTsFilterExpr != null) {
-// Combine
-exprNode = (ExprNodeGenericFuncDesc) newRetainableTsFilterExpr;
-if (!exprNode.isSame(newDiscardableTsFilterExpr)) {
-  // We merge filters from previous scan by ORing with filters 
from current scan
-  if 

[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499948=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499948
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 09:40
Start Date: 13/Oct/20 09:40
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503812132



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##
@@ -386,125 +456,81 @@ public boolean sharedWorkOptimization(ParseContext pctx, 
SharedWorkOptimizerCach
   LOG.debug("Merging subtree starting at {} into subtree starting 
at {}",
   discardableTsOp, retainableTsOp);
 } else {
-  ExprNodeDesc newRetainableTsFilterExpr = null;
-  List semijoinExprNodes = new ArrayList<>();
-  if (retainableTsOp.getConf().getFilterExpr() != null) {
-// Gather SJ expressions and normal expressions
-List allExprNodesExceptSemijoin = new 
ArrayList<>();
-splitExpressions(retainableTsOp.getConf().getFilterExpr(),
-allExprNodesExceptSemijoin, semijoinExprNodes);
-// Create new expressions
-if (allExprNodesExceptSemijoin.size() > 1) {
-  newRetainableTsFilterExpr = 
ExprNodeGenericFuncDesc.newInstance(
-  new GenericUDFOPAnd(), allExprNodesExceptSemijoin);
-} else if (allExprNodesExceptSemijoin.size() > 0 &&
-allExprNodesExceptSemijoin.get(0) instanceof 
ExprNodeGenericFuncDesc) {
-  newRetainableTsFilterExpr = 
allExprNodesExceptSemijoin.get(0);
-}
-// Push filter on top of children for retainable
-pushFilterToTopOfTableScan(optimizerCache, retainableTsOp);
+
+  if (sr.discardableOps.size() > 1) {
+throw new RuntimeException("we can't discard more in this 
path");

Review comment:
   there could be a few things which could go south here - one is that 
pushing filters out from the discardable ts will most likely not work as 
desired.
   
   I feel tempted to remove this multi operator matching stuff in HIVE-24241 - 
because that approach is much simpler; more separated from merging of the 
operators.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499948)
Time Spent: 1h  (was: 50m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499945=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499945
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 09:35
Start Date: 13/Oct/20 09:35
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503809039



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##
@@ -326,6 +372,7 @@ public boolean sharedWorkOptimization(ParseContext pctx, 
SharedWorkOptimizerCach
 LOG.debug("{} and {} cannot be merged", retainableTsOp, 
discardableTsOp);
 continue;
   }
+  // FIXME: I think this optimization is assymetric; but the check 
is symmetric

Review comment:
   this could be done as a cleanup - however I've already concluded that 
because of the table ordering the problematic case will actually never happen; 
so we are safe
   
   I've removed the FIXME
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499945)
Time Spent: 50m  (was: 40m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499942=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499942
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 13/Oct/20 09:32
Start Date: 13/Oct/20 09:32
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503806798



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##
@@ -386,125 +456,81 @@ public boolean sharedWorkOptimization(ParseContext pctx, 
SharedWorkOptimizerCach
   LOG.debug("Merging subtree starting at {} into subtree starting 
at {}",
   discardableTsOp, retainableTsOp);
 } else {
-  ExprNodeDesc newRetainableTsFilterExpr = null;
-  List semijoinExprNodes = new ArrayList<>();
-  if (retainableTsOp.getConf().getFilterExpr() != null) {
-// Gather SJ expressions and normal expressions
-List allExprNodesExceptSemijoin = new 
ArrayList<>();
-splitExpressions(retainableTsOp.getConf().getFilterExpr(),
-allExprNodesExceptSemijoin, semijoinExprNodes);
-// Create new expressions
-if (allExprNodesExceptSemijoin.size() > 1) {
-  newRetainableTsFilterExpr = 
ExprNodeGenericFuncDesc.newInstance(
-  new GenericUDFOPAnd(), allExprNodesExceptSemijoin);
-} else if (allExprNodesExceptSemijoin.size() > 0 &&
-allExprNodesExceptSemijoin.get(0) instanceof 
ExprNodeGenericFuncDesc) {
-  newRetainableTsFilterExpr = 
allExprNodesExceptSemijoin.get(0);
-}
-// Push filter on top of children for retainable
-pushFilterToTopOfTableScan(optimizerCache, retainableTsOp);
+
+  if (sr.discardableOps.size() > 1) {
+throw new RuntimeException("we can't discard more in this 
path");
   }
-  ExprNodeDesc newDiscardableTsFilterExpr = null;
-  if (discardableTsOp.getConf().getFilterExpr() != null) {
-// If there is a single discardable operator, it is a 
TableScanOperator
-// and it means that we will merge filter expressions for it. 
Thus, we
-// might need to remove DPP predicates before doing that
-List allExprNodesExceptSemijoin = new 
ArrayList<>();
-splitExpressions(discardableTsOp.getConf().getFilterExpr(),
-allExprNodesExceptSemijoin, new ArrayList<>());
-// Create new expressions
-if (allExprNodesExceptSemijoin.size() > 1) {
-  newDiscardableTsFilterExpr = 
ExprNodeGenericFuncDesc.newInstance(
-  new GenericUDFOPAnd(), allExprNodesExceptSemijoin);
-} else if (allExprNodesExceptSemijoin.size() > 0 &&
-allExprNodesExceptSemijoin.get(0) instanceof 
ExprNodeGenericFuncDesc) {
-  newDiscardableTsFilterExpr = 
allExprNodesExceptSemijoin.get(0);
-}
-// Remove and add semijoin filter from expressions
-replaceSemijoinExpressions(discardableTsOp, semijoinExprNodes);
-// Push filter on top of children for discardable
-pushFilterToTopOfTableScan(optimizerCache, discardableTsOp);
+
+  SharedWorkModel modelR = new SharedWorkModel(retainableTsOp);
+  SharedWorkModel modelD = new SharedWorkModel(discardableTsOp);
+
+  // Push filter on top of children for retainable
+  pushFilterToTopOfTableScan(optimizerCache, retainableTsOp);
+
+  if (mode == Mode.RemoveSemijoin || mode == Mode.SubtreeMerge) {
+// FIXME: I think idea here is to clear the discardable's 
semijoin filter

Review comment:
   I made this note - because it was not obvious to me what's happening 
here..I've rephrased it to be easier to understand





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499942)
Time Spent: 40m  (was: 0.5h)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  

[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499298=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499298
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 12/Oct/20 09:40
Start Date: 12/Oct/20 09:40
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r503169537



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##
@@ -159,9 +158,15 @@ public ParseContext transform(ParseContext pctx) throws 
SemanticException {
 // Gather information about the DPP table scans and store it in the cache
 gatherDPPTableScanOps(pctx, optimizerCache);
 
+BaseSharedWorkOptimizer swo;
+if (pctx.getConf().getBoolVar(ConfVars.HIVE_SHARED_WORK_MERGE_TS_SCHEMA)) {
+  swo = new BaseSharedWorkOptimizer();

Review comment:
   `SchemaAwareSharedWorkOptimizer` is the more strict version
   
   
https://github.com/apache/hive/blob/78d42f0321c846ee74794a58b94a92f65797430d/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java#L591
   
   I just wanted to enable schema merge for all the optimizations - so I've 
moved it here





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 499298)
Time Spent: 0.5h  (was: 20m)

> Enhance shared work optimizer to merge scans with filters on both sides
> ---
>
> Key: HIVE-24231
> URL: https://issues.apache.org/jira/browse/HIVE-24231
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with filters on both sides

2020-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=499095=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-499095
 ]

ASF GitHub Bot logged work on HIVE-24231:
-

Author: ASF GitHub Bot
Created on: 11/Oct/20 19:01
Start Date: 11/Oct/20 19:01
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1553:
URL: https://github.com/apache/hive/pull/1553#discussion_r502816547



##
File path: Jenkinsfile
##
@@ -174,6 +174,17 @@ def loadWS() {
 tar -xf archive.tar'''
 }
 
+def saveFile(name) {

Review comment:
   It seems this is unrelated to this patch? It may be better to split into 
multiple JIRAs/PRs.

##
File path: pom.xml
##
@@ -104,7 +104,7 @@
 2.17
 1.12
 2.10
-3.0.0-M4
+3.0.0-M5

Review comment:
   Same as above, not sure if it would belong exactly to same PR, but it 
would be better to have different JIRAs/PRs for this different issues.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##
@@ -386,125 +456,81 @@ public boolean sharedWorkOptimization(ParseContext pctx, 
SharedWorkOptimizerCach
   LOG.debug("Merging subtree starting at {} into subtree starting 
at {}",
   discardableTsOp, retainableTsOp);
 } else {
-  ExprNodeDesc newRetainableTsFilterExpr = null;
-  List semijoinExprNodes = new ArrayList<>();
-  if (retainableTsOp.getConf().getFilterExpr() != null) {
-// Gather SJ expressions and normal expressions
-List allExprNodesExceptSemijoin = new 
ArrayList<>();
-splitExpressions(retainableTsOp.getConf().getFilterExpr(),
-allExprNodesExceptSemijoin, semijoinExprNodes);
-// Create new expressions
-if (allExprNodesExceptSemijoin.size() > 1) {
-  newRetainableTsFilterExpr = 
ExprNodeGenericFuncDesc.newInstance(
-  new GenericUDFOPAnd(), allExprNodesExceptSemijoin);
-} else if (allExprNodesExceptSemijoin.size() > 0 &&
-allExprNodesExceptSemijoin.get(0) instanceof 
ExprNodeGenericFuncDesc) {
-  newRetainableTsFilterExpr = 
allExprNodesExceptSemijoin.get(0);
-}
-// Push filter on top of children for retainable
-pushFilterToTopOfTableScan(optimizerCache, retainableTsOp);
+
+  if (sr.discardableOps.size() > 1) {
+throw new RuntimeException("we can't discard more in this 
path");

Review comment:
   Can you leave a comment explaining how we could hit this error? What 
should have happened? Somehow we have something in retainable/discardable that 
should be equivalent but it is not?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##
@@ -338,8 +385,29 @@ public boolean sharedWorkOptimization(ParseContext pctx, 
SharedWorkOptimizerCach
   // about the part of the tree that can be merged. We need to 
regenerate the
   // cache because semijoin operators have been removed
   sr = extractSharedOptimizationInfoForRoot(
-  pctx, optimizerCache, retainableTsOp, discardableTsOp);
-} else {
+  pctx, optimizerCache, retainableTsOp, discardableTsOp, true);
+} else if (mode == Mode.DPPUnion) {
+  boolean mergeable = areMergeable(pctx, retainableTsOp, 
discardableTsOp);
+  if (!mergeable) {
+LOG.debug("{} and {} cannot be merged", retainableTsOp, 
discardableTsOp);
+continue;
+  }
+  boolean validMerge =
+  areMergeableDppUninon(pctx, optimizerCache, retainableTsOp, 
discardableTsOp);

Review comment:
   typo -> areMergeableDppUninon

##
File path: ql/src/test/results/clientpositive/llap/join_parse.q.out
##
@@ -499,34 +499,41 @@ STAGE PLANS:
 sort order: +
 Map-reduce partition columns: _col0 (type: string)
 Statistics: Num rows: 500 Data size: 43500 Basic 
stats: COMPLETE Column stats: COMPLETE
+  Filter Operator
+predicate: (value is not null and key is not null) (type: 
boolean)
+Statistics: Num rows: 500 Data size: 89000 Basic stats: 
COMPLETE Column stats: COMPLETE
+Select Operator
+  expressions: key (type: string), value (type: string)
+  outputColumnNames: _col0, _col1
+  Statistics: Num rows: 500 Data size: 89000 Basic stats: 
COMPLETE Column stats: COMPLETE
   Reduce Output Operator