[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-22 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: HIVE-11297.8.patch

some minor changes about spark_partition_pruning.q.out

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch, HIVE-11297.8.patch, hive-site.xml
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-22 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: hive-site.xml

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch, hive-site.xml
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-20 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: HIVE-11297.7.patch

[~csun]:  update HIVE-11297.7.patch according to the last round of review in 
review board.

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-18 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: HIVE-11297.6.patch

[~csun]: in HIVE-11297.6, fix all comments except renaming filterOp.  About 
this, i explain more in above, if there is misunderstanding ,tell me.

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, HIVE-11297.6.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-16 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: HIVE-11297.5.patch

[~csun]: help review and update patch on the RB.thanks!


> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-15 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: HIVE-11297.4.patch

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-15 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: (was: HIVE-11297.4.patch)

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-15 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: HIVE-11297.4.patch

[~csun]: update HIVE-11297.4.patch according to what you mentioned on RB.
{noformat}
 TS1TS2
  |   |
  FIL1FIL2
  |   |
  RS SEL---
  |  |   \\
  |RS  SEL  SEL
  \   /  | |
  JOIN  GBY   GBY
  ||
  |  SPARKPRUNINGSINK
  |
  SPARKPRUNINGSINK
{noformat}
current algorithms:
1. find the filter FIL2, tranverse each branch of FIL2 and get the children 
which start branches contain SPARKPRUNINGSINK.
2.  split the tree into 2 seperate tree

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-13 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: HIVE-11297.3.patch

[~csun]:  update SplitOpTreeForDPP and to split the trees like what you 
mentioned last time.
because the explain plan is changed after this jira
{code}
set hive.execution.engine=spark; 
set hive.auto.convert.join.noconditionaltask.size=20; 
set hive.spark.dynamic.partition.pruning=true;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
{code}

before
{code}
STAGE PLANS:
  Stage: Stage-2
Spark
 A masked pattern was here 
  Vertices:
Map 5 
Map Operator Tree:
TableScan
  alias: srcpart_date_hour
  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  expressions: ds (type: string), hr (type: string)
  outputColumnNames: _col0, _col2
  Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  keys: _col0 (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
  Spark Partition Pruning Sink Operator
partition key expr: ds
Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
target column name: ds
target work: Map 1
Map 6 
Map Operator Tree:
TableScan
  alias: srcpart_date_hour
  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  expressions: ds (type: string), hr (type: string)
  outputColumnNames: _col0, _col2
  Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col2 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  keys: _col0 (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
  Spark Partition Pruning Sink Operator
partition key expr: hr
Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
target column name: hr
target work: Map 1

  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL 
SORT, 2)
Reducer 3 <- Reducer 2 (GROUP, 1)
{code}

now
{code}
Stage: Stage-2  Spark
 A masked pattern was here 
Vertices:
  Map 5 
  Map Operator Tree:
  TableScan
alias: srcpart_date_hour
filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
Filter Operator
 

[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-05 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: HIVE-11297.2.patch

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-05-31 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Status: Patch Available  (was: Open)

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-05-30 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-11297:

Attachment: HIVE-11297.1.patch

[~csun]: update patch, as in my environment,[case "multiple sources, single 
key"|https://issues.apache.org/jira/browse/HIVE-16780] in 
spark_dynamic_pruning.q fails, i could not generate new 
spark_dynamic_partition_pruning.q.out. I extract the test case about "multi 
columns, single source" in a new qfile 
"spark_dynamic_partition_pruning_combine.q"( here i create a configuration item 
" hive.spark.dynamic.partition.pruning.combine" ,so if this config item is not 
enabled, combine op trees for partiition info will not happen)
{code}
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.spark.dynamic.partition.pruning=true;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.strict.checks.cartesian.product=false;
set hive.spark.dynamic.partition.pruning=true;
set hive.spark.dynamic.partition.pruning.combine=true;


-- SORT_QUERY_RESULTS
create table srcpart_date_hour as select ds as ds, ds as `date`, hr as hr, hr 
as hour from srcpart group by ds, hr;
-- multiple columns single source
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
set hive.spark.dynamic.partition.pruning.combine=false;
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
{code}

I think we can parallel, you can review and i continue to fix HIVE-16780. after 
fixing HIVE-16780 in my environment, i can update the 
spark_dynamic_partition_pruning.q.out with the change of HIVE-11297.

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-05-25 Thread Jianguo Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianguo Tian updated HIVE-11297:

Attachment: (was: HIVE-11297.1.patch)

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: Jianguo Tian
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-05-25 Thread Jianguo Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianguo Tian updated HIVE-11297:

Attachment: HIVE-11297.1.patch

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: Jianguo Tian
> Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)