[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-25 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062470#comment-16062470
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~ferd]:  as [~csun] finished review, let's commit HIVE-11297.8.patch. [~csun]: 
thanks for helping review!

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch, HIVE-11297.8.patch, hive-site.xml
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-23 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061436#comment-16061436
 ] 

Chao Sun commented on HIVE-11297:
-

Thanks [~kellyzly]! +1 on the latest patch.

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch, HIVE-11297.8.patch, hive-site.xml
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-23 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060632#comment-16060632
 ] 

Hive QA commented on HIVE-11297:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12874190/HIVE-11297.8.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 12 failed/errored test(s), 10846 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[insert_overwrite_local_directory_1]
 (batchId=238)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_smb_main]
 (batchId=150)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query16] 
(batchId=233)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=233)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query94] 
(batchId=233)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[union24] 
(batchId=125)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testBootstrapFunctionReplication
 (batchId=217)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionIncrementalReplication
 (batchId=217)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionWithFunctionBinaryJarsOnHDFS
 (batchId=217)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema
 (batchId=178)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
 (batchId=178)
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation 
(batchId=178)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5743/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5743/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5743/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 12 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12874190 - PreCommit-HIVE-Build

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch, HIVE-11297.8.patch, hive-site.xml
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-22 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060440#comment-16060440
 ] 

Hive QA commented on HIVE-11297:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12874190/HIVE-11297.8.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 10846 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[create_merge_compressed]
 (batchId=238)
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[insert_overwrite_local_directory_1]
 (batchId=238)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_create] 
(batchId=83)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_smb_main]
 (batchId=150)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=99)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=233)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query16] 
(batchId=233)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query94] 
(batchId=233)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[union24] 
(batchId=125)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testBootstrapFunctionReplication
 (batchId=217)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionIncrementalReplication
 (batchId=217)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionWithFunctionBinaryJarsOnHDFS
 (batchId=217)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema
 (batchId=178)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
 (batchId=178)
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation 
(batchId=178)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5739/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5739/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5739/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 15 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12874190 - PreCommit-HIVE-Build

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch, HIVE-11297.8.patch, hive-site.xml
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-22 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060369#comment-16060369
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun]: for the second query you mentioned in RB. file HIVE-16948 to trace

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch, hive-site.xml
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-22 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060336#comment-16060336
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun]: about the questions you mentioned in RB. there are two queries are 
different.  
explain query1( please use the attached hive-site.xml to verify, without the 
configuration in hive-site.xml,  i can not reproduce following explain)
{code}
set hive.execution.engine=spark; 
set hive.spark.dynamic.partition.pruning=true;
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.strict.checks.cartesian.product=false;
explain select count(*) from srcpart join srcpart_date on (srcpart.ds = 
srcpart_date.ds) join srcpart_hour on (srcpart.hr = srcpart_hour.hr) 
where srcpart_date.`date` = '2008-04-08' and srcpart_hour.hour = 11 and 
srcpart.hr = 11
{code}
previous explain 
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
Spark
  DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:2
  Vertices:
Map 7 
Map Operator Tree:
TableScan
  alias: srcpart_date
  filterExpr: ((date = '2008-04-08') and ds is not null) (type: 
boolean)
  Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: ((date = '2008-04-08') and ds is not null) 
(type: boolean)
Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  expressions: ds (type: string)
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  keys: _col0 (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
  Spark Partition Pruning Sink Operator
partition key expr: ds
Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
target column name: ds
target work: Map 1

  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 5 (PARTITION-LEVEL 
SORT, 2)
Reducer 3 <- Map 6 (PARTITION-LEVEL SORT, 2), Reducer 2 
(PARTITION-LEVEL SORT, 2)
Reducer 4 <- Reducer 3 (GROUP, 1)
  DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: srcpart
  Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL 
Column stats: NONE
  Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 11624 Basic stats: 
PARTIAL Column stats: NONE
Reduce Output Operator
  key expressions: _col0 (type: string)
  sort order: +
  Map-reduce partition columns: _col0 (type: string)
  Statistics: Num rows: 1 Data size: 11624 Basic stats: 
PARTIAL Column stats: NONE
  value expressions: _col1 (type: string)
Map 5 
Map Operator Tree:
TableScan
  alias: srcpart_date
  filterExpr: ((date = '2008-04-08') and ds is not null) (type: 
boolean)
  Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: ((date = '2008-04-08') and ds is not null) 
(type: boolean)
Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  expressions: ds (type: string)
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
 

[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057086#comment-16057086
 ] 

Hive QA commented on HIVE-11297:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12873780/HIVE-11297.7.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 12 failed/errored test(s), 10841 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[columnstats_part_coltype]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_smb_main]
 (batchId=149)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=145)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query16] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query94] 
(batchId=232)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testBootstrapFunctionReplication
 (batchId=216)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionIncrementalReplication
 (batchId=216)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionWithFunctionBinaryJarsOnHDFS
 (batchId=216)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation 
(batchId=177)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5706/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5706/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5706/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 12 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12873780 - PreCommit-HIVE-Build

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-20 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056908#comment-16056908
 ] 

Chao Sun commented on HIVE-11297:
-

{quote}
So can you retest it in your env? if the operator tree is like what you 
mentioned, i think all the operator tree in 
spark_dynamic_partition_pruning.q.out will be different as i generated in my 
env.
{quote}

Interesting.. I'm not sure what caused the difference, may be some 
configurations? I've tried several times in my env and the FIL is always 
followed by a SEL operator. Nevertheless, this is not an important issue. Will 
take a look a the RB.

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-20 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056837#comment-16056837
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun]:   I patch HIVE-11297.6.patch on latest master branch(8c5f55e) and run 
query i posted above, i print the operator tree of filterOp 

SplitOpTreeForDPP#process
{code}
.
/** print the operator tree **/
  ArrayList tableScanList = new ArrayList ();
 tableScanList.add((TableScanOperator)stack.get(0));
 LOG.debug("operator tree:"+Operator.toString(tableScanList));
/** print the operator tree**/
Operator filterOp = pruningSinkOp;
while (filterOp != null) {
  if (filterOp.getNumChild() > 1) {
break;
  } else {
filterOp = filterOp.getParentOperators().get(0);
  }
}


{code}

the operator tree is:
{code}
TS[1]-FIL[17]-RS[4]-JOIN[5]-GBY[8]-RS[9]-GBY[10]-FS[12]
TS[1]-FIL[17]-SEL[18]-GBY[19]-SPARKPRUNINGSINK[20]
TS[1]-FIL[17]-SEL[21]-GBY[22]-SPARKPRUNINGSINK[23]
{code}


> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, HIVE-11297.6.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-20 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056577#comment-16056577
 ] 

Chao Sun commented on HIVE-11297:
-

Sorry for the late response. Will put comments in the RB.
Regarding the filterOp issue, It's a little strange since I'm seeing something 
different on my side (with the latest master branch). 
For the query you posted above, I saw:
{code}
TS[3] -> FIL[18] -> SEL[5] -> SEL[19] -> GBY[20] -> SPARKPRUNINGSINK[21]
TS[3] -> FIL[18] -> SEL[5] -> SEL[22] -> GBY[23] -> SPARKPRUNINGSINK[24]
TS[3] -> FIL[18] -> SEL[5] -> RS[7] -> JOIN[8] -> ...
{code}
inside {{SplitOpTreeForDPP}}.



> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, HIVE-11297.6.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-20 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056531#comment-16056531
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun]:  can you spend some time to review HIVE-11297.6.patch? thanks!

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, HIVE-11297.6.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-19 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053510#comment-16053510
 ] 

Hive QA commented on HIVE-11297:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12873432/HIVE-11297.6.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 12 failed/errored test(s), 10831 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[columnstats_part_coltype]
 (batchId=157)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query16] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query94] 
(batchId=232)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testBootstrapFunctionReplication
 (batchId=216)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionIncrementalReplication
 (batchId=216)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionWithFunctionBinaryJarsOnHDFS
 (batchId=216)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation 
(batchId=177)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5673/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5673/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5673/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 12 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12873432 - PreCommit-HIVE-Build

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, HIVE-11297.6.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-18 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053446#comment-16053446
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun]: When i print the operator tree of multi_column_single_source.q  when 
debugging in 
[SplitOpTreeForDPP|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SplitOpTreeForDPP.java#L75
 ], the physical plan is 
{code}
set hive.execution.engine=spark; 
set hive.auto.convert.join.noconditionaltask.size=20; 
set hive.spark.dynamic.partition.pruning=true;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
{code}

physical plan 
{code}
TS[1]-FIL[17]-RS[4]-JOIN[5]-GBY[8]-RS[9]-GBY[10]-FS[12]
 -SEL[18]-GBY[19]-SPARKPRUNINGSINK[20]
 -SEL[21]-GBY[22]-SPARKPRUNINGSINK[23]
{code}
{noformat}RS[4],SEL[18],SEL[21] is children of FIL[17]{noformat}
bq. I think in the original code the parent node of all branches is a filter 
op, but now it is changed
I don't think so, i think now filter op is still {noformat}FIL[17]{noformat}.  
the difference between previous is now.  Before we split above tree into three 
trees
{noformat}
tree1: TS[1]-FIL[17]-RS[4]-JOIN[5]-GBY[8]-RS[9]-GBY[10]-FS[12]
tree2: TS[1]-FIL[17]-SEL[18]-GBY[19]-SPARKPRUNINGSINK[20]
tree3: TS[1]-FIL[17]-SEL[21]-GBY[22]-SPARKPRUNINGSINK[23]
{noformat}

Now we split above tree into two trees
{noformat}
tree1: TS[1]-FIL[17]-RS[4]-JOIN[5]-GBY[8]-RS[9]-GBY[10]-FS[12]
tree2: TS[1]-FIL[17]-SEL[18]-GBY[19]-SPARKPRUNINGSINK[20]
   -SEL[21]-GBY[22]-SPARKPRUNINGSINK[23]
{noformat}

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-16 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052701#comment-16052701
 ] 

Chao Sun commented on HIVE-11297:
-

[~kellyzly] I think in the original code the parent node of all branches is a 
filter op, but now it is changed. That's why I think it's better to rename it 
to something else to avoid confusion. And yes, the selOp is no longer needed.

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-16 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052699#comment-16052699
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun]: just 1 thing need to be confirmed:
{code}
Operator filterOp = pruningSinkOp;
Operator selOp = null;
  while (filterOp != null) {
  if (filterOp.getNumChild() > 1) {
break;
  } else {
selOp = filterOp;
filterOp = filterOp.getParentOperators().get(0);
  }
}

{code}
Here the original code is find the filterOp from pruningSinkOp(tranverse back 
award).  why need rename filterOp to something else?  I think we can remove 
selOp here because it will not used anymore.   If my understanding is wrong, 
please tell me .


> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-16 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051742#comment-16051742
 ] 

Hive QA commented on HIVE-11297:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12873251/HIVE-11297.5.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 10831 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[columnstats_part_coltype]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=145)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=99)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query16] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query94] 
(batchId=232)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testBootstrapFunctionReplication
 (batchId=216)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionIncrementalReplication
 (batchId=216)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionWithFunctionBinaryJarsOnHDFS
 (batchId=216)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation 
(batchId=177)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5659/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5659/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5659/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 14 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12873251 - PreCommit-HIVE-Build

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-14 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049923#comment-16049923
 ] 

Chao Sun commented on HIVE-11297:
-

Sure. Added comments in RB. Regarding the output file, you can just use 
{{-Dtest.output.overwrite=true}} to generate new file.

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-14 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049902#comment-16049902
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun]: can you help to view HIVE-11297.3.patch which changes 
{{SplitOpTreeForDPP.java}} and {{spark.dynamic.partition.pruning.q.out}}? thanks

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-13 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047548#comment-16047548
 ] 

Hive QA commented on HIVE-11297:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12872813/HIVE-11297.3.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 8 failed/errored test(s), 10829 tests 
executed
*Failed tests:*
{noformat}
TestSSLWithMiniKdc - did not produce a TEST-*.xml file (likely timed out) 
(batchId=238)
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[insert_overwrite_local_directory_1]
 (batchId=237)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query16] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query94] 
(batchId=232)
org.apache.hive.hcatalog.pig.TestHCatLoaderComplexSchema.testSyntheticComplexSchema[0]
 (batchId=180)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5633/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5633/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5633/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 8 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12872813 - PreCommit-HIVE-Build

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-12 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046348#comment-16046348
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun]: will try to modify the 
[SplitOpTreeForDPP|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SplitOpTreeForDPP.java#L107].

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-08 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043316#comment-16043316
 ] 

Chao Sun commented on HIVE-11297:
-

[~kellyzly]: more changes are needed in {{SplitOpTreeForDPP}}: you also need to 
make sure when splitting the OP tree, all pruning sinks will be kept. For 
instance, given:
{code}
   TS
   |
   FIL
   |
   SEL
 /  | \
   B1  B2  B3
{code}
suppose {{B2}} and {{B3}} contains pruning sinks, in {{SplitOpTreeForDPP}} we 
should clone the OP tree for both of them. The result should be:
{code}
Original TreeGenerated Tree
TS   TS
 ||
FIL  FIL
 ||
SEL  SEL
 |   / \
B1  B2  B3
{code}

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-06 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038414#comment-16038414
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun]:   we can not do that because 
GenSparkProcContext#clonedPruningTableScanSet will be sent to topNodes of 
GenSparkWorkWalker#startWalking. And GenSparkWorkWalker will split tree in min 
cost. So if topNode is 1, it will split following tree
{noformat}
TS[1]-FIL[17]- SEL[18] -GBY[19]-SPARKPRUNINGSINK[20]
-SEL[21] -GBY[22]-SPARKPRUNINGSINK[23]
{noformat}
into  only 1 tree
{noformat}
TS[1]-FIL[17]- SEL[18] -GBY[19]-SPARKPRUNINGSINK[20]
{noformat}

The log of GenSparkWork
{code}
[root@bdpe41 hive]# grep GenSparkWork logs/hive.log 
2017-06-06T16:34:12,527 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Root operator: TS[0]
2017-06-06T16:34:12,527 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Leaf operator: RS[2]
2017-06-06T16:34:19,070 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: First pass. Leaf operator: RS[2]
2017-06-06T16:34:19,070 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Root operator: JOIN[5]
2017-06-06T16:34:19,070 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Leaf operator: RS[9]
2017-06-06T16:34:22,858 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Removing RS[2] as parent from JOIN[5]
2017-06-06T16:34:22,859 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Removing RS[4] as parent from JOIN[5]
2017-06-06T16:34:22,859 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: First pass. Leaf operator: RS[9]
2017-06-06T16:34:22,859 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Root operator: GBY[10]
2017-06-06T16:34:22,859 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Leaf operator: FS[12]
2017-06-06T16:34:27,322 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Removing RS[9] as parent from GBY[10]
2017-06-06T16:34:27,322 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: First pass. Leaf operator: FS[12]
2017-06-06T16:34:27,322 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Root operator: TS[1]
2017-06-06T16:34:27,322 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Leaf operator: RS[4]
2017-06-06T16:36:14,669 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Second pass. Leaf operator: RS[4] has common downstream 
work:org.apache.hadoop.hive.ql.plan.ReduceWork@7e7f72
2017-06-06T16:36:14,672 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Root operator: TS[1]
2017-06-06T16:36:14,672 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: Leaf operator: SPARKPRUNINGSINK[20]
2017-06-06T16:38:22,338 DEBUG [7e080689-d76b-498f-9a41-d8843a9b199f main] 
spark.GenSparkWork: First pass. Leaf operator: SPARKPRUNINGSINK[20]
{code}


> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-05 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16037305#comment-16037305
 ] 

Chao Sun commented on HIVE-11297:
-

[~kellyzly]: it seems the same TableScan [could be added multiple 
times|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SplitOpTreeForDPP.java#L116]
 in {{SplitOpTreeForDPP}}, and so multiple MapWorks are generated for the same 
TableScan. Can you check if we can avoid doing that? 

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-05 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16036830#comment-16036830
 ] 

Hive QA commented on HIVE-11297:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12871201/HIVE-11297.2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10820 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=145)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=99)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query78] 
(batchId=232)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5531/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5531/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5531/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12871201 - PreCommit-HIVE-Build

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-04 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16036554#comment-16036554
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun]: thanks for review. reply you on review board.
bq.Seems this removes the extra map work after it was generated. Is there a way 
to avoid generating the map work in the first place?
physical operator tree will by spark partition pruningsink
original tree:
{noformat}
TS[1]-FIL[17]-RS[4]-JOIN[5]
 -SEL[18]-GBY[19]-SPARKPRUNINGSINK[20]
 -SEL[21]-GBY[22]-SPARKPRUNINGSINK[23]
{noformat}
after split by spark partition pruningsink:
{noformat}
TS[1]-FIL[17]-RS[4]-JOIN[5]
TS[1]-FIL[17]-SEL[18]-GBY[19]-SPARKPRUNINGSINK[20]
TS[1]-FIL[17]-SEL[21]-GBY[22]-SPARKPRUNINGSINK[23]
{noformat}
If we want to avoid generating multiple map 
works({noformat}TS[1]-FIL[17]-SEL[18]-GBY[19]-SPARKPRUNINGSINK[20],TS[1]-FIL[17]-SEL[18]-GBY[22]-SPARKPRUNINGSINK[23]{noformat}),
 we need remove the rule of spark dynamic partition pruning. If we remove that 
rule, exception will be thrown because the remaining tree will not be in a 
MapWork (   
{noformat}
 -SEL[18]-GBY[19]-SPARKPRUNINGSINK[20]
 -SEL[21]-GBY[22]-SPARKPRUNINGSINK[23]
{noformat}
 )
{code}
opRules.put(new RuleRegExp("Split Work - SparkPartitionPruningSink",
SparkPartitionPruningSinkOperator.getOperatorName() + "%"), genSparkWork);

{code}

If you have idea about this, please give me your suggestion.

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-04 Thread Jianguo Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16036540#comment-16036540
 ] 

Jianguo Tian commented on HIVE-11297:
-

[~csun]: thanks for review, reply you on review board.

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-04 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16036514#comment-16036514
 ] 

Chao Sun commented on HIVE-11297:
-

Thanks for working on this [~kellyzly]!. Sorry for the delay but I added some 
comments in RB.

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-04 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16036446#comment-16036446
 ] 

liyunzhang_intel commented on HIVE-11297:
-

[~csun],[~Ferd]: can you help review HIVE-11297.1.patch if have time?

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-06-01 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032558#comment-16032558
 ] 

Hive QA commented on HIVE-11297:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12870495/HIVE-11297.1.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 10813 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_queries]
 (batchId=228)
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[insert_overwrite_local_directory_1]
 (batchId=237)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=232)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5495/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5495/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5495/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12870495 - PreCommit-HIVE-Build

> Combine op trees for partition info generating tasks [Spark branch]
> ---
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: spark-branch
>Reporter: Chao Sun
>Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

2017-05-09 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002113#comment-16002113
 ] 

liyunzhang_intel commented on HIVE-11297:
-

the explain plan of the multiple columns single source case in 
spark_dynamic_partition_pruning.q is 
{code}
-- multiple columns single source
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
{code}

the explain plan is 
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
Spark
 A masked pattern was here 
  Vertices:
Map 5 
Map Operator Tree:
TableScan
  alias: srcpart_date_hour
  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  expressions: ds (type: string), hr (type: string)
  outputColumnNames: _col0, _col2
  Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  keys: _col0 (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
  Spark Partition Pruning Sink Operator
partition key expr: ds
Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
target column name: ds
target work: Map 1
Map 6 
Map Operator Tree:
TableScan
  alias: srcpart_date_hour
  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  expressions: ds (type: string), hr (type: string)
  outputColumnNames: _col0, _col2
  Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col2 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  keys: _col0 (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
  Spark Partition Pruning Sink Operator
partition key expr: hr
Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
target column name: hr
target work: Map 1

  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL 
SORT, 2)
Reducer 3 <- Reducer 2 (GROUP, 1)
 A masked pattern was here 
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: srcpart
  Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2000 Data