[jira] [Comment Edited] (HIVE-17193) HoS: don't combine map works that are targets of different DPPs

2017-10-22 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214644#comment-16214644
 ] 

liyunzhang_intel edited comment on HIVE-17193 at 10/23/17 5:24 AM:
---

I can reproduce after disabling cbo
{code}

set hive.explain.user=false;
set hive.spark.dynamic.partition.pruning=true;
set hive.tez.dynamic.partition.pruning=true;
set hive.auto.convert.join=false;
set hive.cbo.enable=false;
explain
select * from
  (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.key) a
join
  (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.value) 
b
on a.key=b.key;
{code}

the explain
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
Spark
  DagName: root_20171023004308_4b3c304e-3deb-4193-846d-12cf9e6a50ab:2
  Vertices:
Map 8 
Map Operator Tree:
TableScan
  alias: src
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  expressions: key (type: string)
  outputColumnNames: _col0
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
  Target column: ds (string)
  partition key expr: ds
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  target work: Map 1

  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 4 (PARTITION-LEVEL 
SORT, 1)
Reducer 3 <- Reducer 2 (PARTITION-LEVEL SORT, 1), Reducer 6 
(PARTITION-LEVEL SORT, 1)
Reducer 6 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 7 (PARTITION-LEVEL 
SORT, 1)
  DagName: root_20171023004308_4b3c304e-3deb-4193-846d-12cf9e6a50ab:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: srcpart
  Statistics: Num rows: 232 Data size: 23248 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 232 Data size: 23248 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: ds (type: string)
  sort order: +
  Map-reduce partition columns: ds (type: string)
  Statistics: Num rows: 232 Data size: 23248 Basic stats: 
COMPLETE Column stats: NONE
  value expressions: key (type: string)
Map 4 
Map Operator Tree:
TableScan
  alias: src
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: key (type: string)
  sort order: +
  Map-reduce partition columns: key (type: string)
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Map 7 
Map Operator Tree:
TableScan
  alias: src
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: value is not null (type: boolean)
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: value (type: string)
  sort order: +
  Map-reduce partition columns: value (type: string)
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Reducer 2 
Reduce Operator Tree:
  Join Operator
condition map:

[jira] [Updated] (HIVE-16948) Invalid explain when running dynamic partition pruning query in Hive On Spark

2017-10-22 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-16948:

Attachment: 17193_compare_RS_in_Map_5_1.PNG

> Invalid explain when running dynamic partition pruning query in Hive On Spark
> -
>
> Key: HIVE-16948
> URL: https://issues.apache.org/jira/browse/HIVE-16948
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: 3.0.0
>
> Attachments: 17193_compare_RS_in_Map_5_1.PNG, HIVE-16948.2.patch, 
> HIVE-16948.5.patch, HIVE-16948.6.patch, HIVE-16948.7.patch, HIVE-16948.patch, 
> HIVE-16948_1.patch
>
>
> in 
> [union_subquery.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/spark_dynamic_partition_pruning.q#L107]
>  in spark_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.strict.checks.cartesian.product=false;
> explain select ds from (select distinct(ds) as ds from srcpart union all 
> select distinct(ds) as ds from srcpart) s where s.ds in (select 
> max(srcpart.ds) from srcpart union all select min(srcpart.ds) from srcpart);
> {code}
> explain 
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>   DagName: root_20170622231525_20a777e5-e659-4138-b605-65f8395e18e2:2
>   Vertices:
> Map 10 
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 1 Data size: 23248 Basic stats: 
> PARTIAL Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 1 Data size: 23248 Basic stats: 
> PARTIAL Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order: 
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Map 12 
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 1 Data size: 23248 Basic stats: 
> PARTIAL Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 1 Data size: 23248 Basic stats: 
> PARTIAL Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order: 
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Reducer 11 
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   

[jira] [Commented] (HIVE-17193) HoS: don't combine map works that are targets of different DPPs

2017-10-22 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214644#comment-16214644
 ] 

liyunzhang_intel commented on HIVE-17193:
-

I can reproduce after disabling cbo
{code}

set hive.explain.user=false;
set hive.spark.dynamic.partition.pruning=true;
set hive.tez.dynamic.partition.pruning=true;
set hive.auto.convert.join=false;
set hive.cbo.enable=false;
explain
select * from
  (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.key) a
join
  (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.value) 
b
on a.key=b.key;
{code}

the explain
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
Spark
  DagName: root_20171023004308_4b3c304e-3deb-4193-846d-12cf9e6a50ab:2
  Vertices:
Map 8 
Map Operator Tree:
TableScan
  alias: src
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  expressions: key (type: string)
  outputColumnNames: _col0
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
  Target column: ds (string)
  partition key expr: ds
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  target work: Map 1

  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 4 (PARTITION-LEVEL 
SORT, 1)
Reducer 3 <- Reducer 2 (PARTITION-LEVEL SORT, 1), Reducer 6 
(PARTITION-LEVEL SORT, 1)
Reducer 6 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 7 (PARTITION-LEVEL 
SORT, 1)
  DagName: root_20171023004308_4b3c304e-3deb-4193-846d-12cf9e6a50ab:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: srcpart
  Statistics: Num rows: 232 Data size: 23248 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 232 Data size: 23248 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: ds (type: string)
  sort order: +
  Map-reduce partition columns: ds (type: string)
  Statistics: Num rows: 232 Data size: 23248 Basic stats: 
COMPLETE Column stats: NONE
  value expressions: key (type: string)
Map 4 
Map Operator Tree:
TableScan
  alias: src
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: key (type: string)
  sort order: +
  Map-reduce partition columns: key (type: string)
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Map 7 
Map Operator Tree:
TableScan
  alias: src
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: value is not null (type: boolean)
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: value (type: string)
  sort order: +
  Map-reduce partition columns: value (type: string)
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Reducer 2 
Reduce Operator Tree:
  Join Operator
condition map:
 Inner Join 0 to 1
keys:
  

[jira] [Commented] (HIVE-17193) HoS: don't combine map works that are targets of different DPPs

2017-10-22 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214603#comment-16214603
 ] 

liyunzhang_intel commented on HIVE-17193:
-

[~lirui]: I remember this problem when i developed HIVE-16948. But I can not 
reproduce this problem on hive(commit a51ae9c) now
{code}
set hive.explain.user=false;
set hive.spark.dynamic.partition.pruning=true;
set hive.tez.dynamic.partition.pruning=true;
set hive.auto.convert.join=false;
explain
select * from
  (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.key) a
join
  (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.value) 
b
on a.key=b.key;
{code}
the explain 
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
Spark
  DagName: root_20171022233200_990c146c-b49f-49b9-9a5b-a0028e34f200:2
  Vertices:
Map 8 
Map Operator Tree:
TableScan
  alias: src
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  expressions: key (type: string)
  outputColumnNames: _col0
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  keys: _col0 (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Spark Partition Pruning Sink Operator
Target column: ds (string)
partition key expr: ds
Statistics: Num rows: 58 Data size: 5812 Basic 
stats: COMPLETE Column stats: NONE
target work: Map 1
Map 9 
Map Operator Tree:
TableScan
  alias: src
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: value is not null (type: boolean)
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  expressions: value (type: string)
  outputColumnNames: _col0
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
Group By Operator
  keys: _col0 (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 58 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Spark Partition Pruning Sink Operator
Target column: ds (string)
partition key expr: ds
Statistics: Num rows: 58 Data size: 5812 Basic 
stats: COMPLETE Column stats: NONE
target work: Map 5

  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 4 (PARTITION-LEVEL 
SORT, 1)
Reducer 3 <- Reducer 2 (PARTITION-LEVEL SORT, 1), Reducer 6 
(PARTITION-LEVEL SORT, 1)
Reducer 6 <- Map 5 (PARTITION-LEVEL SORT, 1), Map 7 (PARTITION-LEVEL 
SORT, 1)
  DagName: root_20171022233200_990c146c-b49f-49b9-9a5b-a0028e34f200:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: srcpart
  Statistics: Num rows: 232 Data size: 23248 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 232 Data size: 23248 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  expressions: key (type: string), ds (type: string)
  

[jira] [Updated] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)

2017-10-09 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17634:

Description:  In the statistics 
estimation([StatsRulesProcFactory|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L134]),
 we do not estimate the column stats once we set hive.stats.fetch.column.stats 
as false.Suggest to estimate the data size by column type when 
{{hive.stats.fetch.column.stats}} as false like HIVE-17634.1.patch does.  (was: 
 In the statistics estimation([StatsRulesProcFactory|), we do not estimate the 
column stats once we set hive.stats.fetch.column.stats as false.Suggest to 
estimate the data size by column type when {{hive.stats.fetch.column.stats}} as 
false like HIVE-17634.1.patch does.)

> Estimate the column stats even not retrieve columns from 
> metastore(hive.stats.fetch.column.stats as false)
> --
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.1.patch, HIVE-17634.patch
>
>
>  In the statistics 
> estimation([StatsRulesProcFactory|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L134]),
>  we do not estimate the column stats once we set 
> hive.stats.fetch.column.stats as false.Suggest to estimate the data size by 
> column type when {{hive.stats.fetch.column.stats}} as false like 
> HIVE-17634.1.patch does.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)

2017-10-09 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17634:

Description:  In the statistics estimation([StatsRulesProcFactory|), we do 
not estimate the column stats once we set hive.stats.fetch.column.stats as 
false.Suggest to estimate the data size by column type when 
{{hive.stats.fetch.column.stats}} as false like HIVE-17634.1.patch does.  (was: 
in 
[RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
 we set {{fetchColStats}},{{fetchPartStats}} as true when call 
{{StatsUtils.collectStatistics}}
{code}

   if (!hiveTblMetadata.isPartitioned()) {
// 2.1 Handle the case for unpartitioned table.
try {
  Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
  hiveTblMetadata, hiveNonPartitionCols, 
nonPartColNamesThatRqrStats,
  colStatsCached, nonPartColNamesThatRqrStats, true, true);
  ...
{code}

This will cause querying columns statistic from metastore even we set  
{{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
false in HiveConf.  If we these two properties as false, we can not any column 
statistics from metastore.  Suggest to set the properties from HiveConf. )

> Estimate the column stats even not retrieve columns from 
> metastore(hive.stats.fetch.column.stats as false)
> --
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.1.patch, HIVE-17634.patch
>
>
>  In the statistics estimation([StatsRulesProcFactory|), we do not estimate 
> the column stats once we set hive.stats.fetch.column.stats as false.Suggest 
> to estimate the data size by column type when 
> {{hive.stats.fetch.column.stats}} as false like HIVE-17634.1.patch does.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)

2017-09-30 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187239#comment-16187239
 ] 

liyunzhang_intel commented on HIVE-17634:
-

[~vgarg]: thanks for your command, i will try. Oct1-Oct8 is Chinese holiday.  
Maybe will delay the patch for some time, thanks for your patience.

> Estimate the column stats even not retrieve columns from 
> metastore(hive.stats.fetch.column.stats as false)
> --
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.1.patch, HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)

2017-09-29 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186851#comment-16186851
 ] 

liyunzhang_intel commented on HIVE-17634:
-

the command to regenerate all the q*out is 
{code}
mvn clean test -Dtest=TestCliDriver -Dtest.output.overwrite=true  -Dqfile=*
{code}


If it is not correct,tell me to use which command to regenerate all the q*out, 
thanks!

> Estimate the column stats even not retrieve columns from 
> metastore(hive.stats.fetch.column.stats as false)
> --
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.1.patch, HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)

2017-09-29 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186830#comment-16186830
 ] 

liyunzhang_intel commented on HIVE-17634:
-

[~vgarg]: there are 1243 failed/errored test(s).  Most failures like
{code}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[bucket_map_join_spark4]

Failing for the past 1 build (Since Failed#7056 )
Took 10 sec.
Error Message

Client Execution succeeded but contained differences (error code = 1) after 
executing bucket_map_join_spark4.q 
88c88
< Statistics: Num rows: 10 Data size: 1880 Basic stats: COMPLETE 
Column stats: NONE
---
> Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
> Column stats: NONE
{code}
this is because now use 
[betterDS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L358
 ] not {{ds}} to estimate the data size. The data size changed from 70 to 1880.

Do you think it is ok? If you think it is ok, i will start to regenerate the 
*q.out file in my local cluster.

> Estimate the column stats even not retrieve columns from 
> metastore(hive.stats.fetch.column.stats as false)
> --
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.1.patch, HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)

2017-09-29 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17634:

Status: Patch Available  (was: Open)

> Estimate the column stats even not retrieve columns from 
> metastore(hive.stats.fetch.column.stats as false)
> --
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.1.patch, HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)

2017-09-29 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186454#comment-16186454
 ] 

liyunzhang_intel commented on HIVE-17634:
-

[~vgarg]: thanks for review. Now trigger the test.

> Estimate the column stats even not retrieve columns from 
> metastore(hive.stats.fetch.column.stats as false)
> --
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.1.patch, HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)

2017-09-29 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17634:

Summary: Estimate the column stats even not retrieve columns from 
metastore(hive.stats.fetch.column.stats as false)  (was: Use properties from 
HiveConf about "fetchColStats" and "fetchPartStats" in 
RelOptHiveTable#updateColStats)

> Estimate the column stats even not retrieve columns from 
> metastore(hive.stats.fetch.column.stats as false)
> --
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.1.patch, HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17634) Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in RelOptHiveTable#updateColStats

2017-09-29 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17634:

Attachment: HIVE-17634.1.patch

[~vgarg]: thanks for your reply. I indeed met the problem that statistics is 
not correct when i set  {{hive.stats.fetch.column.stats}} as false.  Attach 
HIVE-17634.1.patch, please help review.

> Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in 
> RelOptHiveTable#updateColStats
> -
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.1.patch, HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17634) Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in RelOptHiveTable#updateColStats

2017-09-28 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185327#comment-16185327
 ] 

liyunzhang_intel commented on HIVE-17634:
-

[~vgarg]: thanks for your explanation.
{quote}
I am not convinced why would user not want to fetch stats from metastore and 
instead rely upon estimated statistics?
{quote}
from the document it said "Fetching column statistics for each needed column 
can be expensive when the number of columns is high". The default value of  
hive.stats.fetch.column.stats is false. Maybe users do not enable this property 
because they need use {{analyze table xxx compute statistics for columns}} to 
collect column statistics and this command are time-consuming for table with 
high number of columns.
{code}
HIVE_STATS_FETCH_COLUMN_STATS("hive.stats.fetch.column.stats", false,
"Annotation of operator tree with statistics information requires 
column statistics.\n" +
"Column statistics are fetched from metastore. Fetching column 
statistics for each needed column\n" +
"can be expensive when the number of columns is high. This flag can be 
used to disable fetching\n" +
"of column statistics from metastore."),
{code}


> Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in 
> RelOptHiveTable#updateColStats
> -
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17634) Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in RelOptHiveTable#updateColStats

2017-09-28 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185281#comment-16185281
 ] 

liyunzhang_intel commented on HIVE-17634:
-

[~vgarg]: thanks for your reply. I can understand the importance of column 
stats to estimate the statistics. What i am confused is in logical plan we uses 
{{true}} to get the column stats from metastore even we can not get [result 
|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L351]
  from metastore and 
[estimateStatsForMissingCols|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L354].
 But in the statistics 
estimation([StatsRulesProcFactory|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L134]),
 we even do not estimate the column stats once we set 
{{hive.stats.fetch.column.stats}} as false.Can we do some refactor for 
[StatsUtils#collectStatistics|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L349]
 like 
{code}
if (fetchColStats) {
colStats = getTableColumnStats(table, schema, neededColumns, 
colStatsCache);
   
}
//Although not fetch column stats from metastore, we still estimate the column 
stats
 if(colStats == null) {
  colStats = Lists.newArrayList();
}

estimateStatsForMissingCols(neededColumns, colStats, table, conf, nr, schema);

// we should have stats for all columns (estimated or actual)
assert(neededColumns.size() == colStats.size());
long betterDS = getDataSizeFromColumnStats(nr, colStats);
ds = (betterDS < 1 || colStats.isEmpty()) ? ds : betterDS;
  

{code}

> Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in 
> RelOptHiveTable#updateColStats
> -
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17634) Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in RelOptHiveTable#updateColStats

2017-09-28 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17634:

Summary: Use properties from HiveConf about "fetchColStats" and 
"fetchPartStats" in RelOptHiveTable#updateColStats  (was: Use properties from 
HiveConf in RelOptHiveTable#updateColStats)

> Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in 
> RelOptHiveTable#updateColStats
> -
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17634) Use properties from HiveConf in RelOptHiveTable#updateColStats

2017-09-28 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17634:

Attachment: HIVE-17634.patch

[~vgarg],[~jcamachorodriguez]:As  you have more knowledge about RelOptHiveTable 
and statistics estimation, can you take a look about the patch? thanks!

> Use properties from HiveConf in RelOptHiveTable#updateColStats
> --
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17634.patch
>
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-17634) Use properties from HiveConf in RelOptHiveTable#updateColStats

2017-09-28 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel reassigned HIVE-17634:
---

Assignee: liyunzhang_intel

> Use properties from HiveConf in RelOptHiveTable#updateColStats
> --
>
> Key: HIVE-17634
> URL: https://issues.apache.org/jira/browse/HIVE-17634
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
>
> in 
> [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309],
>  we set {{fetchColStats}},{{fetchPartStats}} as true when call 
> {{StatsUtils.collectStatistics}}
> {code}
>if (!hiveTblMetadata.isPartitioned()) {
> // 2.1 Handle the case for unpartitioned table.
> try {
>   Statistics stats = StatsUtils.collectStatistics(hiveConf, null,
>   hiveTblMetadata, hiveNonPartitionCols, 
> nonPartColNamesThatRqrStats,
>   colStatsCached, nonPartColNamesThatRqrStats, true, true);
>   ...
> {code}
> This will cause querying columns statistic from metastore even we set  
> {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as 
> false in HiveConf.  If we these two properties as false, we can not any 
> column statistics from metastore.  Suggest to set the properties from 
> HiveConf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17182) Invalid statistics like "RAW DATA SIZE" info for parquet file

2017-09-28 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17182:

Description: 
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name  data_type   comment 
 
ss_sold_time_sk bigint  
ss_item_sk  bigint  
ss_customer_sk  bigint  
ss_cdemo_sk bigint  
ss_hdemo_sk bigint  
ss_addr_sk  bigint  
ss_store_sk bigint  
ss_promo_sk bigint  
ss_ticket_numberbigint  
ss_quantity int 
ss_wholesale_cost   double  
ss_list_price   double  
ss_sales_price  double  
ss_ext_discount_amt double  
ss_ext_sales_price  double  
ss_ext_wholesale_cost   double  
ss_ext_list_price   double  
ss_ext_tax  double  
ss_coupon_amt   double  
ss_net_paid double  
ss_net_paid_inc_tax double  
ss_net_profit   double  
 
# Partition Information  
# col_name  data_type   comment 
 
ss_sold_date_sk bigint  
 
# Detailed Table Information 
Database:   tpcds_bin_partitioned_parquet_200
Owner:  root 
CreateTime: Tue Jun 06 11:51:48 CST 2017 
LastAccessTime: UNKNOWN  
Retention:  0
Location:   
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
  
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
numFiles2023
numPartitions   1824
numRows 575995635   
rawDataSize 12671903970 
totalSize   46465926745 
transient_lastDdlTime   1496721108  
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{noformat}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G   /tmp/tpcds-generate/200/store_sales
{noformat} 
view the parquet file on hdfs
{noformat}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{noformat}

It seems that the rawDataSize is nearly 75G but in "describe formatted 
store_sales" command, it shows only 12G.


I tried to use "analyze table store_sales compute statistics for columns" to 
update the statistics but there is no change for RAWDATASIZE;

I tried to use "analyze table store_sales partition(ss_sold_date_sk) compute 
statistics no scan" to update the statistics but fail, the error is 
{code}
2017-09-28T03:21:04,849  INFO [StatsNoJobTask-Thread-1] exec.Task: [Warning] 
could not update stats for 
tpcds_bin_partitioned_parquet_10.store_sales{ss_sold_date_sk=2451769}.Failed 
with exception Missing timezone id for parquet int96 conversion!
java.lang.IllegalArgumentException: Missing timezone id for p^Carquet int96 
conversion!
 at 
org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTimeUtils.validateTimeZone(NanoTimeUtils.java:169)
 at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setTimeZoneConversion(ParquetRecordReaderBase.java:182)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:89)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:59)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:86)
 at 
org.apache.hadoop.hive.ql.exec.StatsNoJobTask$StatsCollection.run(StatsNoJobTask.java:164)
 at 

[jira] [Updated] (HIVE-17182) Invalid statistics like "RAW DATA SIZE" info for parquet file

2017-09-28 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17182:

Description: 
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name  data_type   comment 
 
ss_sold_time_sk bigint  
ss_item_sk  bigint  
ss_customer_sk  bigint  
ss_cdemo_sk bigint  
ss_hdemo_sk bigint  
ss_addr_sk  bigint  
ss_store_sk bigint  
ss_promo_sk bigint  
ss_ticket_numberbigint  
ss_quantity int 
ss_wholesale_cost   double  
ss_list_price   double  
ss_sales_price  double  
ss_ext_discount_amt double  
ss_ext_sales_price  double  
ss_ext_wholesale_cost   double  
ss_ext_list_price   double  
ss_ext_tax  double  
ss_coupon_amt   double  
ss_net_paid double  
ss_net_paid_inc_tax double  
ss_net_profit   double  
 
# Partition Information  
# col_name  data_type   comment 
 
ss_sold_date_sk bigint  
 
# Detailed Table Information 
Database:   tpcds_bin_partitioned_parquet_200
Owner:  root 
CreateTime: Tue Jun 06 11:51:48 CST 2017 
LastAccessTime: UNKNOWN  
Retention:  0
Location:   
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
  
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
numFiles2023
numPartitions   1824
numRows 575995635   
rawDataSize 12671903970 
totalSize   46465926745 
transient_lastDdlTime   1496721108  
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{noformat}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G   /tmp/tpcds-generate/200/store_sales
{noformat} 
view the parquet file on hdfs
{noformat}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{noformat}

It seems that the rawDataSize is nearly 75G but in "describe formatted 
store_sales" command, it shows only 12G.


I tried to use "analyze table store_sales compute statistics for columns" to 
update the statistics but there is no change for RAWDATASIZE;

I tried to use "analyze table store_sales partition(ss_sold_date_sk) compute 
statistics no scan" to update the statistics but fail, the error is 
{code}

{code}


  was:
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name  data_type   comment 
 
ss_sold_time_sk bigint  
ss_item_sk  bigint  
ss_customer_sk  bigint  
ss_cdemo_sk bigint  
ss_hdemo_sk bigint  
ss_addr_sk  bigint  
ss_store_sk bigint  
ss_promo_sk bigint  
ss_ticket_numberbigint  
ss_quantity int 
ss_wholesale_cost   double  
ss_list_price   double  

[jira] [Updated] (HIVE-17182) Invalid statistics like "RAW DATA SIZE" info for parquet file

2017-09-28 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17182:

Description: 
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name  data_type   comment 
 
ss_sold_time_sk bigint  
ss_item_sk  bigint  
ss_customer_sk  bigint  
ss_cdemo_sk bigint  
ss_hdemo_sk bigint  
ss_addr_sk  bigint  
ss_store_sk bigint  
ss_promo_sk bigint  
ss_ticket_numberbigint  
ss_quantity int 
ss_wholesale_cost   double  
ss_list_price   double  
ss_sales_price  double  
ss_ext_discount_amt double  
ss_ext_sales_price  double  
ss_ext_wholesale_cost   double  
ss_ext_list_price   double  
ss_ext_tax  double  
ss_coupon_amt   double  
ss_net_paid double  
ss_net_paid_inc_tax double  
ss_net_profit   double  
 
# Partition Information  
# col_name  data_type   comment 
 
ss_sold_date_sk bigint  
 
# Detailed Table Information 
Database:   tpcds_bin_partitioned_parquet_200
Owner:  root 
CreateTime: Tue Jun 06 11:51:48 CST 2017 
LastAccessTime: UNKNOWN  
Retention:  0
Location:   
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
  
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
numFiles2023
numPartitions   1824
numRows 575995635   
rawDataSize 12671903970 
totalSize   46465926745 
transient_lastDdlTime   1496721108  
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{noformat}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G   /tmp/tpcds-generate/200/store_sales
{noformat} 
view the parquet file on hdfs
{noformat}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{noformat}

It seems that the rawDataSize is nearly 75G but in "describe formatted 
store_sales" command, it shows only 12G.


I tried to use "analyze table store_sales compute statistics for columns" to 
update the statistics but there is no change for RAWDATASIZE;

I tried to use "analyze table store_sales partition() compute statistics no 
scan" to update the statistics but fail, the error is 
{code}
FAILED: SemanticException [Error 10115]: Table is partitioned and partition 
specification is needed
{code}


  was:
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name  data_type   comment 
 
ss_sold_time_sk bigint  
ss_item_sk  bigint  
ss_customer_sk  bigint  
ss_cdemo_sk bigint  
ss_hdemo_sk bigint  
ss_addr_sk  bigint  
ss_store_sk bigint  
ss_promo_sk bigint  
ss_ticket_numberbigint  
ss_quantity int 
ss_wholesale_cost   double   

[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-09-27 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17486:

Description: 
in HIVE-16602, Implement shared scans with Tez.

Given a query plan, the goal is to identify scans on input tables that can be 
merged so the data is read only once. Optimization will be carried out at the 
physical level.  In Hive on Spark, it caches the result ofsSpark work if the 
spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
is enabled in physical plan in HoS, the identical table scans are merged to 1 
table scan. This result of table scan will be used by more 1 child spark work. 
Thus we need not do the same computation because of cache mechanism.

  was:
in HIVE-16602, Implement shared scans with Tez.

Given a query plan, the goal is to identify scans on input tables that can be 
merged so the data is read only once. Optimization will be carried out at the 
physical level.


> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result ofsSpark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-09-27 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17486:

Description: 
in HIVE-16602, Implement shared scans with Tez.

Given a query plan, the goal is to identify scans on input tables that can be 
merged so the data is read only once. Optimization will be carried out at the 
physical level.  In Hive on Spark, it caches the result of spark work if the 
spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
is enabled in physical plan in HoS, the identical table scans are merged to 1 
table scan. This result of table scan will be used by more 1 child spark work. 
Thus we need not do the same computation because of cache mechanism.

  was:
in HIVE-16602, Implement shared scans with Tez.

Given a query plan, the goal is to identify scans on input tables that can be 
merged so the data is read only once. Optimization will be carried out at the 
physical level.  In Hive on Spark, it caches the result ofsSpark work if the 
spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
is enabled in physical plan in HoS, the identical table scans are merged to 1 
table scan. This result of table scan will be used by more 1 child spark work. 
Thus we need not do the same computation because of cache mechanism.


> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17545) Make HoS RDD Cacheing Optimization Configurable

2017-09-25 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16180288#comment-16180288
 ] 

liyunzhang_intel commented on HIVE-17545:
-

[~lirui]: thanks for explanation. If disabled cache, even equivalent works are 
combined, the computation for the same work are still executed.

> Make HoS RDD Cacheing Optimization Configurable
> ---
>
> Key: HIVE-17545
> URL: https://issues.apache.org/jira/browse/HIVE-17545
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer, Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17545.1.patch, HIVE-17545.2.patch
>
>
> The RDD cacheing optimization add in HIVE-10550 is enabled by default. We 
> should make it configurable in case users want to disable it. We can leave it 
> on by default to preserve backwards compatibility.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17545) Make HoS RDD Cacheing Optimization Configurable

2017-09-25 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16180135#comment-16180135
 ] 

liyunzhang_intel commented on HIVE-17545:
-

[~lirui]:  {quote}

if user turns on combining equivalent works and turns off RDD caching, then 
there won't be perf improvement right?
{quote}
if users turns on combining equivalent, duplicated map/reduce work will be 
removed. The performance will not change whether rdd caching is enabled or not. 
 
 In HoS, cache will be enabled only when the parent spark work have more than 
[1 
children|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L264].
 
If my understanding is not right, tell me.




> Make HoS RDD Cacheing Optimization Configurable
> ---
>
> Key: HIVE-17545
> URL: https://issues.apache.org/jira/browse/HIVE-17545
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer, Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17545.1.patch, HIVE-17545.2.patch
>
>
> The RDD cacheing optimization add in HIVE-10550 is enabled by default. We 
> should make it configurable in case users want to disable it. We can leave it 
> on by default to preserve backwards compatibility.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

2017-09-25 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel resolved HIVE-17474.
-
Resolution: Not A Problem

> Poor Performance about subquery like DS/query70 on HoS
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.70.after.analyze, explain.70.before.analyze, 
> explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17545) Make HoS RDD Cacheing Optimization Configurable

2017-09-25 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16180063#comment-16180063
 ] 

liyunzhang_intel commented on HIVE-17545:
-

[~stakiar]: sounds good.  But i don't know why cache optimization was not 
configurable before. [~lirui]: As you are more familiar with the code, can you 
take some time to look?

> Make HoS RDD Cacheing Optimization Configurable
> ---
>
> Key: HIVE-17545
> URL: https://issues.apache.org/jira/browse/HIVE-17545
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer, Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17545.1.patch, HIVE-17545.2.patch
>
>
> The RDD cacheing optimization add in HIVE-10550 is enabled by default. We 
> should make it configurable in case users want to disable it. We can leave it 
> on by default to preserve backwards compatibility.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17545) Make HoS RDD Cacheing Optimization Configurable

2017-09-25 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178604#comment-16178604
 ] 

liyunzhang_intel commented on HIVE-17545:
-

[~stakiar]: why need to make RDD caching optimization configurable?  Is there 
any problem or performance degradation if enable rdd cache optimization?


> Make HoS RDD Cacheing Optimization Configurable
> ---
>
> Key: HIVE-17545
> URL: https://issues.apache.org/jira/browse/HIVE-17545
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer, Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17545.1.patch, HIVE-17545.2.patch
>
>
> The RDD cacheing optimization add in HIVE-10550 is enabled by default. We 
> should make it configurable in case users want to disable it. We can leave it 
> on by default to preserve backwards compatibility.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17565) NullPointerException occurs when hive.optimize.skewjoin and hive.auto.convert.join are switched on at the same time

2017-09-21 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175846#comment-16175846
 ] 

liyunzhang_intel commented on HIVE-17565:
-

i can reproduce it in Hive on MR in commit(fafa953), will investigate it later.

> NullPointerException occurs when hive.optimize.skewjoin and 
> hive.auto.convert.join are switched on at the same time
> ---
>
> Key: HIVE-17565
> URL: https://issues.apache.org/jira/browse/HIVE-17565
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: Xin Hao
>Assignee: liyunzhang_intel
>
> (A)NullPointerException occurs when hive.optimize.skewjoin and 
> hive.auto.convert.join are switched on at the same time.
> Could pass when hive.optimize.skewjoin=true and hive.auto.convert.join=false.
> (B)Hive Version:
> Found on Apache Hive 1.2.1
> (C)Workload:
> (1)TPCx-BB Q19
> (2) A small case as below,which was actually simplified from Q19:
> SELECT *
> FROM store_returns sr,
> (
>   SELECT d1.d_date_sk
>   FROM date_dim d1, date_dim d2
>   WHERE d1.d_week_seq = d2.d_week_seq
> ) sr_dateFilter
> WHERE sr.sr_returned_date_sk = d_date_sk;
> (D)Exception Error Message:
> Error: java.lang.RuntimeException: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:194)
> at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:223)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1051)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055)
> at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:490)
> at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
> ... 8 more



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-17565) NullPointerException occurs when hive.optimize.skewjoin and hive.auto.convert.join are switched on at the same time

2017-09-20 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel reassigned HIVE-17565:
---

Assignee: liyunzhang_intel

> NullPointerException occurs when hive.optimize.skewjoin and 
> hive.auto.convert.join are switched on at the same time
> ---
>
> Key: HIVE-17565
> URL: https://issues.apache.org/jira/browse/HIVE-17565
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: Xin Hao
>Assignee: liyunzhang_intel
>
> (A)NullPointerException occurs when hive.optimize.skewjoin and 
> hive.auto.convert.join are switched on at the same time.
> Could pass when hive.optimize.skewjoin=true and hive.auto.convert.join=false.
> (B)Hive Version:
> Found on Apache Hive 1.2.1
> (C)Workload:
> (1)TPCx-BB Q19
> (2) A small case as below,which was actually simplified from Q19:
> SELECT *
> FROM store_returns sr,
> (
>   SELECT d1.d_date_sk
>   FROM date_dim d1, date_dim d2
>   WHERE d1.d_week_seq = d2.d_week_seq
> ) sr_dateFilter
> WHERE sr.sr_returned_date_sk = d_date_sk;
> (D)Exception Error Message:
> Error: java.lang.RuntimeException: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:194)
> at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:223)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1051)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055)
> at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:490)
> at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
> ... 8 more



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17565) NullPointerException occurs when hive.optimize.skewjoin and hive.auto.convert.join are switched on at the same time

2017-09-20 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174112#comment-16174112
 ] 

liyunzhang_intel commented on HIVE-17565:
-

HaoXin: this happens on Hive on MR or Hive on Spark?


> NullPointerException occurs when hive.optimize.skewjoin and 
> hive.auto.convert.join are switched on at the same time
> ---
>
> Key: HIVE-17565
> URL: https://issues.apache.org/jira/browse/HIVE-17565
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: Xin Hao
>
> (A)NullPointerException occurs when hive.optimize.skewjoin and 
> hive.auto.convert.join are switched on at the same time.
> Could pass when hive.optimize.skewjoin=true and hive.auto.convert.join=false.
> (B)Hive Version:
> Found on Apache Hive 1.2.1
> (C)Workload:
> (1)TPCx-BB Q19
> (2) A small case as below,which was actually simplified from Q19:
> SELECT *
> FROM store_returns sr,
> (
>   SELECT d1.d_date_sk
>   FROM date_dim d1, date_dim d2
>   WHERE d1.d_week_seq = d2.d_week_seq
> ) sr_dateFilter
> WHERE sr.sr_returned_date_sk = d_date_sk;
> (D)Exception Error Message:
> Error: java.lang.RuntimeException: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:194)
> at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:223)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1051)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055)
> at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:490)
> at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
> ... 8 more



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16602) Implement shared scans with Tez

2017-09-20 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172918#comment-16172918
 ] 

liyunzhang_intel commented on HIVE-16602:
-

[~jcamachorodriguez]: thanks for your reply.

{quote}
...it appears multiple times in the query.
{quote}
i mean the ts is used in the query for more than once. so shared scan 
optimization will work.  I test this in 10g scale in DS queries like 
[query7|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query17.sql],[query70
 
|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql
 ] on 10g scale. But not see  big improvement. I guess the reason maybe the 
data scale is small so that not too much time is reduced even though less TS is 
called.

||query||Before HIVE-16602||HIVE-16602||
|query7 |53.677s|51.934s |
|query70 |46.951s| 47.48s|


> Implement shared scans with Tez
> ---
>
> Key: HIVE-16602
> URL: https://issues.apache.org/jira/browse/HIVE-16602
> Project: Hive
>  Issue Type: New Feature
>  Components: Physical Optimizer
>Affects Versions: 3.0.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>  Labels: TODOC3.0
> Fix For: 3.0.0
>
> Attachments: HIVE-16602.01.patch, HIVE-16602.02.patch, 
> HIVE-16602.03.patch, HIVE-16602.04.patch, HIVE-16602.patch
>
>
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.
> In the longer term, identification of equivalent expressions and 
> reutilization of intermediary results should be done at the logical layer via 
> Spool operator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16602) Implement shared scans with Tez

2017-09-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171349#comment-16171349
 ] 

liyunzhang_intel commented on HIVE-16602:
-

[~jcamachorodriguez]: I am envaluating the the performance improvement of 
HIVE-16602 on tez
i use tpcds compare the execution time in the package without HIVE-16602
and with HIVE-16602 on 10g data scale. I guess there is improvement with this 
feature as it only loads table once even it appears multiple time in the query. 
Have you done some benchmark test about this feature?


> Implement shared scans with Tez
> ---
>
> Key: HIVE-16602
> URL: https://issues.apache.org/jira/browse/HIVE-16602
> Project: Hive
>  Issue Type: New Feature
>  Components: Physical Optimizer
>Affects Versions: 3.0.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>  Labels: TODOC3.0
> Fix For: 3.0.0
>
> Attachments: HIVE-16602.01.patch, HIVE-16602.02.patch, 
> HIVE-16602.03.patch, HIVE-16602.04.patch, HIVE-16602.patch
>
>
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.
> In the longer term, identification of equivalent expressions and 
> reutilization of intermediary results should be done at the logical layer via 
> Spool operator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

2017-09-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171184#comment-16171184
 ] 

liyunzhang_intel edited comment on HIVE-17474 at 9/19/17 6:47 AM:
--

I found that we need execute
"analyze table xxx compute statistics for columns" before executing the query.
Attach the different 
explain([before_analyze|https://issues.apache.org/jira/secure/attachment/12887836/explain.70.before.analyze],[after_analyze|https://issues.apache.org/jira/secure/attachment/12887837/explain.70.after.analyze]
 )
give an example to show the influence of column statistics 
{code}(select s_state as s_state, sum(ss_net_profit),
 rank() over ( partition by s_state order by 
sum(ss_net_profit) desc) as ranking
  from   store_sales, store, date_dim
  where  d_month_seq between 1193 and 1193+11
and date_dim.d_date_sk = store_sales.ss_sold_date_sk
and store.s_store_sk  = store_sales.ss_store_sk
  group by s_state
 ) {code}
before compute column statistics
{code}
 Map 9 
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: (ss_store_sk is not null and ss_sold_date_sk is 
not null) (type: boolean)
  Statistics: Num rows: 27504814 Data size: 825144420 Basic 
stats: COMPLETE Column stats: PARTIAL
  Filter Operator
predicate: ss_store_sk is not null (type: boolean)
Statistics: Num rows: 27504814 Data size: 220038512 Basic 
stats: COMPLETE Column stats: PARTIAL
Select Operator
  expressions: ss_store_sk (type: bigint), ss_net_profit 
(type: double), ss_sold_date_sk (type: bigint)
  outputColumnNames: _col0, _col1, _col2
  Statistics: Num rows: 27504814 Data size: 220038512 Basic 
stats: COMPLETE Column stats: PARTIAL
  Map Join Operator
condition map:
 Inner Join 0 to 1
keys:
  0 _col0 (type: bigint)
  1 _col0 (type: bigint)
outputColumnNames: _col1, _col2, _col4
input vertices:
  1 Map 12
Statistics: Num rows: 30255296 Data size: 242042368 
Basic stats: COMPLETE Column stats: NONE
Map Join Operator
  condition map:
   Inner Join 0 to 1
  keys:
0 _col2 (type: bigint)
1 _col0 (type: bigint)
  outputColumnNames: _col1, _col4
  input vertices:
1 Map 13
  Statistics: Num rows: 33280826 Data size: 266246610 
Basic stats: COMPLETE Column stats: NONE
  Select Operator
expressions: _col4 (type: string), _col1 (type: 
double)
outputColumnNames: _col4, _col1
Statistics: Num rows: 33280826 Data size: 266246610 
Basic stats: COMPLETE Column stats: NONE
Group By Operator
  aggregations: sum(_col1)
  keys: _col4 (type: string)
  mode: hash
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 33280826 Data size: 
266246610 Basic stats: COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: 
string)
Statistics: Num rows: 33280826 Data size: 
266246610 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: double)

{code}
the data size is 266246610

After computing column statistics
{code}
  Map 7 
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: (ss_store_sk is not null and ss_sold_date_sk is 
not null) (type: boolean)
  Statistics: Num rows: 27504814 Data size: 649740104 Basic 
stats: COMPLETE Column stats: PARTIAL
  Filter Operator
predicate: ss_store_sk is not null (type: boolean)
Statistics: Num rows: 26856871 Data size: 634433888 Basic 
stats: COMPLETE Column stats: PARTIAL
Select Operator
  

[jira] [Comment Edited] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

2017-09-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171184#comment-16171184
 ] 

liyunzhang_intel edited comment on HIVE-17474 at 9/19/17 6:44 AM:
--

I found that we need execute
"analyze table xxx compute statistics for columns" before executing the query.
Attach the different 
explain([before_analyze|https://issues.apache.org/jira/secure/attachment/12887836/explain.70.before.analyze],[after_analyze|https://issues.apache.org/jira/secure/attachment/12887837/explain.70.after.analyze])
give an example to show the influence of column statistics 
{code}(select s_state as s_state, sum(ss_net_profit),
 rank() over ( partition by s_state order by 
sum(ss_net_profit) desc) as ranking
  from   store_sales, store, date_dim
  where  d_month_seq between 1193 and 1193+11
and date_dim.d_date_sk = store_sales.ss_sold_date_sk
and store.s_store_sk  = store_sales.ss_store_sk
  group by s_state
 ) {code}
before compute column statistics
{code}
 Map 9 
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: (ss_store_sk is not null and ss_sold_date_sk is 
not null) (type: boolean)
  Statistics: Num rows: 27504814 Data size: 825144420 Basic 
stats: COMPLETE Column stats: PARTIAL
  Filter Operator
predicate: ss_store_sk is not null (type: boolean)
Statistics: Num rows: 27504814 Data size: 220038512 Basic 
stats: COMPLETE Column stats: PARTIAL
Select Operator
  expressions: ss_store_sk (type: bigint), ss_net_profit 
(type: double), ss_sold_date_sk (type: bigint)
  outputColumnNames: _col0, _col1, _col2
  Statistics: Num rows: 27504814 Data size: 220038512 Basic 
stats: COMPLETE Column stats: PARTIAL
  Map Join Operator
condition map:
 Inner Join 0 to 1
keys:
  0 _col0 (type: bigint)
  1 _col0 (type: bigint)
outputColumnNames: _col1, _col2, _col4
input vertices:
  1 Map 12
Statistics: Num rows: 30255296 Data size: 242042368 
Basic stats: COMPLETE Column stats: NONE
Map Join Operator
  condition map:
   Inner Join 0 to 1
  keys:
0 _col2 (type: bigint)
1 _col0 (type: bigint)
  outputColumnNames: _col1, _col4
  input vertices:
1 Map 13
  Statistics: Num rows: 33280826 Data size: 266246610 
Basic stats: COMPLETE Column stats: NONE
  Select Operator
expressions: _col4 (type: string), _col1 (type: 
double)
outputColumnNames: _col4, _col1
Statistics: Num rows: 33280826 Data size: 266246610 
Basic stats: COMPLETE Column stats: NONE
Group By Operator
  aggregations: sum(_col1)
  keys: _col4 (type: string)
  mode: hash
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 33280826 Data size: 
266246610 Basic stats: COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: 
string)
Statistics: Num rows: 33280826 Data size: 
266246610 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: double)

{code}
the data size is 266246610

After computing column statistics
{code}
  Map 7 
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: (ss_store_sk is not null and ss_sold_date_sk is 
not null) (type: boolean)
  Statistics: Num rows: 27504814 Data size: 649740104 Basic 
stats: COMPLETE Column stats: PARTIAL
  Filter Operator
predicate: ss_store_sk is not null (type: boolean)
Statistics: Num rows: 26856871 Data size: 634433888 Basic 
stats: COMPLETE Column stats: PARTIAL
Select Operator
  

[jira] [Updated] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

2017-09-19 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17474:

Attachment: explain.70.after.analyze
explain.70.before.analyze

> Poor Performance about subquery like DS/query70 on HoS
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.70.after.analyze, explain.70.before.analyze, 
> explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

2017-09-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171184#comment-16171184
 ] 

liyunzhang_intel commented on HIVE-17474:
-

I found that we need to execute
"analyze table xxx compute statistics for columns" before executing the query.
Attach the different explain before and after analyze statistics.
give an example to show the influence of column statistics 
{code}(select s_state as s_state, sum(ss_net_profit),
 rank() over ( partition by s_state order by 
sum(ss_net_profit) desc) as ranking
  from   store_sales, store, date_dim
  where  d_month_seq between 1193 and 1193+11
and date_dim.d_date_sk = store_sales.ss_sold_date_sk
and store.s_store_sk  = store_sales.ss_store_sk
  group by s_state
 ) {code}
before compute column statistics
{code}
 Map 9 
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: (ss_store_sk is not null and ss_sold_date_sk is 
not null) (type: boolean)
  Statistics: Num rows: 27504814 Data size: 825144420 Basic 
stats: COMPLETE Column stats: PARTIAL
  Filter Operator
predicate: ss_store_sk is not null (type: boolean)
Statistics: Num rows: 27504814 Data size: 220038512 Basic 
stats: COMPLETE Column stats: PARTIAL
Select Operator
  expressions: ss_store_sk (type: bigint), ss_net_profit 
(type: double), ss_sold_date_sk (type: bigint)
  outputColumnNames: _col0, _col1, _col2
  Statistics: Num rows: 27504814 Data size: 220038512 Basic 
stats: COMPLETE Column stats: PARTIAL
  Map Join Operator
condition map:
 Inner Join 0 to 1
keys:
  0 _col0 (type: bigint)
  1 _col0 (type: bigint)
outputColumnNames: _col1, _col2, _col4
input vertices:
  1 Map 12
Statistics: Num rows: 30255296 Data size: 242042368 
Basic stats: COMPLETE Column stats: NONE
Map Join Operator
  condition map:
   Inner Join 0 to 1
  keys:
0 _col2 (type: bigint)
1 _col0 (type: bigint)
  outputColumnNames: _col1, _col4
  input vertices:
1 Map 13
  Statistics: Num rows: 33280826 Data size: 266246610 
Basic stats: COMPLETE Column stats: NONE
  Select Operator
expressions: _col4 (type: string), _col1 (type: 
double)
outputColumnNames: _col4, _col1
Statistics: Num rows: 33280826 Data size: 266246610 
Basic stats: COMPLETE Column stats: NONE
Group By Operator
  aggregations: sum(_col1)
  keys: _col4 (type: string)
  mode: hash
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 33280826 Data size: 
266246610 Basic stats: COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: 
string)
Statistics: Num rows: 33280826 Data size: 
266246610 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: double)

{code}
the data size is 266246610

After computing column statistics
{code}
  Map 7 
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: (ss_store_sk is not null and ss_sold_date_sk is 
not null) (type: boolean)
  Statistics: Num rows: 27504814 Data size: 649740104 Basic 
stats: COMPLETE Column stats: PARTIAL
  Filter Operator
predicate: ss_store_sk is not null (type: boolean)
Statistics: Num rows: 26856871 Data size: 634433888 Basic 
stats: COMPLETE Column stats: PARTIAL
Select Operator
  expressions: ss_store_sk (type: bigint), ss_net_profit 
(type: double), ss_sold_date_sk (type: bigint)
  outputColumnNames: _col0, _col1, _col2
  Statistics: Num rows: 26856871 Data 

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-09-15 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167441#comment-16167441
 ] 

liyunzhang_intel commented on HIVE-17486:
-

the reason why CombineEquivalentWorkResolver does not think Map1 is same as 
Map5, Map4 is same as Map7 is:
when comparing Map4 and Map7
Map4
{code}
TS[2]-SEL[3]-RS[13]
{code}
Map7 
{code}
TS[6]-SEL[7]-RS[9]
{code}

It returns not equal when comparing RS\[13\] and RS\[9\] at 
[ExprNodeColumnDesc#isSame|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeColumnDesc.java#L181].
 {code}
if ( tabAlias != null && dest.tabAlias != null ) {
  if ( !tabAlias.equals(dest.tabAlias) ) {
return false;
  }
}
{code}

here {{tabAlias}} is {{$hdt$_1}} while dest.tabAlias is {{$hdt$_3}}, actually 
{{$hdt$_1}} and {{$hdt$_3}} points to table {{test2}}.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

2017-09-14 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167233#comment-16167233
 ] 

liyunzhang_intel commented on HIVE-17474:
-

enlarged map join threshold size to cheat hive to think part1 is small table(in 
runtime, the size of part1 is very small). After that the execution plan 
changed, the execution time on 3TB is reduced from 12 min to 78 seconds. For 
such case where join on the data which keys are low cardinality, map join maybe 
the  best solution. 

> Poor Performance about subquery like DS/query70 on HoS
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

2017-09-12 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164180#comment-16164180
 ] 

liyunzhang_intel commented on HIVE-17474:
-

[~lirui]: thanks for reply. I am debugging whether there is problem about 
statistics.
By the way,can we solve the problem by converting the common join to skewed 
join?
As  all keys in part2 is very big and the distinct key is very few(less than 
30), can we think this is a  skew case? I have tried to set 
hive.optimize.skewjoin as true and hive.skewjoin.key as 10. But it seems 
not effect.  I am very curious  why skew join does not have effect. From the 
doc, it seems will 
{code}
A join B on A.id=B.id 
And A skews for id=1. Then we perform the following two joins: 
1.  A join B on A.id=B.id and A.id!=1 
2.  A join B on A.id=B.id and A.id=1 
If B doesn’t skew on id=1, then #2 will be a map join.
{code}
I think after enabling skew join, all keys in part2 will be skewed keys, part2 
will map join with part1. 

> Poor Performance about subquery like DS/query70 on HoS
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

2017-09-12 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163762#comment-16163762
 ] 

liyunzhang_intel commented on HIVE-17474:
-

[~lirui] , [~xuefuz]: after debugging in tez, found the part2 join part1 is 
common merge join(CommonMergeJoinOperator).
{code}
  Reducer 2 
Reduce Operator Tree:
  Merge Join Operator
condition map:
 Inner Join 0 to 1
keys:
  0 _col7 (type: string)
  1 _col0 (type: string)

{code}


the implementation of CommonMergeJoin. Does hive on spark enable 
CommonMergeJoin?
{code}
/*
 * With an aim to consolidate the join algorithms to either hash based joins 
(MapJoinOperator) or
 * sort-merge based joins, this operator is being introduced. This operator 
executes a sort-merge
 * based algorithm. It replaces both the JoinOperator and the 
SMBMapJoinOperator for the tez side of
 * things. It works in either the map phase or reduce phase.
 *
 * The basic algorithm is as follows:
 *
 * 1. The processOp receives a row from a "big" table.
 * 2. In order to process it, the operator does a fetch for rows from the other 
tables.
 * 3. Once we have a set of rows from the other tables (till we hit a new key), 
more rows are
 *brought in from the big table and a join is performed.
 */
{code}

> Poor Performance about subquery like DS/query70 on HoS
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

2017-09-12 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162687#comment-16162687
 ] 

liyunzhang_intel commented on HIVE-17474:
-

after debugging code, i found part2 join part1 is a map join in tez, this is 
the difference with hive on spark.Will update the detail reason later.

> Poor Performance about subquery like DS/query70 on HoS
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

2017-09-12 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17474:

Summary: Poor Performance about subquery like DS/query70 on HoS  (was: Poor 
Performance about subquery like DS/query70)

> Poor Performance about subquery like DS/query70 on HoS
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70

2017-09-12 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162659#comment-16162659
 ] 

liyunzhang_intel commented on HIVE-17474:
-

[~xuefuz], [~lirui]: can you help view above issue. Thanks!

> Poor Performance about subquery like DS/query70
> ---
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70

2017-09-12 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162657#comment-16162657
 ] 

liyunzhang_intel commented on HIVE-17474:
-

the execution plan of hive on spark about DS/query70 is 
[attached|https://issues.apache.org/jira/secure/attachment/12886590/explain.70.vec].
Investigate the problem, i found that several points
1. the statistics for sub-query is not correct, it estimates nearly 36g about 
the result while actually the result is very small(nearly 30 rows about state 
info). Because of this, the join between part1 and part2(see jira description) 
is common join not map join. Maybe the calculation of statistics estimation 
need be more intelligent in such complex sub-query.
{code}
  Reducer 12 
Reduce Operator Tree:
  Select Operator
expressions: KEY.reducesinkkey0 (type: string), 
KEY.reducesinkkey1 (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 4991930471 Data size: 109822470377 Basic 
stats: COMPLETE Column stats: NONE
PTF Operator
  Function definitions:
  Input definition
input alias: ptf_0
output shape: _col0: string, _col1: double
type: WINDOWING
  Windowing table definition
input alias: ptf_1
name: windowingtablefunction
order by: _col1 DESC NULLS LAST
partition by: _col0
raw input shape:
window functions:
window function definition
  alias: rank_window_0
  arguments: _col1
  name: rank
  window function: GenericUDAFRankEvaluator
  window frame: PRECEDING(MAX)~FOLLOWING(MAX)
  isPivotResult: true
  Statistics: Num rows: 4991930471 Data size: 109822470377 
Basic stats: COMPLETE Column stats: NONE
  Filter Operator
predicate: (rank_window_0 <= 5) (type: boolean)
Statistics: Num rows: 1663976823 Data size: 36607490111 
Basic stats: COMPLETE Column stats: NONE
Select Operator
  expressions: _col0 (type: string)
  outputColumnNames: _col0
  Statistics: Num rows: 1663976823 Data size: 36607490111 
Basic stats: COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1663976823 Data size: 36607490111 
Basic stats: COMPLETE Column stats: NONE
{code}


> Poor Performance about subquery like DS/query70
> ---
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables 

[jira] [Updated] (HIVE-17474) Poor Performance about subquery like DS/query70

2017-09-12 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17474:

Attachment: explain.70.vec

> Poor Performance about subquery like DS/query70
> ---
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (HIVE-17474) Poor Performance about subquery like DS/query70

2017-09-12 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17474:

Comment: was deleted

(was: After HIVE-15192, the store is converted to map join.
the logical plan will be forever 
{code}
TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
TS[1]-FIL[64]-RS[5]-JOIN[6]
TS[2]-FIL[65]-RS[10]-JOIN[11]
TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
TS[13]-FIL[69]-RS[18]-JOIN[19]
TS[14]-FIL[70]-RS[22]-JOIN[23]

{code}

It is reasonable the small table store is converted to map join. so close the 
jira.)

> Poor Performance about subquery like DS/query70
> ---
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17474) Poor Performance about subquery like DS/query70

2017-09-12 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17474:

Summary: Poor Performance about subquery like DS/query70  (was: Different 
logical plan of same query(TPC-DS/70) with same settings)

> Poor Performance about subquery like DS/query70
> ---
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings

2017-09-12 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17474:

Description: 
in 
[DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
 {code}
select  
sum(ss_net_profit) as total_sum
   ,s_state
   ,s_county
   ,grouping__id as lochierarchy
   , rank() over(partition by grouping__id, case when grouping__id == 2 then 
s_state end order by sum(ss_net_profit)) as rank_within_parent
from
store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
join store s on s.s_store_sk  = ss.ss_store_sk
 where
d1.d_month_seq between 1193 and 1193+11
 and s.s_state in
 ( select s_state
   from  (select s_state as s_state, sum(ss_net_profit),
 rank() over ( partition by s_state order by 
sum(ss_net_profit) desc) as ranking
  from   store_sales, store, date_dim
  where  d_month_seq between 1193 and 1193+11
and date_dim.d_date_sk = store_sales.ss_sold_date_sk
and store.s_store_sk  = store_sales.ss_store_sk
  group by s_state
 ) tmp1 
   where ranking <= 5
 )
 group by s_state,s_county with rollup
order by
   lochierarchy desc
  ,case when lochierarchy = 0 then s_state end
  ,rank_within_parent
 limit 100;
{code}
 let's analyze the query,
part1: it calculates the sub-query and get the result of the state which 
ss_net_profit is less than 5.
part2: big table store_sales join small tables date_dim, store and get the 
result.
part3: part1 join part2
the problem is on the part3, this is common join. The cardinality of part1 and 
part2 is low as there are not very different values about states( actually 
there are 30 different values in the table store).  If use common join, big 
data will go to the 30 reducers.

  was:
in 
[DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
 The explain of hive on spark is
{code}


{code}


> Different logical plan of same query(TPC-DS/70) with same settings
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
> sum(ss_net_profit) as total_sum
>,s_state
>,s_county
>,grouping__id as lochierarchy
>, rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk  = ss.ss_store_sk
>  where
> d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>  ( select s_state
>from  (select s_state as s_state, sum(ss_net_profit),
>  rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>   from   store_sales, store, date_dim
>   where  d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
> and store.s_store_sk  = store_sales.ss_store_sk
>   group by s_state
>  ) tmp1 
>where ranking <= 5
>  )
>  group by s_state,s_county with rollup
> order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings

2017-09-12 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17474:

Description: 
in 
[DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
 The explain of hive on spark is
{code}


{code}

  was:
in 
[DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
 On hive version(d3b88f6),  i found that the logical plan is different in 
runtime with the same settings.

sometimes the logical plan
{code}
TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62]
TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45]
TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48]
TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41]
TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20]
TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23]
{code}
 TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on 
JOIN\[48\].

sometimes 
{code}
TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
TS[1]-FIL[64]-RS[5]-JOIN[6]
TS[2]-FIL[65]-RS[10]-JOIN[11]
TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
TS[13]-FIL[69]-RS[18]-JOIN[19]
TS[14]-FIL[70]-RS[22]-JOIN[23]
{code}
TS\[2\] connects with TS\[0\] on JOIN\[11\]

Although TS\[2\] and TS\[6\] has different operator id, they are table store in 
the query.

The difference causes different spark execution plan and different execution 
time.  I'm very confused why there are different logical plan with same 
setting. Can anyone know where to investigate the root cause?


> Different logical plan of same query(TPC-DS/70) with same settings
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  The explain of hive on spark is
> {code}
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings

2017-09-12 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel reopened HIVE-17474:
-

> Different logical plan of same query(TPC-DS/70) with same settings
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  On hive version(d3b88f6),  i found that the logical plan is different in 
> runtime with the same settings.
> sometimes the logical plan
> {code}
> TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62]
> TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45]
> TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48]
> TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41]
> TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20]
> TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23]
> {code}
>  TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on 
> JOIN\[48\].
> sometimes 
> {code}
> TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
> TS[1]-FIL[64]-RS[5]-JOIN[6]
> TS[2]-FIL[65]-RS[10]-JOIN[11]
> TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
> TS[13]-FIL[69]-RS[18]-JOIN[19]
> TS[14]-FIL[70]-RS[22]-JOIN[23]
> {code}
> TS\[2\] connects with TS\[0\] on JOIN\[11\]
> Although TS\[2\] and TS\[6\] has different operator id, they are table store 
> in the query.
> The difference causes different spark execution plan and different execution 
> time.  I'm very confused why there are different logical plan with same 
> setting. Can anyone know where to investigate the root cause?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-09-11 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160943#comment-16160943
 ] 

liyunzhang_intel commented on HIVE-17486:
-

[~stakiar]: thanks for interesting on it.  I guess this optimization have 
following effect.
{code}
set hive.strict.checks.cartesian.product=false;
set hive.join.emit.interval=2;
set hive.auto.convert.join=false;

explain SELECT *
FROM (
  SELECT test1.key AS key1, test1.value AS value1, test1.col_1 AS col_1,
 test2.key AS key2, test2.value AS value2, test2.col_2 AS col_2
  FROM test1 RIGHT OUTER JOIN test2
  ON (test1.value=test2.value
AND (test1.key between 100 and 102
  OR test2.key between 100 and 102))
  ) sq1
FULL OUTER JOIN (
  SELECT test1.key AS key3, test1.value AS value3, test1.col_1 AS col_3,
 test2.key AS key4, test2.value AS value4, test2.col_2 AS col_4
  FROM test1 LEFT OUTER JOIN test2
  ON (test1.value=test2.value
AND (test1.key between 100 and 102
  OR test2.key between 100 and 102))
  ) sq2
ON (sq1.value1 is null or sq2.value4 is null and sq2.value3 != sq1.value2);

{code}

the spark explain
{code}
 STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 12), Map 4 (PARTITION-LEVEL 
SORT, 12)
Reducer 3 <- Reducer 2 (PARTITION-LEVEL SORT, 1), Reducer 6 
(PARTITION-LEVEL SORT, 1)
Reducer 6 <- Map 5 (PARTITION-LEVEL SORT, 12), Map 7 (PARTITION-LEVEL 
SORT, 12)
  DagName: root_20170911043433_e314705a-beca-41a0-b28a-c85c5f811a67:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: test1
  Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: key (type: int), value (type: int), col_1 
(type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  key expressions: _col1 (type: int)
  sort order: +
  Map-reduce partition columns: _col1 (type: int)
  Statistics: Num rows: 6 Data size: 56 Basic stats: 
COMPLETE Column stats: NONE
  value expressions: _col0 (type: int), _col2 (type: string)
Map 4 
Map Operator Tree:
TableScan
  alias: test2
  Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: key (type: int), value (type: int), col_2 
(type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  key expressions: _col1 (type: int)
  sort order: +
  Map-reduce partition columns: _col1 (type: int)
  Statistics: Num rows: 4 Data size: 38 Basic stats: 
COMPLETE Column stats: NONE
  value expressions: _col0 (type: int), _col2 (type: string)
Map 5 
Map Operator Tree:
TableScan
  alias: test1
  Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: key (type: int), value (type: int), col_1 
(type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  key expressions: _col1 (type: int)
  sort order: +
  Map-reduce partition columns: _col1 (type: int)
  Statistics: Num rows: 6 Data size: 56 Basic stats: 
COMPLETE Column stats: NONE
  value expressions: _col0 (type: int), _col2 (type: string)
Map 7 
Map Operator Tree:
TableScan
  alias: test2
  Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
  Select Operator
expressions: key (type: int), value (type: int), col_2 
(type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  key expressions: _col1 (type: int)
  sort order: +
  Map-reduce partition 

[jira] [Assigned] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-09-08 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel reassigned HIVE-17486:
---


> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings

2017-09-07 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156692#comment-16156692
 ] 

liyunzhang_intel edited comment on HIVE-17474 at 9/7/17 8:42 AM:
-

After HIVE-15192, the store is converted to map join.
the logical plan will be forever 
{code}
TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
TS[1]-FIL[64]-RS[5]-JOIN[6]
TS[2]-FIL[65]-RS[10]-JOIN[11]
TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
TS[13]-FIL[69]-RS[18]-JOIN[19]
TS[14]-FIL[70]-RS[22]-JOIN[23]

{code}

It is reasonable the small table store is converted to map join. so close the 
jira.


was (Author: kellyzly):
After HIVE-15192, the store is converted to map join.
the execution plan will be forever 
{code}
TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
TS[1]-FIL[64]-RS[5]-JOIN[6]
TS[2]-FIL[65]-RS[10]-JOIN[11]
TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
TS[13]-FIL[69]-RS[18]-JOIN[19]
TS[14]-FIL[70]-RS[22]-JOIN[23]

{code}

It is reasonable the small table store is converted to map join. so close the 
jira.

> Different logical plan of same query(TPC-DS/70) with same settings
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  On hive version(d3b88f6),  i found that the logical plan is different in 
> runtime with the same settings.
> sometimes the logical plan
> {code}
> TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62]
> TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45]
> TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48]
> TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41]
> TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20]
> TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23]
> {code}
>  TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on 
> JOIN\[48\].
> sometimes 
> {code}
> TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
> TS[1]-FIL[64]-RS[5]-JOIN[6]
> TS[2]-FIL[65]-RS[10]-JOIN[11]
> TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
> TS[13]-FIL[69]-RS[18]-JOIN[19]
> TS[14]-FIL[70]-RS[22]-JOIN[23]
> {code}
> TS\[2\] connects with TS\[0\] on JOIN\[11\]
> Although TS\[2\] and TS\[6\] has different operator id, they are table store 
> in the query.
> The difference causes different spark execution plan and different execution 
> time.  I'm very confused why there are different logical plan with same 
> setting. Can anyone know where to investigate the root cause?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings

2017-09-07 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel resolved HIVE-17474.
-
Resolution: Not A Bug

> Different logical plan of same query(TPC-DS/70) with same settings
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  On hive version(d3b88f6),  i found that the logical plan is different in 
> runtime with the same settings.
> sometimes the logical plan
> {code}
> TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62]
> TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45]
> TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48]
> TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41]
> TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20]
> TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23]
> {code}
>  TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on 
> JOIN\[48\].
> sometimes 
> {code}
> TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
> TS[1]-FIL[64]-RS[5]-JOIN[6]
> TS[2]-FIL[65]-RS[10]-JOIN[11]
> TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
> TS[13]-FIL[69]-RS[18]-JOIN[19]
> TS[14]-FIL[70]-RS[22]-JOIN[23]
> {code}
> TS\[2\] connects with TS\[0\] on JOIN\[11\]
> Although TS\[2\] and TS\[6\] has different operator id, they are table store 
> in the query.
> The difference causes different spark execution plan and different execution 
> time.  I'm very confused why there are different logical plan with same 
> setting. Can anyone know where to investigate the root cause?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings

2017-09-07 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156692#comment-16156692
 ] 

liyunzhang_intel commented on HIVE-17474:
-

After HIVE-15192, the store is converted to map join.
the execution plan will be forever 
{code}
TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
TS[1]-FIL[64]-RS[5]-JOIN[6]
TS[2]-FIL[65]-RS[10]-JOIN[11]
TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
TS[13]-FIL[69]-RS[18]-JOIN[19]
TS[14]-FIL[70]-RS[22]-JOIN[23]

{code}

It is reasonable the small table store is converted to map join. so close the 
jira.

> Different logical plan of same query(TPC-DS/70) with same settings
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  On hive version(d3b88f6),  i found that the logical plan is different in 
> runtime with the same settings.
> sometimes the logical plan
> {code}
> TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62]
> TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45]
> TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48]
> TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41]
> TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20]
> TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23]
> {code}
>  TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on 
> JOIN\[48\].
> sometimes 
> {code}
> TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
> TS[1]-FIL[64]-RS[5]-JOIN[6]
> TS[2]-FIL[65]-RS[10]-JOIN[11]
> TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
> TS[13]-FIL[69]-RS[18]-JOIN[19]
> TS[14]-FIL[70]-RS[22]-JOIN[23]
> {code}
> TS\[2\] connects with TS\[0\] on JOIN\[11\]
> Although TS\[2\] and TS\[6\] has different operator id, they are table store 
> in the query.
> The difference causes different spark execution plan and different execution 
> time.  I'm very confused why there are different logical plan with same 
> setting. Can anyone know where to investigate the root cause?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings

2017-09-06 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17474:

Summary: Different logical plan of same query(TPC-DS/70) with same settings 
 (was: Different physical plan of same query(TPC-DS/70) on HOS)

> Different logical plan of same query(TPC-DS/70) with same settings
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  On hive version(d3b88f6),  i found that the physical plan is different in 
> runtime with the same settings.
> sometimes the physical plan
> {code}
> TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62]
> TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45]
> TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48]
> TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41]
> TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20]
> TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23]
> {code}
>  TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on 
> JOIN\[48\].
> sometimes 
> {code}
> TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
> TS[1]-FIL[64]-RS[5]-JOIN[6]
> TS[2]-FIL[65]-RS[10]-JOIN[11]
> TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
> TS[13]-FIL[69]-RS[18]-JOIN[19]
> TS[14]-FIL[70]-RS[22]-JOIN[23]
> {code}
> TS\[2\] connects with TS\[0\] on JOIN\[11\]
> Although TS\[2\] and TS\[6\] has different operator id, they are table store 
> in the query.
> The difference causes different spark execution plan and different execution 
> time.  I'm very confused why there are different physical plan with same 
> setting. Can anyone know where to investigate the root cause?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings

2017-09-06 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17474:

Description: 
in 
[DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
 On hive version(d3b88f6),  i found that the logical plan is different in 
runtime with the same settings.

sometimes the logical plan
{code}
TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62]
TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45]
TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48]
TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41]
TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20]
TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23]
{code}
 TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on 
JOIN\[48\].

sometimes 
{code}
TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
TS[1]-FIL[64]-RS[5]-JOIN[6]
TS[2]-FIL[65]-RS[10]-JOIN[11]
TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
TS[13]-FIL[69]-RS[18]-JOIN[19]
TS[14]-FIL[70]-RS[22]-JOIN[23]
{code}
TS\[2\] connects with TS\[0\] on JOIN\[11\]

Although TS\[2\] and TS\[6\] has different operator id, they are table store in 
the query.

The difference causes different spark execution plan and different execution 
time.  I'm very confused why there are different logical plan with same 
setting. Can anyone know where to investigate the root cause?

  was:
in 
[DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
 On hive version(d3b88f6),  i found that the physical plan is different in 
runtime with the same settings.

sometimes the physical plan
{code}
TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62]
TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45]
TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48]
TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41]
TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20]
TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23]
{code}
 TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on 
JOIN\[48\].

sometimes 
{code}
TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
TS[1]-FIL[64]-RS[5]-JOIN[6]
TS[2]-FIL[65]-RS[10]-JOIN[11]
TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44]
TS[13]-FIL[69]-RS[18]-JOIN[19]
TS[14]-FIL[70]-RS[22]-JOIN[23]
{code}
TS\[2\] connects with TS\[0\] on JOIN\[11\]

Although TS\[2\] and TS\[6\] has different operator id, they are table store in 
the query.

The difference causes different spark execution plan and different execution 
time.  I'm very confused why there are different physical plan with same 
setting. Can anyone know where to investigate the root cause?


> Different logical plan of same query(TPC-DS/70) with same settings
> --
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  On hive version(d3b88f6),  i found that the logical plan is different in 
> runtime with the same settings.
> sometimes the logical plan
> {code}
> TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62]
> TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45]
> TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48]
> TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41]
> TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20]
> TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23]
> {code}
>  TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on 
> JOIN\[48\].
> sometimes 
> {code}
> TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60]
> TS[1]-FIL[64]-RS[5]-JOIN[6]
> TS[2]-FIL[65]-RS[10]-JOIN[11]
> 

[jira] [Commented] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-09-06 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156308#comment-16156308
 ] 

liyunzhang_intel commented on HIVE-17414:
-

thanks for [~lirui] and [~stakiar]'s review

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Fix For: 3.0.0
>
> Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, 
> HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.5.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> 

[jira] [Commented] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-09-04 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153103#comment-16153103
 ] 

liyunzhang_intel commented on HIVE-17414:
-

[~Ferd]: please commit the 5th patch, thanks!

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, 
> HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.5.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> 

[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-09-04 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Attachment: HIVE-17414.5.patch

[~stakiar]: thanks for you reminder.  attach the 5th patch to trigger QA tests.

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, 
> HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.5.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> 

[jira] [Commented] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-09-04 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152291#comment-16152291
 ] 

liyunzhang_intel commented on HIVE-17414:
-

[~lirui]:yes, i mean the 4th patch, [~ferd], as [~lirui] and [~stakiar] 
finished review, please commit the 4th patch.

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, 
> HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> 

[jira] [Commented] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-09-04 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152271#comment-16152271
 ] 

liyunzhang_intel commented on HIVE-17414:
-

thanks for [~lirui] and [~stakiar]'s review. The changes in HIVE-17414.3.patch
1. remove the Map4 which does not exist in explain
2. other changes about {code}explain select count(*) from srcpart join 
srcpart_date on (srcpart.ds = srcpart_date.ds) join srcpart_hour on (srcpart.hr 
= srcpart_hour.hr) where srcpart_date.`date` = '2008-04-08' and srcpart.hr 
= 13;{code} this is because HIVE-16811.

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, 
> HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   

[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-09-04 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Attachment: HIVE-17414.4.patch

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, 
> HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column 

[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-09-03 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Attachment: HIVE-17414.3.patch

trigger HIVE QA

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, 
> HIVE-17414.3.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: 

[jira] [Commented] (HIVE-17383) ArrayIndexOutOfBoundsException in VectorGroupByOperator

2017-09-01 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150330#comment-16150330
 ] 

liyunzhang_intel commented on HIVE-17383:
-

why say "The failures can't be reproduced locally.". Actually it can be 
reproduced in my env. Do you mean in latest master, this is fixed?


Not very well understand the logic of vectorization.
why {{firstOutputColumnIndex}} starts from {{initialColumnNames.length}}.  for 
example,if have 1 column, the {{firstOutputColumnIndex}} is from 1( normally 
the index is from 0).  When we construct the output batch, the column is from 
1, is this right?
{code}
// Convenient constructor for initial batch creation takes
  // a list of columns names and maps them to 0..n-1 indices.
  public VectorizationContext(String contextName, List 
initialColumnNames,
  HiveConf hiveConf) {
this.contextName = contextName;
level = 0;
this.initialColumnNames = initialColumnNames;
this.projectionColumnNames = initialColumnNames;

projectedColumns = new ArrayList();
projectionColumnMap = new HashMap();
for (int i = 0; i < this.projectionColumnNames.size(); i++) {
  projectedColumns.add(i);
  projectionColumnMap.put(projectionColumnNames.get(i), i);
}

int firstOutputColumnIndex = projectedColumns.size();
this.ocm = new OutputColumnManager(firstOutputColumnIndex);
this.firstOutputColumnIndex = firstOutputColumnIndex;
vMap = new VectorExpressionDescriptor();

if (hiveConf != null) {
  setHiveConfVars(hiveConf);
}
  }
{code}


> ArrayIndexOutOfBoundsException in VectorGroupByOperator
> ---
>
> Key: HIVE-17383
> URL: https://issues.apache.org/jira/browse/HIVE-17383
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-17383.1.patch
>
>
> Query to reproduce:
> {noformat}
> set hive.cbo.enable=false;
> select count(*) from (select key from src group by key) s where s.key='98';
> {noformat}
> The stack trace is:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:831)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:174)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1046)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:462)
>   ... 18 more
> {noformat}
> More details can be found in HIVE-16823



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT

2017-08-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150076#comment-16150076
 ] 

liyunzhang_intel commented on HIVE-17405:
-

[~stakiar]: thanks for explanation,  [different file format test 
case|https://issues.apache.org/jira/secure/attachment/12884191/HIVE-17216.4.patch]
 is added in HIVE-17216 to spark_dynamic_partition_pruning.q.


> HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
> -
>
> Key: HIVE-17405
> URL: https://issues.apache.org/jira/browse/HIVE-17405
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, 
> HIVE-17405.3.patch, HIVE-17405.4.patch, HIVE-17405.5.patch, 
> HIVE-17405.6.patch, HIVE-17405.7.patch
>
>
> In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new 
> ConstantPropagate().transform(parseContext)}} to {{new 
> ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}}
> Hive-on-Tez does the same thing.
> Running the full constant propagation isn't really necessary, we just want to 
> eliminate any {{and true}} predicates that were introduced by 
> {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The 
> {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the 
> operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. 
> The predicates introduced via {{SyntheticJoinPredicate}} are necessary to 
> help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or 
> not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-31 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Attachment: HIVE-17414.1.patch

[~lirui]:  update the comments, there is a test case in 
[spark_vectorized_dynamic_partition_pruning.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/spark_vectorized_dynamic_partition_pruning.q#L112].
 After HIVE-17405 is resolved. I will update the q.out of the case.

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   

[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-31 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Attachment: HIVE-17414.2.patch

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Spark 

[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-31 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Attachment: (was: HIVE-17414.1.patch)

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> 

[jira] [Commented] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT

2017-08-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150025#comment-16150025
 ] 

liyunzhang_intel commented on HIVE-17405:
-

[~stakiar]: why need remove following query in original file?
{code}
-- different file format
create table srcpart_orc (key int, value string) partitioned by (ds string, hr 
int) stored as orc;


set hive.exec.dynamic.partition.mode=nonstrict;
set hive.vectorized.execution.enabled=false;
set hive.exec.max.dynamic.partitions=1000;

insert into table srcpart_orc partition (ds, hr) select key, value, ds, hr from 
srcpart;
EXPLAIN select count(*) from srcpart_orc join srcpart_date_hour on 
(srcpart_orc.ds = srcpart_date_hour.ds and srcpart_orc.hr = 
srcpart_date_hour.hr) where srcpart_date_hour.hour = 11 and 
(srcpart_date_hour.`date` = '2008-04-08' or srcpart_date_hour.`date` = 
'2008-04-09');
select count(*) from srcpart_orc join srcpart_date_hour on (srcpart_orc.ds = 
srcpart_date_hour.ds and srcpart_orc.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.hour = 11 and (srcpart_date_hour.`date` = '2008-04-08' or 
srcpart_date_hour.`date` = '2008-04-09');
select count(*) from srcpart where (ds = '2008-04-08' or ds = '2008-04-09') and 
hr = 11;

drop table srcpart_orc;

{code}

> HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
> -
>
> Key: HIVE-17405
> URL: https://issues.apache.org/jira/browse/HIVE-17405
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, 
> HIVE-17405.3.patch, HIVE-17405.4.patch, HIVE-17405.5.patch, 
> HIVE-17405.6.patch, HIVE-17405.7.patch
>
>
> In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new 
> ConstantPropagate().transform(parseContext)}} to {{new 
> ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}}
> Hive-on-Tez does the same thing.
> Running the full constant propagation isn't really necessary, we just want to 
> eliminate any {{and true}} predicates that were introduced by 
> {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The 
> {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the 
> operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. 
> The predicates introduced via {{SyntheticJoinPredicate}} are necessary to 
> help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or 
> not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT

2017-08-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149920#comment-16149920
 ] 

liyunzhang_intel commented on HIVE-17405:
-

[~lirui]:  in TezCompiler,  constant propagation is in the end of 
optimizeOperatorPlan.  I think 
ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext) is 
not for dpp. This should benefit all the plan. 

> HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
> -
>
> Key: HIVE-17405
> URL: https://issues.apache.org/jira/browse/HIVE-17405
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, 
> HIVE-17405.3.patch, HIVE-17405.4.patch, HIVE-17405.5.patch, HIVE-17405.6.patch
>
>
> In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new 
> ConstantPropagate().transform(parseContext)}} to {{new 
> ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}}
> Hive-on-Tez does the same thing.
> Running the full constant propagation isn't really necessary, we just want to 
> eliminate any {{and true}} predicates that were introduced by 
> {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The 
> {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the 
> operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. 
> The predicates introduced via {{SyntheticJoinPredicate}} are necessary to 
> help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or 
> not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-31 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Attachment: HIVE-17414.1.patch

fix according to last round of review

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.1.patch, HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> 

[jira] [Commented] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q

2017-08-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148613#comment-16148613
 ] 

liyunzhang_intel commented on HIVE-17412:
-

[~Ferd]: i think if I trigger Hive-QA, 
spark_vectorized_dynamic_partition_pruning.q still fail, After HIVE-17405 is 
resolved(blocked by HIVE-17383). spark_vectorized_dynamic_partition_pruning 
will pass.

> Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-17412
> URL: https://issues.apache.org/jira/browse/HIVE-17412
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17412.patch
>
>
> for query
> {code}
>  set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> select distinct ds from srcpart;
> {code}
> the result is 
> {code}
> 2008-04-09
> 2008-04-08
> {code}
> the result of groupby in spark is not in order. Sometimes it returns 
> {code}
> 2008-04-08
> 2008-04-09
> {code}
> Sometimes it returns
> {code}
> 2008-04-09
> 2008-04-08
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-30 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Status: Patch Available  (was: Open)

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Spark Partition Pruning Sink Operator
>  

[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-30 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Attachment: HIVE-17414.patch

[~stakiar],[~lirui]: please help review, 
Before we restrict clazz as “SparkPartitionPruningSinkOperator” when calling   
SparkUtilities#collectOp(Collection result, Operator root, 
Class clazz). so now when using VectorSparkPartitionPruningSinkOperator, 
HIVE-16948 does not work. The changes in the patch:
{code}
if (root == null) {
   return;
 }
-if (clazz.equals(root.getClass())) {
+if (clazz.equals(root.getClass()) || 
clazz.isAssignableFrom(root.getClass())) {
   result.add(root);
 }
{code}


> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 

[jira] [Commented] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q

2017-08-30 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148336#comment-16148336
 ] 

liyunzhang_intel commented on HIVE-17412:
-

[~Ferd]: As Xuefu and Sahil finished review, can you help commit the patch, 
thanks, the reason why i trigger Hive QA is because HIVE-17405 will update the 
other change in spark_vectorized_dynamic_partition_pruning.q.out.

> Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-17412
> URL: https://issues.apache.org/jira/browse/HIVE-17412
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17412.patch
>
>
> for query
> {code}
>  set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> select distinct ds from srcpart;
> {code}
> the result is 
> {code}
> 2008-04-09
> 2008-04-08
> {code}
> the result of groupby in spark is not in order. Sometimes it returns 
> {code}
> 2008-04-08
> 2008-04-09
> {code}
> Sometimes it returns
> {code}
> 2008-04-09
> 2008-04-08
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-30 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel reassigned HIVE-17414:
---

Assignee: liyunzhang_intel

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Spark Partition Pruning Sink Operator
>   Target column: ds 

[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-30 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148125#comment-16148125
 ] 

liyunzhang_intel commented on HIVE-16823:
-

let's fix spark_vectorized_dynamic_partition_pruning.q  in the HIVE-17405 
although the target of HIVE-17405 is not 
spark_vectorized_dynamic_partition_pruning.q  after HIVE-17383 is resolved.

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_112]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_112]
>   at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179)
>  

[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-30 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148105#comment-16148105
 ] 

liyunzhang_intel commented on HIVE-16823:
-

[~stakiar]: {quote} 
Maybe a follow up JIRA would be to see what happens when we run 
{ConstantPropagate()}} at the end of SparkCompiler#optimizeOperatorPlan? 
Theoretically, it should improve performance? But sounds like there are some 
bugs we need to address before getting to that stage.
{quote}

is there any unit test failures if we put following code in the end of 
SparkCompiler#optimizeOperatorPlan?
{code}
  if(procCtx.conf.getBoolVar(ConfVars.HIVEOPTCONSTANTPROPAGATION)) {
  new 
ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(procCtx.parseContext);
}
{code}

I think it is better to put it in the end of SparkCompiler#optimizeOperatorPlan 
than in the runDynamicPartitionPruning. This is not related dpp just found bug 
in dpp unit test.  Beside, why it should improve performance? if you know, 
please tell me, thanks!

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_112]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  

[jira] [Updated] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q

2017-08-29 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17412:

Attachment: HIVE-17412.patch

[~stakiar], [~lirui]: Please help review, thanks!

> Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-17412
> URL: https://issues.apache.org/jira/browse/HIVE-17412
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17412.patch
>
>
> for query
> {code}
>  set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> select distinct ds from srcpart;
> {code}
> the result is 
> {code}
> 2008-04-09
> 2008-04-08
> {code}
> the result of groupby in spark is not in order. Sometimes it returns 
> {code}
> 2008-04-08
> 2008-04-09
> {code}
> Sometimes it returns
> {code}
> 2008-04-09
> 2008-04-08
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q

2017-08-29 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel reassigned HIVE-17412:
---


> Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-17412
> URL: https://issues.apache.org/jira/browse/HIVE-17412
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
>
> for query
> {code}
>  set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> select distinct ds from srcpart;
> {code}
> the result is 
> {code}
> 2008-04-09
> 2008-04-08
> {code}
> the result of groupby in spark is not in order. Sometimes it returns 
> {code}
> 2008-04-08
> 2008-04-09
> {code}
> Sometimes it returns
> {code}
> 2008-04-09
> 2008-04-08
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17407) TPC-DS/query65 hangs on HoS in certain settings

2017-08-29 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17407:

Description: 
[TPC-DS/query65.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query65.sql]
 hangs when using following settings on 3TB scale.
{code}
set hive.auto.convert.join.noconditionaltask.size=300;
{code}
  the explain is attached in 
[explain65|https://issues.apache.org/jira/secure/attachment/12884210/explain.65].
 The 
[screenshot|https://issues.apache.org/jira/secure/attachment/12884209/hang.PNG] 
shows that it hanged in the Stage5.

Let's explain why hang.
{code}
   Reducer 10 <- Map 9 (GROUP, 1009)
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 5 (PARTITION-LEVEL 
SORT, 1), Reducer 7 (PARTITION-LEVEL SORT, 1)
Reducer 3 <- Reducer 10 (PARTITION-LEVEL SORT, 1009), Reducer 2 
(PARTITION-LEVEL SORT, 1009)
Reducer 4 <- Reducer 3 (SORT, 1)
Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1009)
{code}

The numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 
is 1. This is because 
org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils#createReduceWork
{code}
public ReduceWork createReduceWork(GenSparkProcContext context, Operator 
root,
SparkWork sparkWork) throws SemanticException {
   
for (Operator parentOfRoot : 
root.getParentOperators()) {
  Preconditions.checkArgument(parentOfRoot instanceof ReduceSinkOperator,
  "AssertionError: expected parentOfRoot to be an "
  + "instance of ReduceSinkOperator, but was "
  + parentOfRoot.getClass().getName());
  ReduceSinkOperator reduceSink = (ReduceSinkOperator) parentOfRoot;
  maxExecutors = Math.max(maxExecutors, 
reduceSink.getConf().getNumReducers());
}
reduceWork.setNumReduceTasks(maxExecutors);

{code}
here the numReducers of all parentOfRoot is 1( in the explain, the parallelism 
of Map 1, Map 5, Reducer 7 is 1), so the numPartitions of SparkEdgeProperty 
which connects Reducer 2 and Reducer 3 is 1. 
More explain why the parallelism of Map 1, Map 5,Reducer 7 are 1. The physical 
plan of the query is 
{code}
TS[0]-FIL[50]-RS[2]-JOIN[5]-FIL[49]-SEL[7]-GBY[8]-RS[9]-GBY[10]-SEL[11]-GBY[15]-SEL[16]-RS[33]-JOIN[34]-RS[36]-JOIN[39]-FIL[48]-SEL[41]-RS[42]-SEL[43]-LIM[44]-FS[45]
TS[1]-FIL[51]-RS[4]-JOIN[5]
TS[17]-FIL[53]-RS[19]-JOIN[22]-FIL[52]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[38]-JOIN[39]
TS[18]-FIL[54]-RS[21]-JOIN[22]
TS[29]-FIL[55]-RS[31]-JOIN[34]
TS[30]-FIL[56]-RS[32]-JOIN[34]
{code}
The related RS of Map1, Map5, Reducer 7 is RS\[31\], RS\[32\], RS\[33\]. The 
parallelism is set by 
[SemanticAnalyzer#genJoinReduceSinkChild|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8267]
It seems that there is no logical error in the code. But it is not reasonable 
to use 1 task to execute to deal with so big data(more than 30GB). Is there any 
way to pass the query in this situation( the reason why i set 
hive.auto.convert.join.noconditionaltask.size as 300, if the join is 
converted to the map join, it will throw disk error).

  was:
[TPC-DS/query65.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query65.sql]
 hangs when using following settings on 3TB scale.
{code}
set hive.auto.convert.join.noconditionaltask.size=300;
{code}
  the explain is attached in 
[explain65|https://issues.apache.org/jira/secure/attachment/12884210/explain.65].
 The [screenshot| shows that it hanged in the Stage5.

Let's explain why hang.
{code}
   Reducer 10 <- Map 9 (GROUP, 1009)
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 5 (PARTITION-LEVEL 
SORT, 1), Reducer 7 (PARTITION-LEVEL SORT, 1)
Reducer 3 <- Reducer 10 (PARTITION-LEVEL SORT, 1009), Reducer 2 
(PARTITION-LEVEL SORT, 1009)
Reducer 4 <- Reducer 3 (SORT, 1)
Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1009)
{code}

The numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 
is 1. This is because 
org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils#createReduceWork
{code}
public ReduceWork createReduceWork(GenSparkProcContext context, Operator 
root,
SparkWork sparkWork) throws SemanticException {
   
for (Operator parentOfRoot : 
root.getParentOperators()) {
  Preconditions.checkArgument(parentOfRoot instanceof ReduceSinkOperator,
  "AssertionError: expected parentOfRoot to be an "
  + "instance of ReduceSinkOperator, but was "
  + parentOfRoot.getClass().getName());
  ReduceSinkOperator reduceSink = (ReduceSinkOperator) parentOfRoot;
  maxExecutors = Math.max(maxExecutors, 
reduceSink.getConf().getNumReducers());
}
reduceWork.setNumReduceTasks(maxExecutors);

{code}
here the numReducers of all parentOfRoot is 1( in the explain, the parallelism 
of Map 1, 

[jira] [Updated] (HIVE-17407) TPC-DS/query65 hangs on HoS in certain settings

2017-08-29 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17407:

Attachment: explain.65
hang.PNG

> TPC-DS/query65 hangs on HoS in certain settings
> ---
>
> Key: HIVE-17407
> URL: https://issues.apache.org/jira/browse/HIVE-17407
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
> Attachments: explain.65, hang.PNG
>
>
> [TPC-DS/query65.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query65.sql]
>  hangs when using following settings on 3TB scale.
> {code}
> set hive.auto.convert.join.noconditionaltask.size=300;
> {code}
>   the explain is attached in explain65. The screenshot shows that it hanged 
> in the Stage5.
> Let's explain why hang.
> {code}
>Reducer 10 <- Map 9 (GROUP, 1009)
> Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 5 (PARTITION-LEVEL 
> SORT, 1), Reducer 7 (PARTITION-LEVEL SORT, 1)
> Reducer 3 <- Reducer 10 (PARTITION-LEVEL SORT, 1009), Reducer 2 
> (PARTITION-LEVEL SORT, 1009)
> Reducer 4 <- Reducer 3 (SORT, 1)
> Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1009)
> {code}
> The numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 
> is 1. This is because 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils#createReduceWork
> {code}
> public ReduceWork createReduceWork(GenSparkProcContext context, Operator 
> root,
> SparkWork sparkWork) throws SemanticException {
>
> for (Operator parentOfRoot : 
> root.getParentOperators()) {
>   Preconditions.checkArgument(parentOfRoot instanceof ReduceSinkOperator,
>   "AssertionError: expected parentOfRoot to be an "
>   + "instance of ReduceSinkOperator, but was "
>   + parentOfRoot.getClass().getName());
>   ReduceSinkOperator reduceSink = (ReduceSinkOperator) parentOfRoot;
>   maxExecutors = Math.max(maxExecutors, 
> reduceSink.getConf().getNumReducers());
> }
> reduceWork.setNumReduceTasks(maxExecutors);
> {code}
> here the numReducers of all parentOfRoot is 1( in the explain, the 
> parallelism of Map 1, Map 5, Reducer 7 is 1), so the numPartitions of 
> SparkEdgeProperty which connects Reducer 2 and Reducer 3 is 1. 
> More explain why the parallelism of Map 1, Map 5,Reducer 7 are 1. The 
> physical plan of the query is 
> {code}
> TS[0]-FIL[50]-RS[2]-JOIN[5]-FIL[49]-SEL[7]-GBY[8]-RS[9]-GBY[10]-SEL[11]-GBY[15]-SEL[16]-RS[33]-JOIN[34]-RS[36]-JOIN[39]-FIL[48]-SEL[41]-RS[42]-SEL[43]-LIM[44]-FS[45]
> TS[1]-FIL[51]-RS[4]-JOIN[5]
> TS[17]-FIL[53]-RS[19]-JOIN[22]-FIL[52]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[38]-JOIN[39]
> TS[18]-FIL[54]-RS[21]-JOIN[22]
> TS[29]-FIL[55]-RS[31]-JOIN[34]
> TS[30]-FIL[56]-RS[32]-JOIN[34]
> {code}
> The related RS of Map1, Map5, Reducer 7 is RS\[31\], RS\[32\], RS\[33\]. The 
> parallelism is set by 
> [SemanticAnalyzer#genJoinReduceSinkChild|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8267]
> It seems that there is no logical error in the code. But it is not reasonable 
> to use 1 task to execute to deal with so big data(more than 30GB). Is there 
> any way to pass the query in this situation( the reason why i set 
> hive.auto.convert.join.noconditionaltask.size as 300, if the join is 
> converted to the map join, it will throw disk error).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17407) TPC-DS/query65 hangs on HoS in certain settings

2017-08-29 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17407:

Description: 
[TPC-DS/query65.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query65.sql]
 hangs when using following settings on 3TB scale.
{code}
set hive.auto.convert.join.noconditionaltask.size=300;
{code}
  the explain is attached in 
[explain65|https://issues.apache.org/jira/secure/attachment/12884210/explain.65].
 The [screenshot| shows that it hanged in the Stage5.

Let's explain why hang.
{code}
   Reducer 10 <- Map 9 (GROUP, 1009)
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 5 (PARTITION-LEVEL 
SORT, 1), Reducer 7 (PARTITION-LEVEL SORT, 1)
Reducer 3 <- Reducer 10 (PARTITION-LEVEL SORT, 1009), Reducer 2 
(PARTITION-LEVEL SORT, 1009)
Reducer 4 <- Reducer 3 (SORT, 1)
Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1009)
{code}

The numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 
is 1. This is because 
org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils#createReduceWork
{code}
public ReduceWork createReduceWork(GenSparkProcContext context, Operator 
root,
SparkWork sparkWork) throws SemanticException {
   
for (Operator parentOfRoot : 
root.getParentOperators()) {
  Preconditions.checkArgument(parentOfRoot instanceof ReduceSinkOperator,
  "AssertionError: expected parentOfRoot to be an "
  + "instance of ReduceSinkOperator, but was "
  + parentOfRoot.getClass().getName());
  ReduceSinkOperator reduceSink = (ReduceSinkOperator) parentOfRoot;
  maxExecutors = Math.max(maxExecutors, 
reduceSink.getConf().getNumReducers());
}
reduceWork.setNumReduceTasks(maxExecutors);

{code}
here the numReducers of all parentOfRoot is 1( in the explain, the parallelism 
of Map 1, Map 5, Reducer 7 is 1), so the numPartitions of SparkEdgeProperty 
which connects Reducer 2 and Reducer 3 is 1. 
More explain why the parallelism of Map 1, Map 5,Reducer 7 are 1. The physical 
plan of the query is 
{code}
TS[0]-FIL[50]-RS[2]-JOIN[5]-FIL[49]-SEL[7]-GBY[8]-RS[9]-GBY[10]-SEL[11]-GBY[15]-SEL[16]-RS[33]-JOIN[34]-RS[36]-JOIN[39]-FIL[48]-SEL[41]-RS[42]-SEL[43]-LIM[44]-FS[45]
TS[1]-FIL[51]-RS[4]-JOIN[5]
TS[17]-FIL[53]-RS[19]-JOIN[22]-FIL[52]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[38]-JOIN[39]
TS[18]-FIL[54]-RS[21]-JOIN[22]
TS[29]-FIL[55]-RS[31]-JOIN[34]
TS[30]-FIL[56]-RS[32]-JOIN[34]
{code}
The related RS of Map1, Map5, Reducer 7 is RS\[31\], RS\[32\], RS\[33\]. The 
parallelism is set by 
[SemanticAnalyzer#genJoinReduceSinkChild|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8267]
It seems that there is no logical error in the code. But it is not reasonable 
to use 1 task to execute to deal with so big data(more than 30GB). Is there any 
way to pass the query in this situation( the reason why i set 
hive.auto.convert.join.noconditionaltask.size as 300, if the join is 
converted to the map join, it will throw disk error).

  was:
[TPC-DS/query65.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query65.sql]
 hangs when using following settings on 3TB scale.
{code}
set hive.auto.convert.join.noconditionaltask.size=300;
{code}
  the explain is attached in explain65. The screenshot shows that it hanged in 
the Stage5.

Let's explain why hang.
{code}
   Reducer 10 <- Map 9 (GROUP, 1009)
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 5 (PARTITION-LEVEL 
SORT, 1), Reducer 7 (PARTITION-LEVEL SORT, 1)
Reducer 3 <- Reducer 10 (PARTITION-LEVEL SORT, 1009), Reducer 2 
(PARTITION-LEVEL SORT, 1009)
Reducer 4 <- Reducer 3 (SORT, 1)
Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1009)
{code}

The numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 
is 1. This is because 
org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils#createReduceWork
{code}
public ReduceWork createReduceWork(GenSparkProcContext context, Operator 
root,
SparkWork sparkWork) throws SemanticException {
   
for (Operator parentOfRoot : 
root.getParentOperators()) {
  Preconditions.checkArgument(parentOfRoot instanceof ReduceSinkOperator,
  "AssertionError: expected parentOfRoot to be an "
  + "instance of ReduceSinkOperator, but was "
  + parentOfRoot.getClass().getName());
  ReduceSinkOperator reduceSink = (ReduceSinkOperator) parentOfRoot;
  maxExecutors = Math.max(maxExecutors, 
reduceSink.getConf().getNumReducers());
}
reduceWork.setNumReduceTasks(maxExecutors);

{code}
here the numReducers of all parentOfRoot is 1( in the explain, the parallelism 
of Map 1, Map 5, Reducer 7 is 1), so the numPartitions of SparkEdgeProperty 
which connects Reducer 2 and Reducer 3 is 1. 
More explain why the 

[jira] [Commented] (HIVE-17383) ArrayIndexOutOfBoundsException in VectorGroupByOperator

2017-08-28 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143467#comment-16143467
 ] 

liyunzhang_intel commented on HIVE-17383:
-

[~lirui]: after enable vectorization, it throws ArrayIndexOutOfBoundsException.
query
{code}
set hive.cbo.enable=false;
set hive.user.install.directory=file:///tmp;
set fs.default.name=file:///;
set fs.defaultFS=file:///;
set tez.staging-dir=/tmp;
set tez.ignore.lib.uris=true;
set tez.runtime.optimize.local.fetch=true;
set tez.local.mode=true;
set hive.explain.user=false;
set hive.vectorized.execution.enabled=true;
select count(*) from (select key from src group by key) s where s.key='98';
{code}
the explain
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Tez
  DagId: root_20170828025707_7b882df3-3e96-47f0-b189-9b6919d44512:1
  Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
  DagName: root_20170828025707_7b882df3-3e96-47f0-b189-9b6919d44512:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: src
  Statistics: Num rows: 2906 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: (key = '98') (type: boolean)
Statistics: Num rows: 1453 Data size: 2906 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  Statistics: Num rows: 1453 Data size: 2906 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
keys: '98' (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1453 Data size: 2906 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: '98' (type: string)
  sort order: +
  Map-reduce partition columns: '98' (type: string)
  Statistics: Num rows: 1453 Data size: 2906 Basic 
stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reducer 2 
Execution mode: vectorized
Reduce Operator Tree:
  Group By Operator
keys: '98' (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 726 Data size: 1452 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  Statistics: Num rows: 726 Data size: 1452 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  sort order: 
  Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: NONE
  value expressions: _col0 (type: bigint)
Reducer 3 
Execution mode: vectorized
Reduce Operator Tree:
  Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
  table:
  input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
{code}

> ArrayIndexOutOfBoundsException in VectorGroupByOperator
> ---
>
> Key: HIVE-17383
> URL: https://issues.apache.org/jira/browse/HIVE-17383
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>
> Query to reproduce:
> {noformat}
> set hive.cbo.enable=false;
> select count(*) from (select key from src group by key) s where s.key='98';
> {noformat}
> The stack trace is:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>   at 
> 

[jira] [Commented] (HIVE-17383) ArrayIndexOutOfBoundsException in VectorGroupByOperator

2017-08-28 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143439#comment-16143439
 ] 

liyunzhang_intel commented on HIVE-17383:
-

[~lirui]: this passes in latest master(6be50b7) in my tez env. If there is some 
wrong with the configuration, tell me!
query
{code}
set hive.cbo.enable=false;
set hive.user.install.directory=file:///tmp;
set fs.default.name=file:///;
set fs.defaultFS=file:///;
set tez.staging-dir=/tmp;
set tez.ignore.lib.uris=true;
set tez.runtime.optimize.local.fetch=true;
set tez.local.mode=true;
set hive.explain.user=false;
explain select count(*) from (select key from src group by key) s where 
s.key='98';
{code}
explain
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Tez
  DagId: root_20170828023743_be3df7bf-49cc-4c71-a4a7-25814558804c:1
  Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
  DagName: root_20170828023743_be3df7bf-49cc-4c71-a4a7-25814558804c:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: src
  Statistics: Num rows: 2906 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: (key = '98') (type: boolean)
Statistics: Num rows: 1453 Data size: 2906 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  Statistics: Num rows: 1453 Data size: 2906 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
keys: '98' (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1453 Data size: 2906 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: '98' (type: string)
  sort order: +
  Map-reduce partition columns: '98' (type: string)
  Statistics: Num rows: 1453 Data size: 2906 Basic 
stats: COMPLETE Column stats: NONE
Reducer 2 
Reduce Operator Tree:
  Group By Operator
keys: '98' (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 726 Data size: 1452 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  Statistics: Num rows: 726 Data size: 1452 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  sort order: 
  Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: NONE
  value expressions: _col0 (type: bigint)
Reducer 3 
Reduce Operator Tree:
  Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
  table:
  input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink

{code}

> ArrayIndexOutOfBoundsException in VectorGroupByOperator
> ---
>
> Key: HIVE-17383
> URL: https://issues.apache.org/jira/browse/HIVE-17383
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>
> Query to reproduce:
> {noformat}
> set hive.cbo.enable=false;
> select count(*) from (select key from src group by key) s where s.key='98';
> {noformat}
> The stack trace is:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:831)
>   at 
> 

[jira] [Comment Edited] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-28 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143397#comment-16143397
 ] 

liyunzhang_intel edited comment on HIVE-16823 at 8/28/17 6:06 AM:
--

[~lirui]: can you help review the patch?
i have 1 question about {{spark_vectorized_dynamic_partition_pruning.q}}, 
should we add {{-- SORT_QUERY_RESULTS}} to the file, otherwise 
the result of 
{code}
select distinct ds from srcpart
{code}
{code}
2008-04-09  
2008-04-08
{code}

while the result in the q.out is
{code}
2008-04-08  
2008-04-09
{code}


was (Author: kellyzly):
[~lirui]: can you help review the patch?
i have 1 question about {{spark_vectorized_dynamic_partition_pruning.q}}, 
should we add {{-- SORT_QUERY_RESULTS}} to the file, otherwise in the q.out 
the result of 
{code}
select distinct ds from srcpart
{code}
{code}
2008-04-09  
2008-04-08
{code}


> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_112]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_112]
>   at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1

[jira] [Commented] (HIVE-17383) ArrayIndexOutOfBoundsException in VectorGroupByOperator

2017-08-28 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143405#comment-16143405
 ] 

liyunzhang_intel commented on HIVE-17383:
-

[~lirui]:  can you help to verify whether ArrayIndexOutOfBoundsException appear 
or not in above query? in my env(hive version:f86878b). No similar exception is 
thrown, this query passes. If there is a RS follows the GBY, the exception will 
not be thrown.


> ArrayIndexOutOfBoundsException in VectorGroupByOperator
> ---
>
> Key: HIVE-17383
> URL: https://issues.apache.org/jira/browse/HIVE-17383
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>
> Query to reproduce:
> {noformat}
> set hive.cbo.enable=false;
> select count(*) from (select key from src group by key) s where s.key='98';
> {noformat}
> The stack trace is:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:831)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:174)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1046)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:462)
>   ... 18 more
> {noformat}
> More details can be found in HIVE-16823



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-27 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143397#comment-16143397
 ] 

liyunzhang_intel commented on HIVE-16823:
-

[~lirui]: can you help review the patch?
i have 1 question about {{spark_vectorized_dynamic_partition_pruning.q}}, 
should we add {{-- SORT_QUERY_RESULTS}} to the file, otherwise in the q.out 
the result of 
{code}
select distinct ds from srcpart
{code}
{code}
2008-04-09  
2008-04-08
{code}


> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_112]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_112]
>   at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> 

[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-25 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141446#comment-16141446
 ] 

liyunzhang_intel commented on HIVE-16823:
-

some update
{quote}
This is why the query runs if map join is disabled, in which case GBY is 
followed by SEL/RS instead of SparkHashTableSinkOperator.
{quote}
more explain about this.  
{code}
set spark.master=local;
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.spark.dynamic.partition.pruning=false;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.vectorized.execution.enabled=true;
set hive.strict.checks.cartesian.product=false;
set hive.auto.convert.join=false;
set hive.cbo.enable=false;
set hive.optimize.constant.propagation=true;
select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
{code}
the cbo is disabled and the explain is not right, the key of GroupBy in Reducer 
is {{keys: '2008-04-08' (type: string)}} it should be {{keys: KEY._col0 (type: 
string)}} but the query finishes successfully. The reason is there is 
{{RS\[9\]}} after {{GBY\[4\]}} 
{code}
GBY[4]-SEL[5]-RS[9]
{code}

{{RS\[9\]}} called following stack, OutputColumnManager#allocateOutputColumn 
makes OutputColumnManager#getScratchColumnTypeNames returning value.
{code}
org.apache.hadoop.hive.ql.exec.vector.VectorizationContext$OutputColumnManager.allocateOutputColumn(VectorizationContext.java:478)
  at 
org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getConstantVectorExpression(VectorizationContext.java:1153)
  at 
org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpression(VectorizationContext.java:688)
  at 
org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpressions(VectorizationContext.java:590)
  at 
org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpressions(VectorizationContext.java:578)
  at 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.canSpecializeReduceSink(Vectorizer.java:3490)
  at 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.vectorizeOperator(Vectorizer.java:4174)
  at 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationNodeProcessor.doVectorize(Vectorizer.java:1632)
  at 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$ReduceWorkVectorizationNodeProcessor.process(Vectorizer.java:1772)
{code}

in the log, we can see after VectorizationNodeProcessor#doVectorizer 
{{GBY\[4\]}} , the vectorization context is 
{code}
2017-08-25T03:40:21,316 DEBUG [cd30697a-7797-4bbe-ad92-1fcec8a89689 main] 
physical.Vectorizer: Vectorized ReduceWork reduce shuffle vectorization context 
Context name __Reduce_Shuffle__, level 0, sorted projectionColumnMap 
{0=KEY._col0}, scratchColumnTypeNames []
{code}
after  VectorizationNodeProcessor#doVectorizer {{RS\[9\]}}, the vectorization 
context is(here scratchColumnTypeNames  returns value)
{code}
2017-08-25T03:48:00,245 DEBUG [cd30697a-7797-4bbe-ad92-1fcec8a89689 main] 
physical.Vectorizer: vectorizeOperator 
org.apache.hadoop.hive.ql.plan.ReduceSinkDesc
2017-08-25T03:48:43,101 DEBUG [cd30697a-7797-4bbe-ad92-1fcec8a89689 main] 
physical.Vectorizer: Vectorized ReduceWork operator RS added vectorization 
context Context name SEL, level 1, sorted projectionColumnMap {}, 
scratchColumnTypeNames [string]
{code}

The difference in scratchColumnTypeNames causes different value in outputBatch 
in 
[VectorGroupKeyHelper|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupKeyHelper.java#L107].

I guess arrayIndexOutOfBoundsException can be reproduced in following condition 
whether in spark or tez mode.
1. cbo is disabled
2.  there is no RS follows GBY in the reducer


> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = 

[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-24 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139655#comment-16139655
 ] 

liyunzhang_intel commented on HIVE-16823:
-

[~lirui]: Although ConstantPropagate influence the logical plan,hive on tez 
will not throw the exception.

{code}
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.tez.dynamic.partition.pruning=true;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.vectorized.execution.enabled=true;
set hive.strict.checks.cartesian.product=false;
set hive.cbo.enable=false;
set hive.user.install.directory=file:///tmp;
set fs.default.name=file:///;
set fs.defaultFS=file:///;
set tez.staging-dir=/tmp;
set tez.ignore.lib.uris=true;
set tez.runtime.optimize.local.fetch=true;
set tez.local.mode=true;
set hive.explain.user=false;
select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
{code}

the explain(It seems the key of GroupByOperator is not right)
{code}
 Reducer 2 
Execution mode: vectorized
Reduce Operator Tree:
  Group By Operator
keys: '2008-04-08' (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  Statistics: Num rows: 1 Data size: 11624 Basic stats: 
COMPLETE Column stats: NONE
  Map Join Operator
{code}

Need more time to investigate why tez is not influenced when cbo is disabled. 
But i guess this is another problem, any suggestion?

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> 

[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-23 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139569#comment-16139569
 ] 

liyunzhang_intel commented on HIVE-16823:
-

[~lirui]: I found if enable cbo with your settings, everything works fine
{code}
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.spark.dynamic.partition.pruning=false;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.vectorized.execution.enabled=true;
set hive.strict.checks.cartesian.product=false;
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size = 1000;
set hive.optimize.constant.propagation=true;
{code}

when enabling cbo, the explain is
{code}
 Map 3 
Map Operator Tree:TableScan
  alias: srcpart  filterExpr: (ds = 
'2008-04-08') (type: boolean)
  Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL 
Column stats: NONE
  Select Operator
Statistics: Num rows: 1 Data size: 11624 Basic stats: 
PARTIAL Column stats: NONE
Group By Operator
  keys: '2008-04-08' (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 11624 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 11624 Basic stats: 
COMPLETE Column stats: NONE
Execution mode: vectorized
Reducer 4 
Execution mode: vectorized
Local Work:
  Map Reduce Local Work
Reduce Operator Tree:
  Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0

{code}

when disabling cbo, the explain is
{code}
 Map 1 
Map Operator Tree:
TableScan
  alias: srcpart  filterExpr: (true and (ds = 
'2008-04-08')) (type: boolean)
  Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL 
Column stats: NONE
  Filter Operatorpredicate: true (type: 
boolean)
Statistics: Num rows: 1 Data size: 11624 Basic stats: 
COMPLETE Column stats: NONE
Select Operator  Statistics: Num rows: 
1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE
  Group By Operator
keys: '2008-04-08' (type: string)   
 mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 11624 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: '2008-04-08' (type: string)  
sort order: +
  Map-reduce partition columns: '2008-04-08' (type: 
string)
  Statistics: Num rows: 1 Data size: 11624 Basic stats: 
COMPLETE Column stats: NONE
Execution mode: vectorized
Reducer 2 
Execution mode: vectorized
Local Work:
  Map Reduce Local Work
Reduce Operator Tree:
  Group By Operator
keys: '2008-04-08' (type: string)
{code}

the difference is the key of GroupByOperator in the Reducer. But not know why 
cbo causes wrong explain. Need to investigate.

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 

[jira] [Commented] (HIVE-10349) overflow in stats

2017-08-23 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139497#comment-16139497
 ] 

liyunzhang_intel commented on HIVE-10349:
-

[~sershe]:  I met similar overflow problem when i running TPC-DS/query17 on 
Hive on Spark, the explain is in the 
[link|https://issues.apache.org/jira/secure/attachment/12875204/query17_explain.log].
  what's the root cause of the problem?

> overflow in stats
> -
>
> Key: HIVE-10349
> URL: https://issues.apache.org/jira/browse/HIVE-10349
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Prasanth Jayachandran
>
> Discovered while running q17 in LLAP.
> {noformat}
> Reducer 2 
> Execution mode: llap
> Reduce Operator Tree:
>   Merge Join Operator
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col28 (type: int), _col27 (type: int)
>   1 cs_bill_customer_sk (type: int), cs_item_sk (type: int)
> outputColumnNames: _col1, _col2, _col6, _col8, _col9, _col22, 
> _col27, _col28, _col34, _col35, _col45, _col51, _col63, _col66, _col82
> Statistics: Num rows: 1047651367827495040 Data size: 
> 9223372036854775807 Basic stats: COMPLETE Column stats: PARTIAL
> Map Join Operator
>   condition map:
>Inner Join 0 to 1
>   keys:
> 0 _col22 (type: int)
> 1 d_date_sk (type: int)
>   outputColumnNames: _col1, _col2, _col6, _col8, _col9, 
> _col22, _col27, _col28, _col34, _col35, _col45, _col51, _col63, _col66, 
> _col82, _col86
>   input vertices:
> 1 Map 7
>   Statistics: Num rows: 1152416529588199552 Data size: 
> 9223372036854775807 Basic stats: COMPLETE Column stats: NONE
> {noformat}
> Data size overflows and row count also looks wrong. I wonder if this is why 
> it generates 1009 reducers for this stage on 6 machines



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-23 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138071#comment-16138071
 ] 

liyunzhang_intel commented on HIVE-16823:
-

explain more about the big changes in 
spark_vectorized_dynamic_partition_pruning.q.out on review board. [~lirui] and 
[~stakiar]: If have time, please help review, thanks!

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_112]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_112]
>   at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179)
>  

[jira] [Updated] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-23 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-16823:

Attachment: HIVE-16823.1.patch

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_112]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_112]
>   at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1035)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 

[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-23 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137991#comment-16137991
 ] 

liyunzhang_intel commented on HIVE-16823:
-

update changes in q*out in HIVE-16823.1.patch. Most changes like following. 
This is because we remove ConstantPropagate in 
SparkCompiler#runDynamicPartitionPruning.  
{code}
Map Operator Tree:
 TableScan
   alias: srcpart_date
-  filterExpr: ((date = '2008-04-08') and ds is not null) 
(type: boolean)
+  filterExpr: ((date = '2008-04-08') and ds is not null and 
true) (type: boolean)
   Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE 
Column stats: NONE
   Filter Operator
-predicate: ((date = '2008-04-08') and ds is not null) 
(type: boolean)
+predicate: ((date = '2008-04-08') and ds is not null and 
true) (type: boolean)
 Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
{code}
Big changes in spark_vectorized_dynamic_partition_pruning.q.out as this file 
has not been updated for long time.

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> 

[jira] [Updated] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-21 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-16823:

Status: Patch Available  (was: Open)

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_112]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_112]
>   at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1035)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> 

[jira] [Updated] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-21 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-16823:

Attachment: HIVE-16823.patch

In HIVE-15269:Dynamic Min-Max/BloomFilter runtime-filtering for Tez, 
ConstantPropagate is removed in TezCompiler#runDynamicPartitionPruning. Similar 
code should be removed in SparkCompiler#runDynamicPartitionPruning
{code}
  private void runDynamicPartitionPruning(OptimizeTezProcContext procCtx, 
Set inputs,
  Set outputs) throws SemanticException {

if (!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_PARTITION_PRUNING)) {
  return;
}

// Sequence of TableScan operators to be walked
Deque deque = new LinkedList();
deque.addAll(procCtx.parseContext.getTopOps().values());

Map opRules = new LinkedHashMap();
opRules.put(
new RuleRegExp(new String("Dynamic Partition Pruning"), 
FilterOperator.getOperatorName()
+ "%"), new DynamicPartitionPruningOptimization());

// The dispatcher fires the processor corresponding to the closest matching
// rule and passes the context along
Dispatcher disp = new DefaultRuleDispatcher(null, opRules, procCtx);
List topNodes = new ArrayList();
topNodes.addAll(procCtx.parseContext.getTopOps().values());
GraphWalker ogw = new ForwardWalker(disp);
ogw.startWalking(topNodes, null);

/** Similar code is removed in TezCompiler in HIVE-15269:Dynamic 
Min-Max/BloomFilter runtime-filtering for Tez***/
// need a new run of the constant folding because we might have created lots
// of "and true and true" conditions.
// Rather than run the full constant folding just need to shortcut AND/OR 
expressions
// involving constant true/false values.
if(procCtx.conf.getBoolVar(ConfVars.HIVEOPTCONSTANTPROPAGATION)) {
  new 
ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(procCtx.parseContext);
}

  }

{code}
[~lirui],[~stakiar]: can you help review?


> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  

[jira] [Comment Edited] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-21 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136156#comment-16136156
 ] 

liyunzhang_intel edited comment on HIVE-16823 at 8/22/17 2:16 AM:
--

some updates about the jira.
 The root cause of the problem is because the difference of sub-query {{select 
ds as ds, ds as `date` from srcpart group by ds}} between tez and spark mode.
the spark explain(the full spark explain is attached in 
[here|https://issues.apache.org/jira/secure/attachment/12883036/explain.spark] )
{code}
  Map 3 
Map Operator Tree:
TableScan
  alias: srcpart
  filterExpr: (ds = '2008-04-08') (type: boolean)
  Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL 
Column stats: NONE
  Select Operator
Statistics: Num rows: 1 Data size: 11624 Basic stats: 
PARTIAL Column stats: NONE
Group By Operator
  keys: '2008-04-08' (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 11624 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: '2008-04-08' (type: string)
sort order: +
Map-reduce partition columns: '2008-04-08' (type: 
string)
Statistics: Num rows: 1 Data size: 11624 Basic stats: 
COMPLETE Column stats: NONE
 Reducer 4 
Local Work:
  Map Reduce Local Work
Reduce Operator Tree:
  Group By Operator
keys: '2008-04-08' (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE 
Column stats: NONE
{code}

the tez explain(the full tez explain is attached in 
[here|https://issues.apache.org/jira/secure/attachment/12883035/explain.tez] )
{code}
  Map 2 
Map Operator Tree:
TableScan
  alias: srcpart
  filterExpr: (ds = '2008-04-08') (type: boolean)
  Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL 
Column stats: NONE
  Select Operator
Statistics: Num rows: 1 Data size: 11624 Basic stats: 
PARTIAL Column stats: NONE
Group By Operator
  keys: '2008-04-08' (type: string)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 11624 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 11624 Basic stats: 
COMPLETE Column stats: NONE
Execution mode: vectorized
Reducer 3 
Execution mode: vectorized
Reduce Operator Tree:
  Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
{code}

The Group By Operator appears in Map and Reducer in tez or spark mode. But the 
keys of GroupByOperator in Reducer is different. In tez, the key is  {{keys: 
KEY._col0 (type: string)}} while in spark the key is {{keys: '2008-04-08' 
(type: string)}}.  This difference causes 
[VectorizationContext#getVectorExpression|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizationContext.java#L579
 ] returns 
[getColumnVectorExpression|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizationContext.java#L582]
 in tez mode while  returns 
[getConstantVectorExpression|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizationContext.java#L660].


was (Author: kellyzly):
some updates about the jira.
 The root cause of the problem is because the difference of sub-query {{select 
ds as ds, ds as `date` from srcpart group by ds}} between tez and spark mode.
the spark explain(the full spark explain is attached in 
[here|https://issues.apache.org/jira/secure/attachment/12883036/explain.spark] )
{code}
  Map 3 
Map Operator Tree:
TableScan
  alias: srcpart
  filterExpr: (ds = '2008-04-08') (type: boolean)
  Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL 
Column stats: NONE
  Select Operator
Statistics: Num rows: 1 Data size: 11624 Basic stats: 
PARTIAL 

  1   2   3   4   >