[jira] [Comment Edited] (HIVE-17193) HoS: don't combine map works that are targets of different DPPs
[ https://issues.apache.org/jira/browse/HIVE-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214644#comment-16214644 ] liyunzhang_intel edited comment on HIVE-17193 at 10/23/17 5:24 AM: --- I can reproduce after disabling cbo {code} set hive.explain.user=false; set hive.spark.dynamic.partition.pruning=true; set hive.tez.dynamic.partition.pruning=true; set hive.auto.convert.join=false; set hive.cbo.enable=false; explain select * from (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.key) a join (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.value) b on a.key=b.key; {code} the explain {code} STAGE DEPENDENCIES: Stage-2 is a root stage Stage-1 depends on stages: Stage-2 Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-2 Spark DagName: root_20171023004308_4b3c304e-3deb-4193-846d-12cf9e6a50ab:2 Vertices: Map 8 Map Operator Tree: TableScan alias: src Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: key (type: string) outputColumnNames: _col0 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: _col0 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Spark Partition Pruning Sink Operator Target column: ds (string) partition key expr: ds Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE target work: Map 1 Stage: Stage-1 Spark Edges: Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 4 (PARTITION-LEVEL SORT, 1) Reducer 3 <- Reducer 2 (PARTITION-LEVEL SORT, 1), Reducer 6 (PARTITION-LEVEL SORT, 1) Reducer 6 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 7 (PARTITION-LEVEL SORT, 1) DagName: root_20171023004308_4b3c304e-3deb-4193-846d-12cf9e6a50ab:1 Vertices: Map 1 Map Operator Tree: TableScan alias: srcpart Statistics: Num rows: 232 Data size: 23248 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 232 Data size: 23248 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: ds (type: string) sort order: + Map-reduce partition columns: ds (type: string) Statistics: Num rows: 232 Data size: 23248 Basic stats: COMPLETE Column stats: NONE value expressions: key (type: string) Map 4 Map Operator Tree: TableScan alias: src Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: key (type: string) sort order: + Map-reduce partition columns: key (type: string) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Map 7 Map Operator Tree: TableScan alias: src Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: value is not null (type: boolean) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: value (type: string) sort order: + Map-reduce partition columns: value (type: string) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Reducer 2 Reduce Operator Tree: Join Operator condition map:
[jira] [Updated] (HIVE-16948) Invalid explain when running dynamic partition pruning query in Hive On Spark
[ https://issues.apache.org/jira/browse/HIVE-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-16948: Attachment: 17193_compare_RS_in_Map_5_1.PNG > Invalid explain when running dynamic partition pruning query in Hive On Spark > - > > Key: HIVE-16948 > URL: https://issues.apache.org/jira/browse/HIVE-16948 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: 3.0.0 > > Attachments: 17193_compare_RS_in_Map_5_1.PNG, HIVE-16948.2.patch, > HIVE-16948.5.patch, HIVE-16948.6.patch, HIVE-16948.7.patch, HIVE-16948.patch, > HIVE-16948_1.patch > > > in > [union_subquery.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/spark_dynamic_partition_pruning.q#L107] > in spark_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.strict.checks.cartesian.product=false; > explain select ds from (select distinct(ds) as ds from srcpart union all > select distinct(ds) as ds from srcpart) s where s.ds in (select > max(srcpart.ds) from srcpart union all select min(srcpart.ds) from srcpart); > {code} > explain > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > DagName: root_20170622231525_20a777e5-e659-4138-b605-65f8395e18e2:2 > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 1 Data size: 23248 Basic stats: > PARTIAL Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 1 Data size: 23248 Basic stats: > PARTIAL Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 1 Data size: 23248 Basic stats: > PARTIAL Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 1 Data size: 23248 Basic stats: > PARTIAL Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Reducer 11 > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE >
[jira] [Commented] (HIVE-17193) HoS: don't combine map works that are targets of different DPPs
[ https://issues.apache.org/jira/browse/HIVE-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214644#comment-16214644 ] liyunzhang_intel commented on HIVE-17193: - I can reproduce after disabling cbo {code} set hive.explain.user=false; set hive.spark.dynamic.partition.pruning=true; set hive.tez.dynamic.partition.pruning=true; set hive.auto.convert.join=false; set hive.cbo.enable=false; explain select * from (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.key) a join (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.value) b on a.key=b.key; {code} the explain {code} STAGE DEPENDENCIES: Stage-2 is a root stage Stage-1 depends on stages: Stage-2 Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-2 Spark DagName: root_20171023004308_4b3c304e-3deb-4193-846d-12cf9e6a50ab:2 Vertices: Map 8 Map Operator Tree: TableScan alias: src Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: key (type: string) outputColumnNames: _col0 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: _col0 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Spark Partition Pruning Sink Operator Target column: ds (string) partition key expr: ds Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE target work: Map 1 Stage: Stage-1 Spark Edges: Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 4 (PARTITION-LEVEL SORT, 1) Reducer 3 <- Reducer 2 (PARTITION-LEVEL SORT, 1), Reducer 6 (PARTITION-LEVEL SORT, 1) Reducer 6 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 7 (PARTITION-LEVEL SORT, 1) DagName: root_20171023004308_4b3c304e-3deb-4193-846d-12cf9e6a50ab:1 Vertices: Map 1 Map Operator Tree: TableScan alias: srcpart Statistics: Num rows: 232 Data size: 23248 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 232 Data size: 23248 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: ds (type: string) sort order: + Map-reduce partition columns: ds (type: string) Statistics: Num rows: 232 Data size: 23248 Basic stats: COMPLETE Column stats: NONE value expressions: key (type: string) Map 4 Map Operator Tree: TableScan alias: src Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: key (type: string) sort order: + Map-reduce partition columns: key (type: string) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Map 7 Map Operator Tree: TableScan alias: src Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: value is not null (type: boolean) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: value (type: string) sort order: + Map-reduce partition columns: value (type: string) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Reducer 2 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 keys:
[jira] [Commented] (HIVE-17193) HoS: don't combine map works that are targets of different DPPs
[ https://issues.apache.org/jira/browse/HIVE-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214603#comment-16214603 ] liyunzhang_intel commented on HIVE-17193: - [~lirui]: I remember this problem when i developed HIVE-16948. But I can not reproduce this problem on hive(commit a51ae9c) now {code} set hive.explain.user=false; set hive.spark.dynamic.partition.pruning=true; set hive.tez.dynamic.partition.pruning=true; set hive.auto.convert.join=false; explain select * from (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.key) a join (select srcpart.ds,srcpart.key from srcpart join src on srcpart.ds=src.value) b on a.key=b.key; {code} the explain {code} STAGE DEPENDENCIES: Stage-2 is a root stage Stage-1 depends on stages: Stage-2 Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-2 Spark DagName: root_20171022233200_990c146c-b49f-49b9-9a5b-a0028e34f200:2 Vertices: Map 8 Map Operator Tree: TableScan alias: src Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: key (type: string) outputColumnNames: _col0 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string) outputColumnNames: _col0 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: _col0 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Spark Partition Pruning Sink Operator Target column: ds (string) partition key expr: ds Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE target work: Map 1 Map 9 Map Operator Tree: TableScan alias: src Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: value is not null (type: boolean) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: value (type: string) outputColumnNames: _col0 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string) outputColumnNames: _col0 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: _col0 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Spark Partition Pruning Sink Operator Target column: ds (string) partition key expr: ds Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE target work: Map 5 Stage: Stage-1 Spark Edges: Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 4 (PARTITION-LEVEL SORT, 1) Reducer 3 <- Reducer 2 (PARTITION-LEVEL SORT, 1), Reducer 6 (PARTITION-LEVEL SORT, 1) Reducer 6 <- Map 5 (PARTITION-LEVEL SORT, 1), Map 7 (PARTITION-LEVEL SORT, 1) DagName: root_20171022233200_990c146c-b49f-49b9-9a5b-a0028e34f200:1 Vertices: Map 1 Map Operator Tree: TableScan alias: srcpart Statistics: Num rows: 232 Data size: 23248 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 232 Data size: 23248 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: key (type: string), ds (type: string)
[jira] [Updated] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17634: Description: In the statistics estimation([StatsRulesProcFactory|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L134]), we do not estimate the column stats once we set hive.stats.fetch.column.stats as false.Suggest to estimate the data size by column type when {{hive.stats.fetch.column.stats}} as false like HIVE-17634.1.patch does. (was: In the statistics estimation([StatsRulesProcFactory|), we do not estimate the column stats once we set hive.stats.fetch.column.stats as false.Suggest to estimate the data size by column type when {{hive.stats.fetch.column.stats}} as false like HIVE-17634.1.patch does.) > Estimate the column stats even not retrieve columns from > metastore(hive.stats.fetch.column.stats as false) > -- > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.1.patch, HIVE-17634.patch > > > In the statistics > estimation([StatsRulesProcFactory|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L134]), > we do not estimate the column stats once we set > hive.stats.fetch.column.stats as false.Suggest to estimate the data size by > column type when {{hive.stats.fetch.column.stats}} as false like > HIVE-17634.1.patch does. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17634: Description: In the statistics estimation([StatsRulesProcFactory|), we do not estimate the column stats once we set hive.stats.fetch.column.stats as false.Suggest to estimate the data size by column type when {{hive.stats.fetch.column.stats}} as false like HIVE-17634.1.patch does. (was: in [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], we set {{fetchColStats}},{{fetchPartStats}} as true when call {{StatsUtils.collectStatistics}} {code} if (!hiveTblMetadata.isPartitioned()) { // 2.1 Handle the case for unpartitioned table. try { Statistics stats = StatsUtils.collectStatistics(hiveConf, null, hiveTblMetadata, hiveNonPartitionCols, nonPartColNamesThatRqrStats, colStatsCached, nonPartColNamesThatRqrStats, true, true); ... {code} This will cause querying columns statistic from metastore even we set {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as false in HiveConf. If we these two properties as false, we can not any column statistics from metastore. Suggest to set the properties from HiveConf. ) > Estimate the column stats even not retrieve columns from > metastore(hive.stats.fetch.column.stats as false) > -- > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.1.patch, HIVE-17634.patch > > > In the statistics estimation([StatsRulesProcFactory|), we do not estimate > the column stats once we set hive.stats.fetch.column.stats as false.Suggest > to estimate the data size by column type when > {{hive.stats.fetch.column.stats}} as false like HIVE-17634.1.patch does. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187239#comment-16187239 ] liyunzhang_intel commented on HIVE-17634: - [~vgarg]: thanks for your command, i will try. Oct1-Oct8 is Chinese holiday. Maybe will delay the patch for some time, thanks for your patience. > Estimate the column stats even not retrieve columns from > metastore(hive.stats.fetch.column.stats as false) > -- > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.1.patch, HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186851#comment-16186851 ] liyunzhang_intel commented on HIVE-17634: - the command to regenerate all the q*out is {code} mvn clean test -Dtest=TestCliDriver -Dtest.output.overwrite=true -Dqfile=* {code} If it is not correct,tell me to use which command to regenerate all the q*out, thanks! > Estimate the column stats even not retrieve columns from > metastore(hive.stats.fetch.column.stats as false) > -- > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.1.patch, HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186830#comment-16186830 ] liyunzhang_intel commented on HIVE-17634: - [~vgarg]: there are 1243 failed/errored test(s). Most failures like {code} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[bucket_map_join_spark4] Failing for the past 1 build (Since Failed#7056 ) Took 10 sec. Error Message Client Execution succeeded but contained differences (error code = 1) after executing bucket_map_join_spark4.q 88c88 < Statistics: Num rows: 10 Data size: 1880 Basic stats: COMPLETE Column stats: NONE --- > Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE > Column stats: NONE {code} this is because now use [betterDS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L358 ] not {{ds}} to estimate the data size. The data size changed from 70 to 1880. Do you think it is ok? If you think it is ok, i will start to regenerate the *q.out file in my local cluster. > Estimate the column stats even not retrieve columns from > metastore(hive.stats.fetch.column.stats as false) > -- > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.1.patch, HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17634: Status: Patch Available (was: Open) > Estimate the column stats even not retrieve columns from > metastore(hive.stats.fetch.column.stats as false) > -- > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.1.patch, HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186454#comment-16186454 ] liyunzhang_intel commented on HIVE-17634: - [~vgarg]: thanks for review. Now trigger the test. > Estimate the column stats even not retrieve columns from > metastore(hive.stats.fetch.column.stats as false) > -- > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.1.patch, HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17634) Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false)
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17634: Summary: Estimate the column stats even not retrieve columns from metastore(hive.stats.fetch.column.stats as false) (was: Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in RelOptHiveTable#updateColStats) > Estimate the column stats even not retrieve columns from > metastore(hive.stats.fetch.column.stats as false) > -- > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.1.patch, HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17634) Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in RelOptHiveTable#updateColStats
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17634: Attachment: HIVE-17634.1.patch [~vgarg]: thanks for your reply. I indeed met the problem that statistics is not correct when i set {{hive.stats.fetch.column.stats}} as false. Attach HIVE-17634.1.patch, please help review. > Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in > RelOptHiveTable#updateColStats > - > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.1.patch, HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17634) Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in RelOptHiveTable#updateColStats
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185327#comment-16185327 ] liyunzhang_intel commented on HIVE-17634: - [~vgarg]: thanks for your explanation. {quote} I am not convinced why would user not want to fetch stats from metastore and instead rely upon estimated statistics? {quote} from the document it said "Fetching column statistics for each needed column can be expensive when the number of columns is high". The default value of hive.stats.fetch.column.stats is false. Maybe users do not enable this property because they need use {{analyze table xxx compute statistics for columns}} to collect column statistics and this command are time-consuming for table with high number of columns. {code} HIVE_STATS_FETCH_COLUMN_STATS("hive.stats.fetch.column.stats", false, "Annotation of operator tree with statistics information requires column statistics.\n" + "Column statistics are fetched from metastore. Fetching column statistics for each needed column\n" + "can be expensive when the number of columns is high. This flag can be used to disable fetching\n" + "of column statistics from metastore."), {code} > Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in > RelOptHiveTable#updateColStats > - > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17634) Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in RelOptHiveTable#updateColStats
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185281#comment-16185281 ] liyunzhang_intel commented on HIVE-17634: - [~vgarg]: thanks for your reply. I can understand the importance of column stats to estimate the statistics. What i am confused is in logical plan we uses {{true}} to get the column stats from metastore even we can not get [result |https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L351] from metastore and [estimateStatsForMissingCols|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L354]. But in the statistics estimation([StatsRulesProcFactory|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L134]), we even do not estimate the column stats once we set {{hive.stats.fetch.column.stats}} as false.Can we do some refactor for [StatsUtils#collectStatistics|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L349] like {code} if (fetchColStats) { colStats = getTableColumnStats(table, schema, neededColumns, colStatsCache); } //Although not fetch column stats from metastore, we still estimate the column stats if(colStats == null) { colStats = Lists.newArrayList(); } estimateStatsForMissingCols(neededColumns, colStats, table, conf, nr, schema); // we should have stats for all columns (estimated or actual) assert(neededColumns.size() == colStats.size()); long betterDS = getDataSizeFromColumnStats(nr, colStats); ds = (betterDS < 1 || colStats.isEmpty()) ? ds : betterDS; {code} > Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in > RelOptHiveTable#updateColStats > - > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17634) Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in RelOptHiveTable#updateColStats
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17634: Summary: Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in RelOptHiveTable#updateColStats (was: Use properties from HiveConf in RelOptHiveTable#updateColStats) > Use properties from HiveConf about "fetchColStats" and "fetchPartStats" in > RelOptHiveTable#updateColStats > - > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17634) Use properties from HiveConf in RelOptHiveTable#updateColStats
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17634: Attachment: HIVE-17634.patch [~vgarg],[~jcamachorodriguez]:As you have more knowledge about RelOptHiveTable and statistics estimation, can you take a look about the patch? thanks! > Use properties from HiveConf in RelOptHiveTable#updateColStats > -- > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17634.patch > > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (HIVE-17634) Use properties from HiveConf in RelOptHiveTable#updateColStats
[ https://issues.apache.org/jira/browse/HIVE-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel reassigned HIVE-17634: --- Assignee: liyunzhang_intel > Use properties from HiveConf in RelOptHiveTable#updateColStats > -- > > Key: HIVE-17634 > URL: https://issues.apache.org/jira/browse/HIVE-17634 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > in > [RelOptHiveTable#updateColStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L309], > we set {{fetchColStats}},{{fetchPartStats}} as true when call > {{StatsUtils.collectStatistics}} > {code} >if (!hiveTblMetadata.isPartitioned()) { > // 2.1 Handle the case for unpartitioned table. > try { > Statistics stats = StatsUtils.collectStatistics(hiveConf, null, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, > colStatsCached, nonPartColNamesThatRqrStats, true, true); > ... > {code} > This will cause querying columns statistic from metastore even we set > {{hive.stats.fetch.column.stats}} and {{hive.stats.fetch.partition.stats}} as > false in HiveConf. If we these two properties as false, we can not any > column statistics from metastore. Suggest to set the properties from > HiveConf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17182) Invalid statistics like "RAW DATA SIZE" info for parquet file
[ https://issues.apache.org/jira/browse/HIVE-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17182: Description: on TPC-DS 200g scale store_sales use "describe formatted store_sales" to view the statistics {code} hive> describe formatted store_sales; OK # col_name data_type comment ss_sold_time_sk bigint ss_item_sk bigint ss_customer_sk bigint ss_cdemo_sk bigint ss_hdemo_sk bigint ss_addr_sk bigint ss_store_sk bigint ss_promo_sk bigint ss_ticket_numberbigint ss_quantity int ss_wholesale_cost double ss_list_price double ss_sales_price double ss_ext_discount_amt double ss_ext_sales_price double ss_ext_wholesale_cost double ss_ext_list_price double ss_ext_tax double ss_coupon_amt double ss_net_paid double ss_net_paid_inc_tax double ss_net_profit double # Partition Information # col_name data_type comment ss_sold_date_sk bigint # Detailed Table Information Database: tpcds_bin_partitioned_parquet_200 Owner: root CreateTime: Tue Jun 06 11:51:48 CST 2017 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles2023 numPartitions 1824 numRows 575995635 rawDataSize 12671903970 totalSize 46465926745 transient_lastDdlTime 1496721108 {code} the rawDataSize is nearly 12G while the totalSize is nearly 46G. view the original data on hdfs {noformat} #hadoop fs -du -h /tmp/tpcds-generate/200/ 75.8 G /tmp/tpcds-generate/200/store_sales {noformat} view the parquet file on hdfs {noformat} # hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db 43.3 G /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales {noformat} It seems that the rawDataSize is nearly 75G but in "describe formatted store_sales" command, it shows only 12G. I tried to use "analyze table store_sales compute statistics for columns" to update the statistics but there is no change for RAWDATASIZE; I tried to use "analyze table store_sales partition(ss_sold_date_sk) compute statistics no scan" to update the statistics but fail, the error is {code} 2017-09-28T03:21:04,849 INFO [StatsNoJobTask-Thread-1] exec.Task: [Warning] could not update stats for tpcds_bin_partitioned_parquet_10.store_sales{ss_sold_date_sk=2451769}.Failed with exception Missing timezone id for parquet int96 conversion! java.lang.IllegalArgumentException: Missing timezone id for p^Carquet int96 conversion! at org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTimeUtils.validateTimeZone(NanoTimeUtils.java:169) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setTimeZoneConversion(ParquetRecordReaderBase.java:182) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:89) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:59) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:86) at org.apache.hadoop.hive.ql.exec.StatsNoJobTask$StatsCollection.run(StatsNoJobTask.java:164) at
[jira] [Updated] (HIVE-17182) Invalid statistics like "RAW DATA SIZE" info for parquet file
[ https://issues.apache.org/jira/browse/HIVE-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17182: Description: on TPC-DS 200g scale store_sales use "describe formatted store_sales" to view the statistics {code} hive> describe formatted store_sales; OK # col_name data_type comment ss_sold_time_sk bigint ss_item_sk bigint ss_customer_sk bigint ss_cdemo_sk bigint ss_hdemo_sk bigint ss_addr_sk bigint ss_store_sk bigint ss_promo_sk bigint ss_ticket_numberbigint ss_quantity int ss_wholesale_cost double ss_list_price double ss_sales_price double ss_ext_discount_amt double ss_ext_sales_price double ss_ext_wholesale_cost double ss_ext_list_price double ss_ext_tax double ss_coupon_amt double ss_net_paid double ss_net_paid_inc_tax double ss_net_profit double # Partition Information # col_name data_type comment ss_sold_date_sk bigint # Detailed Table Information Database: tpcds_bin_partitioned_parquet_200 Owner: root CreateTime: Tue Jun 06 11:51:48 CST 2017 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles2023 numPartitions 1824 numRows 575995635 rawDataSize 12671903970 totalSize 46465926745 transient_lastDdlTime 1496721108 {code} the rawDataSize is nearly 12G while the totalSize is nearly 46G. view the original data on hdfs {noformat} #hadoop fs -du -h /tmp/tpcds-generate/200/ 75.8 G /tmp/tpcds-generate/200/store_sales {noformat} view the parquet file on hdfs {noformat} # hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db 43.3 G /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales {noformat} It seems that the rawDataSize is nearly 75G but in "describe formatted store_sales" command, it shows only 12G. I tried to use "analyze table store_sales compute statistics for columns" to update the statistics but there is no change for RAWDATASIZE; I tried to use "analyze table store_sales partition(ss_sold_date_sk) compute statistics no scan" to update the statistics but fail, the error is {code} {code} was: on TPC-DS 200g scale store_sales use "describe formatted store_sales" to view the statistics {code} hive> describe formatted store_sales; OK # col_name data_type comment ss_sold_time_sk bigint ss_item_sk bigint ss_customer_sk bigint ss_cdemo_sk bigint ss_hdemo_sk bigint ss_addr_sk bigint ss_store_sk bigint ss_promo_sk bigint ss_ticket_numberbigint ss_quantity int ss_wholesale_cost double ss_list_price double
[jira] [Updated] (HIVE-17182) Invalid statistics like "RAW DATA SIZE" info for parquet file
[ https://issues.apache.org/jira/browse/HIVE-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17182: Description: on TPC-DS 200g scale store_sales use "describe formatted store_sales" to view the statistics {code} hive> describe formatted store_sales; OK # col_name data_type comment ss_sold_time_sk bigint ss_item_sk bigint ss_customer_sk bigint ss_cdemo_sk bigint ss_hdemo_sk bigint ss_addr_sk bigint ss_store_sk bigint ss_promo_sk bigint ss_ticket_numberbigint ss_quantity int ss_wholesale_cost double ss_list_price double ss_sales_price double ss_ext_discount_amt double ss_ext_sales_price double ss_ext_wholesale_cost double ss_ext_list_price double ss_ext_tax double ss_coupon_amt double ss_net_paid double ss_net_paid_inc_tax double ss_net_profit double # Partition Information # col_name data_type comment ss_sold_date_sk bigint # Detailed Table Information Database: tpcds_bin_partitioned_parquet_200 Owner: root CreateTime: Tue Jun 06 11:51:48 CST 2017 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles2023 numPartitions 1824 numRows 575995635 rawDataSize 12671903970 totalSize 46465926745 transient_lastDdlTime 1496721108 {code} the rawDataSize is nearly 12G while the totalSize is nearly 46G. view the original data on hdfs {noformat} #hadoop fs -du -h /tmp/tpcds-generate/200/ 75.8 G /tmp/tpcds-generate/200/store_sales {noformat} view the parquet file on hdfs {noformat} # hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db 43.3 G /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales {noformat} It seems that the rawDataSize is nearly 75G but in "describe formatted store_sales" command, it shows only 12G. I tried to use "analyze table store_sales compute statistics for columns" to update the statistics but there is no change for RAWDATASIZE; I tried to use "analyze table store_sales partition() compute statistics no scan" to update the statistics but fail, the error is {code} FAILED: SemanticException [Error 10115]: Table is partitioned and partition specification is needed {code} was: on TPC-DS 200g scale store_sales use "describe formatted store_sales" to view the statistics {code} hive> describe formatted store_sales; OK # col_name data_type comment ss_sold_time_sk bigint ss_item_sk bigint ss_customer_sk bigint ss_cdemo_sk bigint ss_hdemo_sk bigint ss_addr_sk bigint ss_store_sk bigint ss_promo_sk bigint ss_ticket_numberbigint ss_quantity int ss_wholesale_cost double
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17486: Description: in HIVE-16602, Implement shared scans with Tez. Given a query plan, the goal is to identify scans on input tables that can be merged so the data is read only once. Optimization will be carried out at the physical level. In Hive on Spark, it caches the result ofsSpark work if the spark work is used by more than 1 child spark work. After sharedWorkOptimizer is enabled in physical plan in HoS, the identical table scans are merged to 1 table scan. This result of table scan will be used by more 1 child spark work. Thus we need not do the same computation because of cache mechanism. was: in HIVE-16602, Implement shared scans with Tez. Given a query plan, the goal is to identify scans on input tables that can be merged so the data is read only once. Optimization will be carried out at the physical level. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result ofsSpark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17486: Description: in HIVE-16602, Implement shared scans with Tez. Given a query plan, the goal is to identify scans on input tables that can be merged so the data is read only once. Optimization will be carried out at the physical level. In Hive on Spark, it caches the result of spark work if the spark work is used by more than 1 child spark work. After sharedWorkOptimizer is enabled in physical plan in HoS, the identical table scans are merged to 1 table scan. This result of table scan will be used by more 1 child spark work. Thus we need not do the same computation because of cache mechanism. was: in HIVE-16602, Implement shared scans with Tez. Given a query plan, the goal is to identify scans on input tables that can be merged so the data is read only once. Optimization will be carried out at the physical level. In Hive on Spark, it caches the result ofsSpark work if the spark work is used by more than 1 child spark work. After sharedWorkOptimizer is enabled in physical plan in HoS, the identical table scans are merged to 1 table scan. This result of table scan will be used by more 1 child spark work. Thus we need not do the same computation because of cache mechanism. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17545) Make HoS RDD Cacheing Optimization Configurable
[ https://issues.apache.org/jira/browse/HIVE-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16180288#comment-16180288 ] liyunzhang_intel commented on HIVE-17545: - [~lirui]: thanks for explanation. If disabled cache, even equivalent works are combined, the computation for the same work are still executed. > Make HoS RDD Cacheing Optimization Configurable > --- > > Key: HIVE-17545 > URL: https://issues.apache.org/jira/browse/HIVE-17545 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer, Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-17545.1.patch, HIVE-17545.2.patch > > > The RDD cacheing optimization add in HIVE-10550 is enabled by default. We > should make it configurable in case users want to disable it. We can leave it > on by default to preserve backwards compatibility. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17545) Make HoS RDD Cacheing Optimization Configurable
[ https://issues.apache.org/jira/browse/HIVE-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16180135#comment-16180135 ] liyunzhang_intel commented on HIVE-17545: - [~lirui]: {quote} if user turns on combining equivalent works and turns off RDD caching, then there won't be perf improvement right? {quote} if users turns on combining equivalent, duplicated map/reduce work will be removed. The performance will not change whether rdd caching is enabled or not. In HoS, cache will be enabled only when the parent spark work have more than [1 children|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L264]. If my understanding is not right, tell me. > Make HoS RDD Cacheing Optimization Configurable > --- > > Key: HIVE-17545 > URL: https://issues.apache.org/jira/browse/HIVE-17545 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer, Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-17545.1.patch, HIVE-17545.2.patch > > > The RDD cacheing optimization add in HIVE-10550 is enabled by default. We > should make it configurable in case users want to disable it. We can leave it > on by default to preserve backwards compatibility. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel resolved HIVE-17474. - Resolution: Not A Problem > Poor Performance about subquery like DS/query70 on HoS > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.70.after.analyze, explain.70.before.analyze, > explain.70.vec > > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17545) Make HoS RDD Cacheing Optimization Configurable
[ https://issues.apache.org/jira/browse/HIVE-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16180063#comment-16180063 ] liyunzhang_intel commented on HIVE-17545: - [~stakiar]: sounds good. But i don't know why cache optimization was not configurable before. [~lirui]: As you are more familiar with the code, can you take some time to look? > Make HoS RDD Cacheing Optimization Configurable > --- > > Key: HIVE-17545 > URL: https://issues.apache.org/jira/browse/HIVE-17545 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer, Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-17545.1.patch, HIVE-17545.2.patch > > > The RDD cacheing optimization add in HIVE-10550 is enabled by default. We > should make it configurable in case users want to disable it. We can leave it > on by default to preserve backwards compatibility. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17545) Make HoS RDD Cacheing Optimization Configurable
[ https://issues.apache.org/jira/browse/HIVE-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178604#comment-16178604 ] liyunzhang_intel commented on HIVE-17545: - [~stakiar]: why need to make RDD caching optimization configurable? Is there any problem or performance degradation if enable rdd cache optimization? > Make HoS RDD Cacheing Optimization Configurable > --- > > Key: HIVE-17545 > URL: https://issues.apache.org/jira/browse/HIVE-17545 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer, Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-17545.1.patch, HIVE-17545.2.patch > > > The RDD cacheing optimization add in HIVE-10550 is enabled by default. We > should make it configurable in case users want to disable it. We can leave it > on by default to preserve backwards compatibility. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17565) NullPointerException occurs when hive.optimize.skewjoin and hive.auto.convert.join are switched on at the same time
[ https://issues.apache.org/jira/browse/HIVE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175846#comment-16175846 ] liyunzhang_intel commented on HIVE-17565: - i can reproduce it in Hive on MR in commit(fafa953), will investigate it later. > NullPointerException occurs when hive.optimize.skewjoin and > hive.auto.convert.join are switched on at the same time > --- > > Key: HIVE-17565 > URL: https://issues.apache.org/jira/browse/HIVE-17565 > Project: Hive > Issue Type: Bug >Affects Versions: 1.2.1 >Reporter: Xin Hao >Assignee: liyunzhang_intel > > (A)NullPointerException occurs when hive.optimize.skewjoin and > hive.auto.convert.join are switched on at the same time. > Could pass when hive.optimize.skewjoin=true and hive.auto.convert.join=false. > (B)Hive Version: > Found on Apache Hive 1.2.1 > (C)Workload: > (1)TPCx-BB Q19 > (2) A small case as below,which was actually simplified from Q19: > SELECT * > FROM store_returns sr, > ( > SELECT d1.d_date_sk > FROM date_dim d1, date_dim d2 > WHERE d1.d_week_seq = d2.d_week_seq > ) sr_dateFilter > WHERE sr.sr_returned_date_sk = d_date_sk; > (D)Exception Error Message: > Error: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:194) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:223) > at > org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1051) > at > org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055) > at > org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055) > at > org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:490) > at > org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170) > ... 8 more -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (HIVE-17565) NullPointerException occurs when hive.optimize.skewjoin and hive.auto.convert.join are switched on at the same time
[ https://issues.apache.org/jira/browse/HIVE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel reassigned HIVE-17565: --- Assignee: liyunzhang_intel > NullPointerException occurs when hive.optimize.skewjoin and > hive.auto.convert.join are switched on at the same time > --- > > Key: HIVE-17565 > URL: https://issues.apache.org/jira/browse/HIVE-17565 > Project: Hive > Issue Type: Bug >Affects Versions: 1.2.1 >Reporter: Xin Hao >Assignee: liyunzhang_intel > > (A)NullPointerException occurs when hive.optimize.skewjoin and > hive.auto.convert.join are switched on at the same time. > Could pass when hive.optimize.skewjoin=true and hive.auto.convert.join=false. > (B)Hive Version: > Found on Apache Hive 1.2.1 > (C)Workload: > (1)TPCx-BB Q19 > (2) A small case as below,which was actually simplified from Q19: > SELECT * > FROM store_returns sr, > ( > SELECT d1.d_date_sk > FROM date_dim d1, date_dim d2 > WHERE d1.d_week_seq = d2.d_week_seq > ) sr_dateFilter > WHERE sr.sr_returned_date_sk = d_date_sk; > (D)Exception Error Message: > Error: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:194) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:223) > at > org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1051) > at > org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055) > at > org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055) > at > org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:490) > at > org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170) > ... 8 more -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17565) NullPointerException occurs when hive.optimize.skewjoin and hive.auto.convert.join are switched on at the same time
[ https://issues.apache.org/jira/browse/HIVE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174112#comment-16174112 ] liyunzhang_intel commented on HIVE-17565: - HaoXin: this happens on Hive on MR or Hive on Spark? > NullPointerException occurs when hive.optimize.skewjoin and > hive.auto.convert.join are switched on at the same time > --- > > Key: HIVE-17565 > URL: https://issues.apache.org/jira/browse/HIVE-17565 > Project: Hive > Issue Type: Bug >Affects Versions: 1.2.1 >Reporter: Xin Hao > > (A)NullPointerException occurs when hive.optimize.skewjoin and > hive.auto.convert.join are switched on at the same time. > Could pass when hive.optimize.skewjoin=true and hive.auto.convert.join=false. > (B)Hive Version: > Found on Apache Hive 1.2.1 > (C)Workload: > (1)TPCx-BB Q19 > (2) A small case as below,which was actually simplified from Q19: > SELECT * > FROM store_returns sr, > ( > SELECT d1.d_date_sk > FROM date_dim d1, date_dim d2 > WHERE d1.d_week_seq = d2.d_week_seq > ) sr_dateFilter > WHERE sr.sr_returned_date_sk = d_date_sk; > (D)Exception Error Message: > Error: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:194) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:223) > at > org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1051) > at > org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055) > at > org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055) > at > org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:490) > at > org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170) > ... 8 more -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-16602) Implement shared scans with Tez
[ https://issues.apache.org/jira/browse/HIVE-16602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172918#comment-16172918 ] liyunzhang_intel commented on HIVE-16602: - [~jcamachorodriguez]: thanks for your reply. {quote} ...it appears multiple times in the query. {quote} i mean the ts is used in the query for more than once. so shared scan optimization will work. I test this in 10g scale in DS queries like [query7|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query17.sql],[query70 |https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql ] on 10g scale. But not see big improvement. I guess the reason maybe the data scale is small so that not too much time is reduced even though less TS is called. ||query||Before HIVE-16602||HIVE-16602|| |query7 |53.677s|51.934s | |query70 |46.951s| 47.48s| > Implement shared scans with Tez > --- > > Key: HIVE-16602 > URL: https://issues.apache.org/jira/browse/HIVE-16602 > Project: Hive > Issue Type: New Feature > Components: Physical Optimizer >Affects Versions: 3.0.0 >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez > Labels: TODOC3.0 > Fix For: 3.0.0 > > Attachments: HIVE-16602.01.patch, HIVE-16602.02.patch, > HIVE-16602.03.patch, HIVE-16602.04.patch, HIVE-16602.patch > > > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. > In the longer term, identification of equivalent expressions and > reutilization of intermediary results should be done at the logical layer via > Spool operator. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-16602) Implement shared scans with Tez
[ https://issues.apache.org/jira/browse/HIVE-16602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171349#comment-16171349 ] liyunzhang_intel commented on HIVE-16602: - [~jcamachorodriguez]: I am envaluating the the performance improvement of HIVE-16602 on tez i use tpcds compare the execution time in the package without HIVE-16602 and with HIVE-16602 on 10g data scale. I guess there is improvement with this feature as it only loads table once even it appears multiple time in the query. Have you done some benchmark test about this feature? > Implement shared scans with Tez > --- > > Key: HIVE-16602 > URL: https://issues.apache.org/jira/browse/HIVE-16602 > Project: Hive > Issue Type: New Feature > Components: Physical Optimizer >Affects Versions: 3.0.0 >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez > Labels: TODOC3.0 > Fix For: 3.0.0 > > Attachments: HIVE-16602.01.patch, HIVE-16602.02.patch, > HIVE-16602.03.patch, HIVE-16602.04.patch, HIVE-16602.patch > > > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. > In the longer term, identification of equivalent expressions and > reutilization of intermediary results should be done at the logical layer via > Spool operator. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171184#comment-16171184 ] liyunzhang_intel edited comment on HIVE-17474 at 9/19/17 6:47 AM: -- I found that we need execute "analyze table xxx compute statistics for columns" before executing the query. Attach the different explain([before_analyze|https://issues.apache.org/jira/secure/attachment/12887836/explain.70.before.analyze],[after_analyze|https://issues.apache.org/jira/secure/attachment/12887837/explain.70.after.analyze] ) give an example to show the influence of column statistics {code}(select s_state as s_state, sum(ss_net_profit), rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking from store_sales, store, date_dim where d_month_seq between 1193 and 1193+11 and date_dim.d_date_sk = store_sales.ss_sold_date_sk and store.s_store_sk = store_sales.ss_store_sk group by s_state ) {code} before compute column statistics {code} Map 9 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_store_sk is not null and ss_sold_date_sk is not null) (type: boolean) Statistics: Num rows: 27504814 Data size: 825144420 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: ss_store_sk is not null (type: boolean) Statistics: Num rows: 27504814 Data size: 220038512 Basic stats: COMPLETE Column stats: PARTIAL Select Operator expressions: ss_store_sk (type: bigint), ss_net_profit (type: double), ss_sold_date_sk (type: bigint) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 27504814 Data size: 220038512 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col1, _col2, _col4 input vertices: 1 Map 12 Statistics: Num rows: 30255296 Data size: 242042368 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col2 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col1, _col4 input vertices: 1 Map 13 Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col4 (type: string), _col1 (type: double) outputColumnNames: _col4, _col1 Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: sum(_col1) keys: _col4 (type: string) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: double) {code} the data size is 266246610 After computing column statistics {code} Map 7 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_store_sk is not null and ss_sold_date_sk is not null) (type: boolean) Statistics: Num rows: 27504814 Data size: 649740104 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: ss_store_sk is not null (type: boolean) Statistics: Num rows: 26856871 Data size: 634433888 Basic stats: COMPLETE Column stats: PARTIAL Select Operator
[jira] [Comment Edited] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171184#comment-16171184 ] liyunzhang_intel edited comment on HIVE-17474 at 9/19/17 6:44 AM: -- I found that we need execute "analyze table xxx compute statistics for columns" before executing the query. Attach the different explain([before_analyze|https://issues.apache.org/jira/secure/attachment/12887836/explain.70.before.analyze],[after_analyze|https://issues.apache.org/jira/secure/attachment/12887837/explain.70.after.analyze]) give an example to show the influence of column statistics {code}(select s_state as s_state, sum(ss_net_profit), rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking from store_sales, store, date_dim where d_month_seq between 1193 and 1193+11 and date_dim.d_date_sk = store_sales.ss_sold_date_sk and store.s_store_sk = store_sales.ss_store_sk group by s_state ) {code} before compute column statistics {code} Map 9 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_store_sk is not null and ss_sold_date_sk is not null) (type: boolean) Statistics: Num rows: 27504814 Data size: 825144420 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: ss_store_sk is not null (type: boolean) Statistics: Num rows: 27504814 Data size: 220038512 Basic stats: COMPLETE Column stats: PARTIAL Select Operator expressions: ss_store_sk (type: bigint), ss_net_profit (type: double), ss_sold_date_sk (type: bigint) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 27504814 Data size: 220038512 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col1, _col2, _col4 input vertices: 1 Map 12 Statistics: Num rows: 30255296 Data size: 242042368 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col2 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col1, _col4 input vertices: 1 Map 13 Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col4 (type: string), _col1 (type: double) outputColumnNames: _col4, _col1 Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: sum(_col1) keys: _col4 (type: string) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: double) {code} the data size is 266246610 After computing column statistics {code} Map 7 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_store_sk is not null and ss_sold_date_sk is not null) (type: boolean) Statistics: Num rows: 27504814 Data size: 649740104 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: ss_store_sk is not null (type: boolean) Statistics: Num rows: 26856871 Data size: 634433888 Basic stats: COMPLETE Column stats: PARTIAL Select Operator
[jira] [Updated] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17474: Attachment: explain.70.after.analyze explain.70.before.analyze > Poor Performance about subquery like DS/query70 on HoS > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.70.after.analyze, explain.70.before.analyze, > explain.70.vec > > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171184#comment-16171184 ] liyunzhang_intel commented on HIVE-17474: - I found that we need to execute "analyze table xxx compute statistics for columns" before executing the query. Attach the different explain before and after analyze statistics. give an example to show the influence of column statistics {code}(select s_state as s_state, sum(ss_net_profit), rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking from store_sales, store, date_dim where d_month_seq between 1193 and 1193+11 and date_dim.d_date_sk = store_sales.ss_sold_date_sk and store.s_store_sk = store_sales.ss_store_sk group by s_state ) {code} before compute column statistics {code} Map 9 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_store_sk is not null and ss_sold_date_sk is not null) (type: boolean) Statistics: Num rows: 27504814 Data size: 825144420 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: ss_store_sk is not null (type: boolean) Statistics: Num rows: 27504814 Data size: 220038512 Basic stats: COMPLETE Column stats: PARTIAL Select Operator expressions: ss_store_sk (type: bigint), ss_net_profit (type: double), ss_sold_date_sk (type: bigint) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 27504814 Data size: 220038512 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col1, _col2, _col4 input vertices: 1 Map 12 Statistics: Num rows: 30255296 Data size: 242042368 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col2 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col1, _col4 input vertices: 1 Map 13 Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col4 (type: string), _col1 (type: double) outputColumnNames: _col4, _col1 Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: sum(_col1) keys: _col4 (type: string) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: double) {code} the data size is 266246610 After computing column statistics {code} Map 7 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_store_sk is not null and ss_sold_date_sk is not null) (type: boolean) Statistics: Num rows: 27504814 Data size: 649740104 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: ss_store_sk is not null (type: boolean) Statistics: Num rows: 26856871 Data size: 634433888 Basic stats: COMPLETE Column stats: PARTIAL Select Operator expressions: ss_store_sk (type: bigint), ss_net_profit (type: double), ss_sold_date_sk (type: bigint) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 26856871 Data
[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167441#comment-16167441 ] liyunzhang_intel commented on HIVE-17486: - the reason why CombineEquivalentWorkResolver does not think Map1 is same as Map5, Map4 is same as Map7 is: when comparing Map4 and Map7 Map4 {code} TS[2]-SEL[3]-RS[13] {code} Map7 {code} TS[6]-SEL[7]-RS[9] {code} It returns not equal when comparing RS\[13\] and RS\[9\] at [ExprNodeColumnDesc#isSame|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeColumnDesc.java#L181]. {code} if ( tabAlias != null && dest.tabAlias != null ) { if ( !tabAlias.equals(dest.tabAlias) ) { return false; } } {code} here {{tabAlias}} is {{$hdt$_1}} while dest.tabAlias is {{$hdt$_3}}, actually {{$hdt$_1}} and {{$hdt$_3}} points to table {{test2}}. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167233#comment-16167233 ] liyunzhang_intel commented on HIVE-17474: - enlarged map join threshold size to cheat hive to think part1 is small table(in runtime, the size of part1 is very small). After that the execution plan changed, the execution time on 3TB is reduced from 12 min to 78 seconds. For such case where join on the data which keys are low cardinality, map join maybe the best solution. > Poor Performance about subquery like DS/query70 on HoS > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.70.vec > > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164180#comment-16164180 ] liyunzhang_intel commented on HIVE-17474: - [~lirui]: thanks for reply. I am debugging whether there is problem about statistics. By the way,can we solve the problem by converting the common join to skewed join? As all keys in part2 is very big and the distinct key is very few(less than 30), can we think this is a skew case? I have tried to set hive.optimize.skewjoin as true and hive.skewjoin.key as 10. But it seems not effect. I am very curious why skew join does not have effect. From the doc, it seems will {code} A join B on A.id=B.id And A skews for id=1. Then we perform the following two joins: 1. A join B on A.id=B.id and A.id!=1 2. A join B on A.id=B.id and A.id=1 If B doesn’t skew on id=1, then #2 will be a map join. {code} I think after enabling skew join, all keys in part2 will be skewed keys, part2 will map join with part1. > Poor Performance about subquery like DS/query70 on HoS > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.70.vec > > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163762#comment-16163762 ] liyunzhang_intel commented on HIVE-17474: - [~lirui] , [~xuefuz]: after debugging in tez, found the part2 join part1 is common merge join(CommonMergeJoinOperator). {code} Reducer 2 Reduce Operator Tree: Merge Join Operator condition map: Inner Join 0 to 1 keys: 0 _col7 (type: string) 1 _col0 (type: string) {code} the implementation of CommonMergeJoin. Does hive on spark enable CommonMergeJoin? {code} /* * With an aim to consolidate the join algorithms to either hash based joins (MapJoinOperator) or * sort-merge based joins, this operator is being introduced. This operator executes a sort-merge * based algorithm. It replaces both the JoinOperator and the SMBMapJoinOperator for the tez side of * things. It works in either the map phase or reduce phase. * * The basic algorithm is as follows: * * 1. The processOp receives a row from a "big" table. * 2. In order to process it, the operator does a fetch for rows from the other tables. * 3. Once we have a set of rows from the other tables (till we hit a new key), more rows are *brought in from the big table and a join is performed. */ {code} > Poor Performance about subquery like DS/query70 on HoS > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.70.vec > > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162687#comment-16162687 ] liyunzhang_intel commented on HIVE-17474: - after debugging code, i found part2 join part1 is a map join in tez, this is the difference with hive on spark.Will update the detail reason later. > Poor Performance about subquery like DS/query70 on HoS > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.70.vec > > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17474: Summary: Poor Performance about subquery like DS/query70 on HoS (was: Poor Performance about subquery like DS/query70) > Poor Performance about subquery like DS/query70 on HoS > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.70.vec > > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162659#comment-16162659 ] liyunzhang_intel commented on HIVE-17474: - [~xuefuz], [~lirui]: can you help view above issue. Thanks! > Poor Performance about subquery like DS/query70 > --- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.70.vec > > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162657#comment-16162657 ] liyunzhang_intel commented on HIVE-17474: - the execution plan of hive on spark about DS/query70 is [attached|https://issues.apache.org/jira/secure/attachment/12886590/explain.70.vec]. Investigate the problem, i found that several points 1. the statistics for sub-query is not correct, it estimates nearly 36g about the result while actually the result is very small(nearly 30 rows about state info). Because of this, the join between part1 and part2(see jira description) is common join not map join. Maybe the calculation of statistics estimation need be more intelligent in such complex sub-query. {code} Reducer 12 Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: double) outputColumnNames: _col0, _col1 Statistics: Num rows: 4991930471 Data size: 109822470377 Basic stats: COMPLETE Column stats: NONE PTF Operator Function definitions: Input definition input alias: ptf_0 output shape: _col0: string, _col1: double type: WINDOWING Windowing table definition input alias: ptf_1 name: windowingtablefunction order by: _col1 DESC NULLS LAST partition by: _col0 raw input shape: window functions: window function definition alias: rank_window_0 arguments: _col1 name: rank window function: GenericUDAFRankEvaluator window frame: PRECEDING(MAX)~FOLLOWING(MAX) isPivotResult: true Statistics: Num rows: 4991930471 Data size: 109822470377 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (rank_window_0 <= 5) (type: boolean) Statistics: Num rows: 1663976823 Data size: 36607490111 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string) outputColumnNames: _col0 Statistics: Num rows: 1663976823 Data size: 36607490111 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 1663976823 Data size: 36607490111 Basic stats: COMPLETE Column stats: NONE {code} > Poor Performance about subquery like DS/query70 > --- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.70.vec > > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables
[jira] [Updated] (HIVE-17474) Poor Performance about subquery like DS/query70
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17474: Attachment: explain.70.vec > Poor Performance about subquery like DS/query70 > --- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.70.vec > > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Issue Comment Deleted] (HIVE-17474) Poor Performance about subquery like DS/query70
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17474: Comment: was deleted (was: After HIVE-15192, the store is converted to map join. the logical plan will be forever {code} TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] TS[1]-FIL[64]-RS[5]-JOIN[6] TS[2]-FIL[65]-RS[10]-JOIN[11] TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] TS[13]-FIL[69]-RS[18]-JOIN[19] TS[14]-FIL[70]-RS[22]-JOIN[23] {code} It is reasonable the small table store is converted to map join. so close the jira.) > Poor Performance about subquery like DS/query70 > --- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17474) Poor Performance about subquery like DS/query70
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17474: Summary: Poor Performance about subquery like DS/query70 (was: Different logical plan of same query(TPC-DS/70) with same settings) > Poor Performance about subquery like DS/query70 > --- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17474: Description: in [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. {code} select sum(ss_net_profit) as total_sum ,s_state ,s_county ,grouping__id as lochierarchy , rank() over(partition by grouping__id, case when grouping__id == 2 then s_state end order by sum(ss_net_profit)) as rank_within_parent from store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk join store s on s.s_store_sk = ss.ss_store_sk where d1.d_month_seq between 1193 and 1193+11 and s.s_state in ( select s_state from (select s_state as s_state, sum(ss_net_profit), rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking from store_sales, store, date_dim where d_month_seq between 1193 and 1193+11 and date_dim.d_date_sk = store_sales.ss_sold_date_sk and store.s_store_sk = store_sales.ss_store_sk group by s_state ) tmp1 where ranking <= 5 ) group by s_state,s_county with rollup order by lochierarchy desc ,case when lochierarchy = 0 then s_state end ,rank_within_parent limit 100; {code} let's analyze the query, part1: it calculates the sub-query and get the result of the state which ss_net_profit is less than 5. part2: big table store_sales join small tables date_dim, store and get the result. part3: part1 join part2 the problem is on the part3, this is common join. The cardinality of part1 and part2 is low as there are not very different values about states( actually there are 30 different values in the table store). If use common join, big data will go to the 30 reducers. was: in [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. The explain of hive on spark is {code} {code} > Different logical plan of same query(TPC-DS/70) with same settings > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > {code} > select > sum(ss_net_profit) as total_sum >,s_state >,s_county >,grouping__id as lochierarchy >, rank() over(partition by grouping__id, case when grouping__id == 2 then > s_state end order by sum(ss_net_profit)) as rank_within_parent > from > store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk > join store s on s.s_store_sk = ss.ss_store_sk > where > d1.d_month_seq between 1193 and 1193+11 > and s.s_state in > ( select s_state >from (select s_state as s_state, sum(ss_net_profit), > rank() over ( partition by s_state order by > sum(ss_net_profit) desc) as ranking > from store_sales, store, date_dim > where d_month_seq between 1193 and 1193+11 > and date_dim.d_date_sk = > store_sales.ss_sold_date_sk > and store.s_store_sk = store_sales.ss_store_sk > group by s_state > ) tmp1 >where ranking <= 5 > ) > group by s_state,s_county with rollup > order by >lochierarchy desc > ,case when lochierarchy = 0 then s_state end > ,rank_within_parent > limit 100; > {code} > let's analyze the query, > part1: it calculates the sub-query and get the result of the state which > ss_net_profit is less than 5. > part2: big table store_sales join small tables date_dim, store and get the > result. > part3: part1 join part2 > the problem is on the part3, this is common join. The cardinality of part1 > and part2 is low as there are not very different values about states( > actually there are 30 different values in the table store). If use common > join, big data will go to the 30 reducers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17474: Description: in [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. The explain of hive on spark is {code} {code} was: in [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. On hive version(d3b88f6), i found that the logical plan is different in runtime with the same settings. sometimes the logical plan {code} TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62] TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45] TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48] TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41] TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20] TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23] {code} TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on JOIN\[48\]. sometimes {code} TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] TS[1]-FIL[64]-RS[5]-JOIN[6] TS[2]-FIL[65]-RS[10]-JOIN[11] TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] TS[13]-FIL[69]-RS[18]-JOIN[19] TS[14]-FIL[70]-RS[22]-JOIN[23] {code} TS\[2\] connects with TS\[0\] on JOIN\[11\] Although TS\[2\] and TS\[6\] has different operator id, they are table store in the query. The difference causes different spark execution plan and different execution time. I'm very confused why there are different logical plan with same setting. Can anyone know where to investigate the root cause? > Different logical plan of same query(TPC-DS/70) with same settings > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > The explain of hive on spark is > {code} > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Reopened] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel reopened HIVE-17474: - > Different logical plan of same query(TPC-DS/70) with same settings > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > On hive version(d3b88f6), i found that the logical plan is different in > runtime with the same settings. > sometimes the logical plan > {code} > TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62] > TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45] > TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48] > TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41] > TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20] > TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23] > {code} > TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on > JOIN\[48\]. > sometimes > {code} > TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] > TS[1]-FIL[64]-RS[5]-JOIN[6] > TS[2]-FIL[65]-RS[10]-JOIN[11] > TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] > TS[13]-FIL[69]-RS[18]-JOIN[19] > TS[14]-FIL[70]-RS[22]-JOIN[23] > {code} > TS\[2\] connects with TS\[0\] on JOIN\[11\] > Although TS\[2\] and TS\[6\] has different operator id, they are table store > in the query. > The difference causes different spark execution plan and different execution > time. I'm very confused why there are different logical plan with same > setting. Can anyone know where to investigate the root cause? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160943#comment-16160943 ] liyunzhang_intel commented on HIVE-17486: - [~stakiar]: thanks for interesting on it. I guess this optimization have following effect. {code} set hive.strict.checks.cartesian.product=false; set hive.join.emit.interval=2; set hive.auto.convert.join=false; explain SELECT * FROM ( SELECT test1.key AS key1, test1.value AS value1, test1.col_1 AS col_1, test2.key AS key2, test2.value AS value2, test2.col_2 AS col_2 FROM test1 RIGHT OUTER JOIN test2 ON (test1.value=test2.value AND (test1.key between 100 and 102 OR test2.key between 100 and 102)) ) sq1 FULL OUTER JOIN ( SELECT test1.key AS key3, test1.value AS value3, test1.col_1 AS col_3, test2.key AS key4, test2.value AS value4, test2.col_2 AS col_4 FROM test1 LEFT OUTER JOIN test2 ON (test1.value=test2.value AND (test1.key between 100 and 102 OR test2.key between 100 and 102)) ) sq2 ON (sq1.value1 is null or sq2.value4 is null and sq2.value3 != sq1.value2); {code} the spark explain {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Spark Edges: Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 12), Map 4 (PARTITION-LEVEL SORT, 12) Reducer 3 <- Reducer 2 (PARTITION-LEVEL SORT, 1), Reducer 6 (PARTITION-LEVEL SORT, 1) Reducer 6 <- Map 5 (PARTITION-LEVEL SORT, 12), Map 7 (PARTITION-LEVEL SORT, 12) DagName: root_20170911043433_e314705a-beca-41a0-b28a-c85c5f811a67:1 Vertices: Map 1 Map Operator Tree: TableScan alias: test1 Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: key (type: int), value (type: int), col_1 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col1 (type: int) sort order: + Map-reduce partition columns: _col1 (type: int) Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col2 (type: string) Map 4 Map Operator Tree: TableScan alias: test2 Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: key (type: int), value (type: int), col_2 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col1 (type: int) sort order: + Map-reduce partition columns: _col1 (type: int) Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col2 (type: string) Map 5 Map Operator Tree: TableScan alias: test1 Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: key (type: int), value (type: int), col_1 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col1 (type: int) sort order: + Map-reduce partition columns: _col1 (type: int) Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col2 (type: string) Map 7 Map Operator Tree: TableScan alias: test2 Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: key (type: int), value (type: int), col_2 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col1 (type: int) sort order: + Map-reduce partition
[jira] [Assigned] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel reassigned HIVE-17486: --- > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156692#comment-16156692 ] liyunzhang_intel edited comment on HIVE-17474 at 9/7/17 8:42 AM: - After HIVE-15192, the store is converted to map join. the logical plan will be forever {code} TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] TS[1]-FIL[64]-RS[5]-JOIN[6] TS[2]-FIL[65]-RS[10]-JOIN[11] TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] TS[13]-FIL[69]-RS[18]-JOIN[19] TS[14]-FIL[70]-RS[22]-JOIN[23] {code} It is reasonable the small table store is converted to map join. so close the jira. was (Author: kellyzly): After HIVE-15192, the store is converted to map join. the execution plan will be forever {code} TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] TS[1]-FIL[64]-RS[5]-JOIN[6] TS[2]-FIL[65]-RS[10]-JOIN[11] TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] TS[13]-FIL[69]-RS[18]-JOIN[19] TS[14]-FIL[70]-RS[22]-JOIN[23] {code} It is reasonable the small table store is converted to map join. so close the jira. > Different logical plan of same query(TPC-DS/70) with same settings > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > On hive version(d3b88f6), i found that the logical plan is different in > runtime with the same settings. > sometimes the logical plan > {code} > TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62] > TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45] > TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48] > TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41] > TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20] > TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23] > {code} > TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on > JOIN\[48\]. > sometimes > {code} > TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] > TS[1]-FIL[64]-RS[5]-JOIN[6] > TS[2]-FIL[65]-RS[10]-JOIN[11] > TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] > TS[13]-FIL[69]-RS[18]-JOIN[19] > TS[14]-FIL[70]-RS[22]-JOIN[23] > {code} > TS\[2\] connects with TS\[0\] on JOIN\[11\] > Although TS\[2\] and TS\[6\] has different operator id, they are table store > in the query. > The difference causes different spark execution plan and different execution > time. I'm very confused why there are different logical plan with same > setting. Can anyone know where to investigate the root cause? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel resolved HIVE-17474. - Resolution: Not A Bug > Different logical plan of same query(TPC-DS/70) with same settings > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > On hive version(d3b88f6), i found that the logical plan is different in > runtime with the same settings. > sometimes the logical plan > {code} > TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62] > TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45] > TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48] > TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41] > TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20] > TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23] > {code} > TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on > JOIN\[48\]. > sometimes > {code} > TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] > TS[1]-FIL[64]-RS[5]-JOIN[6] > TS[2]-FIL[65]-RS[10]-JOIN[11] > TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] > TS[13]-FIL[69]-RS[18]-JOIN[19] > TS[14]-FIL[70]-RS[22]-JOIN[23] > {code} > TS\[2\] connects with TS\[0\] on JOIN\[11\] > Although TS\[2\] and TS\[6\] has different operator id, they are table store > in the query. > The difference causes different spark execution plan and different execution > time. I'm very confused why there are different logical plan with same > setting. Can anyone know where to investigate the root cause? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156692#comment-16156692 ] liyunzhang_intel commented on HIVE-17474: - After HIVE-15192, the store is converted to map join. the execution plan will be forever {code} TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] TS[1]-FIL[64]-RS[5]-JOIN[6] TS[2]-FIL[65]-RS[10]-JOIN[11] TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] TS[13]-FIL[69]-RS[18]-JOIN[19] TS[14]-FIL[70]-RS[22]-JOIN[23] {code} It is reasonable the small table store is converted to map join. so close the jira. > Different logical plan of same query(TPC-DS/70) with same settings > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > On hive version(d3b88f6), i found that the logical plan is different in > runtime with the same settings. > sometimes the logical plan > {code} > TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62] > TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45] > TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48] > TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41] > TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20] > TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23] > {code} > TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on > JOIN\[48\]. > sometimes > {code} > TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] > TS[1]-FIL[64]-RS[5]-JOIN[6] > TS[2]-FIL[65]-RS[10]-JOIN[11] > TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] > TS[13]-FIL[69]-RS[18]-JOIN[19] > TS[14]-FIL[70]-RS[22]-JOIN[23] > {code} > TS\[2\] connects with TS\[0\] on JOIN\[11\] > Although TS\[2\] and TS\[6\] has different operator id, they are table store > in the query. > The difference causes different spark execution plan and different execution > time. I'm very confused why there are different logical plan with same > setting. Can anyone know where to investigate the root cause? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17474: Summary: Different logical plan of same query(TPC-DS/70) with same settings (was: Different physical plan of same query(TPC-DS/70) on HOS) > Different logical plan of same query(TPC-DS/70) with same settings > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > On hive version(d3b88f6), i found that the physical plan is different in > runtime with the same settings. > sometimes the physical plan > {code} > TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62] > TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45] > TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48] > TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41] > TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20] > TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23] > {code} > TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on > JOIN\[48\]. > sometimes > {code} > TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] > TS[1]-FIL[64]-RS[5]-JOIN[6] > TS[2]-FIL[65]-RS[10]-JOIN[11] > TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] > TS[13]-FIL[69]-RS[18]-JOIN[19] > TS[14]-FIL[70]-RS[22]-JOIN[23] > {code} > TS\[2\] connects with TS\[0\] on JOIN\[11\] > Although TS\[2\] and TS\[6\] has different operator id, they are table store > in the query. > The difference causes different spark execution plan and different execution > time. I'm very confused why there are different physical plan with same > setting. Can anyone know where to investigate the root cause? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17474) Different logical plan of same query(TPC-DS/70) with same settings
[ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17474: Description: in [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. On hive version(d3b88f6), i found that the logical plan is different in runtime with the same settings. sometimes the logical plan {code} TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62] TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45] TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48] TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41] TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20] TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23] {code} TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on JOIN\[48\]. sometimes {code} TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] TS[1]-FIL[64]-RS[5]-JOIN[6] TS[2]-FIL[65]-RS[10]-JOIN[11] TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] TS[13]-FIL[69]-RS[18]-JOIN[19] TS[14]-FIL[70]-RS[22]-JOIN[23] {code} TS\[2\] connects with TS\[0\] on JOIN\[11\] Although TS\[2\] and TS\[6\] has different operator id, they are table store in the query. The difference causes different spark execution plan and different execution time. I'm very confused why there are different logical plan with same setting. Can anyone know where to investigate the root cause? was: in [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. On hive version(d3b88f6), i found that the physical plan is different in runtime with the same settings. sometimes the physical plan {code} TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62] TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45] TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48] TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41] TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20] TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23] {code} TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on JOIN\[48\]. sometimes {code} TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] TS[1]-FIL[64]-RS[5]-JOIN[6] TS[2]-FIL[65]-RS[10]-JOIN[11] TS[12]-FIL[68]-RS[16]-JOIN[19]-RS[20]-JOIN[23]-FIL[67]-SEL[25]-GBY[26]-RS[27]-GBY[28]-RS[29]-GBY[30]-RS[31]-SEL[32]-PTF[33]-FIL[66]-SEL[34]-GBY[39]-RS[43]-JOIN[44] TS[13]-FIL[69]-RS[18]-JOIN[19] TS[14]-FIL[70]-RS[22]-JOIN[23] {code} TS\[2\] connects with TS\[0\] on JOIN\[11\] Although TS\[2\] and TS\[6\] has different operator id, they are table store in the query. The difference causes different spark execution plan and different execution time. I'm very confused why there are different physical plan with same setting. Can anyone know where to investigate the root cause? > Different logical plan of same query(TPC-DS/70) with same settings > -- > > Key: HIVE-17474 > URL: https://issues.apache.org/jira/browse/HIVE-17474 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. > On hive version(d3b88f6), i found that the logical plan is different in > runtime with the same settings. > sometimes the logical plan > {code} > TS[0]-FIL[63]-SEL[2]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-GBY[50]-RS[51]-GBY[52]-SEL[53]-RS[54]-SEL[55]-PTF[56]-SEL[57]-RS[59]-SEL[60]-LIM[61]-FS[62] > TS[3]-FIL[64]-SEL[5]-RS[44]-JOIN[45] > TS[6]-FIL[65]-SEL[8]-RS[39]-JOIN[41]-RS[47]-JOIN[48] > TS[9]-FIL[67]-SEL[11]-RS[18]-JOIN[20]-RS[21]-JOIN[23]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[29]-SEL[30]-PTF[31]-FIL[66]-SEL[32]-GBY[38]-RS[40]-JOIN[41] > TS[12]-FIL[68]-SEL[14]-RS[19]-JOIN[20] > TS[15]-FIL[69]-SEL[17]-RS[22]-JOIN[23] > {code} > TS\[6\] connects with TS\[9\] on JOIN\[41\] and connects with TS\[0\] on > JOIN\[48\]. > sometimes > {code} > TS[0]-FIL[63]-RS[3]-JOIN[6]-RS[8]-JOIN[11]-RS[41]-JOIN[44]-SEL[46]-GBY[47]-RS[48]-GBY[49]-RS[50]-GBY[51]-RS[52]-SEL[53]-PTF[54]-SEL[55]-RS[57]-SEL[58]-LIM[59]-FS[60] > TS[1]-FIL[64]-RS[5]-JOIN[6] > TS[2]-FIL[65]-RS[10]-JOIN[11] >
[jira] [Commented] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156308#comment-16156308 ] liyunzhang_intel commented on HIVE-17414: - thanks for [~lirui] and [~stakiar]'s review > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Fix For: 3.0.0 > > Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, > HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.5.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 >
[jira] [Commented] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153103#comment-16153103 ] liyunzhang_intel commented on HIVE-17414: - [~Ferd]: please commit the 5th patch, thanks! > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, > HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.5.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 >
[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17414: Attachment: HIVE-17414.5.patch [~stakiar]: thanks for you reminder. attach the 5th patch to trigger QA tests. > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, > HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.5.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 >
[jira] [Commented] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152291#comment-16152291 ] liyunzhang_intel commented on HIVE-17414: - [~lirui]:yes, i mean the 4th patch, [~ferd], as [~lirui] and [~stakiar] finished review, please commit the 4th patch. > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, > HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash >
[jira] [Commented] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152271#comment-16152271 ] liyunzhang_intel commented on HIVE-17414: - thanks for [~lirui] and [~stakiar]'s review. The changes in HIVE-17414.3.patch 1. remove the Map4 which does not exist in explain 2. other changes about {code}explain select count(*) from srcpart join srcpart_date on (srcpart.ds = srcpart_date.ds) join srcpart_hour on (srcpart.hr = srcpart_hour.hr) where srcpart_date.`date` = '2008-04-08' and srcpart.hr = 13;{code} this is because HIVE-16811. > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, > HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) >
[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17414: Attachment: HIVE-17414.4.patch > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, > HIVE-17414.3.patch, HIVE-17414.4.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column
[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17414: Attachment: HIVE-17414.3.patch trigger HIVE QA > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, > HIVE-17414.3.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats:
[jira] [Commented] (HIVE-17383) ArrayIndexOutOfBoundsException in VectorGroupByOperator
[ https://issues.apache.org/jira/browse/HIVE-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150330#comment-16150330 ] liyunzhang_intel commented on HIVE-17383: - why say "The failures can't be reproduced locally.". Actually it can be reproduced in my env. Do you mean in latest master, this is fixed? Not very well understand the logic of vectorization. why {{firstOutputColumnIndex}} starts from {{initialColumnNames.length}}. for example,if have 1 column, the {{firstOutputColumnIndex}} is from 1( normally the index is from 0). When we construct the output batch, the column is from 1, is this right? {code} // Convenient constructor for initial batch creation takes // a list of columns names and maps them to 0..n-1 indices. public VectorizationContext(String contextName, List initialColumnNames, HiveConf hiveConf) { this.contextName = contextName; level = 0; this.initialColumnNames = initialColumnNames; this.projectionColumnNames = initialColumnNames; projectedColumns = new ArrayList(); projectionColumnMap = new HashMap(); for (int i = 0; i < this.projectionColumnNames.size(); i++) { projectedColumns.add(i); projectionColumnMap.put(projectionColumnNames.get(i), i); } int firstOutputColumnIndex = projectedColumns.size(); this.ocm = new OutputColumnManager(firstOutputColumnIndex); this.firstOutputColumnIndex = firstOutputColumnIndex; vMap = new VectorExpressionDescriptor(); if (hiveConf != null) { setHiveConfVars(hiveConf); } } {code} > ArrayIndexOutOfBoundsException in VectorGroupByOperator > --- > > Key: HIVE-17383 > URL: https://issues.apache.org/jira/browse/HIVE-17383 > Project: Hive > Issue Type: Bug >Reporter: Rui Li >Assignee: Rui Li > Attachments: HIVE-17383.1.patch > > > Query to reproduce: > {noformat} > set hive.cbo.enable=false; > select count(*) from (select key from src group by key) s where s.key='98'; > {noformat} > The stack trace is: > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107) > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:831) > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:174) > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1046) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:462) > ... 18 more > {noformat} > More details can be found in HIVE-16823 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
[ https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150076#comment-16150076 ] liyunzhang_intel commented on HIVE-17405: - [~stakiar]: thanks for explanation, [different file format test case|https://issues.apache.org/jira/secure/attachment/12884191/HIVE-17216.4.patch] is added in HIVE-17216 to spark_dynamic_partition_pruning.q. > HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT > - > > Key: HIVE-17405 > URL: https://issues.apache.org/jira/browse/HIVE-17405 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, > HIVE-17405.3.patch, HIVE-17405.4.patch, HIVE-17405.5.patch, > HIVE-17405.6.patch, HIVE-17405.7.patch > > > In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new > ConstantPropagate().transform(parseContext)}} to {{new > ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}} > Hive-on-Tez does the same thing. > Running the full constant propagation isn't really necessary, we just want to > eliminate any {{and true}} predicates that were introduced by > {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The > {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the > operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. > The predicates introduced via {{SyntheticJoinPredicate}} are necessary to > help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or > not. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17414: Attachment: HIVE-17414.1.patch [~lirui]: update the comments, there is a test case in [spark_vectorized_dynamic_partition_pruning.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/spark_vectorized_dynamic_partition_pruning.q#L112]. After HIVE-17405 is resolved. I will update the q.out of the case. > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE >
[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17414: Attachment: HIVE-17414.2.patch > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Spark
[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17414: Attachment: (was: HIVE-17414.1.patch) > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.1.patch, HIVE-17414.2.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE >
[jira] [Commented] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
[ https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150025#comment-16150025 ] liyunzhang_intel commented on HIVE-17405: - [~stakiar]: why need remove following query in original file? {code} -- different file format create table srcpart_orc (key int, value string) partitioned by (ds string, hr int) stored as orc; set hive.exec.dynamic.partition.mode=nonstrict; set hive.vectorized.execution.enabled=false; set hive.exec.max.dynamic.partitions=1000; insert into table srcpart_orc partition (ds, hr) select key, value, ds, hr from srcpart; EXPLAIN select count(*) from srcpart_orc join srcpart_date_hour on (srcpart_orc.ds = srcpart_date_hour.ds and srcpart_orc.hr = srcpart_date_hour.hr) where srcpart_date_hour.hour = 11 and (srcpart_date_hour.`date` = '2008-04-08' or srcpart_date_hour.`date` = '2008-04-09'); select count(*) from srcpart_orc join srcpart_date_hour on (srcpart_orc.ds = srcpart_date_hour.ds and srcpart_orc.hr = srcpart_date_hour.hr) where srcpart_date_hour.hour = 11 and (srcpart_date_hour.`date` = '2008-04-08' or srcpart_date_hour.`date` = '2008-04-09'); select count(*) from srcpart where (ds = '2008-04-08' or ds = '2008-04-09') and hr = 11; drop table srcpart_orc; {code} > HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT > - > > Key: HIVE-17405 > URL: https://issues.apache.org/jira/browse/HIVE-17405 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, > HIVE-17405.3.patch, HIVE-17405.4.patch, HIVE-17405.5.patch, > HIVE-17405.6.patch, HIVE-17405.7.patch > > > In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new > ConstantPropagate().transform(parseContext)}} to {{new > ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}} > Hive-on-Tez does the same thing. > Running the full constant propagation isn't really necessary, we just want to > eliminate any {{and true}} predicates that were introduced by > {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The > {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the > operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. > The predicates introduced via {{SyntheticJoinPredicate}} are necessary to > help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or > not. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
[ https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149920#comment-16149920 ] liyunzhang_intel commented on HIVE-17405: - [~lirui]: in TezCompiler, constant propagation is in the end of optimizeOperatorPlan. I think ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext) is not for dpp. This should benefit all the plan. > HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT > - > > Key: HIVE-17405 > URL: https://issues.apache.org/jira/browse/HIVE-17405 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, > HIVE-17405.3.patch, HIVE-17405.4.patch, HIVE-17405.5.patch, HIVE-17405.6.patch > > > In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new > ConstantPropagate().transform(parseContext)}} to {{new > ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}} > Hive-on-Tez does the same thing. > Running the full constant propagation isn't really necessary, we just want to > eliminate any {{and true}} predicates that were introduced by > {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The > {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the > operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. > The predicates introduced via {{SyntheticJoinPredicate}} are necessary to > help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or > not. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17414: Attachment: HIVE-17414.1.patch fix according to last round of review > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.1.patch, HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE >
[jira] [Commented] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148613#comment-16148613 ] liyunzhang_intel commented on HIVE-17412: - [~Ferd]: i think if I trigger Hive-QA, spark_vectorized_dynamic_partition_pruning.q still fail, After HIVE-17405 is resolved(blocked by HIVE-17383). spark_vectorized_dynamic_partition_pruning will pass. > Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-17412 > URL: https://issues.apache.org/jira/browse/HIVE-17412 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17412.patch > > > for query > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > select distinct ds from srcpart; > {code} > the result is > {code} > 2008-04-09 > 2008-04-08 > {code} > the result of groupby in spark is not in order. Sometimes it returns > {code} > 2008-04-08 > 2008-04-09 > {code} > Sometimes it returns > {code} > 2008-04-09 > 2008-04-08 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17414: Status: Patch Available (was: Open) > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Spark Partition Pruning Sink Operator >
[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17414: Attachment: HIVE-17414.patch [~stakiar],[~lirui]: please help review, Before we restrict clazz as “SparkPartitionPruningSinkOperator” when calling SparkUtilities#collectOp(Collectionresult, Operator root, Class clazz). so now when using VectorSparkPartitionPruningSinkOperator, HIVE-16948 does not work. The changes in the patch: {code} if (root == null) { return; } -if (clazz.equals(root.getClass())) { +if (clazz.equals(root.getClass()) || clazz.isAssignableFrom(root.getClass())) { result.add(root); } {code} > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > Attachments: HIVE-17414.patch > > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0
[jira] [Commented] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148336#comment-16148336 ] liyunzhang_intel commented on HIVE-17412: - [~Ferd]: As Xuefu and Sahil finished review, can you help commit the patch, thanks, the reason why i trigger Hive QA is because HIVE-17405 will update the other change in spark_vectorized_dynamic_partition_pruning.q.out. > Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-17412 > URL: https://issues.apache.org/jira/browse/HIVE-17412 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17412.patch > > > for query > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > select distinct ds from srcpart; > {code} > the result is > {code} > 2008-04-09 > 2008-04-08 > {code} > the result of groupby in spark is not in order. Sometimes it returns > {code} > 2008-04-08 > 2008-04-09 > {code} > Sometimes it returns > {code} > 2008-04-09 > 2008-04-08 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver
[ https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel reassigned HIVE-17414: --- Assignee: liyunzhang_intel > HoS DPP + Vectorization generates invalid explain plan due to > CombineEquivalentWorkResolver > --- > > Key: HIVE-17414 > URL: https://issues.apache.org/jira/browse/HIVE-17414 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: liyunzhang_intel > > Similar to HIVE-16948, the following query generates an invalid explain plan > when HoS DPP is enabled + vectorization: > {code:sql} > select ds from (select distinct(ds) as ds from srcpart union all select > distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from > srcpart union all select min(srcpart.ds) from srcpart) > {code} > Explain Plan: > {code} > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > Edges: > Reducer 11 <- Map 10 (GROUP, 1) > Reducer 13 <- Map 12 (GROUP, 1) > A masked pattern was here > Vertices: > Map 10 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: max(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Map 12 > Map Operator Tree: > TableScan > alias: srcpart > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: ds (type: string) > outputColumnNames: ds > Statistics: Num rows: 2000 Data size: 21248 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: min(ds) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > Execution mode: vectorized > Reducer 11 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: max(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: _col0 is not null (type: boolean) > Statistics: Num rows: 1 Data size: 184 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > keys: _col0 (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 2 Data size: 368 Basic stats: > COMPLETE Column stats: NONE > Spark Partition Pruning Sink Operator > Target column: ds
[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148125#comment-16148125 ] liyunzhang_intel commented on HIVE-16823: - let's fix spark_vectorized_dynamic_partition_pruning.q in the HIVE-17405 although the target of HIVE-17405 is not spark_vectorized_dynamic_partition_pruning.q after HIVE-17383 is resolved. > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, > HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} > 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] > spark.SparkReduceRecordHandler: Fatal error: > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > ~[scala-library-2.11.8.jar:?] > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > ~[scala-library-2.11.8.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.Task.run(Task.scala:85) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [?:1.8.0_112] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [?:1.8.0_112] > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112] > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179) >
[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148105#comment-16148105 ] liyunzhang_intel commented on HIVE-16823: - [~stakiar]: {quote} Maybe a follow up JIRA would be to see what happens when we run {ConstantPropagate()}} at the end of SparkCompiler#optimizeOperatorPlan? Theoretically, it should improve performance? But sounds like there are some bugs we need to address before getting to that stage. {quote} is there any unit test failures if we put following code in the end of SparkCompiler#optimizeOperatorPlan? {code} if(procCtx.conf.getBoolVar(ConfVars.HIVEOPTCONSTANTPROPAGATION)) { new ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(procCtx.parseContext); } {code} I think it is better to put it in the end of SparkCompiler#optimizeOperatorPlan than in the runDynamicPartitionPruning. This is not related dpp just found bug in dpp unit test. Beside, why it should improve performance? if you know, please tell me, thanks! > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, > HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} > 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] > spark.SparkReduceRecordHandler: Fatal error: > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > ~[scala-library-2.11.8.jar:?] > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > ~[scala-library-2.11.8.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.Task.run(Task.scala:85) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [?:1.8.0_112] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >
[jira] [Updated] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17412: Attachment: HIVE-17412.patch [~stakiar], [~lirui]: Please help review, thanks! > Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-17412 > URL: https://issues.apache.org/jira/browse/HIVE-17412 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17412.patch > > > for query > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > select distinct ds from srcpart; > {code} > the result is > {code} > 2008-04-09 > 2008-04-08 > {code} > the result of groupby in spark is not in order. Sometimes it returns > {code} > 2008-04-08 > 2008-04-09 > {code} > Sometimes it returns > {code} > 2008-04-09 > 2008-04-08 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel reassigned HIVE-17412: --- > Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-17412 > URL: https://issues.apache.org/jira/browse/HIVE-17412 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > for query > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > select distinct ds from srcpart; > {code} > the result is > {code} > 2008-04-09 > 2008-04-08 > {code} > the result of groupby in spark is not in order. Sometimes it returns > {code} > 2008-04-08 > 2008-04-09 > {code} > Sometimes it returns > {code} > 2008-04-09 > 2008-04-08 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17407) TPC-DS/query65 hangs on HoS in certain settings
[ https://issues.apache.org/jira/browse/HIVE-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17407: Description: [TPC-DS/query65.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query65.sql] hangs when using following settings on 3TB scale. {code} set hive.auto.convert.join.noconditionaltask.size=300; {code} the explain is attached in [explain65|https://issues.apache.org/jira/secure/attachment/12884210/explain.65]. The [screenshot|https://issues.apache.org/jira/secure/attachment/12884209/hang.PNG] shows that it hanged in the Stage5. Let's explain why hang. {code} Reducer 10 <- Map 9 (GROUP, 1009) Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 5 (PARTITION-LEVEL SORT, 1), Reducer 7 (PARTITION-LEVEL SORT, 1) Reducer 3 <- Reducer 10 (PARTITION-LEVEL SORT, 1009), Reducer 2 (PARTITION-LEVEL SORT, 1009) Reducer 4 <- Reducer 3 (SORT, 1) Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1009) {code} The numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 is 1. This is because org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils#createReduceWork {code} public ReduceWork createReduceWork(GenSparkProcContext context, Operator root, SparkWork sparkWork) throws SemanticException { for (Operator parentOfRoot : root.getParentOperators()) { Preconditions.checkArgument(parentOfRoot instanceof ReduceSinkOperator, "AssertionError: expected parentOfRoot to be an " + "instance of ReduceSinkOperator, but was " + parentOfRoot.getClass().getName()); ReduceSinkOperator reduceSink = (ReduceSinkOperator) parentOfRoot; maxExecutors = Math.max(maxExecutors, reduceSink.getConf().getNumReducers()); } reduceWork.setNumReduceTasks(maxExecutors); {code} here the numReducers of all parentOfRoot is 1( in the explain, the parallelism of Map 1, Map 5, Reducer 7 is 1), so the numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 is 1. More explain why the parallelism of Map 1, Map 5,Reducer 7 are 1. The physical plan of the query is {code} TS[0]-FIL[50]-RS[2]-JOIN[5]-FIL[49]-SEL[7]-GBY[8]-RS[9]-GBY[10]-SEL[11]-GBY[15]-SEL[16]-RS[33]-JOIN[34]-RS[36]-JOIN[39]-FIL[48]-SEL[41]-RS[42]-SEL[43]-LIM[44]-FS[45] TS[1]-FIL[51]-RS[4]-JOIN[5] TS[17]-FIL[53]-RS[19]-JOIN[22]-FIL[52]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[38]-JOIN[39] TS[18]-FIL[54]-RS[21]-JOIN[22] TS[29]-FIL[55]-RS[31]-JOIN[34] TS[30]-FIL[56]-RS[32]-JOIN[34] {code} The related RS of Map1, Map5, Reducer 7 is RS\[31\], RS\[32\], RS\[33\]. The parallelism is set by [SemanticAnalyzer#genJoinReduceSinkChild|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8267] It seems that there is no logical error in the code. But it is not reasonable to use 1 task to execute to deal with so big data(more than 30GB). Is there any way to pass the query in this situation( the reason why i set hive.auto.convert.join.noconditionaltask.size as 300, if the join is converted to the map join, it will throw disk error). was: [TPC-DS/query65.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query65.sql] hangs when using following settings on 3TB scale. {code} set hive.auto.convert.join.noconditionaltask.size=300; {code} the explain is attached in [explain65|https://issues.apache.org/jira/secure/attachment/12884210/explain.65]. The [screenshot| shows that it hanged in the Stage5. Let's explain why hang. {code} Reducer 10 <- Map 9 (GROUP, 1009) Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 5 (PARTITION-LEVEL SORT, 1), Reducer 7 (PARTITION-LEVEL SORT, 1) Reducer 3 <- Reducer 10 (PARTITION-LEVEL SORT, 1009), Reducer 2 (PARTITION-LEVEL SORT, 1009) Reducer 4 <- Reducer 3 (SORT, 1) Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1009) {code} The numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 is 1. This is because org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils#createReduceWork {code} public ReduceWork createReduceWork(GenSparkProcContext context, Operator root, SparkWork sparkWork) throws SemanticException { for (Operator parentOfRoot : root.getParentOperators()) { Preconditions.checkArgument(parentOfRoot instanceof ReduceSinkOperator, "AssertionError: expected parentOfRoot to be an " + "instance of ReduceSinkOperator, but was " + parentOfRoot.getClass().getName()); ReduceSinkOperator reduceSink = (ReduceSinkOperator) parentOfRoot; maxExecutors = Math.max(maxExecutors, reduceSink.getConf().getNumReducers()); } reduceWork.setNumReduceTasks(maxExecutors); {code} here the numReducers of all parentOfRoot is 1( in the explain, the parallelism of Map 1,
[jira] [Updated] (HIVE-17407) TPC-DS/query65 hangs on HoS in certain settings
[ https://issues.apache.org/jira/browse/HIVE-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17407: Attachment: explain.65 hang.PNG > TPC-DS/query65 hangs on HoS in certain settings > --- > > Key: HIVE-17407 > URL: https://issues.apache.org/jira/browse/HIVE-17407 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > Attachments: explain.65, hang.PNG > > > [TPC-DS/query65.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query65.sql] > hangs when using following settings on 3TB scale. > {code} > set hive.auto.convert.join.noconditionaltask.size=300; > {code} > the explain is attached in explain65. The screenshot shows that it hanged > in the Stage5. > Let's explain why hang. > {code} >Reducer 10 <- Map 9 (GROUP, 1009) > Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 5 (PARTITION-LEVEL > SORT, 1), Reducer 7 (PARTITION-LEVEL SORT, 1) > Reducer 3 <- Reducer 10 (PARTITION-LEVEL SORT, 1009), Reducer 2 > (PARTITION-LEVEL SORT, 1009) > Reducer 4 <- Reducer 3 (SORT, 1) > Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1009) > {code} > The numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 > is 1. This is because > org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils#createReduceWork > {code} > public ReduceWork createReduceWork(GenSparkProcContext context, Operator > root, > SparkWork sparkWork) throws SemanticException { > > for (Operator parentOfRoot : > root.getParentOperators()) { > Preconditions.checkArgument(parentOfRoot instanceof ReduceSinkOperator, > "AssertionError: expected parentOfRoot to be an " > + "instance of ReduceSinkOperator, but was " > + parentOfRoot.getClass().getName()); > ReduceSinkOperator reduceSink = (ReduceSinkOperator) parentOfRoot; > maxExecutors = Math.max(maxExecutors, > reduceSink.getConf().getNumReducers()); > } > reduceWork.setNumReduceTasks(maxExecutors); > {code} > here the numReducers of all parentOfRoot is 1( in the explain, the > parallelism of Map 1, Map 5, Reducer 7 is 1), so the numPartitions of > SparkEdgeProperty which connects Reducer 2 and Reducer 3 is 1. > More explain why the parallelism of Map 1, Map 5,Reducer 7 are 1. The > physical plan of the query is > {code} > TS[0]-FIL[50]-RS[2]-JOIN[5]-FIL[49]-SEL[7]-GBY[8]-RS[9]-GBY[10]-SEL[11]-GBY[15]-SEL[16]-RS[33]-JOIN[34]-RS[36]-JOIN[39]-FIL[48]-SEL[41]-RS[42]-SEL[43]-LIM[44]-FS[45] > TS[1]-FIL[51]-RS[4]-JOIN[5] > TS[17]-FIL[53]-RS[19]-JOIN[22]-FIL[52]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[38]-JOIN[39] > TS[18]-FIL[54]-RS[21]-JOIN[22] > TS[29]-FIL[55]-RS[31]-JOIN[34] > TS[30]-FIL[56]-RS[32]-JOIN[34] > {code} > The related RS of Map1, Map5, Reducer 7 is RS\[31\], RS\[32\], RS\[33\]. The > parallelism is set by > [SemanticAnalyzer#genJoinReduceSinkChild|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8267] > It seems that there is no logical error in the code. But it is not reasonable > to use 1 task to execute to deal with so big data(more than 30GB). Is there > any way to pass the query in this situation( the reason why i set > hive.auto.convert.join.noconditionaltask.size as 300, if the join is > converted to the map join, it will throw disk error). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17407) TPC-DS/query65 hangs on HoS in certain settings
[ https://issues.apache.org/jira/browse/HIVE-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17407: Description: [TPC-DS/query65.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query65.sql] hangs when using following settings on 3TB scale. {code} set hive.auto.convert.join.noconditionaltask.size=300; {code} the explain is attached in [explain65|https://issues.apache.org/jira/secure/attachment/12884210/explain.65]. The [screenshot| shows that it hanged in the Stage5. Let's explain why hang. {code} Reducer 10 <- Map 9 (GROUP, 1009) Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 5 (PARTITION-LEVEL SORT, 1), Reducer 7 (PARTITION-LEVEL SORT, 1) Reducer 3 <- Reducer 10 (PARTITION-LEVEL SORT, 1009), Reducer 2 (PARTITION-LEVEL SORT, 1009) Reducer 4 <- Reducer 3 (SORT, 1) Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1009) {code} The numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 is 1. This is because org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils#createReduceWork {code} public ReduceWork createReduceWork(GenSparkProcContext context, Operator root, SparkWork sparkWork) throws SemanticException { for (Operator parentOfRoot : root.getParentOperators()) { Preconditions.checkArgument(parentOfRoot instanceof ReduceSinkOperator, "AssertionError: expected parentOfRoot to be an " + "instance of ReduceSinkOperator, but was " + parentOfRoot.getClass().getName()); ReduceSinkOperator reduceSink = (ReduceSinkOperator) parentOfRoot; maxExecutors = Math.max(maxExecutors, reduceSink.getConf().getNumReducers()); } reduceWork.setNumReduceTasks(maxExecutors); {code} here the numReducers of all parentOfRoot is 1( in the explain, the parallelism of Map 1, Map 5, Reducer 7 is 1), so the numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 is 1. More explain why the parallelism of Map 1, Map 5,Reducer 7 are 1. The physical plan of the query is {code} TS[0]-FIL[50]-RS[2]-JOIN[5]-FIL[49]-SEL[7]-GBY[8]-RS[9]-GBY[10]-SEL[11]-GBY[15]-SEL[16]-RS[33]-JOIN[34]-RS[36]-JOIN[39]-FIL[48]-SEL[41]-RS[42]-SEL[43]-LIM[44]-FS[45] TS[1]-FIL[51]-RS[4]-JOIN[5] TS[17]-FIL[53]-RS[19]-JOIN[22]-FIL[52]-SEL[24]-GBY[25]-RS[26]-GBY[27]-RS[38]-JOIN[39] TS[18]-FIL[54]-RS[21]-JOIN[22] TS[29]-FIL[55]-RS[31]-JOIN[34] TS[30]-FIL[56]-RS[32]-JOIN[34] {code} The related RS of Map1, Map5, Reducer 7 is RS\[31\], RS\[32\], RS\[33\]. The parallelism is set by [SemanticAnalyzer#genJoinReduceSinkChild|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8267] It seems that there is no logical error in the code. But it is not reasonable to use 1 task to execute to deal with so big data(more than 30GB). Is there any way to pass the query in this situation( the reason why i set hive.auto.convert.join.noconditionaltask.size as 300, if the join is converted to the map join, it will throw disk error). was: [TPC-DS/query65.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query65.sql] hangs when using following settings on 3TB scale. {code} set hive.auto.convert.join.noconditionaltask.size=300; {code} the explain is attached in explain65. The screenshot shows that it hanged in the Stage5. Let's explain why hang. {code} Reducer 10 <- Map 9 (GROUP, 1009) Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 1), Map 5 (PARTITION-LEVEL SORT, 1), Reducer 7 (PARTITION-LEVEL SORT, 1) Reducer 3 <- Reducer 10 (PARTITION-LEVEL SORT, 1009), Reducer 2 (PARTITION-LEVEL SORT, 1009) Reducer 4 <- Reducer 3 (SORT, 1) Reducer 7 <- Map 6 (GROUP PARTITION-LEVEL SORT, 1009) {code} The numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 is 1. This is because org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils#createReduceWork {code} public ReduceWork createReduceWork(GenSparkProcContext context, Operator root, SparkWork sparkWork) throws SemanticException { for (Operator parentOfRoot : root.getParentOperators()) { Preconditions.checkArgument(parentOfRoot instanceof ReduceSinkOperator, "AssertionError: expected parentOfRoot to be an " + "instance of ReduceSinkOperator, but was " + parentOfRoot.getClass().getName()); ReduceSinkOperator reduceSink = (ReduceSinkOperator) parentOfRoot; maxExecutors = Math.max(maxExecutors, reduceSink.getConf().getNumReducers()); } reduceWork.setNumReduceTasks(maxExecutors); {code} here the numReducers of all parentOfRoot is 1( in the explain, the parallelism of Map 1, Map 5, Reducer 7 is 1), so the numPartitions of SparkEdgeProperty which connects Reducer 2 and Reducer 3 is 1. More explain why the
[jira] [Commented] (HIVE-17383) ArrayIndexOutOfBoundsException in VectorGroupByOperator
[ https://issues.apache.org/jira/browse/HIVE-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143467#comment-16143467 ] liyunzhang_intel commented on HIVE-17383: - [~lirui]: after enable vectorization, it throws ArrayIndexOutOfBoundsException. query {code} set hive.cbo.enable=false; set hive.user.install.directory=file:///tmp; set fs.default.name=file:///; set fs.defaultFS=file:///; set tez.staging-dir=/tmp; set tez.ignore.lib.uris=true; set tez.runtime.optimize.local.fetch=true; set tez.local.mode=true; set hive.explain.user=false; set hive.vectorized.execution.enabled=true; select count(*) from (select key from src group by key) s where s.key='98'; {code} the explain {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: root_20170828025707_7b882df3-3e96-47f0-b189-9b6919d44512:1 Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) DagName: root_20170828025707_7b882df3-3e96-47f0-b189-9b6919d44512:1 Vertices: Map 1 Map Operator Tree: TableScan alias: src Statistics: Num rows: 2906 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (key = '98') (type: boolean) Statistics: Num rows: 1453 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 1453 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: '98' (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1453 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: '98' (type: string) sort order: + Map-reduce partition columns: '98' (type: string) Statistics: Num rows: 1453 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Reducer 2 Execution mode: vectorized Reduce Operator Tree: Group By Operator keys: '98' (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 726 Data size: 1452 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 726 Data size: 1452 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Reducer 3 Execution mode: vectorized Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} > ArrayIndexOutOfBoundsException in VectorGroupByOperator > --- > > Key: HIVE-17383 > URL: https://issues.apache.org/jira/browse/HIVE-17383 > Project: Hive > Issue Type: Bug >Reporter: Rui Li > > Query to reproduce: > {noformat} > set hive.cbo.enable=false; > select count(*) from (select key from src group by key) s where s.key='98'; > {noformat} > The stack trace is: > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107) > at >
[jira] [Commented] (HIVE-17383) ArrayIndexOutOfBoundsException in VectorGroupByOperator
[ https://issues.apache.org/jira/browse/HIVE-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143439#comment-16143439 ] liyunzhang_intel commented on HIVE-17383: - [~lirui]: this passes in latest master(6be50b7) in my tez env. If there is some wrong with the configuration, tell me! query {code} set hive.cbo.enable=false; set hive.user.install.directory=file:///tmp; set fs.default.name=file:///; set fs.defaultFS=file:///; set tez.staging-dir=/tmp; set tez.ignore.lib.uris=true; set tez.runtime.optimize.local.fetch=true; set tez.local.mode=true; set hive.explain.user=false; explain select count(*) from (select key from src group by key) s where s.key='98'; {code} explain {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: root_20170828023743_be3df7bf-49cc-4c71-a4a7-25814558804c:1 Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) DagName: root_20170828023743_be3df7bf-49cc-4c71-a4a7-25814558804c:1 Vertices: Map 1 Map Operator Tree: TableScan alias: src Statistics: Num rows: 2906 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (key = '98') (type: boolean) Statistics: Num rows: 1453 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 1453 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: '98' (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1453 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: '98' (type: string) sort order: + Map-reduce partition columns: '98' (type: string) Statistics: Num rows: 1453 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Reducer 2 Reduce Operator Tree: Group By Operator keys: '98' (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 726 Data size: 1452 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 726 Data size: 1452 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Reducer 3 Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} > ArrayIndexOutOfBoundsException in VectorGroupByOperator > --- > > Key: HIVE-17383 > URL: https://issues.apache.org/jira/browse/HIVE-17383 > Project: Hive > Issue Type: Bug >Reporter: Rui Li > > Query to reproduce: > {noformat} > set hive.cbo.enable=false; > select count(*) from (select key from src group by key) s where s.key='98'; > {noformat} > The stack trace is: > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107) > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:831) > at >
[jira] [Comment Edited] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143397#comment-16143397 ] liyunzhang_intel edited comment on HIVE-16823 at 8/28/17 6:06 AM: -- [~lirui]: can you help review the patch? i have 1 question about {{spark_vectorized_dynamic_partition_pruning.q}}, should we add {{-- SORT_QUERY_RESULTS}} to the file, otherwise the result of {code} select distinct ds from srcpart {code} {code} 2008-04-09 2008-04-08 {code} while the result in the q.out is {code} 2008-04-08 2008-04-09 {code} was (Author: kellyzly): [~lirui]: can you help review the patch? i have 1 question about {{spark_vectorized_dynamic_partition_pruning.q}}, should we add {{-- SORT_QUERY_RESULTS}} to the file, otherwise in the q.out the result of {code} select distinct ds from srcpart {code} {code} 2008-04-09 2008-04-08 {code} > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, > HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} > 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] > spark.SparkReduceRecordHandler: Fatal error: > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > ~[scala-library-2.11.8.jar:?] > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > ~[scala-library-2.11.8.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.Task.run(Task.scala:85) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [?:1.8.0_112] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [?:1.8.0_112] > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112] > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
[jira] [Commented] (HIVE-17383) ArrayIndexOutOfBoundsException in VectorGroupByOperator
[ https://issues.apache.org/jira/browse/HIVE-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143405#comment-16143405 ] liyunzhang_intel commented on HIVE-17383: - [~lirui]: can you help to verify whether ArrayIndexOutOfBoundsException appear or not in above query? in my env(hive version:f86878b). No similar exception is thrown, this query passes. If there is a RS follows the GBY, the exception will not be thrown. > ArrayIndexOutOfBoundsException in VectorGroupByOperator > --- > > Key: HIVE-17383 > URL: https://issues.apache.org/jira/browse/HIVE-17383 > Project: Hive > Issue Type: Bug >Reporter: Rui Li > > Query to reproduce: > {noformat} > set hive.cbo.enable=false; > select count(*) from (select key from src group by key) s where s.key='98'; > {noformat} > The stack trace is: > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107) > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:831) > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:174) > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1046) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:462) > ... 18 more > {noformat} > More details can be found in HIVE-16823 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143397#comment-16143397 ] liyunzhang_intel commented on HIVE-16823: - [~lirui]: can you help review the patch? i have 1 question about {{spark_vectorized_dynamic_partition_pruning.q}}, should we add {{-- SORT_QUERY_RESULTS}} to the file, otherwise in the q.out the result of {code} select distinct ds from srcpart {code} {code} 2008-04-09 2008-04-08 {code} > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, > HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} > 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] > spark.SparkReduceRecordHandler: Fatal error: > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > ~[scala-library-2.11.8.jar:?] > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > ~[scala-library-2.11.8.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.Task.run(Task.scala:85) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [?:1.8.0_112] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [?:1.8.0_112] > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112] > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at >
[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141446#comment-16141446 ] liyunzhang_intel commented on HIVE-16823: - some update {quote} This is why the query runs if map join is disabled, in which case GBY is followed by SEL/RS instead of SparkHashTableSinkOperator. {quote} more explain about this. {code} set spark.master=local; set hive.optimize.ppd=true; set hive.ppd.remove.duplicatefilters=true; set hive.spark.dynamic.partition.pruning=false; set hive.optimize.metadataonly=false; set hive.optimize.index.filter=true; set hive.vectorized.execution.enabled=true; set hive.strict.checks.cartesian.product=false; set hive.auto.convert.join=false; set hive.cbo.enable=false; set hive.optimize.constant.propagation=true; select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; {code} the cbo is disabled and the explain is not right, the key of GroupBy in Reducer is {{keys: '2008-04-08' (type: string)}} it should be {{keys: KEY._col0 (type: string)}} but the query finishes successfully. The reason is there is {{RS\[9\]}} after {{GBY\[4\]}} {code} GBY[4]-SEL[5]-RS[9] {code} {{RS\[9\]}} called following stack, OutputColumnManager#allocateOutputColumn makes OutputColumnManager#getScratchColumnTypeNames returning value. {code} org.apache.hadoop.hive.ql.exec.vector.VectorizationContext$OutputColumnManager.allocateOutputColumn(VectorizationContext.java:478) at org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getConstantVectorExpression(VectorizationContext.java:1153) at org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpression(VectorizationContext.java:688) at org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpressions(VectorizationContext.java:590) at org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpressions(VectorizationContext.java:578) at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.canSpecializeReduceSink(Vectorizer.java:3490) at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.vectorizeOperator(Vectorizer.java:4174) at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationNodeProcessor.doVectorize(Vectorizer.java:1632) at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$ReduceWorkVectorizationNodeProcessor.process(Vectorizer.java:1772) {code} in the log, we can see after VectorizationNodeProcessor#doVectorizer {{GBY\[4\]}} , the vectorization context is {code} 2017-08-25T03:40:21,316 DEBUG [cd30697a-7797-4bbe-ad92-1fcec8a89689 main] physical.Vectorizer: Vectorized ReduceWork reduce shuffle vectorization context Context name __Reduce_Shuffle__, level 0, sorted projectionColumnMap {0=KEY._col0}, scratchColumnTypeNames [] {code} after VectorizationNodeProcessor#doVectorizer {{RS\[9\]}}, the vectorization context is(here scratchColumnTypeNames returns value) {code} 2017-08-25T03:48:00,245 DEBUG [cd30697a-7797-4bbe-ad92-1fcec8a89689 main] physical.Vectorizer: vectorizeOperator org.apache.hadoop.hive.ql.plan.ReduceSinkDesc 2017-08-25T03:48:43,101 DEBUG [cd30697a-7797-4bbe-ad92-1fcec8a89689 main] physical.Vectorizer: Vectorized ReduceWork operator RS added vectorization context Context name SEL, level 1, sorted projectionColumnMap {}, scratchColumnTypeNames [string] {code} The difference in scratchColumnTypeNames causes different value in outputBatch in [VectorGroupKeyHelper|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupKeyHelper.java#L107]. I guess arrayIndexOutOfBoundsException can be reproduced in following condition whether in spark or tez mode. 1. cbo is disabled 2. there is no RS follows GBY in the reducer > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, > HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` =
[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139655#comment-16139655 ] liyunzhang_intel commented on HIVE-16823: - [~lirui]: Although ConstantPropagate influence the logical plan,hive on tez will not throw the exception. {code} set hive.optimize.ppd=true; set hive.ppd.remove.duplicatefilters=true; set hive.tez.dynamic.partition.pruning=true; set hive.optimize.metadataonly=false; set hive.optimize.index.filter=true; set hive.vectorized.execution.enabled=true; set hive.strict.checks.cartesian.product=false; set hive.cbo.enable=false; set hive.user.install.directory=file:///tmp; set fs.default.name=file:///; set fs.defaultFS=file:///; set tez.staging-dir=/tmp; set tez.ignore.lib.uris=true; set tez.runtime.optimize.local.fetch=true; set tez.local.mode=true; set hive.explain.user=false; select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; {code} the explain(It seems the key of GroupByOperator is not right) {code} Reducer 2 Execution mode: vectorized Reduce Operator Tree: Group By Operator keys: '2008-04-08' (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Map Join Operator {code} Need more time to investigate why tez is not influenced when cbo is disabled. But i guess this is another problem, any suggestion? > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, > HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} > 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] > spark.SparkReduceRecordHandler: Fatal error: > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > ~[scala-library-2.11.8.jar:?] > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > ~[scala-library-2.11.8.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) >
[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139569#comment-16139569 ] liyunzhang_intel commented on HIVE-16823: - [~lirui]: I found if enable cbo with your settings, everything works fine {code} set hive.optimize.ppd=true; set hive.ppd.remove.duplicatefilters=true; set hive.spark.dynamic.partition.pruning=false; set hive.optimize.metadataonly=false; set hive.optimize.index.filter=true; set hive.vectorized.execution.enabled=true; set hive.strict.checks.cartesian.product=false; set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask = true; set hive.auto.convert.join.noconditionaltask.size = 1000; set hive.optimize.constant.propagation=true; {code} when enabling cbo, the explain is {code} Map 3 Map Operator Tree:TableScan alias: srcpart filterExpr: (ds = '2008-04-08') (type: boolean) Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats: NONE Select Operator Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats: NONE Group By Operator keys: '2008-04-08' (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Reducer 4 Execution mode: vectorized Local Work: Map Reduce Local Work Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 {code} when disabling cbo, the explain is {code} Map 1 Map Operator Tree: TableScan alias: srcpart filterExpr: (true and (ds = '2008-04-08')) (type: boolean) Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats: NONE Filter Operatorpredicate: true (type: boolean) Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: '2008-04-08' (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: '2008-04-08' (type: string) sort order: + Map-reduce partition columns: '2008-04-08' (type: string) Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Reducer 2 Execution mode: vectorized Local Work: Map Reduce Local Work Reduce Operator Tree: Group By Operator keys: '2008-04-08' (type: string) {code} the difference is the key of GroupByOperator in the Reducer. But not know why cbo causes wrong explain. Need to investigate. > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, > HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} >
[jira] [Commented] (HIVE-10349) overflow in stats
[ https://issues.apache.org/jira/browse/HIVE-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139497#comment-16139497 ] liyunzhang_intel commented on HIVE-10349: - [~sershe]: I met similar overflow problem when i running TPC-DS/query17 on Hive on Spark, the explain is in the [link|https://issues.apache.org/jira/secure/attachment/12875204/query17_explain.log]. what's the root cause of the problem? > overflow in stats > - > > Key: HIVE-10349 > URL: https://issues.apache.org/jira/browse/HIVE-10349 > Project: Hive > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Prasanth Jayachandran > > Discovered while running q17 in LLAP. > {noformat} > Reducer 2 > Execution mode: llap > Reduce Operator Tree: > Merge Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 _col28 (type: int), _col27 (type: int) > 1 cs_bill_customer_sk (type: int), cs_item_sk (type: int) > outputColumnNames: _col1, _col2, _col6, _col8, _col9, _col22, > _col27, _col28, _col34, _col35, _col45, _col51, _col63, _col66, _col82 > Statistics: Num rows: 1047651367827495040 Data size: > 9223372036854775807 Basic stats: COMPLETE Column stats: PARTIAL > Map Join Operator > condition map: >Inner Join 0 to 1 > keys: > 0 _col22 (type: int) > 1 d_date_sk (type: int) > outputColumnNames: _col1, _col2, _col6, _col8, _col9, > _col22, _col27, _col28, _col34, _col35, _col45, _col51, _col63, _col66, > _col82, _col86 > input vertices: > 1 Map 7 > Statistics: Num rows: 1152416529588199552 Data size: > 9223372036854775807 Basic stats: COMPLETE Column stats: NONE > {noformat} > Data size overflows and row count also looks wrong. I wonder if this is why > it generates 1009 reducers for this stage on 6 machines -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138071#comment-16138071 ] liyunzhang_intel commented on HIVE-16823: - explain more about the big changes in spark_vectorized_dynamic_partition_pruning.q.out on review board. [~lirui] and [~stakiar]: If have time, please help review, thanks! > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, > HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} > 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] > spark.SparkReduceRecordHandler: Fatal error: > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > ~[scala-library-2.11.8.jar:?] > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > ~[scala-library-2.11.8.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.Task.run(Task.scala:85) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [?:1.8.0_112] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [?:1.8.0_112] > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112] > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179) >
[jira] [Updated] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-16823: Attachment: HIVE-16823.1.patch > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, > HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} > 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] > spark.SparkReduceRecordHandler: Fatal error: > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > ~[scala-library-2.11.8.jar:?] > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > ~[scala-library-2.11.8.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.Task.run(Task.scala:85) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [?:1.8.0_112] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [?:1.8.0_112] > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112] > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1035) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at
[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137991#comment-16137991 ] liyunzhang_intel commented on HIVE-16823: - update changes in q*out in HIVE-16823.1.patch. Most changes like following. This is because we remove ConstantPropagate in SparkCompiler#runDynamicPartitionPruning. {code} Map Operator Tree: TableScan alias: srcpart_date - filterExpr: ((date = '2008-04-08') and ds is not null) (type: boolean) + filterExpr: ((date = '2008-04-08') and ds is not null and true) (type: boolean) Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE Column stats: NONE Filter Operator -predicate: ((date = '2008-04-08') and ds is not null) (type: boolean) +predicate: ((date = '2008-04-08') and ds is not null and true) (type: boolean) Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column stats: NONE {code} Big changes in spark_vectorized_dynamic_partition_pruning.q.out as this file has not been updated for long time. > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} > 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] > spark.SparkReduceRecordHandler: Fatal error: > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > ~[scala-library-2.11.8.jar:?] > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > ~[scala-library-2.11.8.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.Task.run(Task.scala:85) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at >
[jira] [Updated] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-16823: Status: Patch Available (was: Open) > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} > 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] > spark.SparkReduceRecordHandler: Fatal error: > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > ~[scala-library-2.11.8.jar:?] > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > ~[scala-library-2.11.8.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at org.apache.spark.scheduler.Task.run(Task.scala:85) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > ~[spark-core_2.11-2.0.0.jar:2.0.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [?:1.8.0_112] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [?:1.8.0_112] > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112] > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1035) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at >
[jira] [Updated] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-16823: Attachment: HIVE-16823.patch In HIVE-15269:Dynamic Min-Max/BloomFilter runtime-filtering for Tez, ConstantPropagate is removed in TezCompiler#runDynamicPartitionPruning. Similar code should be removed in SparkCompiler#runDynamicPartitionPruning {code} private void runDynamicPartitionPruning(OptimizeTezProcContext procCtx, Set inputs, Set outputs) throws SemanticException { if (!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_PARTITION_PRUNING)) { return; } // Sequence of TableScan operators to be walked Dequedeque = new LinkedList (); deque.addAll(procCtx.parseContext.getTopOps().values()); Map opRules = new LinkedHashMap (); opRules.put( new RuleRegExp(new String("Dynamic Partition Pruning"), FilterOperator.getOperatorName() + "%"), new DynamicPartitionPruningOptimization()); // The dispatcher fires the processor corresponding to the closest matching // rule and passes the context along Dispatcher disp = new DefaultRuleDispatcher(null, opRules, procCtx); List topNodes = new ArrayList(); topNodes.addAll(procCtx.parseContext.getTopOps().values()); GraphWalker ogw = new ForwardWalker(disp); ogw.startWalking(topNodes, null); /** Similar code is removed in TezCompiler in HIVE-15269:Dynamic Min-Max/BloomFilter runtime-filtering for Tez***/ // need a new run of the constant folding because we might have created lots // of "and true and true" conditions. // Rather than run the full constant folding just need to shortcut AND/OR expressions // involving constant true/false values. if(procCtx.conf.getBoolVar(ConfVars.HIVEOPTCONSTANTPROPAGATION)) { new ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(procCtx.parseContext); } } {code} [~lirui],[~stakiar]: can you help review? > "ArrayIndexOutOfBoundsException" in > spark_vectorized_dynamic_partition_pruning.q > > > Key: HIVE-16823 > URL: https://issues.apache.org/jira/browse/HIVE-16823 > Project: Hive > Issue Type: Bug >Reporter: Jianguo Tian >Assignee: liyunzhang_intel > Attachments: explain.spark, explain.tez, HIVE-16823.patch > > > spark_vectorized_dynamic_partition_pruning.q > {code} > set hive.optimize.ppd=true; > set hive.ppd.remove.duplicatefilters=true; > set hive.spark.dynamic.partition.pruning=true; > set hive.optimize.metadataonly=false; > set hive.optimize.index.filter=true; > set hive.vectorized.execution.enabled=true; > set hive.strict.checks.cartesian.product=false; > -- parent is reduce tasks > select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart > group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08'; > {code} > The exceptions are as follows: > {code} > 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] > spark.SparkReduceRecordHandler: Fatal error: > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing > vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES > ["2008-04-08", "2008-04-08"] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > ~[scala-library-2.11.8.jar:?] > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > ~[scala-library-2.11.8.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) >
[jira] [Comment Edited] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q
[ https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136156#comment-16136156 ] liyunzhang_intel edited comment on HIVE-16823 at 8/22/17 2:16 AM: -- some updates about the jira. The root cause of the problem is because the difference of sub-query {{select ds as ds, ds as `date` from srcpart group by ds}} between tez and spark mode. the spark explain(the full spark explain is attached in [here|https://issues.apache.org/jira/secure/attachment/12883036/explain.spark] ) {code} Map 3 Map Operator Tree: TableScan alias: srcpart filterExpr: (ds = '2008-04-08') (type: boolean) Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats: NONE Select Operator Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats: NONE Group By Operator keys: '2008-04-08' (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: '2008-04-08' (type: string) sort order: + Map-reduce partition columns: '2008-04-08' (type: string) Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Reducer 4 Local Work: Map Reduce Local Work Reduce Operator Tree: Group By Operator keys: '2008-04-08' (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE {code} the tez explain(the full tez explain is attached in [here|https://issues.apache.org/jira/secure/attachment/12883035/explain.tez] ) {code} Map 2 Map Operator Tree: TableScan alias: srcpart filterExpr: (ds = '2008-04-08') (type: boolean) Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats: NONE Select Operator Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats: NONE Group By Operator keys: '2008-04-08' (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 1 Data size: 11624 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Reducer 3 Execution mode: vectorized Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 {code} The Group By Operator appears in Map and Reducer in tez or spark mode. But the keys of GroupByOperator in Reducer is different. In tez, the key is {{keys: KEY._col0 (type: string)}} while in spark the key is {{keys: '2008-04-08' (type: string)}}. This difference causes [VectorizationContext#getVectorExpression|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizationContext.java#L579 ] returns [getColumnVectorExpression|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizationContext.java#L582] in tez mode while returns [getConstantVectorExpression|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizationContext.java#L660]. was (Author: kellyzly): some updates about the jira. The root cause of the problem is because the difference of sub-query {{select ds as ds, ds as `date` from srcpart group by ds}} between tez and spark mode. the spark explain(the full spark explain is attached in [here|https://issues.apache.org/jira/secure/attachment/12883036/explain.spark] ) {code} Map 3 Map Operator Tree: TableScan alias: srcpart filterExpr: (ds = '2008-04-08') (type: boolean) Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats: NONE Select Operator Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL