[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS
[ https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372473#comment-16372473 ] Ke Jia commented on HIVE-18340: --- [~stakiar]: This optimization has following effect: {code:java} set hive.optimize.index.filter=true; set hive.auto.convert.join=false; create table pokes(foo int); create table poke1(foo1 int, fil string); insert into table pokes values(1); insert into table poke1 values(1, "123"); explain select count(*) from pokes join poke1 on (pokes.foo = poke1.foo1) where poke1.fil=123; {code} When enable RF "set hive.spark.dynamic.runtimefilter.pruning=true;", the explain shows: {code:java} STAGE DEPENDENCIES: Stage-2 is a root stage Stage-1 depends on stages: Stage-2 Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-2 Spark Edges: Reducer 6 <- Map 5 (GROUP, 1) DagName: root_20180222135336_d8f32495-a93d-4c59-8b56-7a9d78304a41:4 Vertices: Map 5 Map Operator Tree: TableScan alias: pokes filterExpr: foo is not null (type: boolean) Statistics: Num rows: 3 Data size: 4 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: foo is not null (type: boolean) Statistics: Num rows: 3 Data size: 4 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: foo (type: int) outputColumnNames: _col0 Statistics: Num rows: 3 Data size: 4 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: min(_col0), max(_col0), bloom_filter(_col0, expectedEntries=3) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col1 (type: int), _col2 (type: binary) Reducer 6 Reduce Operator Tree: Group By Operator aggregations: min(VALUE._col0), max(VALUE._col1), bloom_filter(VALUE._col2, expectedEntries=3) mode: final outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE Spark Runtime Filter Partition Pruning Sink Operator Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE target column name: foo1 target work: Map 4 Stage: Stage-1 Spark Edges: Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 28), Map 4 (PARTITION-LEVEL SORT, 28) Reducer 3 <- Reducer 2 (GROUP, 1) DagName: root_20180222135336_d8f32495-a93d-4c59-8b56-7a9d78304a41:3 Vertices: Map 1 Map Operator Tree: TableScan alias: pokes filterExpr: foo is not null (type: boolean) Statistics: Num rows: 3 Data size: 4 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: foo is not null (type: boolean) Statistics: Num rows: 3 Data size: 4 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: foo (type: int) sort order: + Map-reduce partition columns: foo (type: int) Statistics: Num rows: 3 Data size: 4 Basic stats: COMPLETE Column stats: NONE Map 4 Map Operator Tree: TableScan alias: poke1 filterExpr: (foo1 is not null and (foo1 BETWEEN DynamicValue(RS_3_pokes_foo_min) AND DynamicValue(RS_3_pokes_foo_max) and in_bloom_filter(foo1, DynamicValue(RS_3_pokes_foo_bloom_filter))) and (fil = 123)) (type: boolean) Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (foo1 is not null and (foo1 BETWEEN DynamicValue(RS_3_pokes_foo_min) AND DynamicValue(RS_3_pokes_foo_max) and in_bloom_filter(foo1, DynamicValue(RS_3_pokes_foo_bloom_filter))) and (fil = 123)) (type: boolean) Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: foo1 (type: int) sort order: + Map-reduce partition columns:
[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS
[ https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356426#comment-16356426 ] Ke Jia commented on HIVE-18340: --- [~stakiar]: {quote}Hive-on-Tez's has an implementation of DynamicValueRegistry that uses some special Tez APIs such as ProcessorContext#waitForAllInputsReady, how are we simulating this in HoS? {quote} [~kellyzly],Yes, For HoS, I flush the runtime filter info (min/max and bloom filter) to hdfs in SparkRuntimeFilterPruningSinkOperator operator and get the info from hdfs in SparkRuntimeFilterPruner , which is similar as SparkPartitionPruningSinkOperator and SparkDynamicPartitionPruner class in Spark DPP. {quote}It would be nice to have some qtests to help visualize what the explain plan with RF would look like {quote} I upload the HIVE-18340.2.patch to add qtest "spark_runtime_filter_pruning.q" and "spark_runtime_filter_pruning.q.out". Thanks [~stakiar], [~kellyzly] for your review! > Dynamic Min-Max/BloomFilter runtime-filtering in HoS > > > Key: HIVE-18340 > URL: https://issues.apache.org/jira/browse/HIVE-18340 > Project: Hive > Issue Type: New Feature > Components: Spark >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Attachments: HIVE-18340.1.patch, HIVE-18340.2.patch > > > Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 > and we should implement the same in HOS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS
[ https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356365#comment-16356365 ] liyunzhang commented on HIVE-18340: --- [~stakiar]: {quote} Hive-on-Tez's has an implementation of DynamicValueRegistry that uses some special Tez APIs such as ProcessorContext#waitForAllInputsReady, how are we simulating this in HoS? {quote} ProcessorContext#waitForAllInputsReady is called by {{org.apache.hadoop.hive.ql.exec.tez.DynamicValueRegistryTez#init}} to read the runtime filter info. For HoS, I guess [~Jk_self] will read the info from hdfs which is similar as Spark DPP. If my understanding is not right, [~stakiar], [~Jk_Self] please tell me. > Dynamic Min-Max/BloomFilter runtime-filtering in HoS > > > Key: HIVE-18340 > URL: https://issues.apache.org/jira/browse/HIVE-18340 > Project: Hive > Issue Type: New Feature > Components: Spark >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Attachments: HIVE-18340.1.patch > > > Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 > and we should implement the same in HOS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS
[ https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356047#comment-16356047 ] Sahil Takiar commented on HIVE-18340: - Some high level questions: * Hive-on-Tez's has an implementation of {{DynamicValueRegistry}} that uses some special Tez APIs such as {{ProcessorContext#waitForAllInputsReady}}, how are we simulating this in HoS? * It would be nice to have some qtests to help visualize what the explain plan with RF would look like > Dynamic Min-Max/BloomFilter runtime-filtering in HoS > > > Key: HIVE-18340 > URL: https://issues.apache.org/jira/browse/HIVE-18340 > Project: Hive > Issue Type: New Feature > Components: Spark >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Attachments: HIVE-18340.1.patch > > > Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 > and we should implement the same in HOS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS
[ https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354871#comment-16354871 ] Ke Jia commented on HIVE-18340: --- [~stakiar] : >Have we done any performance analysis for this feature? We have done the benchmark in TPC-DS with this feature in HoT and get improvement(+20%) in query82/88 and (+10%) in query34/90. Now, design doc and HIVE-18340.1.patch is the initial implementation in HoS. Can you help to review the design doc and HIVE-18340.1.patch? Thanks for your help! > Dynamic Min-Max/BloomFilter runtime-filtering in HoS > > > Key: HIVE-18340 > URL: https://issues.apache.org/jira/browse/HIVE-18340 > Project: Hive > Issue Type: New Feature > Components: Spark >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Attachments: HIVE-18340.1.patch > > > Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 > and we should implement the same in HOS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS
[ https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354529#comment-16354529 ] Sahil Takiar commented on HIVE-18340: - Have we done any performance analysis for this feature? > Dynamic Min-Max/BloomFilter runtime-filtering in HoS > > > Key: HIVE-18340 > URL: https://issues.apache.org/jira/browse/HIVE-18340 > Project: Hive > Issue Type: New Feature > Components: Spark >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Attachments: HIVE-18340.1.patch > > > Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 > and we should implement the same in HOS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)