[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS

2018-02-21 Thread Ke Jia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372473#comment-16372473
 ] 

Ke Jia commented on HIVE-18340:
---

[~stakiar]:

This optimization has following effect:
{code:java}
set hive.optimize.index.filter=true;
set hive.auto.convert.join=false;
create table pokes(foo int);
create table poke1(foo1 int, fil string);
insert into table pokes values(1);
insert into table poke1 values(1, "123");

explain select count(*) from pokes join poke1  on (pokes.foo = poke1.foo1) 
where poke1.fil=123;
{code}
When enable RF "set hive.spark.dynamic.runtimefilter.pruning=true;", the 
explain shows:

{code:java}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
Spark
  Edges:
Reducer 6 <- Map 5 (GROUP, 1)
  DagName: root_20180222135336_d8f32495-a93d-4c59-8b56-7a9d78304a41:4
  Vertices:
Map 5
Map Operator Tree:
TableScan
  alias: pokes
  filterExpr: foo is not null (type: boolean)
  Statistics: Num rows: 3 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: foo is not null (type: boolean)
Statistics: Num rows: 3 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  expressions: foo (type: int)
  outputColumnNames: _col0
  Statistics: Num rows: 3 Data size: 4 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
aggregations: min(_col0), max(_col0), 
bloom_filter(_col0, expectedEntries=3)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 12 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  sort order:
  Statistics: Num rows: 1 Data size: 12 Basic stats: 
COMPLETE Column stats: NONE
  value expressions: _col0 (type: int), _col1 (type: 
int), _col2 (type: binary)
Reducer 6
Reduce Operator Tree:
  Group By Operator
aggregations: min(VALUE._col0), max(VALUE._col1), 
bloom_filter(VALUE._col2, expectedEntries=3)
mode: final
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE 
Column stats: NONE
Spark Runtime Filter Partition Pruning Sink Operator
  Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE 
Column stats: NONE
  target column name: foo1
  target work: Map 4

  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 28), Map 4 (PARTITION-LEVEL 
SORT, 28)
Reducer 3 <- Reducer 2 (GROUP, 1)
  DagName: root_20180222135336_d8f32495-a93d-4c59-8b56-7a9d78304a41:3
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: pokes
  filterExpr: foo is not null (type: boolean)
  Statistics: Num rows: 3 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: foo is not null (type: boolean)
Statistics: Num rows: 3 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  key expressions: foo (type: int)
  sort order: +
  Map-reduce partition columns: foo (type: int)
  Statistics: Num rows: 3 Data size: 4 Basic stats: 
COMPLETE Column stats: NONE
Map 4
Map Operator Tree:
TableScan
  alias: poke1
  filterExpr: (foo1 is not null and (foo1 BETWEEN 
DynamicValue(RS_3_pokes_foo_min) AND DynamicValue(RS_3_pokes_foo_max) and 
in_bloom_filter(foo1, DynamicValue(RS_3_pokes_foo_bloom_filter))) and (fil = 
123)) (type: boolean)
  Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: (foo1 is not null and (foo1 BETWEEN 
DynamicValue(RS_3_pokes_foo_min) AND DynamicValue(RS_3_pokes_foo_max) and 
in_bloom_filter(foo1, DynamicValue(RS_3_pokes_foo_bloom_filter))) and (fil = 
123)) (type: boolean)
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  key expressions: foo1 (type: int)
  sort order: +
  Map-reduce partition columns: 

[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS

2018-02-07 Thread Ke Jia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356426#comment-16356426
 ] 

Ke Jia commented on HIVE-18340:
---

[~stakiar]:
{quote}Hive-on-Tez's has an implementation of DynamicValueRegistry that uses 
some special Tez APIs such as ProcessorContext#waitForAllInputsReady, how are 
we simulating this in HoS?
{quote}
[~kellyzly],Yes, For HoS, I  flush the runtime filter info (min/max and bloom 
filter) to hdfs in SparkRuntimeFilterPruningSinkOperator operator  and get the 
info from hdfs in SparkRuntimeFilterPruner , which is similar as 
SparkPartitionPruningSinkOperator and SparkDynamicPartitionPruner class in 
Spark DPP.

{quote}It would be nice to have some qtests to help visualize what the explain 
plan with RF would look like

{quote}

I upload the HIVE-18340.2.patch to add qtest "spark_runtime_filter_pruning.q" 
and "spark_runtime_filter_pruning.q.out". 

Thanks [~stakiar], [~kellyzly] for your review!

 

> Dynamic Min-Max/BloomFilter runtime-filtering in HoS
> 
>
> Key: HIVE-18340
> URL: https://issues.apache.org/jira/browse/HIVE-18340
> Project: Hive
>  Issue Type: New Feature
>  Components: Spark
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Attachments: HIVE-18340.1.patch, HIVE-18340.2.patch
>
>
> Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 
> and we should implement the same in HOS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS

2018-02-07 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356365#comment-16356365
 ] 

liyunzhang commented on HIVE-18340:
---

[~stakiar]: {quote}
Hive-on-Tez's has an implementation of DynamicValueRegistry that uses some 
special Tez APIs such as ProcessorContext#waitForAllInputsReady, how are we 
simulating this in HoS?
{quote}
ProcessorContext#waitForAllInputsReady is called by  
{{org.apache.hadoop.hive.ql.exec.tez.DynamicValueRegistryTez#init}} to read the 
runtime filter info. For HoS, I guess [~Jk_self] will read the info from hdfs 
which is similar as Spark DPP. 

If my understanding is not right, [~stakiar], [~Jk_Self] please tell me.

> Dynamic Min-Max/BloomFilter runtime-filtering in HoS
> 
>
> Key: HIVE-18340
> URL: https://issues.apache.org/jira/browse/HIVE-18340
> Project: Hive
>  Issue Type: New Feature
>  Components: Spark
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Attachments: HIVE-18340.1.patch
>
>
> Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 
> and we should implement the same in HOS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS

2018-02-07 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356047#comment-16356047
 ] 

Sahil Takiar commented on HIVE-18340:
-

Some high level questions:

* Hive-on-Tez's has an implementation of {{DynamicValueRegistry}} that uses 
some special Tez APIs such as {{ProcessorContext#waitForAllInputsReady}}, how 
are we simulating this in HoS?
* It would be nice to have some qtests to help visualize what the explain plan 
with RF would look like

> Dynamic Min-Max/BloomFilter runtime-filtering in HoS
> 
>
> Key: HIVE-18340
> URL: https://issues.apache.org/jira/browse/HIVE-18340
> Project: Hive
>  Issue Type: New Feature
>  Components: Spark
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Attachments: HIVE-18340.1.patch
>
>
> Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 
> and we should implement the same in HOS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS

2018-02-06 Thread Ke Jia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354871#comment-16354871
 ] 

Ke Jia commented on HIVE-18340:
---

[~stakiar] :

>Have we done any performance analysis for this feature?

We have done the benchmark in TPC-DS with this feature in HoT and get 
improvement(+20%) in query82/88 and (+10%) in query34/90.

Now,  design doc and HIVE-18340.1.patch is the initial implementation in HoS. 
Can you help to review the design doc and HIVE-18340.1.patch? Thanks for your 
help!

> Dynamic Min-Max/BloomFilter runtime-filtering in HoS
> 
>
> Key: HIVE-18340
> URL: https://issues.apache.org/jira/browse/HIVE-18340
> Project: Hive
>  Issue Type: New Feature
>  Components: Spark
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Attachments: HIVE-18340.1.patch
>
>
> Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 
> and we should implement the same in HOS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18340) Dynamic Min-Max/BloomFilter runtime-filtering in HoS

2018-02-06 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354529#comment-16354529
 ] 

Sahil Takiar commented on HIVE-18340:
-

Have we done any performance analysis for this feature?

> Dynamic Min-Max/BloomFilter runtime-filtering in HoS
> 
>
> Key: HIVE-18340
> URL: https://issues.apache.org/jira/browse/HIVE-18340
> Project: Hive
>  Issue Type: New Feature
>  Components: Spark
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Attachments: HIVE-18340.1.patch
>
>
> Tez implemented Dynamic Min-Max/BloomFilter runtime-filtering in HIVE-15269 
> and we should implement the same in HOS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)