[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-28 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify CaseWhen clauses with (true and false) and (false and true)

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the conditional expressions to data source.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}

3. Reduce the amount of calculation.
{noformat}
Before simplify:
explain select event_type = 'e' from v1;
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#37]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
+- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type 
= e)#38]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct

After simplify:
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#10]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet,
+- *(2) Project [false AS (event_type = e)#14]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

{noformat}



  was:
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
s

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the conditional expressions to data source.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}

3. Reduce the amount of calculation.
{noformat}
Before simplify:
explain select event_type = 'e' from v1;
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#37]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
+- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type 
= e)#38]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct

After simplify:
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#10]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet,
+- *(2) Project [false AS (event_type = e)#14]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

{noformat}



  was:
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
stru

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the conditional expressions to data source.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}

3. Reduce the amount of calculation.
{noformat}
Before simplify:
spark-sql> explain select event_type = 'e' from v1;
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#37]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
+- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type 
= e)#38]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
After simplify:
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#10]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet,
+- *(2) Project [false AS (event_type = e)#14]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

{noformat}



  was:
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the conditional expressions to data source.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}

3. Reduce the amount of calculation.
{noformat}
Before simplify:
spark-sql> explain select event_type = 'e' from v1;
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#37]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
+- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type 
= e)#38]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct

After simplify:
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#10]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet,
+- *(2) Project [false AS (event_type = e)#14]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

{noformat}



  was:
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], Rea

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/t1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the conditional expressions to data source.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}


  was:
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/t1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the filter.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b E

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/t1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the filter.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}


  was:
Simplify/Optimize conditional expressions. We can improve these cases:
1. Reduce read datasource.
2. Simple CaseWhen/If to support filter push down.
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as  
   
select 'a' as event_type, * from t1 
   
union all   
   
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from 
t2 

explain select * from v1 where event_type = 'a';
{code}

Before this PR:

{noformat}
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#30533, id#30535L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], 
Format: Parquet
+- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END AS event_type#30534, id#30536L]
   +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: 
[(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], 
Format: Parquet
{noformat}

After this PR:


{noformat}
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}



>  Simplify/Optimize conditional expressions
> --
>
> Key: SPARK-33910
> URL: https://issues.apache.org/jira/browse/SPARK-33910
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> 1. Push down the foldable expressions through CaseWhen/If
> 2. Simplify conditional in predicate
> 3. Push the UnaryExpression into (if / case) branch

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
Simplify/Optimize conditional expressions. We can improve these cases:
1. Reduce read datasource.
2. Simple CaseWhen/If to support filter push down.
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as  
   
select 'a' as event_type, * from t1 
   
union all   
   
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from 
t2 

explain select * from v1 where event_type = 'a';
{code}

Before this PR:

{noformat}
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#30533, id#30535L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], 
Format: Parquet
+- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END AS event_type#30534, id#30536L]
   +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: 
[(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], 
Format: Parquet
{noformat}

After this PR:


{noformat}
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}


  was:
Simplify CaseWhen/If conditionals. We can improve these cases:
1. Reduce read datasource.
2. Simple CaseWhen/If to support filter push down.
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as  
   
select 'a' as event_type, * from t1 
   
union all   
   
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from 
t2 

explain select * from v1 where event_type = 'a';
{code}

Before this PR:

{noformat}
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#30533, id#30535L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], 
Format: Parquet
+- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END AS event_type#30534, id#30536L]
   +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: 
[(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], 
Format: Parquet
{noformat}

After this PR:


{noformat}
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}



>  Simplify/Optimize conditional expressions
> --
>
> Key: SPARK-33910
> URL: https://issues.apache.org/jira/browse/SPARK-33910
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Simplify/Optimize conditional expressions. We can improve these cases:
> 1. Reduce read datasource.
> 2. Simple CaseWhen/If to support filter push down.
> {code:sql}
> create table t1 using parquet as select * from range(100);
> create table t2 using parquet as select * from range(200);
> create temp view v1 as
>  
> select 'a' as event_type, * from t1   
>  
> union all 
>  
> select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * 
> from t2 
> explain select * from v1 where event_type = 'a';
> {code}
> Before this PR:
> {noformat}
> == Physical Plan ==
> Union
> :- *(1) Project [a AS event_type#30533, id#30535L]
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: 
> [], Format: Parquet
> +- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
> END AS event_type#30534, id#30536L]
>+- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN 
> c END = a)
>   +- *(2) ColumnarToRow
>  +- FileScan parquet default.t2[id#30536L] Batched: true, 
> DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN 

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Summary:  Simplify/Optimize conditional expressions  (was:  Simplify 
conditional)

>  Simplify/Optimize conditional expressions
> --
>
> Key: SPARK-33910
> URL: https://issues.apache.org/jira/browse/SPARK-33910
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Simplify CaseWhen/If conditionals. We can improve these cases:
> 1. Reduce read datasource.
> 2. Simple CaseWhen/If to support filter push down.
> {code:sql}
> create table t1 using parquet as select * from range(100);
> create table t2 using parquet as select * from range(200);
> create temp view v1 as
>  
> select 'a' as event_type, * from t1   
>  
> union all 
>  
> select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * 
> from t2 
> explain select * from v1 where event_type = 'a';
> {code}
> Before this PR:
> {noformat}
> == Physical Plan ==
> Union
> :- *(1) Project [a AS event_type#30533, id#30535L]
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: 
> [], Format: Parquet
> +- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
> END AS event_type#30534, id#30536L]
>+- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN 
> c END = a)
>   +- *(2) ColumnarToRow
>  +- FileScan parquet default.t2[id#30536L] Batched: true, 
> DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
> END = a)], Format: Parquet
> {noformat}
> After this PR:
> {noformat}
> == Physical Plan ==
> *(1) Project [a AS event_type#8, id#4L]
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
> Format: Parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org