[GitHub] spark pull request #18882: [SPARK-21652][SQL] Filter out meaningless constra...

2017-08-09 Thread maropu
Github user maropu closed the pull request at:

https://github.com/apache/spark/pull/18882


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18882: [SPARK-21652][SQL] Filter out meaningless constra...

2017-08-08 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/18882

[SPARK-21652][SQL] Filter out meaningless constraints inferred in 
inferAdditionalConstraints

## What changes were proposed in this pull request?
This pr added code to filter out meaningless constraints inferred in 
`inferAdditionalConstraints` (e.g., given constraint `a = 1`, `b = 1`, `a = c`, 
and  `b = c`, we inferred `a = b` and this predicate was trivially true). These 
constraints possibly cause some `Optimizer` overhead and, for example;
```
scala> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
scala> Seq(1, 2).toDF("col").write.saveAsTable("t2")
scala> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 
AND t1.col1 = t2.col AND t1.col2 = t2.col").explain(true)
```

In this query, `InferFiltersFromConstraints` infers a new constraint 
'(col2#33 = col1#32)' that is appended to the join condition, then 
`PushPredicateThroughJoin` pushes it down, `ConstantPropagation` replaces 
'(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
`ConstantFolding` replaces '1 = 1' with 'true and `BooleanSimplification` 
finally removes this predicate. However, `InferFiltersFromConstraints` will 
again infer '(col2#33 = col1#32)' on the next iteration and the process will 
continue until the limit of iterations is reached.
See below for more details

```
=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
!Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) 
  Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
(col2#33 = col#34)))
 :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
(1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
((col1#32 = 1) && (1 = col2#33)))
 :  +- Relation[col1#32,col2#33] parquet
  :  +- Relation[col1#32,col2#33] parquet
 +- Filter ((1 = col#34) && isnotnull(col#34))  
  +- Filter ((1 = col#34) && isnotnull(col#34))
+- Relation[col#34] parquet 
 +- Relation[col#34] parquet


=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
!Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
col#34)))  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
!:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
(1 = col2#33)))   :- Filter (col2#33 = col1#32)
!:  +- Relation[col1#32,col2#33] parquet
  :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
((col1#32 = 1) && (1 = col2#33)))
!+- Filter ((1 = col#34) && isnotnull(col#34))  
  : +- Relation[col1#32,col2#33] parquet
!   +- Relation[col#34] parquet 
  +- Filter ((1 = col#34) && isnotnull(col#34))
!   
 +- Relation[col#34] parquet


=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
 Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) 
 Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
!:- Filter (col2#33 = col1#32)  
 :- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
!:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
&& (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
!: +- Relation[col1#32,col2#33] parquet 
 +- Filter ((1 = col#34) && isnotnull(col#34))
!+- Filter ((1 = col#34) && isnotnull(col#34))  
+- Relation[col#34] parquet
!   +- Relation[col#34] parquet 
 


=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.ConstantPropagation ===
 Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) 
   Join Inner, ((col1#32 = col#34) && 
(col2#33 = col#34))
!:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
&& (1 = col2#33))) && (col2#33 = col1#32))   :- Filter (((isnotnull(col1#32) && 
isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (1 = 1))
 :  +- Relation[col1#32,col2#33] parquet