[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-30 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r229362181
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
 // "Extract PythonUDF From JoinCondition".
 Batch("Check Cartesian Products", Once,
   CheckCartesianProducts) :+
-Batch("RewriteSubquery", Once,
+Batch("Rewrite Subquery", Once,
--- End diff --

@gatorsmile Sure Sean.. Let me give it a try.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r229361802
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
 // "Extract PythonUDF From JoinCondition".
 Batch("Check Cartesian Products", Once,
   CheckCartesianProducts) :+
-Batch("RewriteSubquery", Once,
+Batch("Rewrite Subquery", Once,
--- End diff --

Sure. That sounds also good to me. @dilipbiswal Could you take the PR 
https://github.com/apache/spark/pull/17520 over?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-30 Thread maryannxue
Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r229356678
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
 // "Extract PythonUDF From JoinCondition".
 Batch("Check Cartesian Products", Once,
   CheckCartesianProducts) :+
-Batch("RewriteSubquery", Once,
+Batch("Rewrite Subquery", Once,
--- End diff --

@gatorsmile I think @dilipbiswal's suggestion is the right way to go.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-29 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r229181821
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
 // "Extract PythonUDF From JoinCondition".
 Batch("Check Cartesian Products", Once,
   CheckCartesianProducts) :+
-Batch("RewriteSubquery", Once,
+Batch("Rewrite Subquery", Once,
--- End diff --

@gatorsmile Do you think its a good time to revisit Natt's PR to convert 
subquery expressions to Joins early in the optimization process ?  Perhaps then 
we can take advantage of all the subsequent rules firing after the subquery 
rewrite ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-29 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r229178084
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
 // "Extract PythonUDF From JoinCondition".
 Batch("Check Cartesian Products", Once,
   CheckCartesianProducts) :+
-Batch("RewriteSubquery", Once,
+Batch("Rewrite Subquery", Once,
--- End diff --

I do not have a good answer for this PR. Ideally, we should run the whole 
batch `operatorOptimizationBatch`. However, running the rules could be very 
time consuming. I would suggest to add a new parameter for introducing the time 
bound limit for each batch.

cc @maryannxue  WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r227261253
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,9 +171,11 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
 // "Extract PythonUDF From JoinCondition".
 Batch("Check Cartesian Products", Once,
   CheckCartesianProducts) :+
-Batch("RewriteSubquery", Once,
+Batch("Rewrite Subquery", Once,
   RewritePredicateSubquery,
   ColumnPruning,
+  InferFiltersFromConstraints,
+  PushDownPredicate,
--- End diff --

looks good, cc @gatorsmile @maryannxue 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-22 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r227236021
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala
 ---
@@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest {
   Batch("Rewrite Subquery", FixedPoint(1),
 RewritePredicateSubquery,
 ColumnPruning,
+InferFiltersFromConstraints,
+PushDownPredicate,
 CollapseProject,
 RemoveRedundantProject) :: Nil
   }
 
   test("Column pruning after rewriting predicate subquery") {
-val relation = LocalRelation('a.int, 'b.int)
-val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)
+withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") {
+  val relation = LocalRelation('a.int, 'b.int)
+  val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)
 
-val query = 
relation.where('a.in(ListQuery(relInSubquery.select('x.select('a)
+  val query = 
relation.where('a.in(ListQuery(relInSubquery.select('x.select('a)
 
-val optimized = Optimize.execute(query.analyze)
-val correctAnswer = relation
-  .select('a)
-  .join(relInSubquery.select('x), LeftSemi, Some('a === 'x))
-  .analyze
+  val optimized = Optimize.execute(query.analyze)
+  val correctAnswer = relation
+.select('a)
+.join(relInSubquery.select('x), LeftSemi, Some('a === 'x))
+.analyze
 
-comparePlans(optimized, correctAnswer)
+  comparePlans(optimized, correctAnswer)
+}
+  }
+
+  test("Infer filters and push down predicate after rewriting predicate 
subquery") {
--- End diff --

How about refactor these test to:
```scala
  val relation = LocalRelation('a.int, 'b.int)
  val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)

  test("Column pruning") {
withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") {
  val query = 
relation.where('a.in(ListQuery(relInSubquery.select('x.select('a)

  val optimized = Optimize.execute(query.analyze)
  val correctAnswer = relation
.select('a)
.join(relInSubquery.select('x), LeftSemi, Some('a === 'x))
.analyze

  comparePlans(optimized, correctAnswer)
}
  }

  test("Column pruning, infer filters and push down predicate") {
withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "true") {
  val query = 
relation.where('a.in(ListQuery(relInSubquery.select('x.select('a)

  val optimized = Optimize.execute(query.analyze)
  val correctAnswer = relation
.where(IsNotNull('a)).select('a)
.join(relInSubquery.where(IsNotNull('x)).select('x), LeftSemi, 
Some('a === 'x))
.analyze

  comparePlans(optimized, correctAnswer)
}
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-22 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r227215135
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala
 ---
@@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest {
   Batch("Rewrite Subquery", FixedPoint(1),
 RewritePredicateSubquery,
 ColumnPruning,
+InferFiltersFromConstraints,
+PushDownPredicate,
 CollapseProject,
 RemoveRedundantProject) :: Nil
   }
 
   test("Column pruning after rewriting predicate subquery") {
-val relation = LocalRelation('a.int, 'b.int)
-val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)
+withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") {
+  val relation = LocalRelation('a.int, 'b.int)
+  val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)
 
-val query = 
relation.where('a.in(ListQuery(relInSubquery.select('x.select('a)
+  val query = 
relation.where('a.in(ListQuery(relInSubquery.select('x.select('a)
 
-val optimized = Optimize.execute(query.analyze)
-val correctAnswer = relation
-  .select('a)
-  .join(relInSubquery.select('x), LeftSemi, Some('a === 'x))
-  .analyze
+  val optimized = Optimize.execute(query.analyze)
+  val correctAnswer = relation
+.select('a)
+.join(relInSubquery.select('x), LeftSemi, Some('a === 'x))
+.analyze
 
-comparePlans(optimized, correctAnswer)
+  comparePlans(optimized, correctAnswer)
+}
+  }
+
+  test("Infer filters and push down predicate after rewriting predicate 
subquery") {
--- End diff --

How about making the test title simple, then leaving comments about what's 
tested clearly here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-22 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r227214593
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala
 ---
@@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest {
   Batch("Rewrite Subquery", FixedPoint(1),
 RewritePredicateSubquery,
 ColumnPruning,
+InferFiltersFromConstraints,
+PushDownPredicate,
 CollapseProject,
 RemoveRedundantProject) :: Nil
   }
 
   test("Column pruning after rewriting predicate subquery") {
-val relation = LocalRelation('a.int, 'b.int)
-val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)
+withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") {
+  val relation = LocalRelation('a.int, 'b.int)
+  val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)
 
-val query = 
relation.where('a.in(ListQuery(relInSubquery.select('x.select('a)
+  val query = 
relation.where('a.in(ListQuery(relInSubquery.select('x.select('a)
 
-val optimized = Optimize.execute(query.analyze)
-val correctAnswer = relation
-  .select('a)
-  .join(relInSubquery.select('x), LeftSemi, Some('a === 'x))
-  .analyze
+  val optimized = Optimize.execute(query.analyze)
+  val correctAnswer = relation
+.select('a)
+.join(relInSubquery.select('x), LeftSemi, Some('a === 'x))
+.analyze
 
-comparePlans(optimized, correctAnswer)
+  comparePlans(optimized, correctAnswer)
+}
+  }
+
+  test("Infer filters and push down predicate after rewriting predicate 
subquery") {
--- End diff --

Need the column pruning in the test title?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-22 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r227214404
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala
 ---
@@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest {
   Batch("Rewrite Subquery", FixedPoint(1),
 RewritePredicateSubquery,
 ColumnPruning,
+InferFiltersFromConstraints,
+PushDownPredicate,
 CollapseProject,
 RemoveRedundantProject) :: Nil
   }
 
   test("Column pruning after rewriting predicate subquery") {
-val relation = LocalRelation('a.int, 'b.int)
-val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)
+withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") {
--- End diff --

Ah, I see. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-22 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r227208719
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala
 ---
@@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest {
   Batch("Rewrite Subquery", FixedPoint(1),
 RewritePredicateSubquery,
 ColumnPruning,
+InferFiltersFromConstraints,
+PushDownPredicate,
 CollapseProject,
 RemoveRedundantProject) :: Nil
   }
 
   test("Column pruning after rewriting predicate subquery") {
-val relation = LocalRelation('a.int, 'b.int)
-val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)
+withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") {
--- End diff --

Yes, `spark.sql.constraintPropagation.enabled=false` to test 
`ColumnPruning`.
`spark.sql.constraintPropagation.enabled=true` to test `ColumnPruning`, 
`InferFiltersFromConstraints` and `PushDownPredicate`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-22 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r227205054
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala
 ---
@@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest {
   Batch("Rewrite Subquery", FixedPoint(1),
 RewritePredicateSubquery,
 ColumnPruning,
+InferFiltersFromConstraints,
+PushDownPredicate,
 CollapseProject,
 RemoveRedundantProject) :: Nil
   }
 
   test("Column pruning after rewriting predicate subquery") {
-val relation = LocalRelation('a.int, 'b.int)
-val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)
+withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") {
--- End diff --

We need to modify this existing test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-19 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/22778

[SPARK-25784][SQL] Infer filters from constraints after rewriting predicate 
subquery

## What changes were proposed in this pull request?

Infer filters from constraints after rewriting predicate subquery.

## How was this patch tested?
unit tests and benchmark tests
```scala
withTempView("t1", "t2") {
  withTempDir { dir =>
spark.range(300)
  .selectExpr("cast(null as int) as c1", "if(id % 2 = 0, null, id) as 
c2", "id as c3")
  .coalesce(1)
  .orderBy("c2")
  .write
  .mode("overwrite")
  .option("parquet.block.size", 10485760)
  .parquet(dir.getCanonicalPath)

spark.read.parquet(dir.getCanonicalPath).createTempView("t1")
spark.read.parquet(dir.getCanonicalPath).createTempView("t2")

Seq("c1", "c2", "c3").foreach { column =>
  val benchmark = new Benchmark(s"join key $column", 10)
  Seq(false, true).foreach { inferFilters =>
benchmark.addCase(s"Is infer filters $inferFilters", numIters = 5) 
{ _ =>
  withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> 
inferFilters.toString) {
sql(s"select t1.* from t1 where t1.$column in (select $column 
from t2)").count()
  }
}
  }
  benchmark.run()
}
  }
}
```

```
ava HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
join key c1: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


Is infer filters false2005 / 2163  0.0   
200481431.0   1.0X
Is infer filters true  190 /  207  0.0
18962935.7  10.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
join key c2: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


Is infer filters false2368 / 2498  0.0   
236803743.1   1.0X
Is infer filters true 1234 / 1268  0.0   
123443912.3   1.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
join key c3: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


Is infer filters false2754 / 2907  0.0   
275376009.7   1.0X
Is infer filters true 2237 / 2255  0.0   
223739457.8   1.2X
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-25784

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22778.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22778


commit c8d1b91b93e7ad05ca0bd17984fad1c30062d504
Author: Yuming Wang 
Date:   2018-10-20T01:39:51Z

Infer filters from constraints after rewriting predicate subquery




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org