[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r229362181 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) // "Extract PythonUDF From JoinCondition". Batch("Check Cartesian Products", Once, CheckCartesianProducts) :+ -Batch("RewriteSubquery", Once, +Batch("Rewrite Subquery", Once, --- End diff -- @gatorsmile Sure Sean.. Let me give it a try. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r229361802 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) // "Extract PythonUDF From JoinCondition". Batch("Check Cartesian Products", Once, CheckCartesianProducts) :+ -Batch("RewriteSubquery", Once, +Batch("Rewrite Subquery", Once, --- End diff -- Sure. That sounds also good to me. @dilipbiswal Could you take the PR https://github.com/apache/spark/pull/17520 over? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user maryannxue commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r229356678 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) // "Extract PythonUDF From JoinCondition". Batch("Check Cartesian Products", Once, CheckCartesianProducts) :+ -Batch("RewriteSubquery", Once, +Batch("Rewrite Subquery", Once, --- End diff -- @gatorsmile I think @dilipbiswal's suggestion is the right way to go. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r229181821 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) // "Extract PythonUDF From JoinCondition". Batch("Check Cartesian Products", Once, CheckCartesianProducts) :+ -Batch("RewriteSubquery", Once, +Batch("Rewrite Subquery", Once, --- End diff -- @gatorsmile Do you think its a good time to revisit Natt's PR to convert subquery expressions to Joins early in the optimization process ? Perhaps then we can take advantage of all the subsequent rules firing after the subquery rewrite ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r229178084 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) // "Extract PythonUDF From JoinCondition". Batch("Check Cartesian Products", Once, CheckCartesianProducts) :+ -Batch("RewriteSubquery", Once, +Batch("Rewrite Subquery", Once, --- End diff -- I do not have a good answer for this PR. Ideally, we should run the whole batch `operatorOptimizationBatch`. However, running the rules could be very time consuming. I would suggest to add a new parameter for introducing the time bound limit for each batch. cc @maryannxue WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r227261253 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -171,9 +171,11 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) // "Extract PythonUDF From JoinCondition". Batch("Check Cartesian Products", Once, CheckCartesianProducts) :+ -Batch("RewriteSubquery", Once, +Batch("Rewrite Subquery", Once, RewritePredicateSubquery, ColumnPruning, + InferFiltersFromConstraints, + PushDownPredicate, --- End diff -- looks good, cc @gatorsmile @maryannxue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r227236021 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala --- @@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest { Batch("Rewrite Subquery", FixedPoint(1), RewritePredicateSubquery, ColumnPruning, +InferFiltersFromConstraints, +PushDownPredicate, CollapseProject, RemoveRedundantProject) :: Nil } test("Column pruning after rewriting predicate subquery") { -val relation = LocalRelation('a.int, 'b.int) -val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int) +withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") { + val relation = LocalRelation('a.int, 'b.int) + val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int) -val query = relation.where('a.in(ListQuery(relInSubquery.select('x.select('a) + val query = relation.where('a.in(ListQuery(relInSubquery.select('x.select('a) -val optimized = Optimize.execute(query.analyze) -val correctAnswer = relation - .select('a) - .join(relInSubquery.select('x), LeftSemi, Some('a === 'x)) - .analyze + val optimized = Optimize.execute(query.analyze) + val correctAnswer = relation +.select('a) +.join(relInSubquery.select('x), LeftSemi, Some('a === 'x)) +.analyze -comparePlans(optimized, correctAnswer) + comparePlans(optimized, correctAnswer) +} + } + + test("Infer filters and push down predicate after rewriting predicate subquery") { --- End diff -- How about refactor these test to: ```scala val relation = LocalRelation('a.int, 'b.int) val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int) test("Column pruning") { withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") { val query = relation.where('a.in(ListQuery(relInSubquery.select('x.select('a) val optimized = Optimize.execute(query.analyze) val correctAnswer = relation .select('a) .join(relInSubquery.select('x), LeftSemi, Some('a === 'x)) .analyze comparePlans(optimized, correctAnswer) } } test("Column pruning, infer filters and push down predicate") { withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "true") { val query = relation.where('a.in(ListQuery(relInSubquery.select('x.select('a) val optimized = Optimize.execute(query.analyze) val correctAnswer = relation .where(IsNotNull('a)).select('a) .join(relInSubquery.where(IsNotNull('x)).select('x), LeftSemi, Some('a === 'x)) .analyze comparePlans(optimized, correctAnswer) } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r227215135 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala --- @@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest { Batch("Rewrite Subquery", FixedPoint(1), RewritePredicateSubquery, ColumnPruning, +InferFiltersFromConstraints, +PushDownPredicate, CollapseProject, RemoveRedundantProject) :: Nil } test("Column pruning after rewriting predicate subquery") { -val relation = LocalRelation('a.int, 'b.int) -val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int) +withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") { + val relation = LocalRelation('a.int, 'b.int) + val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int) -val query = relation.where('a.in(ListQuery(relInSubquery.select('x.select('a) + val query = relation.where('a.in(ListQuery(relInSubquery.select('x.select('a) -val optimized = Optimize.execute(query.analyze) -val correctAnswer = relation - .select('a) - .join(relInSubquery.select('x), LeftSemi, Some('a === 'x)) - .analyze + val optimized = Optimize.execute(query.analyze) + val correctAnswer = relation +.select('a) +.join(relInSubquery.select('x), LeftSemi, Some('a === 'x)) +.analyze -comparePlans(optimized, correctAnswer) + comparePlans(optimized, correctAnswer) +} + } + + test("Infer filters and push down predicate after rewriting predicate subquery") { --- End diff -- How about making the test title simple, then leaving comments about what's tested clearly here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r227214593 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala --- @@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest { Batch("Rewrite Subquery", FixedPoint(1), RewritePredicateSubquery, ColumnPruning, +InferFiltersFromConstraints, +PushDownPredicate, CollapseProject, RemoveRedundantProject) :: Nil } test("Column pruning after rewriting predicate subquery") { -val relation = LocalRelation('a.int, 'b.int) -val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int) +withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") { + val relation = LocalRelation('a.int, 'b.int) + val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int) -val query = relation.where('a.in(ListQuery(relInSubquery.select('x.select('a) + val query = relation.where('a.in(ListQuery(relInSubquery.select('x.select('a) -val optimized = Optimize.execute(query.analyze) -val correctAnswer = relation - .select('a) - .join(relInSubquery.select('x), LeftSemi, Some('a === 'x)) - .analyze + val optimized = Optimize.execute(query.analyze) + val correctAnswer = relation +.select('a) +.join(relInSubquery.select('x), LeftSemi, Some('a === 'x)) +.analyze -comparePlans(optimized, correctAnswer) + comparePlans(optimized, correctAnswer) +} + } + + test("Infer filters and push down predicate after rewriting predicate subquery") { --- End diff -- Need the column pruning in the test title? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r227214404 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala --- @@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest { Batch("Rewrite Subquery", FixedPoint(1), RewritePredicateSubquery, ColumnPruning, +InferFiltersFromConstraints, +PushDownPredicate, CollapseProject, RemoveRedundantProject) :: Nil } test("Column pruning after rewriting predicate subquery") { -val relation = LocalRelation('a.int, 'b.int) -val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int) +withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") { --- End diff -- Ah, I see. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r227208719 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala --- @@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest { Batch("Rewrite Subquery", FixedPoint(1), RewritePredicateSubquery, ColumnPruning, +InferFiltersFromConstraints, +PushDownPredicate, CollapseProject, RemoveRedundantProject) :: Nil } test("Column pruning after rewriting predicate subquery") { -val relation = LocalRelation('a.int, 'b.int) -val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int) +withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") { --- End diff -- Yes, `spark.sql.constraintPropagation.enabled=false` to test `ColumnPruning`. `spark.sql.constraintPropagation.enabled=true` to test `ColumnPruning`, `InferFiltersFromConstraints` and `PushDownPredicate`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r227205054 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala --- @@ -33,23 +34,44 @@ class RewriteSubquerySuite extends PlanTest { Batch("Rewrite Subquery", FixedPoint(1), RewritePredicateSubquery, ColumnPruning, +InferFiltersFromConstraints, +PushDownPredicate, CollapseProject, RemoveRedundantProject) :: Nil } test("Column pruning after rewriting predicate subquery") { -val relation = LocalRelation('a.int, 'b.int) -val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int) +withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") { --- End diff -- We need to modify this existing test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22778 [SPARK-25784][SQL] Infer filters from constraints after rewriting predicate subquery ## What changes were proposed in this pull request? Infer filters from constraints after rewriting predicate subquery. ## How was this patch tested? unit tests and benchmark tests ```scala withTempView("t1", "t2") { withTempDir { dir => spark.range(300) .selectExpr("cast(null as int) as c1", "if(id % 2 = 0, null, id) as c2", "id as c3") .coalesce(1) .orderBy("c2") .write .mode("overwrite") .option("parquet.block.size", 10485760) .parquet(dir.getCanonicalPath) spark.read.parquet(dir.getCanonicalPath).createTempView("t1") spark.read.parquet(dir.getCanonicalPath).createTempView("t2") Seq("c1", "c2", "c3").foreach { column => val benchmark = new Benchmark(s"join key $column", 10) Seq(false, true).foreach { inferFilters => benchmark.addCase(s"Is infer filters $inferFilters", numIters = 5) { _ => withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> inferFilters.toString) { sql(s"select t1.* from t1 where t1.$column in (select $column from t2)").count() } } } benchmark.run() } } } ``` ``` ava HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz join key c1: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Is infer filters false2005 / 2163 0.0 200481431.0 1.0X Is infer filters true 190 / 207 0.0 18962935.7 10.6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz join key c2: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Is infer filters false2368 / 2498 0.0 236803743.1 1.0X Is infer filters true 1234 / 1268 0.0 123443912.3 1.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz join key c3: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Is infer filters false2754 / 2907 0.0 275376009.7 1.0X Is infer filters true 2237 / 2255 0.0 223739457.8 1.2X ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25784 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22778.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22778 commit c8d1b91b93e7ad05ca0bd17984fad1c30062d504 Author: Yuming Wang Date: 2018-10-20T01:39:51Z Infer filters from constraints after rewriting predicate subquery --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org