[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183826745 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -392,6 +392,10 @@ case class FilterEstimation(plan: Filter) extends Logging { val dataType = attr.dataType var newNdv = ndv +if (ndv.toDouble == 0 || colStat.min.isEmpty || colStat.max.isEmpty) { --- End diff -- Yeah, we need to correct it in the next PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183685527 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -392,6 +392,10 @@ case class FilterEstimation(plan: Filter) extends Logging { val dataType = attr.dataType var newNdv = ndv +if (ndv.toDouble == 0 || colStat.min.isEmpty || colStat.max.isEmpty) { --- End diff -- why `colStat.min.isEmpty || colStat.max.isEmpty` means empty output? string type always has no max/min --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21052 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user mshtelma commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183220650 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,32 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") +sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3") +val df2 = spark.sql( + """ + SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 + FROM tbl t1 + JOIN tbl t2 on t1.id=t2.id + WHERE t1.fld3 IN (-123.23,321.23) + """.stripMargin) +df2.createTempView("TBL2") +sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user mshtelma commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183220647 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,32 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") +sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3") +val df2 = spark.sql( + """ + SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 + FROM tbl t1 + JOIN tbl t2 on t1.id=t2.id + WHERE t1.fld3 IN (-123.23,321.23) + """.stripMargin) --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183219812 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,32 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") +sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3") +val df2 = spark.sql( + """ + SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 + FROM tbl t1 + JOIN tbl t2 on t1.id=t2.id + WHERE t1.fld3 IN (-123.23,321.23) + """.stripMargin) --- End diff -- Nit: ```Scala """ |SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 |FROM tbl t1 |JOIN tbl t2 on t1.id=t2.id |WHERE t1.fld3 IN (-123.23,321.23) """.stripMargin) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183219803 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,32 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") +sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3") +val df2 = spark.sql( + """ + SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 + FROM tbl t1 + JOIN tbl t2 on t1.id=t2.id + WHERE t1.fld3 IN (-123.23,321.23) + """.stripMargin) +df2.createTempView("TBL2") +sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() --- End diff -- Please do not use `explain()`. It will output the strings to the console. You can just do this: ``` sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe')").queryExecution.executedPlan ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user mshtelma commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183206908 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(("spark.sql.cbo.enabled", "true")) { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user mshtelma commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183206916 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(("spark.sql.cbo.enabled", "true")) { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3") +val df2 = spark.sql( + """ + SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 + FROM tbl t1 + JOIN tbl t2 on t1.id=t2.id + WHERE t1.fld3 IN (-123.23,321.23) + """.stripMargin) +df2.createTempView("TBL2") +spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() + } +} + --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user mshtelma commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183206919 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(("spark.sql.cbo.enabled", "true")) { --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user mshtelma commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183206913 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(("spark.sql.cbo.enabled", "true")) { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3") +val df2 = spark.sql( + """ + SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 + FROM tbl t1 + JOIN tbl t2 on t1.id=t2.id + WHERE t1.fld3 IN (-123.23,321.23) + """.stripMargin) +df2.createTempView("TBL2") +spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() + } +} + + } + --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user mshtelma commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183206628 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(("spark.sql.cbo.enabled", "true")) { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3") +val df2 = spark.sql( + """ + SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 + FROM tbl t1 + JOIN tbl t2 on t1.id=t2.id + WHERE t1.fld3 IN (-123.23,321.23) + """.stripMargin) +df2.createTempView("TBL2") +spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() --- End diff -- @wzhfy has suggested calling explain in order to trigger query optimization and calling FilterEstimation.evaluateInSet method. I can call collect() instead. I think explain() is sufficient for this test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183190383 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(("spark.sql.cbo.enabled", "true")) { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") --- End diff -- nit: you don't need the `spark.` prefix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183190432 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(("spark.sql.cbo.enabled", "true")) { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3") +val df2 = spark.sql( + """ + SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 + FROM tbl t1 + JOIN tbl t2 on t1.id=t2.id + WHERE t1.fld3 IN (-123.23,321.23) + """.stripMargin) +df2.createTempView("TBL2") +spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() --- End diff -- Why this `explain()` called? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183190234 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(("spark.sql.cbo.enabled", "true")) { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3") +val df2 = spark.sql( + """ + SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 + FROM tbl t1 + JOIN tbl t2 on t1.id=t2.id + WHERE t1.fld3 IN (-123.23,321.23) + """.stripMargin) +df2.createTempView("TBL2") +spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() + } +} + + } + --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183190221 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(("spark.sql.cbo.enabled", "true")) { + withTable("TBL1", "TBL") { +import org.apache.spark.sql.functions._ +val df = spark.range(1000L).select('id, + 'id * 2 as "FLD1", + 'id * 12 as "FLD2", + lit("aaa") + 'id as "fld3") +df.write + .mode(SaveMode.Overwrite) + .bucketBy(10, "id", "FLD1", "FLD2") + .sortBy("id", "FLD1", "FLD2") + .saveAsTable("TBL") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") +spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3") +val df2 = spark.sql( + """ + SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 + FROM tbl t1 + JOIN tbl t2 on t1.id=t2.id + WHERE t1.fld3 IN (-123.23,321.23) + """.stripMargin) +df2.createTempView("TBL2") +spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() + } +} + --- End diff -- nit: drop this line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21052#discussion_r183190136 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("Simple queries must be working, if CBO is turned on") { +withSQLConf(("spark.sql.cbo.enabled", "true")) { --- End diff -- nit: `withSQLConf(SQLConf.CBO_ENABLED.key -> "true")` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org