[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19488 **[Test build #82722 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82722/testReport)** for PR 19488 at commit [`32bdf77`](https://github.com/apache/spark/commit/32bdf771fe70444ac23adf796702b5a26e085805). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19488 add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: SPARK-22266 The same aggregate function was evaluated mu...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19488 Update the title to `[SPARK-22266][SQL] The same aggregate function was evaluated multiple times` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: SPARK-22266 The same aggregate function was evaluated mu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19488 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19488: SPARK-22266 The same aggregate function was evalu...
GitHub user maryannxue opened a pull request: https://github.com/apache/spark/pull/19488 SPARK-22266 The same aggregate function was evaluated multiple times ## What changes were proposed in this pull request? To let the same aggregate function that appear multiple times in an Aggregate be evaluated only once, we need to deduplicate the aggregate expressions. The original code was trying to use a "distinct" call to get a set of aggregate expressions, but did not work, since the "distinct" did not compare semantic equality. And even if it did, further work should be done in result expression rewriting. In this PR, I changed the "set" to a map mapping the semantic identity of a aggregate expression to itself. Thus, later on, when rewriting result expressions (i.e., output expressions), the aggregate expression reference can be fixed. ## How was this patch tested? Added a new test in SQLQuerySuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/maryannxue/spark spark-22266 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19488.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19488 commit 32bdf771fe70444ac23adf796702b5a26e085805 Author: maryannxueDate: 2017-10-13T05:31:10Z SPARK-22266 The same aggregate function was evaluated multiple times --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19475: [SPARK-22257][SQL]Reserve all non-deterministic e...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19475 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18460 Thank you, @gatorsmile . Sure, I agree. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19475: [SPARK-22257][SQL]Reserve all non-deterministic expressi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19475 Thanks! Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19470 One minor comment doesn't affect this. LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19475: [SPARK-22257][SQL]Reserve all non-deterministic expressi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19475 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82714/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19475: [SPARK-22257][SQL]Reserve all non-deterministic expressi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19475 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19475: [SPARK-22257][SQL]Reserve all non-deterministic expressi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19475 **[Test build #82714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82714/testReport)** for PR 19475 at commit [`60a0360`](https://github.com/apache/spark/commit/60a0360b946e9b9b3ee851d8ea5e68b498251d52). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144469314 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2050,4 +2050,64 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + Seq("orc", "parquet").foreach { format => +test(s"SPARK-18355 Read data from a hive table with a new column - $format") { + val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client + + Seq("true", "false").foreach { value => +withSQLConf( + HiveUtils.CONVERT_METASTORE_ORC.key -> value, + HiveUtils.CONVERT_METASTORE_PARQUET.key -> value) { --- End diff -- As you separate orc and parquet to two test in fact, maybe you just need to test against one config at one time, i.e., orc -> HiveUtils.CONVERT_METASTORE_ORC, parquet -> HiveUtils.CONVERT_METASTORE_PARQUET.key. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18460 BTW, we are unable to merge this to Spark 2.2 although this is a bug fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19433 **[Test build #82721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82721/testReport)** for PR 19433 at commit [`93e17fc`](https://github.com/apache/spark/commit/93e17fc74958d4fa8f3bea38731ecec662e4ca66). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18460 LGTM cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19448 Will check it if I am not confident next time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19448 Ok. Next time, please check it with the committers who are familiar with Spark SQL. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19448 I did this as I was confident if it is a bug because doc says it should work but actually not, without breaking the previous support. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19470 LGTM too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19448 This one starts at least since Spark 1.5. If you are not confident whether this is bug or not, please check it before merging it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19470 **[Test build #82720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82720/testReport)** for PR 19470 at commit [`8e7fe9b`](https://github.com/apache/spark/commit/8e7fe9b9e3fc6121686caad45dcfdb4ff08f0c4a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19483: [SPARK-21165][SQL] FileFormatWriter should handle...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19483 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19433 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19433 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82717/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19433 **[Test build #82717 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82717/testReport)** for PR 19433 at commit [`c9a8e01`](https://github.com/apache/spark/commit/c9a8e01cead78d2ea6eb6a9ffa007314cbbcf60a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19483 thanks for the review, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19484: [SPARK-22252][SQL][2.2] FileFormatWriter should r...
Github user cloud-fan closed the pull request at: https://github.com/apache/spark/pull/19484 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19269 **[Test build #82719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82719/testReport)** for PR 19269 at commit [`ac3de3c`](https://github.com/apache/spark/commit/ac3de3c2c5c50df8e5f2c618a78a4fa4c6756797). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144466637 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") { --- End diff -- Yep. I'll add `parquet`, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/18692#discussion_r144466472 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -152,3 +152,71 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType)) } } + +/** + * A rule that uses propagated constraints to infer join conditions. The optimization is applicable + * only to CROSS joins. + * + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right + * relation has 'b = 1', then the rule infers 'a = b' as a join predicate. + */ +object InferJoinConditionsFromConstraints extends Rule[LogicalPlan] with PredicateHelper { + + def apply(plan: LogicalPlan): LogicalPlan = { +if (SQLConf.get.constraintPropagationEnabled) { + inferJoinConditions(plan) +} else { + plan +} + } + + private def inferJoinConditions(plan: LogicalPlan): LogicalPlan = plan transform { +case join @ Join(left, right, Cross, conditionOpt) => + val leftConstraints = join.constraints.filter(_.references.subsetOf(left.outputSet)) + val rightConstraints = join.constraints.filter(_.references.subsetOf(right.outputSet)) --- End diff -- I don't think we need to separate the constraints as left only and right only. The following case can infer `t1.col1 = t2.col1`: ```scala Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1") Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t2") val df = spark.sql("SELECT * FROM t1 CROSS JOIN t2 ON t1.col1 >= t2.col1 " + "WHERE t1.col1 = t1.col2 + t2.col2 and t2.col1 = t1.col2 + t2.col2") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19448 How come fixing the behaviour as documented is not a bug fix? I think that basically mean we don't backport fixes for things not working as documented for other internal configurations. This does not extend the functionailities. This fixes functionalities to work as documented and expected, and I call it a bugfix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19483 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82712/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144465324 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") { --- End diff -- Improve the test case for checking the other formats? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19484: [SPARK-22252][SQL][2.2] FileFormatWriter should respect ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19484 Thanks! Merged to 2.2. Could you close this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144465575 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") { --- End diff -- since it depends on the CONVERT_METASTORE_XXX conf, maybe also test parquet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19483 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19470 LGTM, pending jenkins --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19483 **[Test build #82712 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82712/testReport)** for PR 19483 at commit [`f4a7337`](https://github.com/apache/spark/commit/f4a7337b3c4c2b58931550afc5d57902fa98ba96). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19448 That conf is an internal one. The end users will not see it. This is not a bug fix. We should not extend the existing functions or introduce new behaviors/features in 2.2.x releases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19448 Since the risk is low, I did not revert it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19269 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82713/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19269 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19269 **[Test build #82713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82713/testReport)** for PR 19269 at commit [`d65f2c4`](https://github.com/apache/spark/commit/d65f2c4f7a1d25072f02b97d6454b1ee2670f1b1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class WriteToDataSourceV2Command(writer: DataSourceV2Writer, query: LogicalPlan)` * `class RowToInternalRowDataWriterFactory(` * `class RowToInternalRowDataWriter(rowWriter: DataWriter[Row], encoder: ExpressionEncoder[Row])` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18692 cc @gengliangwang Review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18460 Hi, @gatorsmile and @cloud-fan . Could you review this again, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19470 **[Test build #82718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82718/testReport)** for PR 19470 at commit [`ef2123e`](https://github.com/apache/spark/commit/ef2123ecc516fce6feb2a4abe051b4ef862c51a0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 @tejasapatil Thank you for your comment. I hope that benchmark result is a response to concern about virtual method raised by @hvanhovell. @hvanhovell **What do you think?** As a long term plan, it would be good to all (low-level) memory access (i.e. thru `Platform`) would be in form of `MemoryBlock`. There are two reasons. One is easy to understand. The other is performance. This transition would be step by step. I prepared [another refactoring](https://github.com/apache/spark/pull/19472) for some classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19470 @cloud-fan . Thank you so much for review! I updated the PR except one: If`fieldValue` is `null`, we also use `setNull` again in `else`. So, the current one is simpler. ```scala if (fieldRef == null) { row.setNull... } else { val fieldValue = ... ... } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19484: [SPARK-22252][SQL][2.2] FileFormatWriter should respect ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19484 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19484: [SPARK-22252][SQL][2.2] FileFormatWriter should respect ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19484 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82711/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19484: [SPARK-22252][SQL][2.2] FileFormatWriter should respect ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19484 **[Test build #82711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82711/testReport)** for PR 19484 at commit [`e527540`](https://github.com/apache/spark/commit/e527540776f2016228dd7ae7ba4e45aff602401d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19448 I think this is a bug to fix as the previous behaviour does not work as documented: ``` subclass of org.apache.hadoop.mapreduce.OutputCommitter... ``` and does not change existing behaviour. Could you elaborate why you think it is not eligible? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/19222 At high level, this idea is good and worth moving forward with. I still have to dig into your analysis in response to concern raised by @hvanhovell. In terms of the PR itself, is the long term plan to ensure that all memory access would be in form of `MemoryBlock` ? I mean there are still methods which do vanilla `Object` based access which I feel should be gotten rid of (not in this PR as it will touch a lot more places). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 Sure, I agree with you. I will try performance evaluation for other methods like `getByte()`, and so on. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/19222 I pulled up frequency of methods from `UTF8String` which are being invoked from FB prod clusters and picked top 25. ``` .writeToMemory() .getBytes() .toString() .fromString() .clone() .toLong() .fromString() .translate() .compare() .getByte() .compareTo() .fromAddress() .getBytes() .fromBytes() .() .substringSQL() .substring() .fromBytes() .fromAddress() .equals() .compareTo() .compare() .toString() .hashCode() .toInt() ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/19222 Apart from `UTF8String.trim`, can you try other some other method ? If we have to eval perf., its better to pick a method which would be most frequently used... if I have to guess, `trim()` won't be most of the most called method in that class. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user krishna-pandey commented on the issue: https://github.com/apache/spark/pull/19419 @tgravescs These generic headers are about providing available client-side protection for the application. I also think even if there is no sensitive data to formulate an attack by itself here, the information can be used in conjunction to target other ecosystem components. Also, in future we may add an interface for data access. Now is the time to think of Security First. Cross-site Scripting is one of the most prevalent attack vector and has been an OWASP Top 10 risk for web applications for decades. As the effort to have these in place here is minimal, IMHO we should set these. As you rightly mentioned, deployment on cloud can expand the attack surface pretty wide in absence of right firewall policy. Also let's not forget insider threat inside corporate networks. Going forward may be we will have enough insight to choose which headers are needed to be enabled by default and enforce them from application side and not leave it to Users. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19451 Actually you already have it in the classdoc, so please just update the pr description with it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19451#discussion_r144461898 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { } /** + * If one or both of the datasets in the logical [[Except]] operator are purely transformed using + * [[Filter]], this rule will replace logical [[Except]] operator with a [[Filter]] operator by + * flipping the filter condition of the right child. + * {{{ + * SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 + * ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5 + * }}} + * + * Note: + * 1. We should combine all the [[Filter]] of the right node before flipping it using NOT operator. + */ +object ReplaceExceptWithFilter extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case Except(left, right) if isEligible(left, right) => + val filterCondition = combineFilters(right).asInstanceOf[Filter].condition + Distinct( +Filter(Not(replaceAttributesIn(filterCondition, left)), left) + ) + } + + def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, right) match { +case (left, right: Filter) => nonFilterChild(left).sameResult(nonFilterChild(right)) +case _ => false + } + + def nonFilterChild(plan: LogicalPlan): LogicalPlan = plan.find(!_.isInstanceOf[Filter]).get --- End diff -- it shouldn't be an analysisexception, if there is a bug in catalyst. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19451#discussion_r144461913 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { } /** + * If one or both of the datasets in the logical [[Except]] operator are purely transformed using + * [[Filter]], this rule will replace logical [[Except]] operator with a [[Filter]] operator by + * flipping the filter condition of the right child. + * {{{ + * SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 + * ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5 + * }}} + * + * Note: + * 1. We should combine all the [[Filter]] of the right node before flipping it using NOT operator. + */ +object ReplaceExceptWithFilter extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case Except(left, right) if isEligible(left, right) => + val filterCondition = combineFilters(right).asInstanceOf[Filter].condition + Distinct( +Filter(Not(replaceAttributesIn(filterCondition, left)), left) + ) + } + + def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, right) match { +case (left, right: Filter) => nonFilterChild(left).sameResult(nonFilterChild(right)) +case _ => false + } + + def nonFilterChild(plan: LogicalPlan): LogicalPlan = plan.find(!_.isInstanceOf[Filter]).get --- End diff -- but i agree we should throw a more meaningful error message in case there is a bug --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19451#discussion_r144461813 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { } /** + * If one or both of the datasets in the logical [[Except]] operator are purely transformed using + * [[Filter]], this rule will replace logical [[Except]] operator with a [[Filter]] operator by + * flipping the filter condition of the right child. + * {{{ + * SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 + * ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5 + * }}} + * + * Note: + * 1. We should combine all the [[Filter]] of the right node before flipping it using NOT operator. + */ +object ReplaceExceptWithFilter extends Rule[LogicalPlan] { --- End diff -- I'd put this in a new file. the optimizer file is getting too long. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19433 **[Test build #82717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82717/testReport)** for PR 19433 at commit [`c9a8e01`](https://github.com/apache/spark/commit/c9a8e01cead78d2ea6eb6a9ffa007314cbbcf60a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19451 Can you update the pr description with an example plan before / after this optimization, and also put that example in the comment section of the doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19487: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19487 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19487: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19487 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82709/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19464 **[Test build #82716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82716/testReport)** for PR 19464 at commit [`25f98d0`](https://github.com/apache/spark/commit/25f98d0d89e4566339d9ba7701975af4e175c918). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19487: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19487 **[Test build #82709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82709/testReport)** for PR 19487 at commit [`b15ebe4`](https://github.com/apache/spark/commit/b15ebe4b6721f9533150aa5831986bea081843e2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19480#discussion_r144459033 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { } } + test("SPARK-6: group splitted expressions into one method per nested class") { --- End diff -- @mgaido91 I can reproduce the issue by following test case. You can check it: ```scala test("SPARK-6: too much splitted expressions should not exceen constant pool limit") { withSQLConf( (SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) { val colNumber = 1000 val baseDF = spark.range(10).toDF() val newCols = (1 to colNumber).map { colIndex => expr(s"id + $colIndex").as(s"_$colIndex") } val input = baseDF.select(newCols: _*) val aggs2 = (1 to colNumber).flatMap { colIndex => val colName = s"_$colIndex" Seq(expr(s"stddev($colName)"), expr(s"stddev_samp($colName)"), expr(s"stddev_pop($colName)"), expr(s"variance($colName)"), expr(s"var_samp($colName)"), expr(s"var_pop($colName)"), expr(s"skewness($colName)"), expr(s"kurtosis($colName)")) } input.agg(aggs2.head, aggs2.tail: _*).collect() } } ``` ``` [info] Cause: org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeExc eption: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection has grown past JVM limit of 0x [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressi ons$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1079) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19448 This is not eligible for backporting. We should not do it next time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 Sorry, realized I conflated feature subsampling and `subsampleWeights` (instance weights for training examples). IMO feature subsampling can be added in a follow-up PR, but `subsampleWeights` should go in this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144458346 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -426,4 +426,11 @@ package object config { .toSequence .createOptional + private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM = --- End diff -- maybe just `spark.maxRemoteBlockSizeFetchToMemory`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144458322 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -426,4 +426,11 @@ package object config { .toSequence .createOptional + private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM = --- End diff -- `spark.storage.maxRemoteBlockSizeFetchToMemory` is not very clear that it works for shuffle too... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19222#discussion_r144458324 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java --- @@ -17,47 +17,168 @@ package org.apache.spark.unsafe.memory; -import javax.annotation.Nullable; - import org.apache.spark.unsafe.Platform; +import javax.annotation.Nullable; + /** - * A consecutive block of memory, starting at a {@link MemoryLocation} with a fixed size. + * A declaration of interfaces of MemoryBlock classes . */ -public class MemoryBlock extends MemoryLocation { +public abstract class MemoryBlock { + @Nullable + protected final Object obj; - private final long length; + protected final long offset; + + protected final long length; /** * Optional page number; used when this MemoryBlock represents a page allocated by a - * TaskMemoryManager. This field is public so that it can be modified by the TaskMemoryManager, - * which lives in a different package. + * TaskMemoryManager. This field can be updated using setPageNumber method so that --- End diff -- @tejasapatil While we make it private, accessors (`setPageNumber` and `getPageNumber`) has `final` attribute. Thus, JIT compiler will inline these methods. As a result, I think that this version with `final` accessors can achieve the similar performance compared to the approach with `public` fields. What do you think? BTW, I will fix typo soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19483 that will be great, thanks @tejasapatil ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19470 LGTM except some minor comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144458108 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2050,4 +2050,80 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") { +val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client + +Seq("true", "false").foreach { value => + withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) { +withTempDatabase { db => + client.runSqlHive( +s""" + |CREATE TABLE $db.t( + | click_id string, + | search_id string, + | uid bigint) + |PARTITIONED BY ( + | ts string, + | hour string) + |STORED AS ORC + """.stripMargin) + + client.runSqlHive( +s""" + |INSERT INTO TABLE $db.t + |PARTITION (ts = '98765', hour = '01') + |VALUES (12, 2, 12345) + """.stripMargin + ) + + checkAnswer( +sql(s"SELECT * FROM $db.t"), +Row("12", "2", 12345, "98765", "01")) + + client.runSqlHive(s"ALTER TABLE $db.t ADD COLUMNS (dummy string)") + + checkAnswer( +sql(s"SELECT click_id, search_id FROM $db.t"), +Row("12", "2")) + + checkAnswer( +sql(s"SELECT search_id, click_id FROM $db.t"), +Row("2", "12")) + + checkAnswer( +sql(s"SELECT search_id FROM $db.t"), +Row("2")) + + checkAnswer( +sql(s"SELECT dummy, click_id FROM $db.t"), +Row(null, "12")) + + checkAnswer( +sql(s"SELECT * FROM $db.t"), +Row("12", "2", 12345, null, "98765", "01")) +} + } +} + } + + // This test case is added to prevent regression. + test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order is different") { --- End diff -- it's weird to have a test verifying a bug, I think it's good enough to have a JIRA tracking this bug. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144457977 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2050,4 +2050,80 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") { +val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client + +Seq("true", "false").foreach { value => + withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) { +withTempDatabase { db => + client.runSqlHive( +s""" + |CREATE TABLE $db.t( + | click_id string, + | search_id string, + | uid bigint) + |PARTITIONED BY ( + | ts string, + | hour string) + |STORED AS ORC + """.stripMargin) + + client.runSqlHive( +s""" + |INSERT INTO TABLE $db.t + |PARTITION (ts = '98765', hour = '01') + |VALUES (12, 2, 12345) + """.stripMargin + ) + + checkAnswer( +sql(s"SELECT * FROM $db.t"), +Row("12", "2", 12345, "98765", "01")) + + client.runSqlHive(s"ALTER TABLE $db.t ADD COLUMNS (dummy string)") + + checkAnswer( +sql(s"SELECT click_id, search_id FROM $db.t"), +Row("12", "2")) + + checkAnswer( +sql(s"SELECT search_id, click_id FROM $db.t"), +Row("2", "12")) + + checkAnswer( +sql(s"SELECT search_id FROM $db.t"), +Row("2")) + + checkAnswer( +sql(s"SELECT dummy, click_id FROM $db.t"), +Row(null, "12")) + + checkAnswer( +sql(s"SELECT * FROM $db.t"), --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144457932 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2050,4 +2050,80 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") { +val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client + +Seq("true", "false").foreach { value => + withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) { +withTempDatabase { db => + client.runSqlHive( +s""" + |CREATE TABLE $db.t( + | click_id string, + | search_id string, + | uid bigint) + |PARTITIONED BY ( + | ts string, + | hour string) + |STORED AS ORC + """.stripMargin) + + client.runSqlHive( +s""" + |INSERT INTO TABLE $db.t + |PARTITION (ts = '98765', hour = '01') + |VALUES (12, 2, 12345) + """.stripMargin + ) + + checkAnswer( +sql(s"SELECT * FROM $db.t"), --- End diff -- please list all columns here instead of `*`, to make the test more clear --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19452: [SPARK-22136][SS] Evaluate one-sided conditions early in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19452 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82710/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144457796 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -272,25 +272,35 @@ private[orc] object OrcRelation extends HiveInspectors { def unwrapOrcStructs( conf: Configuration, dataSchema: StructType, + requiredSchema: StructType, maybeStructOI: Option[StructObjectInspector], iterator: Iterator[Writable]): Iterator[InternalRow] = { val deserializer = new OrcSerde -val mutableRow = new SpecificInternalRow(dataSchema.map(_.dataType)) -val unsafeProjection = UnsafeProjection.create(dataSchema) +val mutableRow = new SpecificInternalRow(requiredSchema.map(_.dataType)) +val unsafeProjection = UnsafeProjection.create(requiredSchema) def unwrap(oi: StructObjectInspector): Iterator[InternalRow] = { - val (fieldRefs, fieldOrdinals) = dataSchema.zipWithIndex.map { -case (field, ordinal) => oi.getStructFieldRef(field.name) -> ordinal + val (fieldRefs, fieldOrdinals) = requiredSchema.zipWithIndex.map { +case (field, ordinal) => + var ref = oi.getStructFieldRef(field.name) + if (ref == null) { +val maybeIndex = dataSchema.getFieldIndex(field.name) +if (maybeIndex.isDefined) { + ref = oi.getStructFieldRef("_col" + maybeIndex.get) +} + } + ref -> ordinal }.unzip - val unwrappers = fieldRefs.map(unwrapperFor) + val unwrappers = fieldRefs.map(r => if (r == null) null else unwrapperFor(r)) iterator.map { value => val raw = deserializer.deserialize(value) var i = 0 val length = fieldRefs.length while (i < length) { - val fieldValue = oi.getStructFieldData(raw, fieldRefs(i)) + val fieldRef = fieldRefs(i) + val fieldValue = if (fieldRef == null) null else oi.getStructFieldData(raw, fieldRefs(i)) if (fieldValue == null) { --- End diff -- nit: ``` if (fieldRef == null) { row.setNull... } else { val fieldValue = ... ... } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19452: [SPARK-22136][SS] Evaluate one-sided conditions early in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19452 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19452: [SPARK-22136][SS] Evaluate one-sided conditions early in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19452 **[Test build #82710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82710/testReport)** for PR 19452 at commit [`6753825`](https://github.com/apache/spark/commit/67538255c00e01a8b1553c82c2c83b5ae0a7ddde). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144457643 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable if (maybePhysicalSchema.isEmpty) { Iterator.empty } else { -val physicalSchema = maybePhysicalSchema.get -OrcRelation.setRequiredColumns(conf, physicalSchema, requiredSchema) +OrcRelation.setRequiredColumns(conf, dataSchema, requiredSchema) --- End diff -- oh i see, we only need to pass the required column indices to orc reader. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144457479 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -272,25 +272,35 @@ private[orc] object OrcRelation extends HiveInspectors { def unwrapOrcStructs( conf: Configuration, dataSchema: StructType, + requiredSchema: StructType, maybeStructOI: Option[StructObjectInspector], iterator: Iterator[Writable]): Iterator[InternalRow] = { val deserializer = new OrcSerde -val mutableRow = new SpecificInternalRow(dataSchema.map(_.dataType)) -val unsafeProjection = UnsafeProjection.create(dataSchema) +val mutableRow = new SpecificInternalRow(requiredSchema.map(_.dataType)) +val unsafeProjection = UnsafeProjection.create(requiredSchema) def unwrap(oi: StructObjectInspector): Iterator[InternalRow] = { - val (fieldRefs, fieldOrdinals) = dataSchema.zipWithIndex.map { -case (field, ordinal) => oi.getStructFieldRef(field.name) -> ordinal + val (fieldRefs, fieldOrdinals) = requiredSchema.zipWithIndex.map { +case (field, ordinal) => + var ref = oi.getStructFieldRef(field.name) + if (ref == null) { +val maybeIndex = dataSchema.getFieldIndex(field.name) --- End diff -- the `requiredSchema` is guaranteed to be contained in the `dataSchema`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144457357 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable if (maybePhysicalSchema.isEmpty) { Iterator.empty } else { -val physicalSchema = maybePhysicalSchema.get -OrcRelation.setRequiredColumns(conf, physicalSchema, requiredSchema) +OrcRelation.setRequiredColumns(conf, dataSchema, requiredSchema) --- End diff -- does it work? seems here we lie to the orc reader about the physical schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19470#discussion_r144457235 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable if (maybePhysicalSchema.isEmpty) { --- End diff -- nit ``` val isEmptyFile = OrcFileOperator.readSchema(Seq(file.filePath), Some(conf)).isEmpty if (isEmptyFile) { ... } else ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/19222#discussion_r144457118 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java --- @@ -17,47 +17,168 @@ package org.apache.spark.unsafe.memory; -import javax.annotation.Nullable; - import org.apache.spark.unsafe.Platform; +import javax.annotation.Nullable; + /** - * A consecutive block of memory, starting at a {@link MemoryLocation} with a fixed size. + * A declaration of interfaces of MemoryBlock classes . */ -public class MemoryBlock extends MemoryLocation { +public abstract class MemoryBlock { + @Nullable + protected final Object obj; - private final long length; + protected final long offset; + + protected final long length; /** * Optional page number; used when this MemoryBlock represents a page allocated by a - * TaskMemoryManager. This field is public so that it can be modified by the TaskMemoryManager, - * which lives in a different package. + * TaskMemoryManager. This field can be updated using setPageNumber method so that --- End diff -- typo: `TaskMemoryManager` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/19222#discussion_r144457126 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java --- @@ -17,47 +17,168 @@ package org.apache.spark.unsafe.memory; -import javax.annotation.Nullable; - import org.apache.spark.unsafe.Platform; +import javax.annotation.Nullable; + /** - * A consecutive block of memory, starting at a {@link MemoryLocation} with a fixed size. + * A declaration of interfaces of MemoryBlock classes . */ -public class MemoryBlock extends MemoryLocation { +public abstract class MemoryBlock { + @Nullable + protected final Object obj; - private final long length; + protected final long offset; + + protected final long length; /** * Optional page number; used when this MemoryBlock represents a page allocated by a - * TaskMemoryManager. This field is public so that it can be modified by the TaskMemoryManager, - * which lives in a different package. + * TaskMemoryManager. This field can be updated using setPageNumber method so that --- End diff -- I think keeping it public would be more performant VS mutating the field using a method. I know that this using method is better way in terms of design but there are places in codebase where we do this for solely for perf. reasons --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19451#discussion_r144456518 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { } /** + * If one or both of the datasets in the logical [[Except]] operator are purely transformed using + * [[Filter]], this rule will replace logical [[Except]] operator with a [[Filter]] operator by + * flipping the filter condition of the right child. + * {{{ + * SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 + * ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5 + * }}} + * + * Note: + * 1. We should combine all the [[Filter]] of the right node before flipping it using NOT operator. + */ +object ReplaceExceptWithFilter extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case Except(left, right) if isEligible(left, right) => + val filterCondition = combineFilters(right).asInstanceOf[Filter].condition + Distinct( +Filter(Not(replaceAttributesIn(filterCondition, left)), left) + ) + } + + def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, right) match { +case (left, right: Filter) => nonFilterChild(left).sameResult(nonFilterChild(right)) +case _ => false + } + + def nonFilterChild(plan: LogicalPlan): LogicalPlan = plan.find(!_.isInstanceOf[Filter]).get + + def combineFilters(plan: LogicalPlan): LogicalPlan = { +@tailrec +def fixedPoint(plan: LogicalPlan, acc: LogicalPlan): LogicalPlan = { + if (acc.fastEquals(plan)) acc else fixedPoint(acc, CombineFilters(acc)) +} + +fixedPoint(plan, CombineFilters(plan)) + } + + def replaceAttributesIn(condition: Expression, node: LogicalPlan): Expression = { --- End diff -- We do not need this function. Could we inline these logics in the rule? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/19483 >> I'll refactor it later, to use requiredChildOrdering to do the sort. The hive bucketing PR does that : https://github.com/apache/spark/pull/19001 I can isolate that piece and put out a PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19451#discussion_r144456402 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { } /** + * If one or both of the datasets in the logical [[Except]] operator are purely transformed using + * [[Filter]], this rule will replace logical [[Except]] operator with a [[Filter]] operator by + * flipping the filter condition of the right child. + * {{{ + * SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 + * ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5 + * }}} + * + * Note: + * 1. We should combine all the [[Filter]] of the right node before flipping it using NOT operator. + */ +object ReplaceExceptWithFilter extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case Except(left, right) if isEligible(left, right) => + val filterCondition = combineFilters(right).asInstanceOf[Filter].condition + Distinct( +Filter(Not(replaceAttributesIn(filterCondition, left)), left) + ) + } + + def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, right) match { +case (left, right: Filter) => nonFilterChild(left).sameResult(nonFilterChild(right)) +case _ => false + } + + def nonFilterChild(plan: LogicalPlan): LogicalPlan = plan.find(!_.isInstanceOf[Filter]).get + + def combineFilters(plan: LogicalPlan): LogicalPlan = { +@tailrec +def fixedPoint(plan: LogicalPlan, acc: LogicalPlan): LogicalPlan = { --- End diff -- The func name is confusing... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19451#discussion_r144456109 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { } /** + * If one or both of the datasets in the logical [[Except]] operator are purely transformed using + * [[Filter]], this rule will replace logical [[Except]] operator with a [[Filter]] operator by + * flipping the filter condition of the right child. + * {{{ + * SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 + * ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5 + * }}} + * + * Note: + * 1. We should combine all the [[Filter]] of the right node before flipping it using NOT operator. + */ +object ReplaceExceptWithFilter extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case Except(left, right) if isEligible(left, right) => + val filterCondition = combineFilters(right).asInstanceOf[Filter].condition + Distinct( +Filter(Not(replaceAttributesIn(filterCondition, left)), left) + ) + } + + def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, right) match { +case (left, right: Filter) => nonFilterChild(left).sameResult(nonFilterChild(right)) +case _ => false + } + + def nonFilterChild(plan: LogicalPlan): LogicalPlan = plan.find(!_.isInstanceOf[Filter]).get --- End diff -- Although it is impossible, but please use ```Scala plan.find(!_.isInstanceOf[Filter]).getOrElse { throw new AnalysisException("xyz") } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19451#discussion_r144455482 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { } /** + * If one or both of the datasets in the logical [[Except]] operator are purely transformed using + * [[Filter]], this rule will replace logical [[Except]] operator with a [[Filter]] operator by + * flipping the filter condition of the right child. + * {{{ + * SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 + * ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5 + * }}} + * + * Note: + * 1. We should combine all the [[Filter]] of the right node before flipping it using NOT operator. + */ +object ReplaceExceptWithFilter extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case Except(left, right) if isEligible(left, right) => + val filterCondition = combineFilters(right).asInstanceOf[Filter].condition + Distinct( +Filter(Not(replaceAttributesIn(filterCondition, left)), left) + ) + } + + def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, right) match { --- End diff -- please add `private` in all the internal functions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19451#discussion_r144455603 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { } /** + * If one or both of the datasets in the logical [[Except]] operator are purely transformed using + * [[Filter]], this rule will replace logical [[Except]] operator with a [[Filter]] operator by + * flipping the filter condition of the right child. + * {{{ + * SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 + * ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5 + * }}} + * + * Note: + * 1. We should combine all the [[Filter]] of the right node before flipping it using NOT operator. + */ +object ReplaceExceptWithFilter extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case Except(left, right) if isEligible(left, right) => + val filterCondition = combineFilters(right).asInstanceOf[Filter].condition + Distinct( +Filter(Not(replaceAttributesIn(filterCondition, left)), left) + ) + } + + def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, right) match { +case (left, right: Filter) => nonFilterChild(left).sameResult(nonFilterChild(right)) +case _ => false + } + + def nonFilterChild(plan: LogicalPlan): LogicalPlan = plan.find(!_.isInstanceOf[Filter]).get --- End diff -- Nit: no need to add the return type `: LogicalPlan` for the private functions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19458#discussion_r144456908 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea /** List all the blocks currently stored on disk by the disk manager. */ def getAllBlocks(): Seq[BlockId] = { -getAllFiles().map(f => BlockId(f.getName)) +getAllFiles().flatMap { f => + val blockId = BlockId.guess(f.getName) --- End diff -- +1 for using try-catch here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144456817 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -426,4 +426,11 @@ package object config { .toSequence .createOptional + private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM = --- End diff -- I would prefer to use `spark.storage.maxRemoteBlockSizeFetchToMemory`, since driver side block manager will also leverage this feature. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19263: [SPARK-22050][CORE] Allow BlockUpdated events to be opti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19263 **[Test build #82715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82715/testReport)** for PR 19263 at commit [`21b262a`](https://github.com/apache/spark/commit/21b262a4cf945b22ce1ace84acded29852bc73b5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144456522 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -426,4 +426,11 @@ package object config { .toSequence .createOptional + private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM = --- End diff -- how about `MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM` and `spark.executor.maxRemoteBlockSizeFetchToMemory`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19483 I'll refactor it later, to use `requiredChildOrdering` to do the sort. I just wanna make this bug fix as simple as possible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org