[GitHub] [spark] LuciferYang commented on pull request #37331: [WIP][SPARK-39913][BUILD] Upgrade to Arrow 9.0.0
LuciferYang commented on PR #37331: URL: https://github.com/apache/spark/pull/37331#issuecomment-1202050954 I revert the configuration of AFS staging due to the previous test passed, but I still can't find arrow-*** 9.0 from the central repository. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sumeetgajjar commented on a diff in pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI
sumeetgajjar commented on code in PR #37325: URL: https://github.com/apache/spark/pull/37325#discussion_r935139884 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala: ## @@ -132,4 +132,10 @@ case class BatchScanExec( val result = s"$nodeName$truncatedOutputString ${scan.description()} $runtimeFiltersString" redact(result) } + + /** + * Returns the name of this type of TreeNode. Defaults to the class name. + * Note that we remove the "Exec" suffix for physical operators here. + */ + override def nodeName: String = s"BatchScan ${scan.name()}" Review Comment: Sure - let me check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37355: [SPARK-39930][SQL] Introduce Cache Hints
cloud-fan commented on code in PR #37355: URL: https://github.com/apache/spark/pull/37355#discussion_r935138416 ## sql/core/src/test/scala/org/apache/spark/sql/CacheHintsSuite.scala: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.execution.CachedData +import org.apache.spark.sql.execution.analysis.ResolveCacheHints.NO_RESULT_CACHE_TAG +import org.apache.spark.sql.test.SharedSparkSession + +class CacheHintsSuite extends PlanTest with SharedSparkSession { + private def lookupCachedData(df: DataFrame): Option[CachedData] = { +spark.sharedState.cacheManager.lookupCachedData(df) + } + + test("RESULT_CACHE Hint") { +val e = intercept[AnalysisException]( + sql("SELECT /*+ RESULT_CACHE(abc) */ id from range(0, 1)").collect()) Review Comment: I don't think `CACHE TABLE` command fails if the plan is already cached. So basically the hint is to combine two sql statements? ``` CACHE TABLE abc AS SELECT id from range(0,1) SELECT * FROM abc ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #37373: [SPARK-39911][SQL][3.3] Optimize global Sort to RepartitionByExpression
ulysses-you commented on PR #37373: URL: https://github.com/apache/spark/pull/37373#issuecomment-1202043485 @dongjoon-hyun yes, the story is we fix a bug in https://github.com/apache/spark/pull/37250 and that pr backport into branch-3.3. However, that fix may introduce performance regression. This pr itself is only to improve performance but in order to avoid the regression, we also backport this pr. see the details https://github.com/apache/spark/pull/37330#issuecomment-1201979396 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37372: [SPARK-39944][BUILD] Upgrade dropwizard metrics to 4.2.10
LuciferYang commented on PR #37372: URL: https://github.com/apache/spark/pull/37372#issuecomment-1202041183 @dongjoon-hyun There should be no new page to update. [https://metrics.dropwizard.io/4.2.0/](https://metrics.dropwizard.io/4.2.0/) seems to be the latest, there is no `https://metrics.dropwizard.io/4.2.10/` similar pages exist. If it's my problem, please correct me. In addition, do we need to change `http` to `https`? For example: ``` docs/monitoring.md:Executor memory metrics are also exposed via the Spark metrics system based on the [Dropwizard metrics library](http://metrics.dropwizard.io/4.2.0). ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #37355: [SPARK-39930][SQL] Introduce Cache Hints
yaooqinn commented on code in PR #37355: URL: https://github.com/apache/spark/pull/37355#discussion_r935131842 ## sql/core/src/test/scala/org/apache/spark/sql/CacheHintsSuite.scala: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.execution.CachedData +import org.apache.spark.sql.execution.analysis.ResolveCacheHints.NO_RESULT_CACHE_TAG +import org.apache.spark.sql.test.SharedSparkSession + +class CacheHintsSuite extends PlanTest with SharedSparkSession { + private def lookupCachedData(df: DataFrame): Option[CachedData] = { +spark.sharedState.cacheManager.lookupCachedData(df) + } + + test("RESULT_CACHE Hint") { +val e = intercept[AnalysisException]( + sql("SELECT /*+ RESULT_CACHE(abc) */ id from range(0, 1)").collect()) Review Comment: They are very similar with few differences: 1. the result set is not the same 2. it skips caching step when already cached, so it will not fail Parameters are disabled now because 1. I don't know if it is good to do parsing table identifiers in post-resolution 2. oracle does not have params support either, maybe it's simple and easy for users? 3. minimize the patch size -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #37284: [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode
ulysses-you commented on PR #37284: URL: https://github.com/apache/spark/pull/37284#issuecomment-1202032739 > do we have a physical rule to remove unnecessary sort if the global limit can produce sorted data? yes, we have `RemoveRedundantSorts` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37284: [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode
cloud-fan commented on PR #37284: URL: https://github.com/apache/spark/pull/37284#issuecomment-1202029531 BTW, do we have a physical rule to remove unnecessary sort if the global limit can produce sorted data? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #37375: [SPARK-39947][BUILD] Upgrade Jersey to 2.36
LuciferYang opened a new pull request, #37375: URL: https://github.com/apache/spark/pull/37375 ### What changes were proposed in this pull request? This pr upgrade Jersey from 2.35 to 2.36. ### Why are the changes needed? This version adapts to Jack 2.13.3, which is also used by Spark currently - [Adopt Jackson 2.13](https://github.com/eclipse-ee4j/jersey/pull/4928) - [Update Jackson to 2.13.3](https://github.com/eclipse-ee4j/jersey/pull/5076) The release notes as follows: - https://github.com/eclipse-ee4j/jersey/releases/tag/2.36 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] deshanxiao closed pull request #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL
deshanxiao closed pull request #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL URL: https://github.com/apache/spark/pull/37336 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI
cloud-fan commented on code in PR #37325: URL: https://github.com/apache/spark/pull/37325#discussion_r935119977 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala: ## @@ -132,4 +132,10 @@ case class BatchScanExec( val result = s"$nodeName$truncatedOutputString ${scan.description()} $runtimeFiltersString" redact(result) } + + /** + * Returns the name of this type of TreeNode. Defaults to the class name. + * Note that we remove the "Exec" suffix for physical operators here. + */ + override def nodeName: String = s"BatchScan ${scan.name()}" Review Comment: Can we refactor the code a bit and pass `Table#name` when constructing `BatchScanExec`? I think it's better than adding a new public API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935119522 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,255 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") { + val df4 = spark.read +.table("h2.test.employee") +.groupBy("dept").sum("SALARY") +.orderBy($"dept" + 100) Review Comment: BTW, can we make the grouping key an expression? like this https://github.com/apache/spark/pull/37320/files#diff-1496378d9e7817c45c962f1af48e5e765cb475bd01d58edec118d98225e02ef3R887 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935119295 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,255 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") { + val df4 = spark.read +.table("h2.test.employee") +.groupBy("dept").sum("SALARY") +.orderBy($"dept" + 100) Review Comment: can we use some functions that don't need ansi mode to simplify the test? e.g. `log(10, dept)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37374: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow
gengliangwang commented on PR #37374: URL: https://github.com/apache/spark/pull/37374#issuecomment-1202019831 This PR is to cherry-pick https://github.com/apache/spark/pull/37337 again, which was reverted since it fail to compile after https://github.com/apache/spark/pull/37343 is merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #37374: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow
gengliangwang opened a new pull request, #37374: URL: https://github.com/apache/spark/pull/37374 ### What changes were proposed in this pull request? Similar with https://github.com/apache/spark/pull/37313, currently, when arithmetic overflow errors happen under ANSI mode, the error messages are like ``` [ARITHMETIC_OVERFLOW] long overflow. Use 'try_multiply' to tolerate overflow and return NULL instead. If necessary set spark.sql.ansi.enabled to "false" ``` The "(except for ANSI interval type)" part is confusing. We should remove it for the numeric arithmetic operations and have a new error class for the interval division error: `INTERVAL_ARITHMETIC_OVERFLOW` ### Why are the changes needed? For better error messages ### Does this PR introduce _any_ user-facing change? Yes, Use different error classes for arithmetic overflows of numeric/interval.. After changes, the error messages are simpler and more clear. ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #37284: [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode
ulysses-you commented on PR #37284: URL: https://github.com/apache/spark/pull/37284#issuecomment-1202015625 @cloud-fan @viirya , addressed comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #37276: [SPARK-39835][SQL][3.1] Fix EliminateSorts remove global sort below the local sort
ulysses-you commented on PR #37276: URL: https://github.com/apache/spark/pull/37276#issuecomment-1202013868 @cloud-fan done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #37373: [SPARK-39911][SQL][3.3] Optimize global Sort to RepartitionByExpression
ulysses-you commented on PR #37373: URL: https://github.com/apache/spark/pull/37373#issuecomment-1202011636 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you opened a new pull request, #37373: [SPARK-39911][SQL][3.3] Optimize global Sort to RepartitionByExpression
ulysses-you opened a new pull request, #37373: URL: https://github.com/apache/spark/pull/37373 this is for backport https://github.com/apache/spark/pull/37330 into branch-3.3 ### What changes were proposed in this pull request? Optimize Global sort to RepartitionByExpression, for example: ``` Sort local Sort local Sort global=> RepartitionByExpression ``` ### Why are the changes needed? If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort. ### Does this PR introduce _any_ user-facing change? no, only improve performance ### How was this patch tested? add test Closes #37330 from ulysses-you/optimize-sort. Authored-by: ulysses-you Signed-off-by: Wenchen Fan ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37369: [SPARK-39942][PYTHON][PS] Need to verify the input nums is integer in nsmallest func
zhengruifeng commented on PR #37369: URL: https://github.com/apache/spark/pull/37369#issuecomment-1202009196 I think this is a more general problem: ``` def nsmallest(self, n: int = 5) ``` this method already hint the input type, but given a boolean it will fail with a `AnalysisException` (which should be a `TypeError`). I guess it could accept a floating number like 5.5, and run successfully. I did a quick investigation of [`typing`](https://docs.python.org/3/library/typing.html#module-typing), there is an annotation `@typing.runtime_checkable`, but only works for `class typing.Protocol` is there some method to enforce the input type checking? @zero323 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
beliefer commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935094921 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
beliefer commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935093496 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY") + .groupBy("my_dept", "my_manager").sum("SALARY") + .orderBy("my_dept", "my_manager") + .limit(1) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key", "IS_MANAGER").sum("SALARY") + .orderBy("key", "IS_MANAGER") + .limit(1) +checkSortRemoved(df6) +checkLimitRemoved(df6) +checkPushedInfo(df6, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END, " + +"IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df6, Seq(Row(0.00, false, 12000.00))) + +val df7 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY")) + .orderBy(sum("SALARY")) Review Comment: I added the comments into `V2ScanRelationPushDown` directly. ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", +
[GitHub] [spark] sumeetgajjar commented on a diff in pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI
sumeetgajjar commented on code in PR #37325: URL: https://github.com/apache/spark/pull/37325#discussion_r935094151 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala: ## @@ -132,4 +132,10 @@ case class BatchScanExec( val result = s"$nodeName$truncatedOutputString ${scan.description()} $runtimeFiltersString" redact(result) } + + /** + * Returns the name of this type of TreeNode. Defaults to the class name. + * Note that we remove the "Exec" suffix for physical operators here. + */ + override def nodeName: String = s"BatchScan ${scan.name()}" Review Comment: > Can we simply use v2 `Table.name`? Hi @cloud-fan - thank you for your comment, can you please tell me how can we use v2 `Table#name()` here? `Scan` or `BatchScanExec` does not have a reference to `Table`. Thus I decided to use `Scan#name()`. I would be happy to publish a patch that uses v2 `Table#name()` if you can please tell me how it can be accessed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace
HyukjinKwon commented on PR #37329: URL: https://github.com/apache/spark/pull/37329#issuecomment-1201990422 @physinet mind enabling https://github.com/physinet/spark/actions/workflows/build_main.yml and rebasing please? Apache Spark leverages the Github resources from the PR author's fork. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
beliefer commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935093496 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY") + .groupBy("my_dept", "my_manager").sum("SALARY") + .orderBy("my_dept", "my_manager") + .limit(1) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key", "IS_MANAGER").sum("SALARY") + .orderBy("key", "IS_MANAGER") + .limit(1) +checkSortRemoved(df6) +checkLimitRemoved(df6) +checkPushedInfo(df6, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END, " + +"IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df6, Seq(Row(0.00, false, 12000.00))) + +val df7 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY")) + .orderBy(sum("SALARY")) Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37341: [SPARK-38639][HIVE]Ignore the corrupted rows that failed to deserialize in hive sequence table
AmplabJenkins commented on PR #37341: URL: https://github.com/apache/spark/pull/37341#issuecomment-1201989368 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #37295: [SPARK-39873][SQL] Remove `OptimizeLimitZero` and merge it into `EliminateLimits`
beliefer commented on PR #37295: URL: https://github.com/apache/spark/pull/37295#issuecomment-1201989202 @cloud-fan @gengliangwang @dongjoon-hyun Thank you ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wForget commented on pull request #37346: [SPARK-37210][CORE][SQL] Allow forced use of staging directory
wForget commented on PR #37346: URL: https://github.com/apache/spark/pull/37346#issuecomment-1201989180 > Why it is an issue particular for `InsertIntoHadoopFsRelationCommand`? `InsertIntoHiveTable` always uses hive staging dir https://github.com/apache/spark/blob/b0c831d3408dddfbbf3acacbe8100a9e08b400de/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L107 `InsertIntoHadoopFsRelationCommand` only uses spark staging dir in dynamic overwrite mode, otherwise it uses `table_location/_temporary` which leads to concurrency conflicts. https://github.com/apache/spark/blob/b0c831d3408dddfbbf3acacbe8100a9e08b400de/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L171 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37355: [SPARK-39930][SQL] Introduce Cache Hints
cloud-fan commented on code in PR #37355: URL: https://github.com/apache/spark/pull/37355#discussion_r935089294 ## sql/core/src/test/scala/org/apache/spark/sql/CacheHintsSuite.scala: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.execution.CachedData +import org.apache.spark.sql.execution.analysis.ResolveCacheHints.NO_RESULT_CACHE_TAG +import org.apache.spark.sql.test.SharedSparkSession + +class CacheHintsSuite extends PlanTest with SharedSparkSession { + private def lookupCachedData(df: DataFrame): Option[CachedData] = { +spark.sharedState.cacheManager.lookupCachedData(df) + } + + test("RESULT_CACHE Hint") { +val e = intercept[AnalysisException]( + sql("SELECT /*+ RESULT_CACHE(abc) */ id from range(0, 1)").collect()) Review Comment: is it the same as `CACHE TABLE abc AS SELECT id from range(0,1)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37341: [SPARK-38639][HIVE]Ignore the corrupted rows that failed to deserialize in hive sequence table
cloud-fan commented on PR #37341: URL: https://github.com/apache/spark/pull/37341#issuecomment-1201981954 Does Hive have this feature? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37355: [SPARK-39930][SQL] Introduce Cache Hints
HyukjinKwon commented on PR #37355: URL: https://github.com/apache/spark/pull/37355#issuecomment-1201980467 I like the idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37330: [SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression
cloud-fan commented on PR #37330: URL: https://github.com/apache/spark/pull/37330#issuecomment-1201979396 @ulysses-you can you open a backport PR for 3.3? I think this is a necessary followup of https://github.com/apache/spark/pull/37250 to avoid perf regression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37276: [SPARK-39835][SQL][3.1] Fix EliminateSorts remove global sort below the local sort
cloud-fan commented on PR #37276: URL: https://github.com/apache/spark/pull/37276#issuecomment-1201978590 to avoid any perf regression, can we include the global sort -> repartition optimization? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37358: [SPARK-39932][SQL] WindowExec should clear the final partition buffer
cloud-fan commented on PR #37358: URL: https://github.com/apache/spark/pull/37358#issuecomment-1201978144 also cc @maryannxue @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #37368: [SPARK-39940][SS] Refresh catalog table on streaming query with DSv1 sink
viirya commented on code in PR #37368: URL: https://github.com/apache/spark/pull/37368#discussion_r935082833 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamTableAPISuite.scala: ## @@ -445,6 +445,44 @@ class DataStreamTableAPISuite extends StreamTest with BeforeAndAfter { } } + test("SPARK-39940: refresh table when streaming query writes to the catalog table via DSv1") { +withTable("tbl1", "tbl2") { + withTempDir { dir => +val baseTbls = new File(dir, "tables") +val tbl1File = new File(baseTbls, "tbl1") +val tbl2File = new File(baseTbls, "tbl2") +val checkpointLocation = new File(dir, "checkpoint") + +val format = "parquet" +Seq((1, 2)).toDF("i", "d") + .write.format(format).option("path", tbl1File.getCanonicalPath).saveAsTable("tbl1") + +val query = spark.readStream.format(format).table("tbl1") + .writeStream.format(format) + .option("checkpointLocation", checkpointLocation.getCanonicalPath) + .option("path", tbl2File.getCanonicalPath) + .toTable("tbl2") + +try { + query.processAllAvailable() + checkAnswer(spark.table("tbl2").sort($"i"), Seq(Row(1, 2))) + + Seq((3, 4)).toDF("i", "d") +.write.format(format).option("path", tbl1File.getCanonicalPath) +.mode(SaveMode.Append).saveAsTable("tbl1") + + query.processAllAvailable() + checkAnswer(spark.table("tbl2").sort($"i"), Seq(Row(1, 2), Row(3, 4))) + + assert(query.exception.isEmpty, s"No exception should happen in streaming query: " + Review Comment: ```suggestion assert(query.exception.isEmpty, "No exception should happen in streaming query: " + ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37365: [SPARK-39938][PYTHON][PS] Accept all inputs of prefix/suffix which implement __str__ in add_predix/add_suffix
HyukjinKwon commented on PR #37365: URL: https://github.com/apache/spark/pull/37365#issuecomment-1201973069 cc @xinrong-meng @itholic @zhengruifeng FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37366: [SPARK-39939][PYTHON][PS] return self.copy during calling shift with period == 0
HyukjinKwon commented on PR #37366: URL: https://github.com/apache/spark/pull/37366#issuecomment-1201972652 cc @itholic @xinrong-meng @zhengruifeng FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37367: [SPARK-39941][PYTHON][PS] window and min_periods should be integer in rolling func
HyukjinKwon commented on PR #37367: URL: https://github.com/apache/spark/pull/37367#issuecomment-1201972549 cc @zhengruifeng @xinrong-meng @itholic FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37369: [SPARK-39942][PYTHON][PS] Need to verify the input nums is integer in nsmallest func
HyukjinKwon commented on PR #37369: URL: https://github.com/apache/spark/pull/37369#issuecomment-1201971564 cc @itholic @xinrong-meng @zhengruifeng FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37356: [SPARK-39877][PYTHON][FOLLOW-UP] Add DataFrame melt to PySpark docs
HyukjinKwon closed pull request #37356: [SPARK-39877][PYTHON][FOLLOW-UP] Add DataFrame melt to PySpark docs URL: https://github.com/apache/spark/pull/37356 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37356: [SPARK-39877][PYTHON][FOLLOW-UP] Add DataFrame melt to PySpark docs
HyukjinKwon commented on PR #37356: URL: https://github.com/apache/spark/pull/37356#issuecomment-1201971011 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37351: [SPARK-38864][SQL][FOLLOW-UP] Make AnalysisException message deterministic
HyukjinKwon closed pull request #37351: [SPARK-38864][SQL][FOLLOW-UP] Make AnalysisException message deterministic URL: https://github.com/apache/spark/pull/37351 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37351: [SPARK-38864][SQL][FOLLOW-UP] Make AnalysisException message deterministic
HyukjinKwon commented on PR #37351: URL: https://github.com/apache/spark/pull/37351#issuecomment-1201970303 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37350: [SPARK-39900][SQL] Address partial or negated condition in binary format's predicate pushdown
HyukjinKwon commented on PR #37350: URL: https://github.com/apache/spark/pull/37350#issuecomment-1201968982 cc @WeichenXu123 too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37350: SPARK-39900 : Issue with querying dataframe produced by 'binaryFile' …
HyukjinKwon commented on PR #37350: URL: https://github.com/apache/spark/pull/37350#issuecomment-1201968566 LGTM otherwise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37169: [SPARK-38901][SQL] DS V2 supports push down misc functions
cloud-fan commented on code in PR #37169: URL: https://github.com/apache/spark/pull/37169#discussion_r935075376 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -1115,6 +1115,69 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df8, Seq(Row("alex"))) } + test("scan with filter push-down with misc functions") { +Seq(false, true).foreach { ansiMode => + withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiMode.toString) { Review Comment: Can we avoid the cast? Let's add a binary type column to the testing table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #37372: [SPARK-39944][BUILD] Upgrade dropwizard metrics to 4.2.10
LuciferYang opened a new pull request, #37372: URL: https://github.com/apache/spark/pull/37372 ### What changes were proposed in this pull request? This pr upgrade dropwizard metrics from 4.2.7 to 4.2.10. ### Why are the changes needed? There are 3 versions after 4.2.7, the release notes as follows: - https://github.com/dropwizard/metrics/releases/tag/v4.2.8 - https://github.com/dropwizard/metrics/releases/tag/v4.2.9 - https://github.com/dropwizard/metrics/releases/tag/v4.2.10 The new version brings a new API for more type safe of registering gauges(https://github.com/dropwizard/metrics/pull/2642) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37360: [SPARK-39931][PYTHON][WIP] Improve applyInPandas performance for very small groups
HyukjinKwon commented on PR #37360: URL: https://github.com/apache/spark/pull/37360#issuecomment-1201963930 Hm, the general idea might be fine but I think the implementation is the problem. For example, the current design is that the user defined `function` always takes one group for `pdf`. To keep this behaviour, you should send the multiple groups into one, and apply the same function multiple times for each group. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #37295: [SPARK-39873][SQL] Remove `OptimizeLimitZero` and merge it into `EliminateLimits`
cloud-fan closed pull request #37295: [SPARK-39873][SQL] Remove `OptimizeLimitZero` and merge it into `EliminateLimits` URL: https://github.com/apache/spark/pull/37295 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37295: [SPARK-39873][SQL] Remove `OptimizeLimitZero` and merge it into `EliminateLimits`
cloud-fan commented on PR #37295: URL: https://github.com/apache/spark/pull/37295#issuecomment-1201963450 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #37357: [SPARK-39933][SQL][TESTS] Check query context by `checkError()`
MaxGekk closed pull request #37357: [SPARK-39933][SQL][TESTS] Check query context by `checkError()` URL: https://github.com/apache/spark/pull/37357 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #37357: [SPARK-39933][SQL][TESTS] Check query context by `checkError()`
MaxGekk commented on PR #37357: URL: https://github.com/apache/spark/pull/37357#issuecomment-1201961100 Merging to master. Thank you, @cloud-fan for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] williamhyun opened a new pull request, #37371: [SPARK-39945][BUILD] Upgrade sbt-mima-plugin to 1.1.0
williamhyun opened a new pull request, #37371: URL: https://github.com/apache/spark/pull/37371 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #37203: [SPARK-39755][CORE] Randomization in Spark local directory for K8 resource managers
pralabhkumar commented on PR #37203: URL: https://github.com/apache/spark/pull/37203#issuecomment-1201957796 @HyukjinKwon @cloud-fan @dongjoon-hyun Please find some time to review the PR . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] williamhyun opened a new pull request, #37370: [SPARK-39943][BUILD] Upgrade rocksdbjni to 7.4.4
williamhyun opened a new pull request, #37370: URL: https://github.com/apache/spark/pull/37370 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #36200: [SPARK-38909][CORE][YARN] Encapsulate `LevelDB` used by `ExternalShuffleBlockResolver` and `YarnShuffleService` as `DB`
LuciferYang commented on PR #36200: URL: https://github.com/apache/spark/pull/36200#issuecomment-1201952802 thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935064814 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") Review Comment: can we also test `.orderBy($"key" + 100)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935064627 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY") + .groupBy("my_dept", "my_manager").sum("SALARY") + .orderBy("my_dept", "my_manager") + .limit(1) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key", "IS_MANAGER").sum("SALARY") + .orderBy("key", "IS_MANAGER") + .limit(1) +checkSortRemoved(df6) +checkLimitRemoved(df6) +checkPushedInfo(df6, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END, " + +"IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df6, Seq(Row(0.00, false, 12000.00))) + +val df7 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY")) + .orderBy(sum("SALARY")) + .limit(1) +checkSortRemoved(df7, false) +checkLimitRemoved(df7, false) +checkPushedInfo(df7, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []") +checkAnswer(df7, Seq(Row(6, 12000.00))) + +val df8 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY").as("total")) + .orderBy("total") + .limit(1) +checkSortRemoved(df8, false) +checkLimitRemoved(df8, false) +checkPushedInfo(df8, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []") +checkAnswer(df8, Seq(Row(6, 12000.00))) + } + +
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935064235 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY") + .groupBy("my_dept", "my_manager").sum("SALARY") + .orderBy("my_dept", "my_manager") + .limit(1) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key", "IS_MANAGER").sum("SALARY") + .orderBy("key", "IS_MANAGER") + .limit(1) +checkSortRemoved(df6) +checkLimitRemoved(df6) +checkPushedInfo(df6, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END, " + +"IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df6, Seq(Row(0.00, false, 12000.00))) + +val df7 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY")) + .orderBy(sum("SALARY")) + .limit(1) +checkSortRemoved(df7, false) +checkLimitRemoved(df7, false) +checkPushedInfo(df7, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []") +checkAnswer(df7, Seq(Row(6, 12000.00))) + +val df8 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY").as("total")) Review Comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935063965 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY") + .groupBy("my_dept", "my_manager").sum("SALARY") + .orderBy("my_dept", "my_manager") + .limit(1) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key", "IS_MANAGER").sum("SALARY") + .orderBy("key", "IS_MANAGER") + .limit(1) +checkSortRemoved(df6) +checkLimitRemoved(df6) +checkPushedInfo(df6, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END, " + +"IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df6, Seq(Row(0.00, false, 12000.00))) + +val df7 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY")) + .orderBy(sum("SALARY")) Review Comment: can we add some comments to explain why top n can't be pushed here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935063573 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY") Review Comment: We already have tests to verify the alias, we don't need to test it again with 2 columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37368: [SPARK-39940][SS] Refresh catalog table on streaming query with DSv1 sink
HeartSaVioR commented on PR #37368: URL: https://github.com/apache/spark/pull/37368#issuecomment-1201948170 cc. @cloud-fan @viirya @zsxwing @xuanyuanking Appreciate your review. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37368: [SPARK-39940][SS] Refresh catalog table on streaming query with DSv1 sink
HeartSaVioR commented on code in PR #37368: URL: https://github.com/apache/spark/pull/37368#discussion_r935061578 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -680,7 +680,14 @@ class MicroBatchExecution( val batchSinkProgress: Option[StreamWriterCommitProgress] = reportTimeTaken("addBatch") { SQLExecution.withNewExecutionId(lastExecution) { sink match { - case s: Sink => s.addBatch(currentBatchId, nextBatch) + case s: Sink => +s.addBatch(currentBatchId, nextBatch) +// DSv2 write node has a mechanism to invalidate DSv2 relation, but there is no +// corresponding one for DSv1. Given we have an information of catalog table for sink, +// we can refresh the catalog table once the write has succeeded. +plan.catalogTable.foreach { tbl => Review Comment: Here I only deal with DSv1 path since DSv2 streaming writer node (`WriteToDataSourceV2Exec`) contains the mechanism of refreshing cache. I think this would cover the case of DSv2 table and I can't think of coordination between DSv2 sink and catalog table, but please correct me if I'm missing something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935061143 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -410,12 +413,21 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { case s @ Sort(order, _, operation @ ScanOperation(project, filter, sHolder: ScanBuilderHolder)) // Without building the Scan, we do not know the resulting column names after aggregate // push-down, and thus can't push down Top-N which needs to know the ordering column names. -// TODO: we can support simple cases like GROUP BY columns directly and ORDER BY the same -// columns, which we know the resulting column names: the original table columns. -if sHolder.pushedAggregate.isEmpty && filter.isEmpty && +// In particular, we push down the simple cases like GROUP BY expressions directly and +// ORDER BY the same expressions, which we know the original table columns. +if filter.isEmpty && CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) => val aliasMap = getAliasMap(project) - val newOrder = order.map(replaceAlias(_, aliasMap)).asInstanceOf[Seq[SortOrder]] + val aliasReplacedOrder = order.map(replaceAlias(_, aliasMap)) + val newOrder = if (sHolder.pushedAggregate.isDefined) { +aliasReplacedOrder.map { Review Comment: let's add some comments here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r935060915 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -410,12 +413,21 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { case s @ Sort(order, _, operation @ ScanOperation(project, filter, sHolder: ScanBuilderHolder)) // Without building the Scan, we do not know the resulting column names after aggregate // push-down, and thus can't push down Top-N which needs to know the ordering column names. -// TODO: we can support simple cases like GROUP BY columns directly and ORDER BY the same -// columns, which we know the resulting column names: the original table columns. -if sHolder.pushedAggregate.isEmpty && filter.isEmpty && +// In particular, we push down the simple cases like GROUP BY expressions directly and +// ORDER BY the same expressions, which we know the original table columns. Review Comment: I think we can remove all these comments here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bzhaoopenstack opened a new pull request, #37369: [SPARK-39942][PYTHON][PS] Need to verify the input nums is integer in nsmallest func
bzhaoopenstack opened a new pull request, #37369: URL: https://github.com/apache/spark/pull/37369 The input parameter of nsmallest should be validated as Integer. So I think we might miss this validation. And PySpark will raise Error when we input the strange types into nsmallest func. ### What changes were proposed in this pull request? validate the input num is integer type only. ### Why are the changes needed? PySpark will raise Error if we not limit the type. ··· >>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B']) >>> df.groupby(['A'])['B'] >>> df.groupby(['A'])['B'].nsmallest(True) Traceback (most recent call last): File "", line 1, in File "/home/spark/spark/python/pyspark/pandas/groupby.py", line 3598, in nsmallest sdf.withColumn(temp_rank_column, F.row_number().over(window)) File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2129, in filter jdf = self._jdf.filter(condition._jc) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__ return_value = get_return_value( File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco raise converted from None pyspark.sql.utils.AnalysisException: cannot resolve '(__rank__ <= true)' due to data type mismatch: differing types in '(__rank__ <= true)' (int and boolean).; 'Filter (__rank__#4995 <= true) +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995] +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995, __rank__#4995] +- Window [row_number() windowspecdefinition(__index_level_0__#4988L, B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank__#4995], [__index_level_0__#4988L], [B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST] +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L] +- Project [A#4978L AS __index_level_0__#4988L, __index_level_0__#4977L AS __index_level_1__#4989L, B#4979L, __natural_order__#4983L] +- Project [__index_level_0__#4977L, A#4978L, B#4979L, monotonically_increasing_id() AS __natural_order__#4983L] +- LogicalRDD [__index_level_0__#4977L, A#4978L, B#4979L], false ··· ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? input non-Integer type will raise AssersionError during calling nsmallest func -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #37368: [SPARK-39940][SS] Refresh catalog table on streaming query with DSv1 sink
HeartSaVioR opened a new pull request, #37368: URL: https://github.com/apache/spark/pull/37368 Credit to @pranavanand on figuring out the issue and providing the broken test code! ### What changes were proposed in this pull request? This PR proposes to refresh the destination catalog table when streaming query is writing to the catalog table via DSv1 sink. ### Why are the changes needed? It has been long standing issue that streaming query is not aware of catalog table (not the table brought by DSv2), hence no way for streaming query to refresh the catalog table after updating it. As a side effect of SPARK-39564, Spark brings up the information of destination catalog table into MicrobatchExecution, which enables to refresh table. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI
cloud-fan commented on code in PR #37325: URL: https://github.com/apache/spark/pull/37325#discussion_r935059096 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala: ## @@ -132,4 +132,10 @@ case class BatchScanExec( val result = s"$nodeName$truncatedOutputString ${scan.description()} $runtimeFiltersString" redact(result) } + + /** + * Returns the name of this type of TreeNode. Defaults to the class name. + * Note that we remove the "Exec" suffix for physical operators here. + */ + override def nodeName: String = s"BatchScan ${scan.name()}" Review Comment: I don't think `Scan` can provide more information than `Table` for UI node name purposes. Even the time-travel version is also available in `Table`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI
cloud-fan commented on code in PR #37325: URL: https://github.com/apache/spark/pull/37325#discussion_r935058486 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala: ## @@ -132,4 +132,10 @@ case class BatchScanExec( val result = s"$nodeName$truncatedOutputString ${scan.description()} $runtimeFiltersString" redact(result) } + + /** + * Returns the name of this type of TreeNode. Defaults to the class name. + * Note that we remove the "Exec" suffix for physical operators here. + */ + override def nodeName: String = s"BatchScan ${scan.name()}" Review Comment: Can we simply use v2 `Table.name`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37350: SPARK-39900 : Issue with querying dataframe produced by 'binaryFile' …
AmplabJenkins commented on PR #37350: URL: https://github.com/apache/spark/pull/37350#issuecomment-1201940622 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37351: [SPARK-38864][SQL][FOLLOW-UP] Make AnalysisException message deterministic
AmplabJenkins commented on PR #37351: URL: https://github.com/apache/spark/pull/37351#issuecomment-1201940599 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37352: [SPARK-39927][BUILD] Upgrade to Avro 1.11.1
AmplabJenkins commented on PR #37352: URL: https://github.com/apache/spark/pull/37352#issuecomment-1201940582 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bzhaoopenstack opened a new pull request, #37367: [SPARK-39941][PYTHON][PS] period and min_periods should be integer in rolling func
bzhaoopenstack opened a new pull request, #37367: URL: https://github.com/apache/spark/pull/37367 window and min_periods parameters is not be validated in rolling function. ### What changes were proposed in this pull request? Validate the said 2 parameters to be a integer only in rolling func, will raise ValueError if they are not Integer ### Why are the changes needed? I guess we miss the validation towards those parameters. PySpark will raise Error during using other type inputs ``` >>> s = ps.Series([4, 3, 5, 2, 6]) >>> s.rolling(1).sum() 04 13 25 32 46 dtype: int64 >>> s.rolling('STRING').sum() Traceback (most recent call last): File "", line 1, in File "/home/spark/spark/python/pyspark/pandas/generic.py", line 2707, in rolling return Rolling(self, window=window, min_periods=min_periods) File "/home/spark/spark/python/pyspark/pandas/window.py", line 179, in __init__ super().__init__(window, min_periods) File "/home/spark/spark/python/pyspark/pandas/window.py", line 147, in __init__ if window < 0: TypeError: '<' not supported between instances of 'str' and 'int' >>> ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Input other types of window and min_periods in rolling func will raise ValueError -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] deshanxiao commented on pull request #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL
deshanxiao commented on PR #37336: URL: https://github.com/apache/spark/pull/37336#issuecomment-1201931299 Very happy to receive these suggestions. Please let me know if anyone have any questions. Will close this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bzhaoopenstack opened a new pull request, #37366: [SPARK-39939][PYTHON][PS] return self.copy during calling shift with period == 0
bzhaoopenstack opened a new pull request, #37366: URL: https://github.com/apache/spark/pull/37366 PySpark raises Error when we call shift func with periods=0. The behavior of Pandas will return a same copy for the said obj. ### What changes were proposed in this pull request? Will return self.copy when period == 0 ### Why are the changes needed? Behaviors between PySpark and pandas are different PySpark: ``` >>> df = ps.DataFrame({'Col1': [10, 20, 15, 30, 45], 'Col2': [13, 23, 18, 33, 48],'Col3': [17, 27, 22, 37, 52]},columns=['Col1', 'Col2', 'Col3']) >>> df.Col1.shift(periods=3) 0 NaN 1 NaN 2 NaN 3 10.0 4 20.0 Name: Col1, dtype: float64 >>> df.Col1.shift(periods=0) Traceback (most recent call last): File "", line 1, in File "/home/spark/spark/python/pyspark/pandas/base.py", line 1170, in shift return self._shift(periods, fill_value).spark.analyzed File "/home/spark/spark/python/pyspark/pandas/spark/accessors.py", line 256, in analyzed return first_series(DataFrame(self._data._internal.resolved_copy)) File "/home/spark/spark/python/pyspark/pandas/utils.py", line 589, in wrapped_lazy_property setattr(self, attr_name, fn(self)) File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1173, in resolved_copy sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS)) File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2073, in select jdf = self._jdf.select(self._jcols(*cols)) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__ return_value = get_return_value( File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco raise converted from None pyspark.sql.utils.AnalysisException: Cannot specify window frame for lag function ``` pandas: ``` >>> pdf = pd.DataFrame({'Col1': [10, 20, 15, 30, 45], 'Col2': [13, 23, 18, 33, 48],'Col3': [17, 27, 22, 37, 52]},columns=['Col1', 'Col2', 'Col3']) >>> pdf.Col1.shift(periods=3) 0 NaN 1 NaN 2 NaN 3 10.0 4 20.0 Name: Col1, dtype: float64 >>> pdf.Col1.shift(periods=0) 0 10 1 20 2 15 3 30 4 45 Name: Col1, dtype: int64 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? call shift func with period == 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
dongjoon-hyun commented on PR #37287: URL: https://github.com/apache/spark/pull/37287#issuecomment-1201919934 Got it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI
dongjoon-hyun closed pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI URL: https://github.com/apache/spark/pull/37325 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bzhaoopenstack opened a new pull request, #37365: [SPARK-39938][python][PS] Accept all inputs of prefix/suffix which implement __str__ in add_predix/add_suffix
bzhaoopenstack opened a new pull request, #37365: URL: https://github.com/apache/spark/pull/37365 We need to follow the pandas behavior of prefix/suffix parameter validation in add_prefix/add_suffix. Now, we force to validate it as a String type. But pandas looks all values which can be formated as String(implement __str__ func). So it's different here. ### What changes were proposed in this pull request? We support all kind inputs which can be formated as string. ### Why are the changes needed? As pandas behavior is different with PySpark when we input other types into add_prefix/add_suffix funcs. PySpark ``` >>> from pyspark import pandas as ps >>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B']) >>> df.add_suffix(666) Traceback (most recent call last): File "", line 1, in File "/home/spark/spark/python/pyspark/pandas/frame.py", line 9060, in add_suffix assert isinstance(suffix, str) AssertionError >>> df.add_suffix(True) Traceback (most recent call last): File "", line 1, in File "/home/spark/spark/python/pyspark/pandas/frame.py", line 9060, in add_suffix assert isinstance(suffix, str) AssertionError ``` Pandas: 1.3.X/1.4.X ``` >>> pdf.add_suffix(0.1) A0.1 B0.1 0 1 3 1 2 4 2 3 5 3 4 6 >>> pdf.add_suffix(True) ATrue BTrue 0 1 3 1 2 4 2 3 5 3 4 6 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Input any can be stringable input into add_prefix/add_suffix funcs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jeffreychen99 commented on a diff in pull request #37364: [SPARK-39936][SQL] Store schema in properties for Spark Views
Jeffreychen99 commented on code in PR #37364: URL: https://github.com/apache/spark/pull/37364#discussion_r935031668 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala: ## @@ -2386,10 +2385,6 @@ class HiveDDLSuite "CREATE TABLE t1 USING PARQUET AS SELECT NULL AS null_col", "Parquet data source does not support void data type") - assertAnalysisError( Review Comment: Oops - sorry forgot to add them back -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936][SQL] Store schema in properties for Spark Views
cloud-fan commented on code in PR #37364: URL: https://github.com/apache/spark/pull/37364#discussion_r935030049 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala: ## @@ -2386,10 +2385,6 @@ class HiveDDLSuite "CREATE TABLE t1 USING PARQUET AS SELECT NULL AS null_col", "Parquet data source does not support void data type") - assertAnalysisError( Review Comment: do we still need to remove this test? ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala: ## @@ -2400,10 +2395,6 @@ class HiveDDLSuite "CREATE TABLE t1 (v VOID) USING PARQUET", "Parquet data source does not support void data type") - assertAnalysisError( Review Comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
cloud-fan commented on PR #37287: URL: https://github.com/apache/spark/pull/37287#issuecomment-1201897650 This is another instance that `db.tbl` is inconsistent of `spark_catalog.db.tbl`. This PR fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36200: [SPARK-38909][CORE][YARN] Encapsulate `LevelDB` used by `ExternalShuffleBlockResolver` and `YarnShuffleService` as `DB`
dongjoon-hyun commented on PR #36200: URL: https://github.com/apache/spark/pull/36200#issuecomment-1201896768 Although I couldn't take a look at this, you can ask YARN committers' help if you want to make this proceed, @LuciferYang . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on pull request #37268: [SPARK-39853][CORE] Support stage level task resource schedule for standalone cluster when dynamic allocation disabled
ivoson commented on PR #37268: URL: https://github.com/apache/spark/pull/37268#issuecomment-1201892453 > so I would like to see the issue or this PR description have much more details about design, API, and its behavior. For instance: > > ``` > Does this PR introduce any user-facing change? > > No > ``` > > This is not true, this is very much user impacting API. > > Also the issue I linked to talked about the policy for reusing executors. What is the proposal for that here? What happens if the task resource specified doesn't fit in the executor resources? I'd like to make sure we come to agreement on what this api is and does and user impact of that. Does this api also apply to reusing executor with dynamic allocation on... as commented above seems limiting. This is a public API so I want to make sure the changes work in various situations. > > Docs will need to update to explain new API and behavior. Thanks @tgravescs , will update the doc and then we can discuss. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936] Store schema in properties for Spark Views
cloud-fan commented on code in PR #37364: URL: https://github.com/apache/spark/pull/37364#discussion_r935011158 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala: ## @@ -378,4 +378,15 @@ class HiveParquetSourceSuite extends ParquetPartitioningTest with ParquetTest { } } } + + test("Create view with dashes in column type") { +sql( + """ +|CREATE TABLE tbl (f array>) USING parquet +|""".stripMargin) +sql( + """ +|CREATE VIEW tblView AS SELECT f FROM tbl Review Comment: and let's make the test complete ``` withView("t") { sql("CREATE VIEW t AS SELECT STRUCT('a' AS $a, 1 AS b) q") checkAnswer(spark.table("t"), Row(Row("a", 1))) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936] Store schema in properties for Spark Views
cloud-fan commented on code in PR #37364: URL: https://github.com/apache/spark/pull/37364#discussion_r935010782 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala: ## @@ -378,4 +378,15 @@ class HiveParquetSourceSuite extends ParquetPartitioningTest with ParquetTest { } } } + + test("Create view with dashes in column type") { +sql( + """ +|CREATE TABLE tbl (f array>) USING parquet +|""".stripMargin) +sql( + """ +|CREATE VIEW tblView AS SELECT f FROM tbl Review Comment: we can simply use the existing test case `CREATE VIEW t AS SELECT STRUCT('a' AS `$a`, 1 AS b) q` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936] Store schema in properties for Spark Views
cloud-fan commented on code in PR #37364: URL: https://github.com/apache/spark/pull/37364#discussion_r935010522 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala: ## @@ -160,18 +160,6 @@ class HiveCatalogedDDLSuite extends DDLSuite with TestHiveSingleton with BeforeA } test("SPARK-22431: illegal nested type") { -val queries = Seq( - "CREATE TABLE t USING hive AS SELECT STRUCT('a' AS `$a`, 1 AS b) q", - "CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING hive", - "CREATE VIEW t AS SELECT STRUCT('a' AS `$a`, 1 AS b) q") Review Comment: I think we only need to remove this line from the test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936] Store schema in properties for Spark Views
cloud-fan commented on code in PR #37364: URL: https://github.com/apache/spark/pull/37364#discussion_r935010178 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala: ## @@ -283,7 +283,22 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat // columns etc. in table properties, so that we can work around the Hive metastore issue // about not case preserving and make Hive serde table and view support mixed-case column // names. -properties = tableDefinition.properties ++ tableMetaToTableProps(tableDefinition)) +properties = tableDefinition.properties ++ tableMetaToTableProps(tableDefinition) + ) + try { +client.createTable(tableWithDataSourceProps, ignoreIfExists) + } catch { +case NonFatal(e) => + // If for some reason we fail to store the schema we store it as empty there + // since we already store the real schema in the table properties. This try-catch + // should only be necessary for Spark views which are incompatible with Hive + client.createTable( +tableWithDataSourceProps.copy( + schema = StructType(EMPTY_DATA_SCHEMA ++ tableDefinition.partitionSchema) Review Comment: A view cannot have partition columns, so here can simply be `new StructType()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936] Store schema in properties for Spark Views
cloud-fan commented on code in PR #37364: URL: https://github.com/apache/spark/pull/37364#discussion_r93501 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala: ## @@ -283,7 +283,22 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat // columns etc. in table properties, so that we can work around the Hive metastore issue // about not case preserving and make Hive serde table and view support mixed-case column // names. -properties = tableDefinition.properties ++ tableMetaToTableProps(tableDefinition)) +properties = tableDefinition.properties ++ tableMetaToTableProps(tableDefinition) + ) + try { +client.createTable(tableWithDataSourceProps, ignoreIfExists) + } catch { +case NonFatal(e) => Review Comment: ```suggestion case NonFatal(e) if ...is_view... => ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
amaliujia commented on PR #37287: URL: https://github.com/apache/spark/pull/37287#issuecomment-1201853151 The test is failing for example on ``` Expected: Database(name='default', catalog=None, description='default database', ... Got: Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/__w/spark/spark/python/target/4ab5b07b-a1fd-4be3-b29b-6bfc1ac33d6d/e75ca4c7-9d5e-409d-b319-fba011a1ad51') ``` I think the actual result is expected as of now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jeffreychen99 opened a new pull request, #37364: [SPARK-39936] Store schema in properties for Spark Views
Jeffreychen99 opened a new pull request, #37364: URL: https://github.com/apache/spark/pull/37364 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37356: [SPARK-39877][PYTHON][FOLLOW-UP] Add DataFrame melt to PySpark docs
AmplabJenkins commented on PR #37356: URL: https://github.com/apache/spark/pull/37356#issuecomment-1201816031 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] iemejia commented on pull request #37352: [SPARK-39927][BUILD] Upgrade to Avro 1.11.1
iemejia commented on PR #37352: URL: https://github.com/apache/spark/pull/37352#issuecomment-1201761684 @xkrogen It is indeed. The announcement has not gone out yet but the binaries are already available. https://lists.apache.org/thread/8tk92x804owmjbvtj57xxzwhxn1qh3ho -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37346: [SPARK-37210][CORE][SQL] Allow forced use of staging directory
dongjoon-hyun commented on PR #37346: URL: https://github.com/apache/spark/pull/37346#issuecomment-1201742819 Thank you for making a PR, @wForget . To @viirya and @sunchao . This issue has a reproducible example in the JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37337: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow
dongjoon-hyun commented on PR #37337: URL: https://github.com/apache/spark/pull/37337#issuecomment-1201728069 Thank you for swift recovering `master` branch, @gengliangwang . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37337: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow
gengliangwang commented on PR #37337: URL: https://github.com/apache/spark/pull/37337#issuecomment-1201718125 The master branch failed to compile after merging this PR: https://github.com/apache/spark/runs/7616787860?check_suite_focus=true I am reverting it to unblock the development of other contributors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37359: [SPARK-25342][CORE][SQL]Support rolling back a result stage and rerunning all result tasks when writing files
AmplabJenkins commented on PR #37359: URL: https://github.com/apache/spark/pull/37359#issuecomment-1201674509 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37360: [SPARK-39931][PYTHON][WIP] Improve applyInPandas performance for very small groups
AmplabJenkins commented on PR #37360: URL: https://github.com/apache/spark/pull/37360#issuecomment-1201674443 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37361: [SPARK-39925][SQL] Add array_sort(column, comparator) overload to DataFrame operations
AmplabJenkins commented on PR #37361: URL: https://github.com/apache/spark/pull/37361#issuecomment-1201674401 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37362: WIP: Revert "[SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE"
AmplabJenkins commented on PR #37362: URL: https://github.com/apache/spark/pull/37362#issuecomment-1201674356 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37254: [SPARK-39841][SQL] simplify conflict binary comparison
gengliangwang commented on PR #37254: URL: https://github.com/apache/spark/pull/37254#issuecomment-1201666315 cc @sigmod as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org