[GitHub] [spark] dependabot[bot] opened a new pull request, #37425: Bump postgresql from 42.3.3 to 42.4.1
dependabot[bot] opened a new pull request, #37425: URL: https://github.com/apache/spark/pull/37425 Bumps [postgresql](https://github.com/pgjdbc/pgjdbc) from 42.3.3 to 42.4.1. Release notes Sourced from https://github.com/pgjdbc/pgjdbc/releases;>postgresql's releases. 42.4.0 What's Changed Enhancement: Made TimestampUtils.utcTz static and renamed to UTC_TIMEZONE by https://github.com/svendiedrichsen;>@svendiedrichsen in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2519;>pgjdbc/pgjdbc#2519 fix: return correct base type for domain from getUDTs (https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2520;>#2520) by https://github.com/alurie;>@alurie in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2522;>pgjdbc/pgjdbc#2522 fix: support queries with up to 65535 (inclusive) parameters by https://github.com/vlsi;>@vlsi in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2525;>pgjdbc/pgjdbc#2525 chore: use META-INF/licenses/$group/$artifact-$version/... folder for licenses by https://github.com/vlsi;>@vlsi in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2531;>pgjdbc/pgjdbc#2531 fix: added GROUP_STARTUP_PARAMETERS boolean property to determine whether or not to group startup parameters in a transaction or not fixes Issue 2423 pgbouncer cannot deal with transactions in statement pooling mode by https://github.com/davecramer;>@davecramer in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2425;>pgjdbc/pgjdbc#2425 chore: Make the readme version agnostic by https://github.com/jorsol;>@jorsol in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2540;>pgjdbc/pgjdbc#2540 Release notes 42.4.0 by https://github.com/davecramer;>@davecramer in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2541;>pgjdbc/pgjdbc#2541 New Contributors https://github.com/svendiedrichsen;>@svendiedrichsen made their first contribution in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2519;>pgjdbc/pgjdbc#2519 Full Changelog: https://github.com/pgjdbc/pgjdbc/compare/REL42.3.6...REL42.4.0;>https://github.com/pgjdbc/pgjdbc/compare/REL42.3.6...REL42.4.0 Changelog Sourced from https://github.com/pgjdbc/pgjdbc/blob/master/CHANGELOG.md;>postgresql's changelog. Changelog Notable changes since version 42.0.0, read the complete https://jdbc.postgresql.org/documentation/changelog.html;>History of Changes. The format is based on http://keepachangelog.com/en/1.0.0/;>Keep a Changelog. [Unreleased] Changed Added Fixed [42.4.1] (2022-08-01 16:24:20 -0400) Security fix: CVE-2022-31197 Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so as to prevent SQL injection. Previously, the column names for both key and data columns in the table were copied as-is into the generated SQL. This allowed a malicious table with column names that include statement terminator to be parsed and executed as multiple separate commands. Also adds a new test class ResultSetRefreshTest to verify this change. Reported by https://github.com/kato-sho;>Sho Kato Changed chore: skip publishing pgjdbc-osgi-test to Central chore: bump Gradle to 7.5 test: update JUnit to 5.8.2 Added chore: added Gradle Wrapper Validation for verifying gradle-wrapper.jar chore: added permissions: contents: read for GitHub Actions to avoid unintentional modifications by the CI chore: support building pgjdbc with Java 17 Fixed [42.4.0] (2022-06-09 08:14:02 -0400) Changed fix: added GROUP_STARTUP_PARAMETERS boolean property to determine whether or not to group startup parameters in a transaction (default=false like 42.2.x) fixes [Issue https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2425;>#2425](https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2497;>pgjdbc/pgjdbc#2497) pgbouncer cannot deal with transactions in statement pooling mode [PR https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2425;>#2425](https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2425;>pgjdbc/pgjdbc#2425) Fixed fix: queries with up to 65535 (inclusive) parameters are supported now (previous limit was 32767) [PR https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2525;>#2525](https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2525;>pgjdbc/pgjdbc#2525), [Issue https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/1311;>#1311](https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/1311;>pgjdbc/pgjdbc#1311) fix: workaround JarIndex parsing issue by using groupId/artifactId-version directory namings. Regression since 42.2.13. [PR https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2531;>#2531](https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2531;>pgjdbc/pgjdbc#2531), [issue
[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
beliefer commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r939485352 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("dept").sum("SALARY") + .orderBy($"dept".gt(1)) Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
beliefer commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r939485083 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
beliefer commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r939479595 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -408,16 +411,27 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { } (operation, isPushed && !isPartiallyPushed) case s @ Sort(order, _, operation @ ScanOperation(project, filter, sHolder: ScanBuilderHolder)) -// Without building the Scan, we do not know the resulting column names after aggregate -// push-down, and thus can't push down Top-N which needs to know the ordering column names. -// TODO: we can support simple cases like GROUP BY columns directly and ORDER BY the same -// columns, which we know the resulting column names: the original table columns. -if sHolder.pushedAggregate.isEmpty && filter.isEmpty && - CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) => + if filter.isEmpty && +CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) => val aliasMap = getAliasMap(project) - val newOrder = order.map(replaceAlias(_, aliasMap)).asInstanceOf[Seq[SortOrder]] + val aliasReplacedOrder = order.map(replaceAlias(_, aliasMap)) + val newOrder = if (sHolder.pushedAggregate.isDefined) { +// Without building the Scan, Aggregate push-down give the expected output starts with +// `group_col_` or `agg_func_`. When Aggregate push-down working with Sort for top n +// push-down, we need replace these expected output with the origin expressions. +aliasReplacedOrder.map { + _.transform { +case a: Attribute => sHolder.pushedAggOutputMap.getOrElse(a, a) + }.asInstanceOf[SortOrder] +} + } else { +aliasReplacedOrder.asInstanceOf[Seq[SortOrder]] + } val normalizedOrders = DataSourceStrategy.normalizeExprs( newOrder, sHolder.relation.output).asInstanceOf[Seq[SortOrder]] + // Because V2ExpressionBuilder can't translate aggregate functions, so we can't Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37408: [SPARK-39982][WIP] Add doc string to StructType.fromJson
AmplabJenkins commented on PR #37408: URL: https://github.com/apache/spark/pull/37408#issuecomment-1207133120 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37404: [SPARK-39866][SQL] Memory leak when closing a session of Spark ThriftServer
AmplabJenkins commented on PR #37404: URL: https://github.com/apache/spark/pull/37404#issuecomment-1207133142 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37409: [SPARK-39970][CORE] Introduce ThrottledLogger to prevent log message flooding caused by network issues
AmplabJenkins commented on PR #37409: URL: https://github.com/apache/spark/pull/37409#issuecomment-1207133115 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #37389: [SPARK-39963][SQL] Simplify `SimplifyCasts.isWiderCast`
beliefer commented on PR #37389: URL: https://github.com/apache/spark/pull/37389#issuecomment-1207131840 @cloud-fan @dongjoon-hyun @viirya @gengliangwang Thank you ! @xkrogen Thank you for you comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #37391: [SPARK-39964][SQL] DS V2 pushdown should unify the translate path
beliefer commented on PR #37391: URL: https://github.com/apache/spark/pull/37391#issuecomment-1207131948 @cloud-fan @huaxingao Thank you for you review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37411: [SPARK-39984][CORE] Check workerLastHeartbeat with master before HeartbeatReceiver expires an executor
AmplabJenkins commented on PR #37411: URL: https://github.com/apache/spark/pull/37411#issuecomment-1207116271 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver
AmplabJenkins commented on PR #37413: URL: https://github.com/apache/spark/pull/37413#issuecomment-1207116255 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #35948: [SPARK-38634][SQL] Support download data through thrift server
github-actions[bot] commented on PR #35948: URL: https://github.com/apache/spark/pull/35948#issuecomment-1207101576 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37422: [SPARK-39992][BUILD] Upgrade `slf4j` to 1.7.36
HyukjinKwon closed pull request #37422: [SPARK-39992][BUILD] Upgrade `slf4j` to 1.7.36 URL: https://github.com/apache/spark/pull/37422 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37422: [SPARK-39992][BUILD] Upgrade `slf4j` to 1.7.36
HyukjinKwon commented on PR #37422: URL: https://github.com/apache/spark/pull/37422#issuecomment-1207092774 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37416: [SPARK-39743][DOCS] Updated some spark.io.compression configuration descriptions to clarify parameter applica…
AmplabJenkins commented on PR #37416: URL: https://github.com/apache/spark/pull/37416#issuecomment-1207088912 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37417: [SPARK-33782][K8s][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster
AmplabJenkins commented on PR #37417: URL: https://github.com/apache/spark/pull/37417#issuecomment-1207088896 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled
AmplabJenkins commented on PR #37419: URL: https://github.com/apache/spark/pull/37419#issuecomment-1207088877 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
amaliujia commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r939283477 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -90,30 +92,31 @@ abstract class Catalog { /** * Returns a list of columns for the given table/view or temporary view. * - * @param tableName is either a qualified or unqualified name that designates a table/view. - * If no database identifier is provided, it refers to a temporary view or - * a table/view in the current database. + * @param tableName is either a qualified or unqualified name that designates a table/view. It + * follows the same resolution rule with SQL: search for temp views first then + * table/views in the current database (namespace). * @since 2.0.0 */ @throws[AnalysisException]("table does not exist") def listColumns(tableName: String): Dataset[Column] /** - * Returns a list of columns for the given table/view in the specified database. + * Returns a list of columns for the given table/view in the specified database under the Hive + * Metastore. Review Comment: +1 this looks nice to explicitly says HMS -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] andygrove opened a new pull request, #37424: [SPARK-39991][SQL][AQE] Use available column statistics from completed query stages
andygrove opened a new pull request, #37424: URL: https://github.com/apache/spark/pull/37424 ### What changes were proposed in this pull request? AQE uses statistics from completed query stages and feeds them back into the logical optimizer. AQE currently only uses `dataSize` and `numOutputRows` and ignores any available `attributeMap` (column statistics). This PR updates AQE to also populate `attributeMap` in the statistics that it uses for re-optimizing the plan. ### Why are the changes needed? These changes are needed so that Spark plugins that provide custom implementations of the `ShuffleExchangeLike` trait can leverage column statistics for better plan optimization during AQE execution. ### Does this PR introduce _any_ user-facing change? No. The current Spark implementation of `ShuffleExchangeLike` (`ShuffleExchangeExec`) does not populate `attributeMap`, so this PR is a no-op for regular Spark. ### How was this patch tested? New unit test added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames
dtenedor commented on PR #37423: URL: https://github.com/apache/spark/pull/37423#issuecomment-1206863387 Hi @gengliangwang this fixes a small bug in column DEFAULT values with DataFrames, and adds some testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor opened a new pull request, #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames
dtenedor opened a new pull request, #37423: URL: https://github.com/apache/spark/pull/37423 ### What changes were proposed in this pull request? Enable implicit DEFAULT column values in inserts from DataFrames. This mostly already worked since the DataFrame inserts already converted to LogicalPlans. I added testing and a small analysis change since the operators are resolved one-by-one instead of all at once. Note that explicit column "default" references are not supported in write operations from DataFrames: since the operators are resolved one-by-one, any `.select` referring to "default" generates a "column not found" error before any following `.insertInto`. ### Why are the changes needed? This makes inserts from DataFrames produce the same results as those from SQL commands, for consistency and correctness. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Extended the `InsertSuite` in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37422: [SPARK-39992][BUILD] Upgrade `slf4j` to 1.7.36
AmplabJenkins commented on PR #37422: URL: https://github.com/apache/spark/pull/37422#issuecomment-1206819116 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule
AmplabJenkins commented on PR #35334: URL: https://github.com/apache/spark/pull/35334#issuecomment-1206818862 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen opened a new pull request, #37422: [SPARK-39992] Upgrade `slf4j` to 1.7.36
bjornjorgensen opened a new pull request, #37422: URL: https://github.com/apache/spark/pull/37422 ### What changes were proposed in this pull request? Upgrade `slf4j` for 1.7.32 to 1.7.36 ### Why are the changes needed? Snyk have [open up a PR at my branch](https://github.com/bjornjorgensen/spark/pull/19) , where they want to [upgrade slf4j ](https://www.slf4j.org) The recommended version is 4 versions ahead of your current version. The recommended version was released 6 months ago, on 2022-02-08. [Release log](https://www.slf4j.org/news.html) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver
mridulm commented on PR #37413: URL: https://github.com/apache/spark/pull/37413#issuecomment-1206775251 @JoshRosen Ah yes, missed out on that - should have taken a more detailed look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen commented on pull request #37413: [SPARK-39983][CORE] Do not cache unserialized broadcast relations on the driver
JoshRosen commented on PR #37413: URL: https://github.com/apache/spark/pull/37413#issuecomment-1206762340 @mridulm, it doesn't break local mode because there's a carve-out to preserve the existing behavior in that case: in both places where the `if(serializedOnly` check changes behavior, there's a check for `isLocalMaster` to avoid behavior changes: We'll still store the original object in the driver block manager at write time in local mode: https://github.com/apache/spark/blob/75ab18ee0e382b8117bf65fc9ef05190d4fdf01a/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L133-L136 There's a similar carve-out in `readBroadcastBlock` (although I don't think we'd ever actually hit that branch in local mode given that we would have already stored the re-assembled broadcast block in `writeBlocks`): https://github.com/apache/spark/blob/75ab18ee0e382b8117bf65fc9ef05190d4fdf01a/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L277-L284 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled
sunchao commented on PR #37419: URL: https://github.com/apache/spark/pull/37419#issuecomment-1206745959 @sadikovi in the example you gave: ``` root/ col0=0/ part-0001.parquet (schema: COL0) ``` what's the content in `part-0001.parquet`? I wonder why we need to pushdown partition filters to Parquet, given that we'll not materialize the partition values in the Parquet files. What is the pushed filters to Parquet in this example? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique
peter-toth commented on PR #37334: URL: https://github.com/apache/spark/pull/37334#issuecomment-1206707424 > > Fixes AliasAwareOutputPartitioning and AliasAwareOutputOrdering as testing discovered that ... > > Can we create a new PR for this? Sure, I can open a new one for this next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique
peter-toth commented on code in PR #37334: URL: https://github.com/apache/spark/pull/37334#discussion_r939045315 ## sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70.sf100/explain.txt: ## @@ -157,31 +157,31 @@ Input [2]: [s_state#14, sum#16] Keys [1]: [s_state#14] Functions [1]: [sum(UnscaledValue(ss_net_profit#10))] Aggregate Attributes [1]: [sum(UnscaledValue(ss_net_profit#10))#17] -Results [3]: [s_state#14, s_state#14, MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#18] +Results [3]: [s_state#14 AS s_state#18, s_state#14, MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#19] Review Comment: Ok. No problem, I can revert this PR to the fisrt version next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
amaliujia commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r939003429 ## R/pkg/tests/fulltests/test_sparkSQL.R: ## @@ -4098,14 +4098,13 @@ test_that("catalog APIs, listTables, getTable, listColumns, listFunctions, funct c("name", "description", "dataType", "nullable", "isPartition", "isBucket")) expect_equal(collect(c)[[1]][[1]], "speed") expect_error(listColumns("zxwtyswklpf", "default"), - paste("Error in listColumns : analysis error - Table", - "'zxwtyswklpf' does not exist in database 'default'")) + paste("Table or view not found: spark_catalog.default.zxwtyswklpf")) Review Comment: SG -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
amaliujia commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r939003155 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -33,36 +33,37 @@ import org.apache.spark.storage.StorageLevel abstract class Catalog { /** - * Returns the current default database in this session. + * Returns the current database (namespace) in this session. Review Comment: sounds good. Just wanted to confirm that we don't miss anything obvious. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #37418: [SPARK-39987][K8S] Support `PEAK_JVM_(ON|OFF)HEAP_MEMORY` executor rolling policy
dongjoon-hyun closed pull request #37418: [SPARK-39987][K8S] Support `PEAK_JVM_(ON|OFF)HEAP_MEMORY` executor rolling policy URL: https://github.com/apache/spark/pull/37418 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37418: [SPARK-39987][K8S] Support `PEAK_JVM_(ON|OFF)HEAP_MEMORY` executor rolling policy
dongjoon-hyun commented on PR #37418: URL: https://github.com/apache/spark/pull/37418#issuecomment-1206646328 Thank you, @viirya ! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique
cloud-fan commented on PR #37334: URL: https://github.com/apache/spark/pull/37334#issuecomment-1206635362 > Fixes AliasAwareOutputPartitioning and AliasAwareOutputOrdering as testing discovered that ... Can we create a new PR for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique
cloud-fan commented on code in PR #37334: URL: https://github.com/apache/spark/pull/37334#discussion_r938976805 ## sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70.sf100/explain.txt: ## @@ -157,31 +157,31 @@ Input [2]: [s_state#14, sum#16] Keys [1]: [s_state#14] Functions [1]: [sum(UnscaledValue(ss_net_profit#10))] Aggregate Attributes [1]: [sum(UnscaledValue(ss_net_profit#10))#17] -Results [3]: [s_state#14, s_state#14, MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#18] +Results [3]: [s_state#14 AS s_state#18, s_state#14, MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#19] Review Comment: I'm surprised that this query plan works fine today (two `s_state#14`). Reading the source code a bit more, I think Spark is actually fine with duplicated attribute ids. It assumes columns with the same attr id always output same values, so it can safely bind attributes with the first match. See `BindReferences`. So the problem is still at `Union`. It's not fine to introduce duplicated attr ids in the first child of Union, as it will introduce duplicated attr ids in the output of Union, which causes wrong results. I think eventually we should make `Union` use new attr ids to build its output columns. As a bug fix, your first attempt to simply not remove alias in the first child of Union looks the best one: it's simple, and having a bit more unnecessary alias won't impact performance. Sorry for the back and forth! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8s][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster m
pralabhkumar commented on PR #37417: URL: https://github.com/apache/spark/pull/37417#issuecomment-1206608039 @HyukjinKwon Please review , build is failing because of un related error (since its passing on local) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37403: [SPARK-39980][INFRA] Change infra image to static tag: `ubuntu:focal-20220801`
HyukjinKwon closed pull request #37403: [SPARK-39980][INFRA] Change infra image to static tag: `ubuntu:focal-20220801` URL: https://github.com/apache/spark/pull/37403 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37403: [SPARK-39980][INFRA] Change infra image to static tag: `ubuntu:focal-20220801`
HyukjinKwon commented on PR #37403: URL: https://github.com/apache/spark/pull/37403#issuecomment-1206478723 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] smallzhongfeng commented on pull request #37395: [SPARK-39967][CORE] Use the successful array to count the number of t…
smallzhongfeng commented on PR #37395: URL: https://github.com/apache/spark/pull/37395#issuecomment-1206476150 At present, for `master` branch, I can not recur the same situation. I agree with you.@mridulm So, I'm going to close this PR temporarily. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] smallzhongfeng closed pull request #37395: [SPARK-39967][CORE] Use the successful array to count the number of t…
smallzhongfeng closed pull request #37395: [SPARK-39967][CORE] Use the successful array to count the number of t… URL: https://github.com/apache/spark/pull/37395 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on pull request #37268: [SPARK-39853][CORE] Support stage level task resource profile for standalone cluster when dynamic allocation disabled
ivoson commented on PR #37268: URL: https://github.com/apache/spark/pull/37268#issuecomment-1206460341 > ``` > /** >* Return resource profile Ids of executors where tasks can be assigned to. >*/ > def compatibleExecutorRpIds(rpMgr: ResourceProfileManager): Set[Int] > ``` > > It seems a little bit odd to ask the ResourceProfile to give you compatible other ResourceProfiles. This feels like it should be in the ResourceProfileManager which knows about all the ResourceProfiles. I guess that is why you pass in the ResourceProfileManager here? Is the intention the user could explicitly set which ResourceProfiles its compatible with? If so I definitely would want a way to not have to specify it. Yes, exactly. I put the `ResourceProfileManager` here because it knows about all the ResourceProfiles. Adding this API just want to make sure that we have one interface to get compatible RP Ids for scheduler to assign tasks. And the implementation can be further enriched in future maybe for re-use executors with dynamic allocation on and adding more reuse policy as #33941 does. And we'll not let user to specify compatible ResourceProfiles. In this case, we will only have `TaskResourceProfile` compatible with `Default ResourceProfile`. > > The other issue raised that wasn't addressed was the reuse policy. I guess in this case we are limiting the executor profile to 1 because we don't have dynamic allocation so one could argue that if you use task resource request with that you know what you get. Which I am fine with but we need to be clear that it might very well waste resources. > > Also if the intent is to not support TaskResourceProfile with dynamic allocation, I think we should throw an exception if anyone uses it with the dynamic allocation config on. As mentioned above, in this case we will only have `TaskResourceProfile` re-use executors with `Default ResourceProfile` when dynamic allocation is off. The behavior will be limited to dynamic allocation off and will throw an exception if user use it with dynamic allocation on. Does this behavior change make sense? @tgravescs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yikf commented on pull request #37254: [SPARK-39841][SQL] simplify conflict binary comparison
yikf commented on PR #37254: URL: https://github.com/apache/spark/pull/37254#issuecomment-1206452674 friendly ping @gengliangwang @cloud-fan @sigmod, Sorry for late reply, i was busy last week, i fixed comments you left, please take a look again when you have a time~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37362: [SPARK-39950][SQL] It's unnecessary to materialize BroadcastQueryStage firstly, because the BroadcastQueryStage does not timeout in AQE.
cloud-fan commented on PR #37362: URL: https://github.com/apache/spark/pull/37362#issuecomment-1206450655 looks fine, but let's think a bit more and see if there is any benefit to submit broadcast jobs first. cc @yaooqinn @maryannxue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r938796458 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY") + .groupBy("my_dept", "my_manager").sum("SALARY") + .orderBy("my_dept", "my_manager") + .limit(1) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key", "IS_MANAGER").sum("SALARY") + .orderBy("key", "IS_MANAGER") + .limit(1) +checkSortRemoved(df6) +checkLimitRemoved(df6) +checkPushedInfo(df6, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END, " + +"IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df6, Seq(Row(0.00, false, 12000.00))) + +val df7 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY")) + .orderBy(sum("SALARY")) + .limit(1) +checkSortRemoved(df7, false) +checkLimitRemoved(df7, false) +checkPushedInfo(df7, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []") +checkAnswer(df7, Seq(Row(6, 12000.00))) + +val df8 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY").as("total")) + .orderBy("total") + .limit(1) +checkSortRemoved(df8, false) +checkLimitRemoved(df8, false) +checkPushedInfo(df8, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []") +checkAnswer(df8, Seq(Row(6, 12000.00))) + } + +
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r938794884 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY") + .groupBy("my_dept", "my_manager").sum("SALARY") + .orderBy("my_dept", "my_manager") + .limit(1) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key", "IS_MANAGER").sum("SALARY") + .orderBy("key", "IS_MANAGER") + .limit(1) +checkSortRemoved(df6) +checkLimitRemoved(df6) +checkPushedInfo(df6, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END, " + +"IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df6, Seq(Row(0.00, false, 12000.00))) + +val df7 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY")) + .orderBy(sum("SALARY")) Review Comment: We can translate agg expressions now, why this test still can't trigger top-n pushdown? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r938794073 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY") + .groupBy("my_dept", "my_manager").sum("SALARY") + .orderBy("my_dept", "my_manager") + .limit(1) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key", "IS_MANAGER").sum("SALARY") + .orderBy("key", "IS_MANAGER") + .limit(1) +checkSortRemoved(df6) +checkLimitRemoved(df6) +checkPushedInfo(df6, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END, " + +"IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df6, Seq(Row(0.00, false, 12000.00))) + +val df7 = spark.read + .table("h2.test.employee") + .select($"DEPT", $"SALARY") + .groupBy("dept").agg(sum("SALARY")) + .orderBy(sum("SALARY")) Review Comment: how about ``` .groupBy("dept").agg(sum("SALARY").as("sum_salary")) .orderBy("sum_salary") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r938792943 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("dept").sum("SALARY") + .orderBy($"dept".gt(1)) + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT > 1 ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, 19000.00))) + +val df5 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) Review Comment: let's use case instead of case when, which is simpler and easier to review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r938792433 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("dept").sum("SALARY") + .orderBy($"dept".gt(1)) Review Comment: let's use cast now. group by a predicate is super weird. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r938791655 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") Review Comment: let's add a cast instead of just a simple alias ``` .select($"DEPT".cast("string").as("my_dept"), $"SALARY") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r938792037 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = spark.read + .table("h2.test.employee") + .select($"DEPT".as("my_dept"), $"SALARY") + .groupBy("my_dept").sum("SALARY") + .orderBy("my_dept") + .limit(1) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) Review Comment: we can remove this test after you address https://github.com/apache/spark/pull/37320/files#r938791655 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r938790245 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -408,16 +411,27 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { } (operation, isPushed && !isPartiallyPushed) case s @ Sort(order, _, operation @ ScanOperation(project, filter, sHolder: ScanBuilderHolder)) -// Without building the Scan, we do not know the resulting column names after aggregate -// push-down, and thus can't push down Top-N which needs to know the ordering column names. -// TODO: we can support simple cases like GROUP BY columns directly and ORDER BY the same -// columns, which we know the resulting column names: the original table columns. -if sHolder.pushedAggregate.isEmpty && filter.isEmpty && - CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) => + if filter.isEmpty && +CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) => val aliasMap = getAliasMap(project) - val newOrder = order.map(replaceAlias(_, aliasMap)).asInstanceOf[Seq[SortOrder]] + val aliasReplacedOrder = order.map(replaceAlias(_, aliasMap)) + val newOrder = if (sHolder.pushedAggregate.isDefined) { +// Without building the Scan, Aggregate push-down give the expected output starts with +// `group_col_` or `agg_func_`. When Aggregate push-down working with Sort for top n +// push-down, we need replace these expected output with the origin expressions. Review Comment: ``` // `ScanBuilderHolder` has different output columns after aggregate pushdown. Here we // replace the attributes in ordering expressions with the original table output columns. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r938790445 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -408,16 +411,27 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { } (operation, isPushed && !isPartiallyPushed) case s @ Sort(order, _, operation @ ScanOperation(project, filter, sHolder: ScanBuilderHolder)) -// Without building the Scan, we do not know the resulting column names after aggregate -// push-down, and thus can't push down Top-N which needs to know the ordering column names. -// TODO: we can support simple cases like GROUP BY columns directly and ORDER BY the same -// columns, which we know the resulting column names: the original table columns. -if sHolder.pushedAggregate.isEmpty && filter.isEmpty && - CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) => + if filter.isEmpty && +CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) => val aliasMap = getAliasMap(project) - val newOrder = order.map(replaceAlias(_, aliasMap)).asInstanceOf[Seq[SortOrder]] + val aliasReplacedOrder = order.map(replaceAlias(_, aliasMap)) + val newOrder = if (sHolder.pushedAggregate.isDefined) { +// Without building the Scan, Aggregate push-down give the expected output starts with +// `group_col_` or `agg_func_`. When Aggregate push-down working with Sort for top n +// push-down, we need replace these expected output with the origin expressions. +aliasReplacedOrder.map { + _.transform { +case a: Attribute => sHolder.pushedAggOutputMap.getOrElse(a, a) + }.asInstanceOf[SortOrder] +} + } else { +aliasReplacedOrder.asInstanceOf[Seq[SortOrder]] + } val normalizedOrders = DataSourceStrategy.normalizeExprs( newOrder, sHolder.relation.output).asInstanceOf[Seq[SortOrder]] + // Because V2ExpressionBuilder can't translate aggregate functions, so we can't Review Comment: we can remove this TODO now. ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -408,16 +411,27 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { } (operation, isPushed && !isPartiallyPushed) case s @ Sort(order, _, operation @ ScanOperation(project, filter, sHolder: ScanBuilderHolder)) -// Without building the Scan, we do not know the resulting column names after aggregate -// push-down, and thus can't push down Top-N which needs to know the ordering column names. -// TODO: we can support simple cases like GROUP BY columns directly and ORDER BY the same -// columns, which we know the resulting column names: the original table columns. -if sHolder.pushedAggregate.isEmpty && filter.isEmpty && - CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) => + if filter.isEmpty && +CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) => val aliasMap = getAliasMap(project) - val newOrder = order.map(replaceAlias(_, aliasMap)).asInstanceOf[Seq[SortOrder]] + val aliasReplacedOrder = order.map(replaceAlias(_, aliasMap)) + val newOrder = if (sHolder.pushedAggregate.isDefined) { +// Without building the Scan, Aggregate push-down give the expected output starts with +// `group_col_` or `agg_func_`. When Aggregate push-down working with Sort for top n +// push-down, we need replace these expected output with the origin expressions. +aliasReplacedOrder.map { + _.transform { +case a: Attribute => sHolder.pushedAggOutputMap.getOrElse(a, a) + }.asInstanceOf[SortOrder] +} + } else { +aliasReplacedOrder.asInstanceOf[Seq[SortOrder]] + } val normalizedOrders = DataSourceStrategy.normalizeExprs( newOrder, sHolder.relation.output).asInstanceOf[Seq[SortOrder]] + // Because V2ExpressionBuilder can't translate aggregate functions, so we can't Review Comment: and let's add tests for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37393: [SPARK-39966][SQL] Use V2 Filter in SupportsDelete
cloud-fan commented on code in PR #37393: URL: https://github.com/apache/spark/pull/37393#discussion_r938782700 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala: ## @@ -277,14 +277,14 @@ class DataSourceV2Strategy(session: SparkSession) extends Strategy with Predicat // correctness depends on removing all matching data. val filters = DataSourceStrategy.normalizeExprs(Seq(condition), output) .flatMap(splitConjunctivePredicates(_).map { -f => DataSourceStrategy.translateFilter(f, true).getOrElse( +f => DataSourceV2Strategy.translateFilterV2(f).getOrElse( throw QueryCompilationErrors.cannotTranslateExpressionToSourceFilterError(f)) Review Comment: `SupportsDelete` is different from runtime filtering. For runtime filtering, we need to push as many predicates as possible, but this is just a perf improvement. `SupportsDelete` must take all predicates, otherwise the source can't delete enough data which leads to wrong query result later. We need to follow the same behavior in https://github.com/apache/spark/pull/37393/files#diff-dc485c81773a73a5a462994af50e17a5043a8d66c47399cf29b0a3cb56c85591R80 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37393: [SPARK-39966][SQL] Use V2 Filter in SupportsDelete
cloud-fan commented on code in PR #37393: URL: https://github.com/apache/spark/pull/37393#discussion_r938778661 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java: ## @@ -28,7 +30,7 @@ * @since 3.0.0 */ @Evolving -public interface SupportsDelete extends TruncatableTable { +public interface SupportsDelete extends TruncatableTable, SupportsDeleteV2 { Review Comment: ```suggestion public interface SupportsDelete extends SupportsDeleteV2 { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop
zhengruifeng commented on code in PR #37335: URL: https://github.com/apache/spark/pull/37335#discussion_r938728719 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -826,6 +826,16 @@ class DataFrameSuite extends QueryTest assert(df.schema.map(_.name) === Seq("key", "value")) } + test("drop two column references") { Review Comment: ```suggestion test("SPARK-39895: drop two column references") { ``` ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -87,6 +87,21 @@ def test_help_command(self): pydoc.render_doc(df.foo) pydoc.render_doc(df.take(1)) +def test_drop(self): +df = self.spark.createDataFrame([("A", 50, "Y"), ("B", 60, "Y")], ["name", "age", "active"]) + Review Comment: I think we can remove these empty lines here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule
cloud-fan commented on PR #35334: URL: https://github.com/apache/spark/pull/35334#issuecomment-1206308133 thanks, merging to master/3.3/3.2! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule
cloud-fan closed pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule URL: https://github.com/apache/spark/pull/35334 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule
cloud-fan commented on PR #35334: URL: https://github.com/apache/spark/pull/35334#issuecomment-1206305491 I think the previous O(n!) time complexity code is unexpected and buggy. Let's backport this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule
cloud-fan commented on code in PR #35334: URL: https://github.com/apache/spark/pull/35334#discussion_r938688689 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -1148,9 +1148,9 @@ object CollapseWindow extends Rule[LogicalPlan] { */ object TransposeWindow extends Rule[LogicalPlan] { private def compatiblePartitions(ps1 : Seq[Expression], ps2: Seq[Expression]): Boolean = { -ps1.length < ps2.length && ps2.take(ps1.length).permutations.exists(ps1.zip(_).forall { Review Comment: I mean the previous code is kind of buggy. This PR is more like a bug fix instead of perf improvement and we should backport it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #37391: [SPARK-39964][SQL] DS V2 pushdown should unify the translate path
cloud-fan closed pull request #37391: [SPARK-39964][SQL] DS V2 pushdown should unify the translate path URL: https://github.com/apache/spark/pull/37391 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37391: [SPARK-39964][SQL] DS V2 pushdown should unify the translate path
cloud-fan commented on PR #37391: URL: https://github.com/apache/spark/pull/37391#issuecomment-1206303422 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #37389: [SPARK-39963][SQL] Simplify `SimplifyCasts.isWiderCast`
cloud-fan closed pull request #37389: [SPARK-39963][SQL] Simplify `SimplifyCasts.isWiderCast` URL: https://github.com/apache/spark/pull/37389 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37389: [SPARK-39963][SQL] Simplify `SimplifyCasts.isWiderCast`
cloud-fan commented on PR #37389: URL: https://github.com/apache/spark/pull/37389#issuecomment-1206302474 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37389: [SPARK-39963][SQL] Simplify `SimplifyCasts.isWiderCast`
cloud-fan commented on PR #37389: URL: https://github.com/apache/spark/pull/37389#issuecomment-1206302144 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37340: [MINOR][SQL] Improve the comments about null tracking for UnsafeRow
cloud-fan commented on code in PR #37340: URL: https://github.com/apache/spark/pull/37340#discussion_r938686329 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java: ## @@ -46,10 +46,10 @@ /** * An Unsafe implementation of Row which is backed by raw memory instead of Java objects. * - * Each tuple has three parts: [null bit set] [values] [variable length portion] + * Each tuple has three parts: [null-tracking bit set] [values] [variable length portion] * - * The bit set is used for null tracking and is aligned to 8-byte word boundaries. It stores - * one bit per field. + * The null-tracking bit set is used for null tracking and is aligned to 8-byte word boundaries. Review Comment: This is not addressed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #37169: [SPARK-38901][SQL] DS V2 supports push down misc functions
cloud-fan closed pull request #37169: [SPARK-38901][SQL] DS V2 supports push down misc functions URL: https://github.com/apache/spark/pull/37169 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37169: [SPARK-38901][SQL] DS V2 supports push down misc functions
cloud-fan commented on PR #37169: URL: https://github.com/apache/spark/pull/37169#issuecomment-1206299101 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
cloud-fan commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r938682574 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -33,36 +33,37 @@ import org.apache.spark.storage.StorageLevel abstract class Catalog { /** - * Returns the current default database in this session. + * Returns the current database (namespace) in this session. Review Comment: `namespace` is more like the official name. database/schema is only for the hive catalog. We can change `database` to `database/schema` though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
cloud-fan commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r938681100 ## R/pkg/tests/fulltests/test_sparkSQL.R: ## @@ -4098,14 +4098,13 @@ test_that("catalog APIs, listTables, getTable, listColumns, listFunctions, funct c("name", "description", "dataType", "nullable", "isPartition", "isBucket")) expect_equal(collect(c)[[1]][[1]], "speed") expect_error(listColumns("zxwtyswklpf", "default"), - paste("Error in listColumns : analysis error - Table", - "'zxwtyswklpf' does not exist in database 'default'")) + paste("Table or view not found: spark_catalog.default.zxwtyswklpf")) Review Comment: I don't think we treat error message change as behavior change. We change error messages from time to time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wang-zhun commented on pull request #37406: [SPARK-39921][SQL] SkewJoin--Stream side skew in BroadcastHashJoin
wang-zhun commented on PR #37406: URL: https://github.com/apache/spark/pull/37406#issuecomment-1206294767 > you do not use AQE ? Turning off AQE will be a `SortMergeJoin`, we need to turn on AQE and solve the data skew -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #37398: [SPARK-39867][SQL][3.1] Global limit should not inherit OrderPreservingUnaryNode
cloud-fan closed pull request #37398: [SPARK-39867][SQL][3.1] Global limit should not inherit OrderPreservingUnaryNode URL: https://github.com/apache/spark/pull/37398 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37398: [SPARK-39867][SQL][3.1] Global limit should not inherit OrderPreservingUnaryNode
cloud-fan commented on PR #37398: URL: https://github.com/apache/spark/pull/37398#issuecomment-1206293236 The python syle check failure is unrelated. Thanks, merging to 3.1! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #37397: [SPARK-39867][SQL][3.2] Global limit should not inherit OrderPreservingUnaryNode
cloud-fan closed pull request #37397: [SPARK-39867][SQL][3.2] Global limit should not inherit OrderPreservingUnaryNode URL: https://github.com/apache/spark/pull/37397 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37397: [SPARK-39867][SQL][3.2] Global limit should not inherit OrderPreservingUnaryNode
cloud-fan commented on PR #37397: URL: https://github.com/apache/spark/pull/37397#issuecomment-1206290710 The failed python style check is not related to this PR. merging to 3.2, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #37403: [SPARK-39980][INFRA] Change infra image to static tag: `ubuntu:focal-20220801`
Yikun commented on PR #37403: URL: https://github.com/apache/spark/pull/37403#issuecomment-1206288324 @HyukjinKwon Ready to go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Resol1992 commented on pull request #37404: [SPARK-39866][SQL] Memory leak when closing a session of Spark ThriftServer
Resol1992 commented on PR #37404: URL: https://github.com/apache/spark/pull/37404#issuecomment-1206257121 > Different sessions can share this cache. Maybe you can set `spark.sql.metadataCacheTTLSeconds` to a positive value to workaround this issue? @wangyum Thanks for your advise. I have tried to workaroud this with setting `spark.sql.metadataCacheTTLSeconds = 10`, but it does not work, the fileStatus objects still exist after sparkSession is closed. In fact, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast
HyukjinKwon commented on PR #37414: URL: https://github.com/apache/spark/pull/37414#issuecomment-1206247682 Test conflicts. I just removed the test part in branch-3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast
HyukjinKwon closed pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast URL: https://github.com/apache/spark/pull/37414 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast
HyukjinKwon commented on PR #37414: URL: https://github.com/apache/spark/pull/37414#issuecomment-1206246390 Merged to master and branch-3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #37411: [SPARK-39984][CORE] Check workerLastHeartbeat with master before HeartbeatReceiver expires an executor
mridulm commented on code in PR #37411: URL: https://github.com/apache/spark/pull/37411#discussion_r938566024 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -2398,4 +2398,11 @@ package object config { .version("3.3.0") .intConf .createWithDefault(5) + + private[spark] val HEARTBEAT_RECEIVER_CHECK_WORKER_LAST_HEARTBEAT = +ConfigBuilder("spark.driver.heartbeat.checkWorkerLastHeartbeat") + .internal() + .version("3.4.0") + .booleanConf + .createWithDefault(true) Review Comment: Default this to `false` ## core/src/main/scala/org/apache/spark/internal/config/Network.scala: ## @@ -49,7 +49,13 @@ private[spark] object Network { ConfigBuilder("spark.network.timeoutInterval") .version("1.3.2") .timeConf(TimeUnit.MILLISECONDS) - .createWithDefaultString(STORAGE_BLOCKMANAGER_TIMEOUTINTERVAL.defaultValueString) + .createWithDefaultString("15s") + + private[spark] val NETWORK_EXECUTOR_TIMEOUT = +ConfigBuilder("spark.network.executorTimeout") + .version("1.3.0") Review Comment: `1.3.0` -> `3.4.0` ## core/src/main/scala/org/apache/spark/internal/config/Network.scala: ## @@ -49,7 +49,13 @@ private[spark] object Network { ConfigBuilder("spark.network.timeoutInterval") .version("1.3.2") .timeConf(TimeUnit.MILLISECONDS) - .createWithDefaultString(STORAGE_BLOCKMANAGER_TIMEOUTINTERVAL.defaultValueString) + .createWithDefaultString("15s") + + private[spark] val NETWORK_EXECUTOR_TIMEOUT = +ConfigBuilder("spark.network.executorTimeout") + .version("1.3.0") + .timeConf(TimeUnit.MILLISECONDS) + .createWithDefaultString("60s") Review Comment: fallback to `NETWORK_TIMEOUT` to preserve existing behavior. ## core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala: ## @@ -199,41 +222,131 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) removeExecutor(executorRemoved.executorId) } + private def killExecutor(executorId: String, timeout: Long): Unit = { +logWarning(s"Removing executor $executorId with no recent heartbeats: " + + s"${timeout} ms exceeds timeout $executorTimeoutMs ms") +killExecutorThread.submit(new Runnable { + override def run(): Unit = Utils.tryLogNonFatalError { +// Note: we want to get an executor back after expiring this one, +// so do not simply call `sc.killExecutor` here (SPARK-8119) +sc.killAndReplaceExecutor(executorId) +// SPARK-27348: in case of the executors which are not gracefully shut down, +// we should remove lost executors from CoarseGrainedSchedulerBackend manually +// here to guarantee two things: +// 1) explicitly remove executor information from CoarseGrainedSchedulerBackend for +//a lost executor instead of waiting for disconnect message +// 2) call scheduler.executorLost() underlying to fail any tasks assigned to +//those executors to avoid app hang +sc.schedulerBackend match { + case backend: CoarseGrainedSchedulerBackend => +backend.driverEndpoint.send(RemoveExecutor(executorId, + ExecutorProcessLost( +s"Executor heartbeat timed out after ${timeout} ms", +causedByApp = !sc.conf.get(HEARTBEAT_RECEIVER_CHECK_WORKER_LAST_HEARTBEAT + + // LocalSchedulerBackend is used locally and only has one single executor + case _: LocalSchedulerBackend => + + case other => throw new UnsupportedOperationException( +s"Unknown scheduler backend: ${other.getClass}") +} + } +}) + } + private def expireDeadHosts(): Unit = { + /** + * [SC-105641] + * Originally, the driver’s HeartbeatReceiver will expire an executor if it does not receive any + * heartbeat from the executor for 120 seconds. However, 120 seconds is too long, but we will face + * other challenges when we try to lower the timeout threshold. To elaborate, when an executor is + * performing full GC, it cannot send/reply any message. Next paragraphs describe the solution to + * detect network disconnection between driver and executor in a short time. + * + * An executor is running on a worker but in different JVMs, and a driver is running on a master + * but in different JVMs. Hence, the network connection between driver/executor and master/worker + * is the same. Because executor and worker are running on different JVMs, worker can still send + * heartbeat to master when executor performs GC. + * + * For new Heartbeat Receiver, if driver does not receive any heartbeat from the executor for + * `executorTimeoutMs` (default: 60s) seconds, HeartbeatReceiver will send a request to master to + * ask for the latest heartbeat from the worker which the executor runs on
[GitHub] [spark] wangyum opened a new pull request, #37421: [SPARK-39989][SQL] Support estimate column statistics if it is foldable expression
wangyum opened a new pull request, #37421: URL: https://github.com/apache/spark/pull/37421 ### What changes were proposed in this pull request? This PR adds support estimate column statistics if it is foldable expression. For example: estimate the `'a' AS a`'s column statistics from `SELECT 'a' AS a FROM tbl`. 1. If the foldable expression is null: ```scala ColumnStat(Some(0), None, None, Some(rowCount), Some(size), Some(size), None, 2) ``` 2. If the foldable expression is not null: ```scala ColumnStat(Some(1), Some(value), Some(value), Some(0), Some(size), Some(size), None, 2) ``` ### Why are the changes needed? Improve column statistics. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37418: [SPARK-39987][K8S] Support `PEAK_JVM_(ON|OFF)HEAP_MEMORY` executor rolling policy
dongjoon-hyun commented on PR #37418: URL: https://github.com/apache/spark/pull/37418#issuecomment-1206159404 Could you review this when you have some time, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] CHzxp commented on pull request #37406: [SPARK-39921][SQL] SkewJoin--Stream side skew in BroadcastHashJoin
CHzxp commented on PR #37406: URL: https://github.com/apache/spark/pull/37406#issuecomment-1206140696 you do not use AQE ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Zhangshunyu commented on pull request #37404: [SPARK-39866][SQL] Memory leak when closing a session of Spark ThriftServer
Zhangshunyu commented on PR #37404: URL: https://github.com/apache/spark/pull/37404#issuecomment-1206127023 This is a very good fix when using multiple connections to query the same table and then closing the connections, I don't think relying on TTL is enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Zhangshunyu commented on pull request #37404: [SPARK-39866][SQL] Memory leak when closing a session of Spark ThriftServer
Zhangshunyu commented on PR #37404: URL: https://github.com/apache/spark/pull/37404#issuecomment-1206126564 This is a very good fix when using multiple connections to query the same table and then closing the connections, I don't think relying on TTL is enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37413: [SPARK-39983][CORE] Do not cache unserialized broadcast relations on the driver
mridulm commented on PR #37413: URL: https://github.com/apache/spark/pull/37413#issuecomment-1206124066 Doesn't this not break in local mode ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #36200: [SPARK-38909][CORE][YARN] Encapsulate `LevelDB` used by `ExternalShuffleBlockResolver`,`YarnShuffleService` and `RemoteBlockPus
LuciferYang commented on code in PR #36200: URL: https://github.com/apache/spark/pull/36200#discussion_r938512128 ## common/network-common/src/main/java/org/apache/spark/network/util/DBProvider.java: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.network.util; + +import java.io.File; +import java.io.IOException; + +import com.fasterxml.jackson.databind.ObjectMapper; +import com.google.common.annotations.VisibleForTesting; + +import org.apache.spark.network.shuffledb.LevelDB; +import org.apache.spark.network.shuffledb.DB; +import org.apache.spark.network.shuffledb.StoreVersion; + +public class DBProvider { +public static DB initDB(File dbFile, StoreVersion version, ObjectMapper mapper) +throws IOException { +if (dbFile != null) { +if (dbFile.getName().endsWith(".ldb")) { +org.iq80.leveldb.DB levelDB = LevelDBProvider.initLevelDB(dbFile, version, mapper); +return levelDB != null ? new LevelDB(levelDB) : null; +} else { +return null; +} +} Review Comment: Let me think about it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages
mridulm commented on PR #37384: URL: https://github.com/apache/spark/pull/37384#issuecomment-1206118791 My query was slightly different - are you actually seeing this issue ? Can you elaborate on what was observed ? There are corner cases which we do not deal with currently in spark, given the complexity of handling them, and given how infrequent they are. A stage/job is not immediately failed when a single task fails - it is retried, and there are various exclude lists which prevent repeated schedule on the same executor/node and quickly fail the stage/job. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled
sadikovi commented on code in PR #37419: URL: https://github.com/apache/spark/pull/37419#discussion_r938492838 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: ## @@ -228,6 +228,13 @@ class ParquetFileFormat SQLConf.PARQUET_TIMESTAMP_NTZ_ENABLED.key, sparkSession.sessionState.conf.parquetTimestampNTZEnabled) +// See PARQUET-2170. +// Disable column index optimisation when required schema is empty so we get the correct +// row count from parquet-mr. +if (requiredSchema.isEmpty) { Review Comment: No, this is not required for DSv2. The test works in DSv2 due to another inconsistency - Parquet DSv2 does not consider the full file schema when creating pushdown filters. There is a check in FileScanBuilder to ignore partition columns so in this case, the schema is empty so no filters will be pushed down, returning the correct number of records. It is rather a performance inefficiency in DSv2 as the entire file will be scanned. However, the result will be correct. I thought about fixing it the same way DSv2 fixed the issue but it is a much bigger change as it would affect not just this case but others as well. I hope my explanation makes sense. Let me know your thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast
gengliangwang commented on PR #37414: URL: https://github.com/apache/spark/pull/37414#issuecomment-1206083866 BTW, it would be great if we could have a test case to capture the non-codegen code patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast
gengliangwang commented on PR #37414: URL: https://github.com/apache/spark/pull/37414#issuecomment-1206082240 Thanks for catching it, @HyukjinKwon ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org