date:20220805

[GitHub] [spark] dependabot[bot] opened a new pull request, #37425: Bump postgresql from 42.3.3 to 42.4.1

2022-08-05 Thread GitBox



dependabot[bot] opened a new pull request, #37425:
URL: https://github.com/apache/spark/pull/37425

   Bumps [postgresql](https://github.com/pgjdbc/pgjdbc) from 42.3.3 to 42.4.1.
   
   Release notes
   Sourced from https://github.com/pgjdbc/pgjdbc/releases;>postgresql's 
releases.
   
   42.4.0
   What's Changed
   
   Enhancement: Made TimestampUtils.utcTz static and renamed to 
UTC_TIMEZONE by https://github.com/svendiedrichsen;>@svendiedrichsen in 
https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2519;>pgjdbc/pgjdbc#2519
   fix: return correct base type for domain from getUDTs (https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2520;>#2520)
 by https://github.com/alurie;>@alurie in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2522;>pgjdbc/pgjdbc#2522
   fix: support queries with up to 65535 (inclusive) parameters by https://github.com/vlsi;>@vlsi in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2525;>pgjdbc/pgjdbc#2525
   chore: use META-INF/licenses/$group/$artifact-$version/... folder for 
licenses by https://github.com/vlsi;>@vlsi in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2531;>pgjdbc/pgjdbc#2531
   fix: added GROUP_STARTUP_PARAMETERS boolean property to determine 
whether or not to group startup parameters in a transaction or not fixes Issue 
2423 pgbouncer cannot deal with transactions in statement pooling mode by https://github.com/davecramer;>@davecramer in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2425;>pgjdbc/pgjdbc#2425
   chore: Make the readme version agnostic by https://github.com/jorsol;>@jorsol in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2540;>pgjdbc/pgjdbc#2540
   Release notes 42.4.0 by https://github.com/davecramer;>@davecramer in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2541;>pgjdbc/pgjdbc#2541
   
   New Contributors
   
   https://github.com/svendiedrichsen;>@svendiedrichsen 
made their first contribution in https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2519;>pgjdbc/pgjdbc#2519
   
   Full Changelog: https://github.com/pgjdbc/pgjdbc/compare/REL42.3.6...REL42.4.0;>https://github.com/pgjdbc/pgjdbc/compare/REL42.3.6...REL42.4.0
   
   
   
   Changelog
   Sourced from https://github.com/pgjdbc/pgjdbc/blob/master/CHANGELOG.md;>postgresql's 
changelog.
   
   Changelog
   Notable changes since version 42.0.0, read the complete https://jdbc.postgresql.org/documentation/changelog.html;>History of 
Changes.
   The format is based on http://keepachangelog.com/en/1.0.0/;>Keep 
a Changelog.
   [Unreleased]
   Changed
   Added
   Fixed
   [42.4.1] (2022-08-01 16:24:20 -0400)
   Security
   
   fix: CVE-2022-31197 Fixes SQL generated in PgResultSet.refresh() to 
escape column identifiers so as to prevent SQL injection.
   
   Previously, the column names for both key and data columns in the table 
were copied as-is into the generated
   SQL. This allowed a malicious table with column names that include statement 
terminator to be parsed and
   executed as multiple separate commands.
   Also adds a new test class ResultSetRefreshTest to verify this 
change.
   Reported by https://github.com/kato-sho;>Sho Kato
   
   
   
   Changed
   
   chore: skip publishing pgjdbc-osgi-test to Central
   chore: bump Gradle to 7.5
   test: update JUnit to 5.8.2
   
   Added
   
   chore: added Gradle Wrapper Validation for verifying 
gradle-wrapper.jar
   chore: added permissions: contents: read for GitHub Actions 
to avoid unintentional modifications by the CI
   chore: support building pgjdbc with Java 17
   
   Fixed
   [42.4.0] (2022-06-09 08:14:02 -0400)
   Changed
   
   fix: added GROUP_STARTUP_PARAMETERS boolean property to determine 
whether or not to group
   startup parameters in a transaction (default=false like 42.2.x) fixes [Issue 
https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2425;>#2425](https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2497;>pgjdbc/pgjdbc#2497)
   pgbouncer cannot deal with transactions in statement pooling mode [PR https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2425;>#2425](https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2425;>pgjdbc/pgjdbc#2425)
   
   Fixed
   
   fix: queries with up to 65535 (inclusive) parameters are supported now 
(previous limit was 32767)
   [PR https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2525;>#2525](https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2525;>pgjdbc/pgjdbc#2525),
 [Issue https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/1311;>#1311](https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/1311;>pgjdbc/pgjdbc#1311)
   fix: workaround JarIndex parsing issue by using 
groupId/artifactId-version directory namings.
   Regression since 42.2.13. [PR https://github-redirect.dependabot.com/pgjdbc/pgjdbc/issues/2531;>#2531](https://github-redirect.dependabot.com/pgjdbc/pgjdbc/pull/2531;>pgjdbc/pgjdbc#2531),
 [issue

[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



beliefer commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r939485352


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("dept").sum("SALARY")
+  .orderBy($"dept".gt(1))

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



beliefer commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r939485083


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



beliefer commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r939479595


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -408,16 +411,27 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
   }
   (operation, isPushed && !isPartiallyPushed)
 case s @ Sort(order, _, operation @ ScanOperation(project, filter, 
sHolder: ScanBuilderHolder))
-// Without building the Scan, we do not know the resulting column 
names after aggregate
-// push-down, and thus can't push down Top-N which needs to know the 
ordering column names.
-// TODO: we can support simple cases like GROUP BY columns directly 
and ORDER BY the same
-//   columns, which we know the resulting column names: the 
original table columns.
-if sHolder.pushedAggregate.isEmpty && filter.isEmpty &&
-  CollapseProject.canCollapseExpressions(order, project, alwaysInline 
= true) =>
+  if filter.isEmpty &&
+CollapseProject.canCollapseExpressions(order, project, alwaysInline = 
true) =>
   val aliasMap = getAliasMap(project)
-  val newOrder = order.map(replaceAlias(_, 
aliasMap)).asInstanceOf[Seq[SortOrder]]
+  val aliasReplacedOrder = order.map(replaceAlias(_, aliasMap))
+  val newOrder = if (sHolder.pushedAggregate.isDefined) {
+// Without building the Scan, Aggregate push-down give the expected 
output starts with
+// `group_col_` or `agg_func_`. When Aggregate push-down working with 
Sort for top n
+// push-down, we need replace these expected output with the origin 
expressions.
+aliasReplacedOrder.map {
+  _.transform {
+case a: Attribute => sHolder.pushedAggOutputMap.getOrElse(a, a)
+  }.asInstanceOf[SortOrder]
+}
+  } else {
+aliasReplacedOrder.asInstanceOf[Seq[SortOrder]]
+  }
   val normalizedOrders = DataSourceStrategy.normalizeExprs(
 newOrder, sHolder.relation.output).asInstanceOf[Seq[SortOrder]]
+  // Because V2ExpressionBuilder can't translate aggregate functions, so 
we can't

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37408: [SPARK-39982][WIP] Add doc string to StructType.fromJson

2022-08-05 Thread GitBox



AmplabJenkins commented on PR #37408:
URL: https://github.com/apache/spark/pull/37408#issuecomment-1207133120

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37404: [SPARK-39866][SQL] Memory leak when closing a session of Spark ThriftServer

2022-08-05 Thread GitBox



AmplabJenkins commented on PR #37404:
URL: https://github.com/apache/spark/pull/37404#issuecomment-1207133142

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37409: [SPARK-39970][CORE] Introduce ThrottledLogger to prevent log message flooding caused by network issues

2022-08-05 Thread GitBox



AmplabJenkins commented on PR #37409:
URL: https://github.com/apache/spark/pull/37409#issuecomment-1207133115

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #37389: [SPARK-39963][SQL] Simplify `SimplifyCasts.isWiderCast`

2022-08-05 Thread GitBox



beliefer commented on PR #37389:
URL: https://github.com/apache/spark/pull/37389#issuecomment-1207131840

   @cloud-fan @dongjoon-hyun @viirya @gengliangwang Thank you !
   @xkrogen Thank you for you comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #37391: [SPARK-39964][SQL] DS V2 pushdown should unify the translate path

2022-08-05 Thread GitBox



beliefer commented on PR #37391:
URL: https://github.com/apache/spark/pull/37391#issuecomment-1207131948

   @cloud-fan @huaxingao Thank you for you review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37411: [SPARK-39984][CORE] Check workerLastHeartbeat with master before HeartbeatReceiver expires an executor

2022-08-05 Thread GitBox



AmplabJenkins commented on PR #37411:
URL: https://github.com/apache/spark/pull/37411#issuecomment-1207116271

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

2022-08-05 Thread GitBox



AmplabJenkins commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1207116255

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #35948: [SPARK-38634][SQL] Support download data through thrift server

2022-08-05 Thread GitBox



github-actions[bot] commented on PR #35948:
URL: https://github.com/apache/spark/pull/35948#issuecomment-1207101576

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37422: [SPARK-39992][BUILD] Upgrade `slf4j` to 1.7.36

2022-08-05 Thread GitBox



HyukjinKwon closed pull request #37422: [SPARK-39992][BUILD] Upgrade `slf4j` to 
1.7.36
URL: https://github.com/apache/spark/pull/37422


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37422: [SPARK-39992][BUILD] Upgrade `slf4j` to 1.7.36

2022-08-05 Thread GitBox



HyukjinKwon commented on PR #37422:
URL: https://github.com/apache/spark/pull/37422#issuecomment-1207092774

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37416: [SPARK-39743][DOCS] Updated some spark.io.compression configuration descriptions to clarify parameter applica…

2022-08-05 Thread GitBox



AmplabJenkins commented on PR #37416:
URL: https://github.com/apache/spark/pull/37416#issuecomment-1207088912

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37417: [SPARK-33782][K8s][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster

2022-08-05 Thread GitBox



AmplabJenkins commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1207088896

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled

2022-08-05 Thread GitBox



AmplabJenkins commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1207088877

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-08-05 Thread GitBox



amaliujia commented on code in PR #37287:
URL: https://github.com/apache/spark/pull/37287#discussion_r939283477


##
sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala:
##
@@ -90,30 +92,31 @@ abstract class Catalog {
   /**
* Returns a list of columns for the given table/view or temporary view.
*
-   * @param tableName is either a qualified or unqualified name that 
designates a table/view.
-   *  If no database identifier is provided, it refers to a 
temporary view or
-   *  a table/view in the current database.
+   * @param tableName is either a qualified or unqualified name that 
designates a table/view. It
+   *  follows the same resolution rule with SQL: search for 
temp views first then
+   *  table/views in the current database (namespace).
* @since 2.0.0
*/
   @throws[AnalysisException]("table does not exist")
   def listColumns(tableName: String): Dataset[Column]
 
   /**
-   * Returns a list of columns for the given table/view in the specified 
database.
+   * Returns a list of columns for the given table/view in the specified 
database under the Hive
+   * Metastore.

Review Comment:
   +1 this looks nice to explicitly says HMS 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] andygrove opened a new pull request, #37424: [SPARK-39991][SQL][AQE] Use available column statistics from completed query stages

2022-08-05 Thread GitBox



andygrove opened a new pull request, #37424:
URL: https://github.com/apache/spark/pull/37424

   
   
   ### What changes were proposed in this pull request?
   
   
   AQE uses statistics from completed query stages and feeds them back into the 
logical optimizer. AQE currently only uses `dataSize` and `numOutputRows` and 
ignores any available `attributeMap` (column statistics).
   
   This PR updates AQE to also populate `attributeMap` in the statistics that 
it uses for re-optimizing the plan.
   
   ### Why are the changes needed?
   
   
   These changes are needed so that Spark plugins that provide custom 
implementations of the `ShuffleExchangeLike` trait can leverage column 
statistics for better plan optimization during AQE execution.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No. The current Spark implementation of `ShuffleExchangeLike` 
(`ShuffleExchangeExec`) does not populate `attributeMap`, so this PR is a no-op 
for regular Spark.
   
   ### How was this patch tested?
   
   
   New unit test added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

2022-08-05 Thread GitBox



dtenedor commented on PR #37423:
URL: https://github.com/apache/spark/pull/37423#issuecomment-1206863387

   Hi @gengliangwang this fixes a small bug in column DEFAULT values with 
DataFrames, and adds some testing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor opened a new pull request, #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

2022-08-05 Thread GitBox



dtenedor opened a new pull request, #37423:
URL: https://github.com/apache/spark/pull/37423

   ### What changes were proposed in this pull request?
   
   Enable implicit DEFAULT column values in inserts from DataFrames.
   
   This mostly already worked since the DataFrame inserts already converted to 
LogicalPlans. I added testing and a small analysis change since the operators 
are resolved one-by-one instead of all at once.
   
   Note that explicit column "default" references are not supported in write 
operations from DataFrames: since the operators are resolved one-by-one, any 
`.select` referring to "default" generates a "column not found" error before 
any following `.insertInto`.
   
   ### Why are the changes needed?
   
   This makes inserts from DataFrames produce the same results as those from 
SQL commands, for consistency and correctness.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Extended the `InsertSuite` in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37422: [SPARK-39992][BUILD] Upgrade `slf4j` to 1.7.36

2022-08-05 Thread GitBox



AmplabJenkins commented on PR #37422:
URL: https://github.com/apache/spark/pull/37422#issuecomment-1206819116

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule

2022-08-05 Thread GitBox



AmplabJenkins commented on PR #35334:
URL: https://github.com/apache/spark/pull/35334#issuecomment-1206818862

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bjornjorgensen opened a new pull request, #37422: [SPARK-39992] Upgrade `slf4j` to 1.7.36

2022-08-05 Thread GitBox



bjornjorgensen opened a new pull request, #37422:
URL: https://github.com/apache/spark/pull/37422

   ### What changes were proposed in this pull request?
   Upgrade `slf4j` for 1.7.32 to 1.7.36
   
   ### Why are the changes needed?
   Snyk have [open up a PR at my 
branch](https://github.com/bjornjorgensen/spark/pull/19) , where they want to 
[upgrade slf4j ](https://www.slf4j.org)
   
   The recommended version is 4 versions ahead of your current version. 
   The recommended version was released 6 months ago, on 2022-02-08.
   
   [Release log](https://www.slf4j.org/news.html)  
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

2022-08-05 Thread GitBox



mridulm commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1206775251

   @JoshRosen Ah yes, missed out on that - should have taken a more detailed 
look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] JoshRosen commented on pull request #37413: [SPARK-39983][CORE] Do not cache unserialized broadcast relations on the driver

2022-08-05 Thread GitBox



JoshRosen commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1206762340

   @mridulm, it doesn't break local mode because there's a carve-out to 
preserve the existing behavior in that case: in both places where the 
`if(serializedOnly` check changes behavior, there's a check for `isLocalMaster` 
to avoid behavior changes:
   
   We'll still store the original object in the driver block manager at write 
time in local mode:
   
   
https://github.com/apache/spark/blob/75ab18ee0e382b8117bf65fc9ef05190d4fdf01a/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L133-L136
   
   There's a similar carve-out in `readBroadcastBlock` (although I don't think 
we'd ever actually hit that branch in local mode given that we would have 
already stored the re-assembled broadcast block in `writeBlocks`):
   
   
https://github.com/apache/spark/blob/75ab18ee0e382b8117bf65fc9ef05190d4fdf01a/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L277-L284
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled

2022-08-05 Thread GitBox



sunchao commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1206745959

   @sadikovi in the example you gave:
   ```
   root/
 col0=0/
   part-0001.parquet (schema: COL0)
   ```
   what's the content in `part-0001.parquet`? I wonder why we need to pushdown 
partition filters to Parquet, given that we'll not materialize the partition 
values in the Parquet files. What is the pushed filters to Parquet in this 
example?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

2022-08-05 Thread GitBox



peter-toth commented on PR #37334:
URL: https://github.com/apache/spark/pull/37334#issuecomment-1206707424

   > > Fixes AliasAwareOutputPartitioning and AliasAwareOutputOrdering as 
testing discovered that ...
   > 
   > Can we create a new PR for this?
   
   Sure, I can open a new one for this next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

2022-08-05 Thread GitBox



peter-toth commented on code in PR #37334:
URL: https://github.com/apache/spark/pull/37334#discussion_r939045315


##
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70.sf100/explain.txt:
##
@@ -157,31 +157,31 @@ Input [2]: [s_state#14, sum#16]
 Keys [1]: [s_state#14]
 Functions [1]: [sum(UnscaledValue(ss_net_profit#10))]
 Aggregate Attributes [1]: [sum(UnscaledValue(ss_net_profit#10))#17]
-Results [3]: [s_state#14, s_state#14, 
MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#18]
+Results [3]: [s_state#14 AS s_state#18, s_state#14, 
MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#19]

Review Comment:
   Ok. No problem, I can revert this PR to the fisrt version next week.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-08-05 Thread GitBox



amaliujia commented on code in PR #37287:
URL: https://github.com/apache/spark/pull/37287#discussion_r939003429


##
R/pkg/tests/fulltests/test_sparkSQL.R:
##
@@ -4098,14 +4098,13 @@ test_that("catalog APIs, listTables, getTable, 
listColumns, listFunctions, funct
c("name", "description", "dataType", "nullable", "isPartition", 
"isBucket"))
   expect_equal(collect(c)[[1]][[1]], "speed")
   expect_error(listColumns("zxwtyswklpf", "default"),
-   paste("Error in listColumns : analysis error - Table",
- "'zxwtyswklpf' does not exist in database 'default'"))
+   paste("Table or view not found: 
spark_catalog.default.zxwtyswklpf"))

Review Comment:
   SG



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-08-05 Thread GitBox



amaliujia commented on code in PR #37287:
URL: https://github.com/apache/spark/pull/37287#discussion_r939003155


##
sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala:
##
@@ -33,36 +33,37 @@ import org.apache.spark.storage.StorageLevel
 abstract class Catalog {
 
   /**
-   * Returns the current default database in this session.
+   * Returns the current database (namespace) in this session.

Review Comment:
   sounds good. Just wanted to confirm that we don't miss anything obvious.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #37418: [SPARK-39987][K8S] Support `PEAK_JVM_(ON|OFF)HEAP_MEMORY` executor rolling policy

2022-08-05 Thread GitBox



dongjoon-hyun closed pull request #37418: [SPARK-39987][K8S] Support 
`PEAK_JVM_(ON|OFF)HEAP_MEMORY` executor rolling policy
URL: https://github.com/apache/spark/pull/37418


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37418: [SPARK-39987][K8S] Support `PEAK_JVM_(ON|OFF)HEAP_MEMORY` executor rolling policy

2022-08-05 Thread GitBox



dongjoon-hyun commented on PR #37418:
URL: https://github.com/apache/spark/pull/37418#issuecomment-1206646328

   Thank you, @viirya ! Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

2022-08-05 Thread GitBox



cloud-fan commented on PR #37334:
URL: https://github.com/apache/spark/pull/37334#issuecomment-1206635362

   > Fixes AliasAwareOutputPartitioning and AliasAwareOutputOrdering as testing 
discovered that ...
   
   Can we create a new PR for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37334:
URL: https://github.com/apache/spark/pull/37334#discussion_r938976805


##
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70.sf100/explain.txt:
##
@@ -157,31 +157,31 @@ Input [2]: [s_state#14, sum#16]
 Keys [1]: [s_state#14]
 Functions [1]: [sum(UnscaledValue(ss_net_profit#10))]
 Aggregate Attributes [1]: [sum(UnscaledValue(ss_net_profit#10))#17]
-Results [3]: [s_state#14, s_state#14, 
MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#18]
+Results [3]: [s_state#14 AS s_state#18, s_state#14, 
MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#19]

Review Comment:
   I'm surprised that this query plan works fine today (two `s_state#14`). 
Reading the source code a bit more, I think Spark is actually fine with 
duplicated attribute ids. It assumes columns with the same attr id always 
output same values, so it can safely bind attributes with the first match. See 
`BindReferences`.
   
   So the problem is still at `Union`. It's not fine to introduce duplicated 
attr ids in the first child of Union, as it will introduce duplicated attr ids 
in the output of Union, which causes wrong results.
   
   I think eventually we should make `Union` use new attr ids to build its 
output columns. As a bug fix, your first attempt to simply not remove alias in 
the first child of Union looks the best one: it's simple, and having a bit more 
unnecessary alias won't impact performance. Sorry for the back and forth!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8s][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster m

2022-08-05 Thread GitBox



pralabhkumar commented on PR #37417:
URL: https://github.com/apache/spark/pull/37417#issuecomment-1206608039

   @HyukjinKwon  Please review , build is failing because of un related error 
(since its passing on local)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37403: [SPARK-39980][INFRA] Change infra image to static tag: `ubuntu:focal-20220801`

2022-08-05 Thread GitBox



HyukjinKwon closed pull request #37403: [SPARK-39980][INFRA] Change infra image 
to static tag: `ubuntu:focal-20220801`
URL: https://github.com/apache/spark/pull/37403


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37403: [SPARK-39980][INFRA] Change infra image to static tag: `ubuntu:focal-20220801`

2022-08-05 Thread GitBox



HyukjinKwon commented on PR #37403:
URL: https://github.com/apache/spark/pull/37403#issuecomment-1206478723

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] smallzhongfeng commented on pull request #37395: [SPARK-39967][CORE] Use the successful array to count the number of t…

2022-08-05 Thread GitBox



smallzhongfeng commented on PR #37395:
URL: https://github.com/apache/spark/pull/37395#issuecomment-1206476150

   At present, for `master` branch, I can not recur the same situation. I agree 
with you.@mridulm  So, I'm going to close this PR temporarily.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] smallzhongfeng closed pull request #37395: [SPARK-39967][CORE] Use the successful array to count the number of t…

2022-08-05 Thread GitBox



smallzhongfeng closed pull request #37395: [SPARK-39967][CORE] Use the 
successful array to count the number of t…
URL: https://github.com/apache/spark/pull/37395


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ivoson commented on pull request #37268: [SPARK-39853][CORE] Support stage level task resource profile for standalone cluster when dynamic allocation disabled

2022-08-05 Thread GitBox



ivoson commented on PR #37268:
URL: https://github.com/apache/spark/pull/37268#issuecomment-1206460341

   > ```
   >  /**
   >* Return resource profile Ids of executors where tasks can be assigned 
to.
   >*/
   >   def compatibleExecutorRpIds(rpMgr: ResourceProfileManager): Set[Int]
   > ```
   > 
   > It seems a little bit odd to ask the ResourceProfile to give you 
compatible other ResourceProfiles. This feels like it should be in the 
ResourceProfileManager which knows about all the ResourceProfiles. I guess that 
is why you pass in the ResourceProfileManager here? Is the intention the user 
could explicitly set which ResourceProfiles its compatible with? If so I 
definitely would want a way to not have to specify it.
   
   Yes, exactly. I put the `ResourceProfileManager` here because it knows about 
all the ResourceProfiles. 
   Adding this API just want to make sure that we have one interface to get 
compatible RP Ids for scheduler to assign tasks. And the implementation can be 
further enriched in future maybe for re-use executors with dynamic allocation 
on and adding more reuse policy as #33941 does.
   And we'll not let user to specify compatible ResourceProfiles. In this case, 
we will only have `TaskResourceProfile` compatible with `Default 
ResourceProfile`.
   
   > 
   > The other issue raised that wasn't addressed was the reuse policy. I guess 
in this case we are limiting the executor profile to 1 because we don't have 
dynamic allocation so one could argue that if you use task resource request 
with that you know what you get. Which I am fine with but we need to be clear 
that it might very well waste resources.
   > 
   > Also if the intent is to not support TaskResourceProfile with dynamic 
allocation, I think we should throw an exception if anyone uses it with the 
dynamic allocation config on.
   
   As mentioned above, in this case we will only have `TaskResourceProfile` 
re-use executors with `Default ResourceProfile` when dynamic allocation is off. 
The behavior will be limited to dynamic allocation off and will throw an 
exception if user use it with dynamic allocation on.
   
   Does this behavior change make sense? @tgravescs 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yikf commented on pull request #37254: [SPARK-39841][SQL] simplify conflict binary comparison

2022-08-05 Thread GitBox



yikf commented on PR #37254:
URL: https://github.com/apache/spark/pull/37254#issuecomment-1206452674

   friendly ping @gengliangwang @cloud-fan @sigmod, Sorry for late reply, i was 
busy last week, i fixed comments you left, please take a look again when you 
have a time~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37362: [SPARK-39950][SQL] It's unnecessary to materialize BroadcastQueryStage firstly, because the BroadcastQueryStage does not timeout in AQE.

2022-08-05 Thread GitBox



cloud-fan commented on PR #37362:
URL: https://github.com/apache/spark/pull/37362#issuecomment-1206450655

   looks fine, but let's think a bit more and see if there is any benefit to 
submit broadcast jobs first. cc @yaooqinn @maryannxue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r938796458


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY")
+  .groupBy("my_dept", "my_manager").sum("SALARY")
+  .orderBy("my_dept", "my_manager")
+  .limit(1)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key", "IS_MANAGER").sum("SALARY")
+  .orderBy("key", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df6)
+checkLimitRemoved(df6)
+checkPushedInfo(df6,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END, " +
+"IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df6, Seq(Row(0.00, false, 12000.00)))
+
+val df7 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY"))
+  .orderBy(sum("SALARY"))
+  .limit(1)
+checkSortRemoved(df7, false)
+checkLimitRemoved(df7, false)
+checkPushedInfo(df7,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []")
+checkAnswer(df7, Seq(Row(6, 12000.00)))
+
+val df8 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY").as("total"))
+  .orderBy("total")
+  .limit(1)
+checkSortRemoved(df8, false)
+checkLimitRemoved(df8, false)
+checkPushedInfo(df8,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []")
+checkAnswer(df8, Seq(Row(6, 12000.00)))
+  }
+
+

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r938794884


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY")
+  .groupBy("my_dept", "my_manager").sum("SALARY")
+  .orderBy("my_dept", "my_manager")
+  .limit(1)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key", "IS_MANAGER").sum("SALARY")
+  .orderBy("key", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df6)
+checkLimitRemoved(df6)
+checkPushedInfo(df6,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END, " +
+"IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df6, Seq(Row(0.00, false, 12000.00)))
+
+val df7 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY"))
+  .orderBy(sum("SALARY"))

Review Comment:
   We can translate agg expressions now, why this test still can't trigger 
top-n pushdown?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r938794073


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY")
+  .groupBy("my_dept", "my_manager").sum("SALARY")
+  .orderBy("my_dept", "my_manager")
+  .limit(1)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key", "IS_MANAGER").sum("SALARY")
+  .orderBy("key", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df6)
+checkLimitRemoved(df6)
+checkPushedInfo(df6,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END, " +
+"IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df6, Seq(Row(0.00, false, 12000.00)))
+
+val df7 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY"))
+  .orderBy(sum("SALARY"))

Review Comment:
   how about
   ```
   .groupBy("dept").agg(sum("SALARY").as("sum_salary"))
   .orderBy("sum_salary")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r938792943


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("dept").sum("SALARY")
+  .orderBy($"dept".gt(1))
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT > 1 ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df4, Seq(Row(1, 19000.00)))
+
+val df5 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))

Review Comment:
   let's use case instead of case when, which is simpler and easier to review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r938792433


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("dept").sum("SALARY")
+  .orderBy($"dept".gt(1))

Review Comment:
   let's use cast now. group by a predicate is super weird.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r938791655


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")

Review Comment:
   let's add a cast instead of just a simple alias
   ```
   .select($"DEPT".cast("string").as("my_dept"), $"SALARY")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r938792037


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,253 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))

Review Comment:
   we can remove this test after you address 
https://github.com/apache/spark/pull/37320/files#r938791655



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r938790245


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -408,16 +411,27 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
   }
   (operation, isPushed && !isPartiallyPushed)
 case s @ Sort(order, _, operation @ ScanOperation(project, filter, 
sHolder: ScanBuilderHolder))
-// Without building the Scan, we do not know the resulting column 
names after aggregate
-// push-down, and thus can't push down Top-N which needs to know the 
ordering column names.
-// TODO: we can support simple cases like GROUP BY columns directly 
and ORDER BY the same
-//   columns, which we know the resulting column names: the 
original table columns.
-if sHolder.pushedAggregate.isEmpty && filter.isEmpty &&
-  CollapseProject.canCollapseExpressions(order, project, alwaysInline 
= true) =>
+  if filter.isEmpty &&
+CollapseProject.canCollapseExpressions(order, project, alwaysInline = 
true) =>
   val aliasMap = getAliasMap(project)
-  val newOrder = order.map(replaceAlias(_, 
aliasMap)).asInstanceOf[Seq[SortOrder]]
+  val aliasReplacedOrder = order.map(replaceAlias(_, aliasMap))
+  val newOrder = if (sHolder.pushedAggregate.isDefined) {
+// Without building the Scan, Aggregate push-down give the expected 
output starts with
+// `group_col_` or `agg_func_`. When Aggregate push-down working with 
Sort for top n
+// push-down, we need replace these expected output with the origin 
expressions.

Review Comment:
   ```
   // `ScanBuilderHolder` has different output columns after aggregate 
pushdown. Here we
   // replace the attributes in ordering expressions with the original table 
output columns.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r938790445


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -408,16 +411,27 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
   }
   (operation, isPushed && !isPartiallyPushed)
 case s @ Sort(order, _, operation @ ScanOperation(project, filter, 
sHolder: ScanBuilderHolder))
-// Without building the Scan, we do not know the resulting column 
names after aggregate
-// push-down, and thus can't push down Top-N which needs to know the 
ordering column names.
-// TODO: we can support simple cases like GROUP BY columns directly 
and ORDER BY the same
-//   columns, which we know the resulting column names: the 
original table columns.
-if sHolder.pushedAggregate.isEmpty && filter.isEmpty &&
-  CollapseProject.canCollapseExpressions(order, project, alwaysInline 
= true) =>
+  if filter.isEmpty &&
+CollapseProject.canCollapseExpressions(order, project, alwaysInline = 
true) =>
   val aliasMap = getAliasMap(project)
-  val newOrder = order.map(replaceAlias(_, 
aliasMap)).asInstanceOf[Seq[SortOrder]]
+  val aliasReplacedOrder = order.map(replaceAlias(_, aliasMap))
+  val newOrder = if (sHolder.pushedAggregate.isDefined) {
+// Without building the Scan, Aggregate push-down give the expected 
output starts with
+// `group_col_` or `agg_func_`. When Aggregate push-down working with 
Sort for top n
+// push-down, we need replace these expected output with the origin 
expressions.
+aliasReplacedOrder.map {
+  _.transform {
+case a: Attribute => sHolder.pushedAggOutputMap.getOrElse(a, a)
+  }.asInstanceOf[SortOrder]
+}
+  } else {
+aliasReplacedOrder.asInstanceOf[Seq[SortOrder]]
+  }
   val normalizedOrders = DataSourceStrategy.normalizeExprs(
 newOrder, sHolder.relation.output).asInstanceOf[Seq[SortOrder]]
+  // Because V2ExpressionBuilder can't translate aggregate functions, so 
we can't

Review Comment:
   we can remove this TODO now.



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -408,16 +411,27 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
   }
   (operation, isPushed && !isPartiallyPushed)
 case s @ Sort(order, _, operation @ ScanOperation(project, filter, 
sHolder: ScanBuilderHolder))
-// Without building the Scan, we do not know the resulting column 
names after aggregate
-// push-down, and thus can't push down Top-N which needs to know the 
ordering column names.
-// TODO: we can support simple cases like GROUP BY columns directly 
and ORDER BY the same
-//   columns, which we know the resulting column names: the 
original table columns.
-if sHolder.pushedAggregate.isEmpty && filter.isEmpty &&
-  CollapseProject.canCollapseExpressions(order, project, alwaysInline 
= true) =>
+  if filter.isEmpty &&
+CollapseProject.canCollapseExpressions(order, project, alwaysInline = 
true) =>
   val aliasMap = getAliasMap(project)
-  val newOrder = order.map(replaceAlias(_, 
aliasMap)).asInstanceOf[Seq[SortOrder]]
+  val aliasReplacedOrder = order.map(replaceAlias(_, aliasMap))
+  val newOrder = if (sHolder.pushedAggregate.isDefined) {
+// Without building the Scan, Aggregate push-down give the expected 
output starts with
+// `group_col_` or `agg_func_`. When Aggregate push-down working with 
Sort for top n
+// push-down, we need replace these expected output with the origin 
expressions.
+aliasReplacedOrder.map {
+  _.transform {
+case a: Attribute => sHolder.pushedAggOutputMap.getOrElse(a, a)
+  }.asInstanceOf[SortOrder]
+}
+  } else {
+aliasReplacedOrder.asInstanceOf[Seq[SortOrder]]
+  }
   val normalizedOrders = DataSourceStrategy.normalizeExprs(
 newOrder, sHolder.relation.output).asInstanceOf[Seq[SortOrder]]
+  // Because V2ExpressionBuilder can't translate aggregate functions, so 
we can't

Review Comment:
   and let's add tests for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37393: [SPARK-39966][SQL] Use V2 Filter in SupportsDelete

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37393:
URL: https://github.com/apache/spark/pull/37393#discussion_r938782700


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala:
##
@@ -277,14 +277,14 @@ class DataSourceV2Strategy(session: SparkSession) extends 
Strategy with Predicat
   // correctness depends on removing all matching data.
   val filters = DataSourceStrategy.normalizeExprs(Seq(condition), 
output)
   .flatMap(splitConjunctivePredicates(_).map {
-f => DataSourceStrategy.translateFilter(f, true).getOrElse(
+f => DataSourceV2Strategy.translateFilterV2(f).getOrElse(
   throw 
QueryCompilationErrors.cannotTranslateExpressionToSourceFilterError(f))

Review Comment:
   `SupportsDelete` is different from runtime filtering. For runtime filtering, 
we need to push as many predicates as possible, but this is just a perf 
improvement. `SupportsDelete` must take all predicates, otherwise the source 
can't delete enough data which leads to wrong query result later. We need to 
follow the same behavior in 
https://github.com/apache/spark/pull/37393/files#diff-dc485c81773a73a5a462994af50e17a5043a8d66c47399cf29b0a3cb56c85591R80



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37393: [SPARK-39966][SQL] Use V2 Filter in SupportsDelete

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37393:
URL: https://github.com/apache/spark/pull/37393#discussion_r938778661


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java:
##
@@ -28,7 +30,7 @@
  * @since 3.0.0
  */
 @Evolving
-public interface SupportsDelete extends TruncatableTable {
+public interface SupportsDelete extends TruncatableTable, SupportsDeleteV2 {

Review Comment:
   ```suggestion
   public interface SupportsDelete extends SupportsDeleteV2 {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop

2022-08-05 Thread GitBox



zhengruifeng commented on code in PR #37335:
URL: https://github.com/apache/spark/pull/37335#discussion_r938728719


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -826,6 +826,16 @@ class DataFrameSuite extends QueryTest
 assert(df.schema.map(_.name) === Seq("key", "value"))
   }
 
+  test("drop two column references") {

Review Comment:
   ```suggestion
 test("SPARK-39895: drop two column references") {
   ```



##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -87,6 +87,21 @@ def test_help_command(self):
 pydoc.render_doc(df.foo)
 pydoc.render_doc(df.take(1))
 
+def test_drop(self):
+df = self.spark.createDataFrame([("A", 50, "Y"), ("B", 60, "Y")], 
["name", "age", "active"])
+

Review Comment:
   I think we can remove these empty lines here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule

2022-08-05 Thread GitBox



cloud-fan commented on PR #35334:
URL: https://github.com/apache/spark/pull/35334#issuecomment-1206308133

   thanks, merging to master/3.3/3.2!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule

2022-08-05 Thread GitBox



cloud-fan closed pull request #35334: [SPARK-38034][SQL] Optimize 
TransposeWindow rule
URL: https://github.com/apache/spark/pull/35334


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule

2022-08-05 Thread GitBox



cloud-fan commented on PR #35334:
URL: https://github.com/apache/spark/pull/35334#issuecomment-1206305491

   I think the previous O(n!) time complexity code is unexpected and buggy. 
Let's backport this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #35334: [SPARK-38034][SQL] Optimize TransposeWindow rule

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #35334:
URL: https://github.com/apache/spark/pull/35334#discussion_r938688689


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -1148,9 +1148,9 @@ object CollapseWindow extends Rule[LogicalPlan] {
  */
 object TransposeWindow extends Rule[LogicalPlan] {
   private def compatiblePartitions(ps1 : Seq[Expression], ps2: 
Seq[Expression]): Boolean = {
-ps1.length < ps2.length && 
ps2.take(ps1.length).permutations.exists(ps1.zip(_).forall {

Review Comment:
   I mean the previous code is kind of buggy. This PR is more like a bug fix 
instead of perf improvement and we should backport it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #37391: [SPARK-39964][SQL] DS V2 pushdown should unify the translate path

2022-08-05 Thread GitBox



cloud-fan closed pull request #37391: [SPARK-39964][SQL] DS V2 pushdown should 
unify the translate path
URL: https://github.com/apache/spark/pull/37391


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37391: [SPARK-39964][SQL] DS V2 pushdown should unify the translate path

2022-08-05 Thread GitBox



cloud-fan commented on PR #37391:
URL: https://github.com/apache/spark/pull/37391#issuecomment-1206303422

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #37389: [SPARK-39963][SQL] Simplify `SimplifyCasts.isWiderCast`

2022-08-05 Thread GitBox



cloud-fan closed pull request #37389: [SPARK-39963][SQL] Simplify 
`SimplifyCasts.isWiderCast`
URL: https://github.com/apache/spark/pull/37389


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37389: [SPARK-39963][SQL] Simplify `SimplifyCasts.isWiderCast`

2022-08-05 Thread GitBox



cloud-fan commented on PR #37389:
URL: https://github.com/apache/spark/pull/37389#issuecomment-1206302474

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37389: [SPARK-39963][SQL] Simplify `SimplifyCasts.isWiderCast`

2022-08-05 Thread GitBox



cloud-fan commented on PR #37389:
URL: https://github.com/apache/spark/pull/37389#issuecomment-1206302144

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37340: [MINOR][SQL] Improve the comments about null tracking for UnsafeRow

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37340:
URL: https://github.com/apache/spark/pull/37340#discussion_r938686329


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java:
##
@@ -46,10 +46,10 @@
 /**
  * An Unsafe implementation of Row which is backed by raw memory instead of 
Java objects.
  *
- * Each tuple has three parts: [null bit set] [values] [variable length 
portion]
+ * Each tuple has three parts: [null-tracking bit set] [values] [variable 
length portion]
  *
- * The bit set is used for null tracking and is aligned to 8-byte word 
boundaries.  It stores
- * one bit per field.
+ * The null-tracking bit set is used for null tracking and is aligned to 
8-byte word boundaries.

Review Comment:
   This is not addressed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #37169: [SPARK-38901][SQL] DS V2 supports push down misc functions

2022-08-05 Thread GitBox



cloud-fan closed pull request #37169: [SPARK-38901][SQL] DS V2 supports push 
down misc functions
URL: https://github.com/apache/spark/pull/37169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37169: [SPARK-38901][SQL] DS V2 supports push down misc functions

2022-08-05 Thread GitBox



cloud-fan commented on PR #37169:
URL: https://github.com/apache/spark/pull/37169#issuecomment-1206299101

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37287:
URL: https://github.com/apache/spark/pull/37287#discussion_r938682574


##
sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala:
##
@@ -33,36 +33,37 @@ import org.apache.spark.storage.StorageLevel
 abstract class Catalog {
 
   /**
-   * Returns the current default database in this session.
+   * Returns the current database (namespace) in this session.

Review Comment:
   `namespace` is more like the official name. database/schema is only for the 
hive catalog. We can change `database` to `database/schema` though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-08-05 Thread GitBox



cloud-fan commented on code in PR #37287:
URL: https://github.com/apache/spark/pull/37287#discussion_r938681100


##
R/pkg/tests/fulltests/test_sparkSQL.R:
##
@@ -4098,14 +4098,13 @@ test_that("catalog APIs, listTables, getTable, 
listColumns, listFunctions, funct
c("name", "description", "dataType", "nullable", "isPartition", 
"isBucket"))
   expect_equal(collect(c)[[1]][[1]], "speed")
   expect_error(listColumns("zxwtyswklpf", "default"),
-   paste("Error in listColumns : analysis error - Table",
- "'zxwtyswklpf' does not exist in database 'default'"))
+   paste("Table or view not found: 
spark_catalog.default.zxwtyswklpf"))

Review Comment:
   I don't think we treat error message change as behavior change. We change 
error messages from time to time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wang-zhun commented on pull request #37406: [SPARK-39921][SQL] SkewJoin--Stream side skew in BroadcastHashJoin

2022-08-05 Thread GitBox



wang-zhun commented on PR #37406:
URL: https://github.com/apache/spark/pull/37406#issuecomment-1206294767

   > you do not use AQE ?
   
   Turning off AQE will be a `SortMergeJoin`, we need to turn on AQE and solve 
the data skew


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #37398: [SPARK-39867][SQL][3.1] Global limit should not inherit OrderPreservingUnaryNode

2022-08-05 Thread GitBox



cloud-fan closed pull request #37398: [SPARK-39867][SQL][3.1] Global limit 
should not inherit OrderPreservingUnaryNode
URL: https://github.com/apache/spark/pull/37398


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37398: [SPARK-39867][SQL][3.1] Global limit should not inherit OrderPreservingUnaryNode

2022-08-05 Thread GitBox



cloud-fan commented on PR #37398:
URL: https://github.com/apache/spark/pull/37398#issuecomment-1206293236

   The python syle check failure is unrelated. Thanks, merging to 3.1!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #37397: [SPARK-39867][SQL][3.2] Global limit should not inherit OrderPreservingUnaryNode

2022-08-05 Thread GitBox



cloud-fan closed pull request #37397: [SPARK-39867][SQL][3.2] Global limit 
should not inherit OrderPreservingUnaryNode
URL: https://github.com/apache/spark/pull/37397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37397: [SPARK-39867][SQL][3.2] Global limit should not inherit OrderPreservingUnaryNode

2022-08-05 Thread GitBox



cloud-fan commented on PR #37397:
URL: https://github.com/apache/spark/pull/37397#issuecomment-1206290710

   The failed python style check is not related to this PR. merging to 3.2, 
thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #37403: [SPARK-39980][INFRA] Change infra image to static tag: `ubuntu:focal-20220801`

2022-08-05 Thread GitBox



Yikun commented on PR #37403:
URL: https://github.com/apache/spark/pull/37403#issuecomment-1206288324

   @HyukjinKwon Ready to go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Resol1992 commented on pull request #37404: [SPARK-39866][SQL] Memory leak when closing a session of Spark ThriftServer

2022-08-05 Thread GitBox



Resol1992 commented on PR #37404:
URL: https://github.com/apache/spark/pull/37404#issuecomment-1206257121

   
   
   
   
   > Different sessions can share this cache. Maybe you can set 
`spark.sql.metadataCacheTTLSeconds` to a positive value to workaround this 
issue?
   
   @wangyum Thanks for your advise. I have tried to workaroud this with setting 
`spark.sql.metadataCacheTTLSeconds = 10`, but it does not work, the fileStatus 
objects still exist after sparkSession is closed.
   In fact, 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast

2022-08-05 Thread GitBox



HyukjinKwon commented on PR #37414:
URL: https://github.com/apache/spark/pull/37414#issuecomment-1206247682

   Test conflicts. I just removed the test part in branch-3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast

2022-08-05 Thread GitBox



HyukjinKwon closed pull request #37414: [SPARK-39981][SQL] Throw the exception 
QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast
URL: https://github.com/apache/spark/pull/37414


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast

2022-08-05 Thread GitBox



HyukjinKwon commented on PR #37414:
URL: https://github.com/apache/spark/pull/37414#issuecomment-1206246390

   Merged to master and branch-3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #37411: [SPARK-39984][CORE] Check workerLastHeartbeat with master before HeartbeatReceiver expires an executor

2022-08-05 Thread GitBox



mridulm commented on code in PR #37411:
URL: https://github.com/apache/spark/pull/37411#discussion_r938566024


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2398,4 +2398,11 @@ package object config {
   .version("3.3.0")
   .intConf
   .createWithDefault(5)
+
+  private[spark] val HEARTBEAT_RECEIVER_CHECK_WORKER_LAST_HEARTBEAT =
+ConfigBuilder("spark.driver.heartbeat.checkWorkerLastHeartbeat")
+  .internal()
+  .version("3.4.0")
+  .booleanConf
+  .createWithDefault(true)

Review Comment:
   Default this to `false`



##
core/src/main/scala/org/apache/spark/internal/config/Network.scala:
##
@@ -49,7 +49,13 @@ private[spark] object Network {
 ConfigBuilder("spark.network.timeoutInterval")
   .version("1.3.2")
   .timeConf(TimeUnit.MILLISECONDS)
-  
.createWithDefaultString(STORAGE_BLOCKMANAGER_TIMEOUTINTERVAL.defaultValueString)
+  .createWithDefaultString("15s")
+
+  private[spark] val NETWORK_EXECUTOR_TIMEOUT =
+ConfigBuilder("spark.network.executorTimeout")
+  .version("1.3.0")

Review Comment:
   `1.3.0` -> `3.4.0`



##
core/src/main/scala/org/apache/spark/internal/config/Network.scala:
##
@@ -49,7 +49,13 @@ private[spark] object Network {
 ConfigBuilder("spark.network.timeoutInterval")
   .version("1.3.2")
   .timeConf(TimeUnit.MILLISECONDS)
-  
.createWithDefaultString(STORAGE_BLOCKMANAGER_TIMEOUTINTERVAL.defaultValueString)
+  .createWithDefaultString("15s")
+
+  private[spark] val NETWORK_EXECUTOR_TIMEOUT =
+ConfigBuilder("spark.network.executorTimeout")
+  .version("1.3.0")
+  .timeConf(TimeUnit.MILLISECONDS)
+  .createWithDefaultString("60s")

Review Comment:
   fallback to `NETWORK_TIMEOUT` to preserve existing behavior.



##
core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:
##
@@ -199,41 +222,131 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, 
clock: Clock)
 removeExecutor(executorRemoved.executorId)
   }
 
+  private def killExecutor(executorId: String, timeout: Long): Unit = {
+logWarning(s"Removing executor $executorId with no recent heartbeats: " +
+  s"${timeout} ms exceeds timeout $executorTimeoutMs ms")
+killExecutorThread.submit(new Runnable {
+  override def run(): Unit = Utils.tryLogNonFatalError {
+// Note: we want to get an executor back after expiring this one,
+// so do not simply call `sc.killExecutor` here (SPARK-8119)
+sc.killAndReplaceExecutor(executorId)
+// SPARK-27348: in case of the executors which are not gracefully shut 
down,
+// we should remove lost executors from CoarseGrainedSchedulerBackend 
manually
+// here to guarantee two things:
+// 1) explicitly remove executor information from 
CoarseGrainedSchedulerBackend for
+//a lost executor instead of waiting for disconnect message
+// 2) call scheduler.executorLost() underlying to fail any tasks 
assigned to
+//those executors to avoid app hang
+sc.schedulerBackend match {
+  case backend: CoarseGrainedSchedulerBackend =>
+backend.driverEndpoint.send(RemoveExecutor(executorId,
+  ExecutorProcessLost(
+s"Executor heartbeat timed out after ${timeout} ms",
+causedByApp = 
!sc.conf.get(HEARTBEAT_RECEIVER_CHECK_WORKER_LAST_HEARTBEAT
+
+  // LocalSchedulerBackend is used locally and only has one single 
executor
+  case _: LocalSchedulerBackend =>
+
+  case other => throw new UnsupportedOperationException(
+s"Unknown scheduler backend: ${other.getClass}")
+}
+  }
+})
+  }
+
   private def expireDeadHosts(): Unit = {
+  /**
+   * [SC-105641]
+   * Originally, the driver’s HeartbeatReceiver will expire an executor if it 
does not receive any
+   * heartbeat from the executor for 120 seconds. However, 120 seconds is too 
long, but we will face
+   * other challenges when we try to lower the timeout threshold. To 
elaborate, when an executor is
+   * performing full GC, it cannot send/reply any message. Next paragraphs 
describe the solution to
+   * detect network disconnection between driver and executor in a short time.
+   *
+   * An executor is running on a worker but in different JVMs, and a driver is 
running on a master
+   * but in different JVMs. Hence, the network connection between 
driver/executor and master/worker
+   * is the same. Because executor and worker are running on different JVMs, 
worker can still send
+   * heartbeat to master when executor performs GC.
+   *
+   * For new Heartbeat Receiver, if driver does not receive any heartbeat from 
the executor for
+   * `executorTimeoutMs` (default: 60s) seconds, HeartbeatReceiver will send a 
request to master to
+   * ask for the latest heartbeat from the worker which the executor runs on

[GitHub] [spark] wangyum opened a new pull request, #37421: [SPARK-39989][SQL] Support estimate column statistics if it is foldable expression

2022-08-05 Thread GitBox



wangyum opened a new pull request, #37421:
URL: https://github.com/apache/spark/pull/37421

   ### What changes were proposed in this pull request?
   
   This PR adds support estimate column statistics if it is foldable 
expression. For example: estimate the `'a' AS a`'s column statistics from 
`SELECT 'a' AS a FROM tbl`.
   
   1. If the foldable expression is null:
  ```scala
  ColumnStat(Some(0), None, None, Some(rowCount), Some(size), Some(size), 
None, 2)
  ```
   2. If the foldable expression is not null:
  ```scala
  ColumnStat(Some(1), Some(value), Some(value), Some(0), Some(size), 
Some(size), None, 2)
  ```
   
   ### Why are the changes needed?
   
   Improve column statistics.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37418: [SPARK-39987][K8S] Support `PEAK_JVM_(ON|OFF)HEAP_MEMORY` executor rolling policy

2022-08-05 Thread GitBox



dongjoon-hyun commented on PR #37418:
URL: https://github.com/apache/spark/pull/37418#issuecomment-1206159404

   Could you review this when you have some time, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] CHzxp commented on pull request #37406: [SPARK-39921][SQL] SkewJoin--Stream side skew in BroadcastHashJoin

2022-08-05 Thread GitBox



CHzxp commented on PR #37406:
URL: https://github.com/apache/spark/pull/37406#issuecomment-1206140696

   you do not use AQE ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Zhangshunyu commented on pull request #37404: [SPARK-39866][SQL] Memory leak when closing a session of Spark ThriftServer

2022-08-05 Thread GitBox



Zhangshunyu commented on PR #37404:
URL: https://github.com/apache/spark/pull/37404#issuecomment-1206127023

   This is a very good fix when using multiple connections to query the same 
table and then closing the connections, I don't think relying on TTL is enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Zhangshunyu commented on pull request #37404: [SPARK-39866][SQL] Memory leak when closing a session of Spark ThriftServer

2022-08-05 Thread GitBox



Zhangshunyu commented on PR #37404:
URL: https://github.com/apache/spark/pull/37404#issuecomment-1206126564

   This is a very good fix when using multiple connections to query the same 
table and then closing the connections, I don't think relying on TTL is enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37413: [SPARK-39983][CORE] Do not cache unserialized broadcast relations on the driver

2022-08-05 Thread GitBox



mridulm commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1206124066

   Doesn't this not break in local mode ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #36200: [SPARK-38909][CORE][YARN] Encapsulate `LevelDB` used by `ExternalShuffleBlockResolver`,`YarnShuffleService` and `RemoteBlockPus

2022-08-05 Thread GitBox



LuciferYang commented on code in PR #36200:
URL: https://github.com/apache/spark/pull/36200#discussion_r938512128


##
common/network-common/src/main/java/org/apache/spark/network/util/DBProvider.java:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.network.util;
+
+import java.io.File;
+import java.io.IOException;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.google.common.annotations.VisibleForTesting;
+
+import org.apache.spark.network.shuffledb.LevelDB;
+import org.apache.spark.network.shuffledb.DB;
+import org.apache.spark.network.shuffledb.StoreVersion;
+
+public class DBProvider {
+public static DB initDB(File dbFile, StoreVersion version, ObjectMapper 
mapper)
+throws IOException {
+if (dbFile != null) {
+if (dbFile.getName().endsWith(".ldb")) {
+org.iq80.leveldb.DB levelDB = 
LevelDBProvider.initLevelDB(dbFile, version, mapper);
+return levelDB != null ? new LevelDB(levelDB) : null;
+} else {
+return null;
+}
+}

Review Comment:
   Let me think about it
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

2022-08-05 Thread GitBox



mridulm commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1206118791

   My query was slightly different - are you actually seeing this issue ? Can 
you elaborate on what was observed ?
   There are corner cases which we do not deal with currently in spark, given 
the complexity of handling them, and given how infrequent they are. A stage/job 
is not immediately failed when a single task fails - it is retried, and there 
are various exclude lists which prevent repeated schedule on the same 
executor/node and quickly fail the stage/job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled

2022-08-05 Thread GitBox



sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r938492838


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:
##
@@ -228,6 +228,13 @@ class ParquetFileFormat
   SQLConf.PARQUET_TIMESTAMP_NTZ_ENABLED.key,
   sparkSession.sessionState.conf.parquetTimestampNTZEnabled)
 
+// See PARQUET-2170.
+// Disable column index optimisation when required schema is empty so we 
get the correct
+// row count from parquet-mr.
+if (requiredSchema.isEmpty) {

Review Comment:
   No, this is not required for DSv2.
   
   The test works in DSv2 due to another inconsistency - Parquet DSv2 does not 
consider the full file schema when creating pushdown filters. There is a check 
in FileScanBuilder to ignore partition columns so in this case, the schema is 
empty so no filters will be pushed down, returning the correct number of 
records. It is rather a performance inefficiency in DSv2 as the entire file 
will be scanned. However, the result will be correct.
   
   I thought about fixing it the same way DSv2 fixed the issue but it is a much 
bigger change as it would affect not just this case but others as well. I hope 
my explanation makes sense. Let me know your thoughts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast

2022-08-05 Thread GitBox



gengliangwang commented on PR #37414:
URL: https://github.com/apache/spark/pull/37414#issuecomment-1206083866

   BTW, it would be great if we could have a test case to capture the 
non-codegen code patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #37414: [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast

2022-08-05 Thread GitBox



gengliangwang commented on PR #37414:
URL: https://github.com/apache/spark/pull/37414#issuecomment-1206082240

   Thanks for catching it, @HyukjinKwon !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

91 matches

Mail list logo