[GitHub] [spark] LuciferYang commented on pull request #37331: [WIP][SPARK-39913][BUILD] Upgrade to Arrow 9.0.0

2022-08-01 Thread GitBox


LuciferYang commented on PR #37331:
URL: https://github.com/apache/spark/pull/37331#issuecomment-1202050954

   I revert the configuration of AFS staging due to the previous test passed, 
but I still can't find arrow-*** 9.0 from the central repository.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sumeetgajjar commented on a diff in pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI

2022-08-01 Thread GitBox


sumeetgajjar commented on code in PR #37325:
URL: https://github.com/apache/spark/pull/37325#discussion_r935139884


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala:
##
@@ -132,4 +132,10 @@ case class BatchScanExec(
 val result = s"$nodeName$truncatedOutputString ${scan.description()} 
$runtimeFiltersString"
 redact(result)
   }
+
+  /**
+   * Returns the name of this type of TreeNode.  Defaults to the class name.
+   * Note that we remove the "Exec" suffix for physical operators here.
+   */
+  override def nodeName: String = s"BatchScan ${scan.name()}"

Review Comment:
   Sure - let me check



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37355: [SPARK-39930][SQL] Introduce Cache Hints

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37355:
URL: https://github.com/apache/spark/pull/37355#discussion_r935138416


##
sql/core/src/test/scala/org/apache/spark/sql/CacheHintsSuite.scala:
##
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.execution.CachedData
+import 
org.apache.spark.sql.execution.analysis.ResolveCacheHints.NO_RESULT_CACHE_TAG
+import org.apache.spark.sql.test.SharedSparkSession
+
+class CacheHintsSuite extends PlanTest with SharedSparkSession {
+  private def lookupCachedData(df: DataFrame): Option[CachedData] = {
+spark.sharedState.cacheManager.lookupCachedData(df)
+  }
+
+  test("RESULT_CACHE Hint") {
+val e = intercept[AnalysisException](
+  sql("SELECT /*+ RESULT_CACHE(abc) */ id from range(0, 1)").collect())

Review Comment:
   I don't think `CACHE TABLE` command fails if the plan is already cached. So 
basically the hint is to combine two sql statements?
   ```
   CACHE TABLE abc AS SELECT id from range(0,1)
   SELECT * FROM abc
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #37373: [SPARK-39911][SQL][3.3] Optimize global Sort to RepartitionByExpression

2022-08-01 Thread GitBox


ulysses-you commented on PR #37373:
URL: https://github.com/apache/spark/pull/37373#issuecomment-1202043485

   @dongjoon-hyun yes, the story is we fix a bug in 
https://github.com/apache/spark/pull/37250 and that pr backport into 
branch-3.3. However, that fix may introduce performance regression. This pr 
itself is only to improve performance but in order to avoid the regression, we 
also backport this pr. see the details 
https://github.com/apache/spark/pull/37330#issuecomment-1201979396


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #37372: [SPARK-39944][BUILD] Upgrade dropwizard metrics to 4.2.10

2022-08-01 Thread GitBox


LuciferYang commented on PR #37372:
URL: https://github.com/apache/spark/pull/37372#issuecomment-1202041183

   @dongjoon-hyun There should be no new page to update. 
[https://metrics.dropwizard.io/4.2.0/](https://metrics.dropwizard.io/4.2.0/) 
seems to be the latest,  there is no `https://metrics.dropwizard.io/4.2.10/` 
similar pages exist. If it's my problem, please correct me.
   
   
   
   In addition, do we need to change `http` to `https`? For example:
   
   ```
   docs/monitoring.md:Executor memory metrics are also exposed via the Spark 
metrics system based on the [Dropwizard metrics 
library](http://metrics.dropwizard.io/4.2.0).
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a diff in pull request #37355: [SPARK-39930][SQL] Introduce Cache Hints

2022-08-01 Thread GitBox


yaooqinn commented on code in PR #37355:
URL: https://github.com/apache/spark/pull/37355#discussion_r935131842


##
sql/core/src/test/scala/org/apache/spark/sql/CacheHintsSuite.scala:
##
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.execution.CachedData
+import 
org.apache.spark.sql.execution.analysis.ResolveCacheHints.NO_RESULT_CACHE_TAG
+import org.apache.spark.sql.test.SharedSparkSession
+
+class CacheHintsSuite extends PlanTest with SharedSparkSession {
+  private def lookupCachedData(df: DataFrame): Option[CachedData] = {
+spark.sharedState.cacheManager.lookupCachedData(df)
+  }
+
+  test("RESULT_CACHE Hint") {
+val e = intercept[AnalysisException](
+  sql("SELECT /*+ RESULT_CACHE(abc) */ id from range(0, 1)").collect())

Review Comment:
   They are very similar with few differences:
   
   1. the result set is not the same
   2. it skips caching step when already cached, so it will not fail
   
   
   Parameters are disabled now because
   
   1. I don't know if it is good to do parsing table identifiers in 
post-resolution 
   2. oracle does not have params support either, maybe it's simple and easy 
for users?
   3. minimize the patch size
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #37284: [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode

2022-08-01 Thread GitBox


ulysses-you commented on PR #37284:
URL: https://github.com/apache/spark/pull/37284#issuecomment-1202032739

   >  do we have a physical rule to remove unnecessary sort if the global limit 
can produce sorted data?
   
   yes, we have `RemoveRedundantSorts`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37284: [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode

2022-08-01 Thread GitBox


cloud-fan commented on PR #37284:
URL: https://github.com/apache/spark/pull/37284#issuecomment-1202029531

   BTW, do we have a physical rule to remove unnecessary sort if the global 
limit can produce sorted data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #37375: [SPARK-39947][BUILD] Upgrade Jersey to 2.36

2022-08-01 Thread GitBox


LuciferYang opened a new pull request, #37375:
URL: https://github.com/apache/spark/pull/37375

   ### What changes were proposed in this pull request?
   This pr upgrade Jersey from 2.35 to 2.36.
   
   
   ### Why are the changes needed?
   This version adapts to Jack 2.13.3, which is also used by Spark currently
   
   - [Adopt Jackson 2.13](https://github.com/eclipse-ee4j/jersey/pull/4928)
   - [Update Jackson to 
2.13.3](https://github.com/eclipse-ee4j/jersey/pull/5076)
   
   The release notes as follows:
   
   - https://github.com/eclipse-ee4j/jersey/releases/tag/2.36
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Actions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] deshanxiao closed pull request #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL

2022-08-01 Thread GitBox


deshanxiao closed pull request #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] 
Merge ml SchemaUtils to SQL
URL: https://github.com/apache/spark/pull/37336


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37325:
URL: https://github.com/apache/spark/pull/37325#discussion_r935119977


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala:
##
@@ -132,4 +132,10 @@ case class BatchScanExec(
 val result = s"$nodeName$truncatedOutputString ${scan.description()} 
$runtimeFiltersString"
 redact(result)
   }
+
+  /**
+   * Returns the name of this type of TreeNode.  Defaults to the class name.
+   * Note that we remove the "Exec" suffix for physical operators here.
+   */
+  override def nodeName: String = s"BatchScan ${scan.name()}"

Review Comment:
   Can we refactor the code a bit and pass `Table#name` when constructing 
`BatchScanExec`? I think it's better than adding a new public API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935119522


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,255 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") {
+  val df4 = spark.read
+.table("h2.test.employee")
+.groupBy("dept").sum("SALARY")
+.orderBy($"dept" + 100)

Review Comment:
   BTW, can we make the grouping key an expression? like this 
https://github.com/apache/spark/pull/37320/files#diff-1496378d9e7817c45c962f1af48e5e765cb475bd01d58edec118d98225e02ef3R887



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935119295


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,255 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") {
+  val df4 = spark.read
+.table("h2.test.employee")
+.groupBy("dept").sum("SALARY")
+.orderBy($"dept" + 100)

Review Comment:
   can we use some functions that don't need ansi mode to simplify the test? 
e.g. `log(10, dept)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #37374: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow

2022-08-01 Thread GitBox


gengliangwang commented on PR #37374:
URL: https://github.com/apache/spark/pull/37374#issuecomment-1202019831

   This PR is to cherry-pick https://github.com/apache/spark/pull/37337 again, 
which was reverted since it fail to compile after 
https://github.com/apache/spark/pull/37343 is merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request, #37374: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow

2022-08-01 Thread GitBox


gengliangwang opened a new pull request, #37374:
URL: https://github.com/apache/spark/pull/37374

   
   
   ### What changes were proposed in this pull request?
   
   Similar with https://github.com/apache/spark/pull/37313, currently, when  
arithmetic overflow errors happen under ANSI mode, the error messages are like
   ```
   [ARITHMETIC_OVERFLOW] long overflow. Use 'try_multiply' to tolerate overflow 
and return NULL instead. If necessary set spark.sql.ansi.enabled to "false" 
   ```
   
   
   The "(except for ANSI interval type)" part is confusing. We should remove it 
for the numeric arithmetic operations and have a new error class for the 
interval division error: `INTERVAL_ARITHMETIC_OVERFLOW`
   
   
   
   ### Why are the changes needed?
   
   For better error messages
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, Use different error classes for arithmetic overflows of 
numeric/interval.. After changes, the error messages are simpler and more clear.
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #37284: [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode

2022-08-01 Thread GitBox


ulysses-you commented on PR #37284:
URL: https://github.com/apache/spark/pull/37284#issuecomment-1202015625

   @cloud-fan @viirya , addressed comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #37276: [SPARK-39835][SQL][3.1] Fix EliminateSorts remove global sort below the local sort

2022-08-01 Thread GitBox


ulysses-you commented on PR #37276:
URL: https://github.com/apache/spark/pull/37276#issuecomment-1202013868

   @cloud-fan done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #37373: [SPARK-39911][SQL][3.3] Optimize global Sort to RepartitionByExpression

2022-08-01 Thread GitBox


ulysses-you commented on PR #37373:
URL: https://github.com/apache/spark/pull/37373#issuecomment-1202011636

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you opened a new pull request, #37373: [SPARK-39911][SQL][3.3] Optimize global Sort to RepartitionByExpression

2022-08-01 Thread GitBox


ulysses-you opened a new pull request, #37373:
URL: https://github.com/apache/spark/pull/37373

   this is for backport https://github.com/apache/spark/pull/37330 into 
branch-3.3
   ### What changes were proposed in this pull request?
   
   Optimize Global sort to RepartitionByExpression, for example:
   ```
   Sort local Sort local
 Sort global=>  RepartitionByExpression
   ```
   
   ### Why are the changes needed?
   
   If a global sort below a local sort, the only meaningful thing is it's 
distribution. So this pr optimizes that global sort to RepartitionByExpression 
to save a local sort.
   
   ### Does this PR introduce _any_ user-facing change?
   
   no, only improve performance
   
   ### How was this patch tested?
   
   add test
   
   Closes #37330 from ulysses-you/optimize-sort.
   
   Authored-by: ulysses-you 
   Signed-off-by: Wenchen Fan 
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #37369: [SPARK-39942][PYTHON][PS] Need to verify the input nums is integer in nsmallest func

2022-08-01 Thread GitBox


zhengruifeng commented on PR #37369:
URL: https://github.com/apache/spark/pull/37369#issuecomment-1202009196

   I think this is a more general problem:
   
   ```
   def nsmallest(self, n: int = 5)
   ```
   this method already hint the input type, but given a boolean it will fail 
with a `AnalysisException` (which should be a `TypeError`). 
   I guess it could accept a floating number like 5.5, and run successfully.
   
   I did a quick investigation of 
[`typing`](https://docs.python.org/3/library/typing.html#module-typing), there 
is an annotation `@typing.runtime_checkable`, but only works for `class 
typing.Protocol`
   
   is there some method to enforce the input type checking?   @zero323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


beliefer commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935094921


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


beliefer commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935093496


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY")
+  .groupBy("my_dept", "my_manager").sum("SALARY")
+  .orderBy("my_dept", "my_manager")
+  .limit(1)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key", "IS_MANAGER").sum("SALARY")
+  .orderBy("key", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df6)
+checkLimitRemoved(df6)
+checkPushedInfo(df6,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END, " +
+"IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df6, Seq(Row(0.00, false, 12000.00)))
+
+val df7 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY"))
+  .orderBy(sum("SALARY"))

Review Comment:
   I added the comments into `V2ScanRelationPushDown` directly.



##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  

[GitHub] [spark] sumeetgajjar commented on a diff in pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI

2022-08-01 Thread GitBox


sumeetgajjar commented on code in PR #37325:
URL: https://github.com/apache/spark/pull/37325#discussion_r935094151


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala:
##
@@ -132,4 +132,10 @@ case class BatchScanExec(
 val result = s"$nodeName$truncatedOutputString ${scan.description()} 
$runtimeFiltersString"
 redact(result)
   }
+
+  /**
+   * Returns the name of this type of TreeNode.  Defaults to the class name.
+   * Note that we remove the "Exec" suffix for physical operators here.
+   */
+  override def nodeName: String = s"BatchScan ${scan.name()}"

Review Comment:
   >  Can we simply use v2 `Table.name`?
   
   Hi @cloud-fan - thank you for your comment, can you please tell me how can 
we use v2 `Table#name()` here?
   `Scan` or `BatchScanExec` does not have a reference to `Table`. Thus I 
decided to use `Scan#name()`. 
   
   I would be happy to publish a patch that uses v2 `Table#name()` if you can 
please tell me how it can be accessed.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37329:
URL: https://github.com/apache/spark/pull/37329#issuecomment-1201990422

   @physinet mind enabling 
https://github.com/physinet/spark/actions/workflows/build_main.yml and rebasing 
please? Apache Spark leverages the Github resources from the PR author's fork.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


beliefer commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935093496


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY")
+  .groupBy("my_dept", "my_manager").sum("SALARY")
+  .orderBy("my_dept", "my_manager")
+  .limit(1)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key", "IS_MANAGER").sum("SALARY")
+  .orderBy("key", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df6)
+checkLimitRemoved(df6)
+checkPushedInfo(df6,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END, " +
+"IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df6, Seq(Row(0.00, false, 12000.00)))
+
+val df7 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY"))
+  .orderBy(sum("SALARY"))

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37341: [SPARK-38639][HIVE]Ignore the corrupted rows that failed to deserialize in hive sequence table

2022-08-01 Thread GitBox


AmplabJenkins commented on PR #37341:
URL: https://github.com/apache/spark/pull/37341#issuecomment-1201989368

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #37295: [SPARK-39873][SQL] Remove `OptimizeLimitZero` and merge it into `EliminateLimits`

2022-08-01 Thread GitBox


beliefer commented on PR #37295:
URL: https://github.com/apache/spark/pull/37295#issuecomment-1201989202

   @cloud-fan @gengliangwang @dongjoon-hyun Thank you !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wForget commented on pull request #37346: [SPARK-37210][CORE][SQL] Allow forced use of staging directory

2022-08-01 Thread GitBox


wForget commented on PR #37346:
URL: https://github.com/apache/spark/pull/37346#issuecomment-1201989180

   > Why it is an issue particular for `InsertIntoHadoopFsRelationCommand`?
   
   `InsertIntoHiveTable` always uses hive staging dir 
https://github.com/apache/spark/blob/b0c831d3408dddfbbf3acacbe8100a9e08b400de/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L107
   `InsertIntoHadoopFsRelationCommand` only uses spark staging dir in dynamic 
overwrite mode, otherwise it uses `table_location/_temporary` which leads to 
concurrency conflicts.
   
https://github.com/apache/spark/blob/b0c831d3408dddfbbf3acacbe8100a9e08b400de/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37355: [SPARK-39930][SQL] Introduce Cache Hints

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37355:
URL: https://github.com/apache/spark/pull/37355#discussion_r935089294


##
sql/core/src/test/scala/org/apache/spark/sql/CacheHintsSuite.scala:
##
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.execution.CachedData
+import 
org.apache.spark.sql.execution.analysis.ResolveCacheHints.NO_RESULT_CACHE_TAG
+import org.apache.spark.sql.test.SharedSparkSession
+
+class CacheHintsSuite extends PlanTest with SharedSparkSession {
+  private def lookupCachedData(df: DataFrame): Option[CachedData] = {
+spark.sharedState.cacheManager.lookupCachedData(df)
+  }
+
+  test("RESULT_CACHE Hint") {
+val e = intercept[AnalysisException](
+  sql("SELECT /*+ RESULT_CACHE(abc) */ id from range(0, 1)").collect())

Review Comment:
   is it the same as `CACHE TABLE abc AS SELECT id from range(0,1)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37341: [SPARK-38639][HIVE]Ignore the corrupted rows that failed to deserialize in hive sequence table

2022-08-01 Thread GitBox


cloud-fan commented on PR #37341:
URL: https://github.com/apache/spark/pull/37341#issuecomment-1201981954

   Does Hive have this feature?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37355: [SPARK-39930][SQL] Introduce Cache Hints

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37355:
URL: https://github.com/apache/spark/pull/37355#issuecomment-1201980467

   I like the idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37330: [SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression

2022-08-01 Thread GitBox


cloud-fan commented on PR #37330:
URL: https://github.com/apache/spark/pull/37330#issuecomment-1201979396

   @ulysses-you can you open a backport PR for 3.3? I think this is a necessary 
followup of https://github.com/apache/spark/pull/37250 to avoid perf regression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37276: [SPARK-39835][SQL][3.1] Fix EliminateSorts remove global sort below the local sort

2022-08-01 Thread GitBox


cloud-fan commented on PR #37276:
URL: https://github.com/apache/spark/pull/37276#issuecomment-1201978590

   to avoid any perf regression, can we include the global sort -> repartition 
optimization?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37358: [SPARK-39932][SQL] WindowExec should clear the final partition buffer

2022-08-01 Thread GitBox


cloud-fan commented on PR #37358:
URL: https://github.com/apache/spark/pull/37358#issuecomment-1201978144

   also cc @maryannxue @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a diff in pull request #37368: [SPARK-39940][SS] Refresh catalog table on streaming query with DSv1 sink

2022-08-01 Thread GitBox


viirya commented on code in PR #37368:
URL: https://github.com/apache/spark/pull/37368#discussion_r935082833


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamTableAPISuite.scala:
##
@@ -445,6 +445,44 @@ class DataStreamTableAPISuite extends StreamTest with 
BeforeAndAfter {
 }
   }
 
+  test("SPARK-39940: refresh table when streaming query writes to the catalog 
table via DSv1") {
+withTable("tbl1", "tbl2") {
+  withTempDir { dir =>
+val baseTbls = new File(dir, "tables")
+val tbl1File = new File(baseTbls, "tbl1")
+val tbl2File = new File(baseTbls, "tbl2")
+val checkpointLocation = new File(dir, "checkpoint")
+
+val format = "parquet"
+Seq((1, 2)).toDF("i", "d")
+  .write.format(format).option("path", 
tbl1File.getCanonicalPath).saveAsTable("tbl1")
+
+val query = spark.readStream.format(format).table("tbl1")
+  .writeStream.format(format)
+  .option("checkpointLocation", checkpointLocation.getCanonicalPath)
+  .option("path", tbl2File.getCanonicalPath)
+  .toTable("tbl2")
+
+try {
+  query.processAllAvailable()
+  checkAnswer(spark.table("tbl2").sort($"i"), Seq(Row(1, 2)))
+
+  Seq((3, 4)).toDF("i", "d")
+.write.format(format).option("path", tbl1File.getCanonicalPath)
+.mode(SaveMode.Append).saveAsTable("tbl1")
+
+  query.processAllAvailable()
+  checkAnswer(spark.table("tbl2").sort($"i"), Seq(Row(1, 2), Row(3, 
4)))
+
+  assert(query.exception.isEmpty, s"No exception should happen in 
streaming query: " +

Review Comment:
   ```suggestion
 assert(query.exception.isEmpty, "No exception should happen in 
streaming query: " +
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37365: [SPARK-39938][PYTHON][PS] Accept all inputs of prefix/suffix which implement __str__ in add_predix/add_suffix

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37365:
URL: https://github.com/apache/spark/pull/37365#issuecomment-1201973069

   cc @xinrong-meng @itholic @zhengruifeng FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37366: [SPARK-39939][PYTHON][PS] return self.copy during calling shift with period == 0

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37366:
URL: https://github.com/apache/spark/pull/37366#issuecomment-1201972652

   cc @itholic @xinrong-meng @zhengruifeng FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37367: [SPARK-39941][PYTHON][PS] window and min_periods should be integer in rolling func

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37367:
URL: https://github.com/apache/spark/pull/37367#issuecomment-1201972549

   cc @zhengruifeng @xinrong-meng @itholic FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37369: [SPARK-39942][PYTHON][PS] Need to verify the input nums is integer in nsmallest func

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37369:
URL: https://github.com/apache/spark/pull/37369#issuecomment-1201971564

   cc @itholic @xinrong-meng @zhengruifeng FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37356: [SPARK-39877][PYTHON][FOLLOW-UP] Add DataFrame melt to PySpark docs

2022-08-01 Thread GitBox


HyukjinKwon closed pull request #37356: [SPARK-39877][PYTHON][FOLLOW-UP] Add 
DataFrame melt to PySpark docs
URL: https://github.com/apache/spark/pull/37356


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37356: [SPARK-39877][PYTHON][FOLLOW-UP] Add DataFrame melt to PySpark docs

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37356:
URL: https://github.com/apache/spark/pull/37356#issuecomment-1201971011

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37351: [SPARK-38864][SQL][FOLLOW-UP] Make AnalysisException message deterministic

2022-08-01 Thread GitBox


HyukjinKwon closed pull request #37351: [SPARK-38864][SQL][FOLLOW-UP] Make 
AnalysisException message deterministic
URL: https://github.com/apache/spark/pull/37351


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37351: [SPARK-38864][SQL][FOLLOW-UP] Make AnalysisException message deterministic

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37351:
URL: https://github.com/apache/spark/pull/37351#issuecomment-1201970303

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37350: [SPARK-39900][SQL] Address partial or negated condition in binary format's predicate pushdown

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37350:
URL: https://github.com/apache/spark/pull/37350#issuecomment-1201968982

   cc @WeichenXu123 too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37350: SPARK-39900 : Issue with querying dataframe produced by 'binaryFile' …

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37350:
URL: https://github.com/apache/spark/pull/37350#issuecomment-1201968566

   LGTM otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37169: [SPARK-38901][SQL] DS V2 supports push down misc functions

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37169:
URL: https://github.com/apache/spark/pull/37169#discussion_r935075376


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -1115,6 +1115,69 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df8, Seq(Row("alex")))
   }
 
+  test("scan with filter push-down with misc functions") {
+Seq(false, true).foreach { ansiMode =>
+  withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiMode.toString) {

Review Comment:
   Can we avoid the cast? Let's add a binary type column to the testing table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #37372: [SPARK-39944][BUILD] Upgrade dropwizard metrics to 4.2.10

2022-08-01 Thread GitBox


LuciferYang opened a new pull request, #37372:
URL: https://github.com/apache/spark/pull/37372

   ### What changes were proposed in this pull request?
   This pr upgrade dropwizard metrics from 4.2.7 to 4.2.10.
   
   
   
   
   ### Why are the changes needed?
   There are 3 versions after 4.2.7, the release notes as follows:
   
   - https://github.com/dropwizard/metrics/releases/tag/v4.2.8
   - https://github.com/dropwizard/metrics/releases/tag/v4.2.9
   - https://github.com/dropwizard/metrics/releases/tag/v4.2.10
   
   The new version brings a new API for more type safe of registering 
gauges(https://github.com/dropwizard/metrics/pull/2642)
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Actions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37360: [SPARK-39931][PYTHON][WIP] Improve applyInPandas performance for very small groups

2022-08-01 Thread GitBox


HyukjinKwon commented on PR #37360:
URL: https://github.com/apache/spark/pull/37360#issuecomment-1201963930

   Hm, the general idea might be fine but I think the implementation is the 
problem. For example, the current design is that the user defined `function` 
always takes one group for `pdf`. To keep this behaviour, you should send the 
multiple groups into one, and apply the same function multiple times for each 
group.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #37295: [SPARK-39873][SQL] Remove `OptimizeLimitZero` and merge it into `EliminateLimits`

2022-08-01 Thread GitBox


cloud-fan closed pull request #37295: [SPARK-39873][SQL] Remove 
`OptimizeLimitZero` and merge it into `EliminateLimits`
URL: https://github.com/apache/spark/pull/37295


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37295: [SPARK-39873][SQL] Remove `OptimizeLimitZero` and merge it into `EliminateLimits`

2022-08-01 Thread GitBox


cloud-fan commented on PR #37295:
URL: https://github.com/apache/spark/pull/37295#issuecomment-1201963450

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #37357: [SPARK-39933][SQL][TESTS] Check query context by `checkError()`

2022-08-01 Thread GitBox


MaxGekk closed pull request #37357: [SPARK-39933][SQL][TESTS] Check query 
context by `checkError()`
URL: https://github.com/apache/spark/pull/37357


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #37357: [SPARK-39933][SQL][TESTS] Check query context by `checkError()`

2022-08-01 Thread GitBox


MaxGekk commented on PR #37357:
URL: https://github.com/apache/spark/pull/37357#issuecomment-1201961100

   Merging to master. Thank you, @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] williamhyun opened a new pull request, #37371: [SPARK-39945][BUILD] Upgrade sbt-mima-plugin to 1.1.0

2022-08-01 Thread GitBox


williamhyun opened a new pull request, #37371:
URL: https://github.com/apache/spark/pull/37371

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pralabhkumar commented on pull request #37203: [SPARK-39755][CORE] Randomization in Spark local directory for K8 resource managers

2022-08-01 Thread GitBox


pralabhkumar commented on PR #37203:
URL: https://github.com/apache/spark/pull/37203#issuecomment-1201957796

   @HyukjinKwon @cloud-fan @dongjoon-hyun Please find some time to review the 
PR . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] williamhyun opened a new pull request, #37370: [SPARK-39943][BUILD] Upgrade rocksdbjni to 7.4.4

2022-08-01 Thread GitBox


williamhyun opened a new pull request, #37370:
URL: https://github.com/apache/spark/pull/37370

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #36200: [SPARK-38909][CORE][YARN] Encapsulate `LevelDB` used by `ExternalShuffleBlockResolver` and `YarnShuffleService` as `DB`

2022-08-01 Thread GitBox


LuciferYang commented on PR #36200:
URL: https://github.com/apache/spark/pull/36200#issuecomment-1201952802

   thanks @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935064814


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")

Review Comment:
   can we also test `.orderBy($"key" + 100)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935064627


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY")
+  .groupBy("my_dept", "my_manager").sum("SALARY")
+  .orderBy("my_dept", "my_manager")
+  .limit(1)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key", "IS_MANAGER").sum("SALARY")
+  .orderBy("key", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df6)
+checkLimitRemoved(df6)
+checkPushedInfo(df6,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END, " +
+"IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df6, Seq(Row(0.00, false, 12000.00)))
+
+val df7 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY"))
+  .orderBy(sum("SALARY"))
+  .limit(1)
+checkSortRemoved(df7, false)
+checkLimitRemoved(df7, false)
+checkPushedInfo(df7,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []")
+checkAnswer(df7, Seq(Row(6, 12000.00)))
+
+val df8 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY").as("total"))
+  .orderBy("total")
+  .limit(1)
+checkSortRemoved(df8, false)
+checkLimitRemoved(df8, false)
+checkPushedInfo(df8,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []")
+checkAnswer(df8, Seq(Row(6, 12000.00)))
+  }
+
+ 

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935064235


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY")
+  .groupBy("my_dept", "my_manager").sum("SALARY")
+  .orderBy("my_dept", "my_manager")
+  .limit(1)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key", "IS_MANAGER").sum("SALARY")
+  .orderBy("key", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df6)
+checkLimitRemoved(df6)
+checkPushedInfo(df6,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END, " +
+"IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df6, Seq(Row(0.00, false, 12000.00)))
+
+val df7 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY"))
+  .orderBy(sum("SALARY"))
+  .limit(1)
+checkSortRemoved(df7, false)
+checkLimitRemoved(df7, false)
+checkPushedInfo(df7,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []")
+checkAnswer(df7, Seq(Row(6, 12000.00)))
+
+val df8 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY").as("total"))

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please 

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935063965


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY")
+  .groupBy("my_dept", "my_manager").sum("SALARY")
+  .orderBy("my_dept", "my_manager")
+  .limit(1)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key", "IS_MANAGER").sum("SALARY")
+  .orderBy("key", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df6)
+checkLimitRemoved(df6)
+checkPushedInfo(df6,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END, " +
+"IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df6, Seq(Row(0.00, false, 12000.00)))
+
+val df7 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT", $"SALARY")
+  .groupBy("dept").agg(sum("SALARY"))
+  .orderBy(sum("SALARY"))

Review Comment:
   can we add some comments to explain why top n can't be pushed here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935063573


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -864,6 +851,254 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"SALARY")
+  .groupBy("my_dept").sum("SALARY")
+  .orderBy("my_dept")
+  .limit(1)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"DEPT".as("my_dept"), $"IS_MANAGER".as("my_manager"), $"SALARY")

Review Comment:
   We already have tests to verify the alias, we don't need to test it again 
with 2 columns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #37368: [SPARK-39940][SS] Refresh catalog table on streaming query with DSv1 sink

2022-08-01 Thread GitBox


HeartSaVioR commented on PR #37368:
URL: https://github.com/apache/spark/pull/37368#issuecomment-1201948170

   cc. @cloud-fan @viirya @zsxwing @xuanyuanking Appreciate your review. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37368: [SPARK-39940][SS] Refresh catalog table on streaming query with DSv1 sink

2022-08-01 Thread GitBox


HeartSaVioR commented on code in PR #37368:
URL: https://github.com/apache/spark/pull/37368#discussion_r935061578


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -680,7 +680,14 @@ class MicroBatchExecution(
 val batchSinkProgress: Option[StreamWriterCommitProgress] = 
reportTimeTaken("addBatch") {
   SQLExecution.withNewExecutionId(lastExecution) {
 sink match {
-  case s: Sink => s.addBatch(currentBatchId, nextBatch)
+  case s: Sink =>
+s.addBatch(currentBatchId, nextBatch)
+// DSv2 write node has a mechanism to invalidate DSv2 relation, 
but there is no
+// corresponding one for DSv1. Given we have an information of 
catalog table for sink,
+// we can refresh the catalog table once the write has succeeded.
+plan.catalogTable.foreach { tbl =>

Review Comment:
   Here I only deal with DSv1 path since DSv2 streaming writer node 
(`WriteToDataSourceV2Exec`) contains the mechanism of refreshing cache. I think 
this would cover the case of DSv2 table and I can't think of coordination 
between DSv2 sink and catalog table, but please correct me if I'm missing 
something.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935061143


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -410,12 +413,21 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
 case s @ Sort(order, _, operation @ ScanOperation(project, filter, 
sHolder: ScanBuilderHolder))
 // Without building the Scan, we do not know the resulting column 
names after aggregate
 // push-down, and thus can't push down Top-N which needs to know the 
ordering column names.
-// TODO: we can support simple cases like GROUP BY columns directly 
and ORDER BY the same
-//   columns, which we know the resulting column names: the 
original table columns.
-if sHolder.pushedAggregate.isEmpty && filter.isEmpty &&
+// In particular, we push down the simple cases like GROUP BY 
expressions directly and
+// ORDER BY the same expressions, which we know the original table 
columns.
+if filter.isEmpty &&
   CollapseProject.canCollapseExpressions(order, project, alwaysInline 
= true) =>
   val aliasMap = getAliasMap(project)
-  val newOrder = order.map(replaceAlias(_, 
aliasMap)).asInstanceOf[Seq[SortOrder]]
+  val aliasReplacedOrder = order.map(replaceAlias(_, aliasMap))
+  val newOrder = if (sHolder.pushedAggregate.isDefined) {
+aliasReplacedOrder.map {

Review Comment:
   let's add some comments here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r935060915


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -410,12 +413,21 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
 case s @ Sort(order, _, operation @ ScanOperation(project, filter, 
sHolder: ScanBuilderHolder))
 // Without building the Scan, we do not know the resulting column 
names after aggregate
 // push-down, and thus can't push down Top-N which needs to know the 
ordering column names.
-// TODO: we can support simple cases like GROUP BY columns directly 
and ORDER BY the same
-//   columns, which we know the resulting column names: the 
original table columns.
-if sHolder.pushedAggregate.isEmpty && filter.isEmpty &&
+// In particular, we push down the simple cases like GROUP BY 
expressions directly and
+// ORDER BY the same expressions, which we know the original table 
columns.

Review Comment:
   I think we can remove all these comments here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bzhaoopenstack opened a new pull request, #37369: [SPARK-39942][PYTHON][PS] Need to verify the input nums is integer in nsmallest func

2022-08-01 Thread GitBox


bzhaoopenstack opened a new pull request, #37369:
URL: https://github.com/apache/spark/pull/37369

   The input parameter of nsmallest should be validated as Integer. So I think 
we might miss this validation.
   
   And PySpark will raise Error when we input the strange types into nsmallest 
func.
   
   
   ### What changes were proposed in this pull request?
   validate the input num is integer type only.
   
   
   ### Why are the changes needed?
   PySpark will raise Error if we not limit the type.
   
   ···
   >>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 
'B'])
   >>> df.groupby(['A'])['B']
   
   >>> df.groupby(['A'])['B'].nsmallest(True)
   Traceback (most recent call last):
 File "", line 1, in 
 File "/home/spark/spark/python/pyspark/pandas/groupby.py", line 3598, in 
nsmallest
   sdf.withColumn(temp_rank_column, F.row_number().over(window))
 File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2129, in 
filter
   jdf = self._jdf.filter(condition._jc)
 File 
"/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py",
 line 1321, in __call__
   return_value = get_return_value(
 File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
   raise converted from None
   pyspark.sql.utils.AnalysisException: cannot resolve '(__rank__ <= true)' due 
to data type mismatch: differing types in '(__rank__ <= true)' (int and 
boolean).;
   'Filter (__rank__#4995 <= true)
   +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, 
__natural_order__#4983L, __rank__#4995]
  +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, 
__natural_order__#4983L, __rank__#4995, __rank__#4995]
 +- Window [row_number() windowspecdefinition(__index_level_0__#4988L, 
B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
__rank__#4995], [__index_level_0__#4988L], [B#4979L ASC NULLS FIRST, 
__natural_order__#4983L ASC NULLS FIRST]
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, 
B#4979L, __natural_order__#4983L]
   +- Project [A#4978L AS __index_level_0__#4988L, 
__index_level_0__#4977L AS __index_level_1__#4989L, B#4979L, 
__natural_order__#4983L]
  +- Project [__index_level_0__#4977L, A#4978L, B#4979L, 
monotonically_increasing_id() AS __natural_order__#4983L]
 +- LogicalRDD [__index_level_0__#4977L, A#4978L, B#4979L], 
false
   
   ···
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   input non-Integer type will raise AssersionError during calling nsmallest 
func
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR opened a new pull request, #37368: [SPARK-39940][SS] Refresh catalog table on streaming query with DSv1 sink

2022-08-01 Thread GitBox


HeartSaVioR opened a new pull request, #37368:
URL: https://github.com/apache/spark/pull/37368

   Credit to @pranavanand on figuring out the issue and providing the broken 
test code!
   
   ### What changes were proposed in this pull request?
   
   This PR proposes to refresh the destination catalog table when streaming 
query is writing to the catalog table via DSv1 sink.
   
   ### Why are the changes needed?
   
   It has been long standing issue that streaming query is not aware of catalog 
table (not the table brought by DSv2), hence no way for streaming query to 
refresh the catalog table after updating it. As a side effect of SPARK-39564, 
Spark brings up the information of destination catalog table into 
MicrobatchExecution, which enables to refresh table.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New UT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37325:
URL: https://github.com/apache/spark/pull/37325#discussion_r935059096


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala:
##
@@ -132,4 +132,10 @@ case class BatchScanExec(
 val result = s"$nodeName$truncatedOutputString ${scan.description()} 
$runtimeFiltersString"
 redact(result)
   }
+
+  /**
+   * Returns the name of this type of TreeNode.  Defaults to the class name.
+   * Note that we remove the "Exec" suffix for physical operators here.
+   */
+  override def nodeName: String = s"BatchScan ${scan.name()}"

Review Comment:
   I don't think `Scan` can provide more information than `Table` for UI node 
name purposes. Even the time-travel version is also available in `Table`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37325:
URL: https://github.com/apache/spark/pull/37325#discussion_r935058486


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala:
##
@@ -132,4 +132,10 @@ case class BatchScanExec(
 val result = s"$nodeName$truncatedOutputString ${scan.description()} 
$runtimeFiltersString"
 redact(result)
   }
+
+  /**
+   * Returns the name of this type of TreeNode.  Defaults to the class name.
+   * Note that we remove the "Exec" suffix for physical operators here.
+   */
+  override def nodeName: String = s"BatchScan ${scan.name()}"

Review Comment:
   Can we simply use v2 `Table.name`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37350: SPARK-39900 : Issue with querying dataframe produced by 'binaryFile' …

2022-08-01 Thread GitBox


AmplabJenkins commented on PR #37350:
URL: https://github.com/apache/spark/pull/37350#issuecomment-1201940622

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37351: [SPARK-38864][SQL][FOLLOW-UP] Make AnalysisException message deterministic

2022-08-01 Thread GitBox


AmplabJenkins commented on PR #37351:
URL: https://github.com/apache/spark/pull/37351#issuecomment-1201940599

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37352: [SPARK-39927][BUILD] Upgrade to Avro 1.11.1

2022-08-01 Thread GitBox


AmplabJenkins commented on PR #37352:
URL: https://github.com/apache/spark/pull/37352#issuecomment-1201940582

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bzhaoopenstack opened a new pull request, #37367: [SPARK-39941][PYTHON][PS] period and min_periods should be integer in rolling func

2022-08-01 Thread GitBox


bzhaoopenstack opened a new pull request, #37367:
URL: https://github.com/apache/spark/pull/37367

   window and min_periods parameters is not be validated in rolling function.
   
   ### What changes were proposed in this pull request?
   Validate the said 2 parameters to be a integer only in rolling func, will 
raise ValueError if they are not Integer
   
   
   ### Why are the changes needed?
   I guess we miss the validation towards those parameters.
   
   PySpark will raise Error during using other type inputs
   ```
   >>> s = ps.Series([4, 3, 5, 2, 6])
   >>> s.rolling(1).sum()
   04
   13
   25
   32
   46
   dtype: int64
   >>> s.rolling('STRING').sum()
   Traceback (most recent call last):
 File "", line 1, in 
 File "/home/spark/spark/python/pyspark/pandas/generic.py", line 2707, in 
rolling
   return Rolling(self, window=window, min_periods=min_periods)
 File "/home/spark/spark/python/pyspark/pandas/window.py", line 179, in 
__init__
   super().__init__(window, min_periods)
 File "/home/spark/spark/python/pyspark/pandas/window.py", line 147, in 
__init__
   if window < 0:
   TypeError: '<' not supported between instances of 'str' and 'int'
   >>> 
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Input other types of window and min_periods in rolling func will raise 
ValueError
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] deshanxiao commented on pull request #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL

2022-08-01 Thread GitBox


deshanxiao commented on PR #37336:
URL: https://github.com/apache/spark/pull/37336#issuecomment-1201931299

   Very happy to receive these suggestions. Please let me know if anyone have 
any questions. Will close this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bzhaoopenstack opened a new pull request, #37366: [SPARK-39939][PYTHON][PS] return self.copy during calling shift with period == 0

2022-08-01 Thread GitBox


bzhaoopenstack opened a new pull request, #37366:
URL: https://github.com/apache/spark/pull/37366

   PySpark raises Error when we call shift func with periods=0.
   
   The behavior of Pandas will return a same copy for the said obj.
   
   ### What changes were proposed in this pull request?
   Will return self.copy when period == 0
   
   
   ### Why are the changes needed?
   Behaviors between PySpark and pandas are different
   
   PySpark:
   ```
   >>> df = ps.DataFrame({'Col1': [10, 20, 15, 30, 45], 'Col2': [13, 23, 18, 
33, 48],'Col3': [17, 27, 22, 37, 52]},columns=['Col1', 'Col2', 'Col3'])
   >>> df.Col1.shift(periods=3)
   0     NaN
   1     NaN
   2     NaN
   3    10.0
   4    20.0
   Name: Col1, dtype: float64
   >>> df.Col1.shift(periods=0)
   Traceback (most recent call last):
     File "", line 1, in 
     File "/home/spark/spark/python/pyspark/pandas/base.py", line 1170, in shift
       return self._shift(periods, fill_value).spark.analyzed
     File "/home/spark/spark/python/pyspark/pandas/spark/accessors.py", line 
256, in analyzed
       return first_series(DataFrame(self._data._internal.resolved_copy))
     File "/home/spark/spark/python/pyspark/pandas/utils.py", line 589, in 
wrapped_lazy_property
       setattr(self, attr_name, fn(self))
     File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1173, in 
resolved_copy
       sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS))
     File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2073, in 
select
       jdf = self._jdf.select(self._jcols(*cols))
     File 
"/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py",
 line 1321, in __call__
       return_value = get_return_value(
     File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
       raise converted from None
   pyspark.sql.utils.AnalysisException: Cannot specify window frame for lag 
function
   ```
   
   pandas:
   ```
   >>> pdf = pd.DataFrame({'Col1': [10, 20, 15, 30, 45], 'Col2': [13, 23, 18, 
33, 48],'Col3': [17, 27, 22, 37, 52]},columns=['Col1', 'Col2', 'Col3'])
   >>> pdf.Col1.shift(periods=3)
   0     NaN
   1     NaN
   2     NaN
   3    10.0
   4    20.0
   Name: Col1, dtype: float64
   >>> pdf.Col1.shift(periods=0)
   0    10
   1    20
   2    15
   3    30
   4    45
   Name: Col1, dtype: int64
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   call shift func with period == 0.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-08-01 Thread GitBox


dongjoon-hyun commented on PR #37287:
URL: https://github.com/apache/spark/pull/37287#issuecomment-1201919934

   Got it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #37325: [SPARK-39902][SQL] Add Scan details to spark plan scan node in SparkUI

2022-08-01 Thread GitBox


dongjoon-hyun closed pull request #37325: [SPARK-39902][SQL] Add Scan details 
to spark plan scan node in SparkUI
URL: https://github.com/apache/spark/pull/37325


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bzhaoopenstack opened a new pull request, #37365: [SPARK-39938][python][PS] Accept all inputs of prefix/suffix which implement __str__ in add_predix/add_suffix

2022-08-01 Thread GitBox


bzhaoopenstack opened a new pull request, #37365:
URL: https://github.com/apache/spark/pull/37365

   We need to follow the pandas behavior of prefix/suffix parameter validation 
in add_prefix/add_suffix.
   
   Now, we force to validate it as a String type. But pandas looks all values 
which can be formated as String(implement __str__ func). So it's different here.
   
   ### What changes were proposed in this pull request?
   We support all kind inputs which can be formated as string.
   
   ### Why are the changes needed?
   As pandas behavior is different with PySpark when we input other types into 
add_prefix/add_suffix funcs.
   PySpark
   ```
   >>> from pyspark import pandas as ps
   >>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 
'B'])
   >>> df.add_suffix(666)
   Traceback (most recent call last):
 File "", line 1, in 
 File "/home/spark/spark/python/pyspark/pandas/frame.py", line 9060, in 
add_suffix
   assert isinstance(suffix, str)
   AssertionError
   >>> df.add_suffix(True)
   Traceback (most recent call last):
 File "", line 1, in 
 File "/home/spark/spark/python/pyspark/pandas/frame.py", line 9060, in 
add_suffix
   assert isinstance(suffix, str)
   AssertionError
   ```
   
   Pandas: 1.3.X/1.4.X
   ```
   >>> pdf.add_suffix(0.1)
  A0.1  B0.1
   0 1 3
   1 2 4
   2 3 5
   3 4 6
   >>> pdf.add_suffix(True)
  ATrue  BTrue
   0  1  3
   1  2  4
   2  3  5
   3  4  6
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Input any can be stringable input into add_prefix/add_suffix funcs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jeffreychen99 commented on a diff in pull request #37364: [SPARK-39936][SQL] Store schema in properties for Spark Views

2022-08-01 Thread GitBox


Jeffreychen99 commented on code in PR #37364:
URL: https://github.com/apache/spark/pull/37364#discussion_r935031668


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala:
##
@@ -2386,10 +2385,6 @@ class HiveDDLSuite
 "CREATE TABLE t1 USING PARQUET AS SELECT NULL AS null_col",
 "Parquet data source does not support void data type")
 
-  assertAnalysisError(

Review Comment:
   Oops - sorry forgot to add them back



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936][SQL] Store schema in properties for Spark Views

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37364:
URL: https://github.com/apache/spark/pull/37364#discussion_r935030049


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala:
##
@@ -2386,10 +2385,6 @@ class HiveDDLSuite
 "CREATE TABLE t1 USING PARQUET AS SELECT NULL AS null_col",
 "Parquet data source does not support void data type")
 
-  assertAnalysisError(

Review Comment:
   do we still need to remove this test?



##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala:
##
@@ -2400,10 +2395,6 @@ class HiveDDLSuite
 "CREATE TABLE t1 (v VOID) USING PARQUET",
 "Parquet data source does not support void data type")
 
-  assertAnalysisError(

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-08-01 Thread GitBox


cloud-fan commented on PR #37287:
URL: https://github.com/apache/spark/pull/37287#issuecomment-1201897650

   This is another instance that `db.tbl` is inconsistent of 
`spark_catalog.db.tbl`. This PR fixed it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #36200: [SPARK-38909][CORE][YARN] Encapsulate `LevelDB` used by `ExternalShuffleBlockResolver` and `YarnShuffleService` as `DB`

2022-08-01 Thread GitBox


dongjoon-hyun commented on PR #36200:
URL: https://github.com/apache/spark/pull/36200#issuecomment-1201896768

   Although I couldn't take a look at this, you can ask YARN committers' help 
if you want to make this proceed, @LuciferYang .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on pull request #37268: [SPARK-39853][CORE] Support stage level task resource schedule for standalone cluster when dynamic allocation disabled

2022-08-01 Thread GitBox


ivoson commented on PR #37268:
URL: https://github.com/apache/spark/pull/37268#issuecomment-1201892453

   > so I would like to see the issue or this PR description have much more 
details about design, API, and its behavior. For instance:
   > 
   > ```
   > Does this PR introduce any user-facing change?
   > 
   > No 
   > ```
   > 
   > This is not true, this is very much user impacting API.
   > 
   > Also the issue I linked to talked about the policy for reusing executors. 
What is the proposal for that here? What happens if the task resource specified 
doesn't fit in the executor resources? I'd like to make sure we come to 
agreement on what this api is and does and user impact of that. Does this api 
also apply to reusing executor with dynamic allocation on... as commented above 
seems limiting. This is a public API so I want to make sure the changes work in 
various situations.
   > 
   > Docs will need to update to explain new API and behavior.
   
   Thanks @tgravescs , will update the doc and then we can discuss.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936] Store schema in properties for Spark Views

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37364:
URL: https://github.com/apache/spark/pull/37364#discussion_r935011158


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala:
##
@@ -378,4 +378,15 @@ class HiveParquetSourceSuite extends 
ParquetPartitioningTest with ParquetTest {
   }
 }
   }
+
+  test("Create view with dashes in column type") {
+sql(
+  """
+|CREATE TABLE tbl (f array>) USING 
parquet
+|""".stripMargin)
+sql(
+  """
+|CREATE VIEW tblView AS SELECT f FROM tbl

Review Comment:
   and let's make the test complete
   ```
   withView("t") {
 sql("CREATE VIEW t AS SELECT STRUCT('a' AS $a, 1 AS b) q")
 checkAnswer(spark.table("t"), Row(Row("a", 1)))
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936] Store schema in properties for Spark Views

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37364:
URL: https://github.com/apache/spark/pull/37364#discussion_r935010782


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala:
##
@@ -378,4 +378,15 @@ class HiveParquetSourceSuite extends 
ParquetPartitioningTest with ParquetTest {
   }
 }
   }
+
+  test("Create view with dashes in column type") {
+sql(
+  """
+|CREATE TABLE tbl (f array>) USING 
parquet
+|""".stripMargin)
+sql(
+  """
+|CREATE VIEW tblView AS SELECT f FROM tbl

Review Comment:
   we can simply use the existing test case `CREATE VIEW t AS SELECT STRUCT('a' 
AS `$a`, 1 AS b) q`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936] Store schema in properties for Spark Views

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37364:
URL: https://github.com/apache/spark/pull/37364#discussion_r935010522


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala:
##
@@ -160,18 +160,6 @@ class HiveCatalogedDDLSuite extends DDLSuite with 
TestHiveSingleton with BeforeA
   }
 
   test("SPARK-22431: illegal nested type") {
-val queries = Seq(
-  "CREATE TABLE t USING hive AS SELECT STRUCT('a' AS `$a`, 1 AS b) q",
-  "CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING hive",
-  "CREATE VIEW t AS SELECT STRUCT('a' AS `$a`, 1 AS b) q")

Review Comment:
   I think we only need to remove this line from the test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936] Store schema in properties for Spark Views

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37364:
URL: https://github.com/apache/spark/pull/37364#discussion_r935010178


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -283,7 +283,22 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 // columns etc. in table properties, so that we can work around the 
Hive metastore issue
 // about not case preserving and make Hive serde table and view 
support mixed-case column
 // names.
-properties = tableDefinition.properties ++ 
tableMetaToTableProps(tableDefinition))
+properties = tableDefinition.properties ++ 
tableMetaToTableProps(tableDefinition)
+  )
+  try {
+client.createTable(tableWithDataSourceProps, ignoreIfExists)
+  } catch {
+case NonFatal(e) =>
+  // If for some reason we fail to store the schema we store it as 
empty there
+  // since we already store the real schema in the table properties. 
This try-catch
+  // should only be necessary for Spark views which are incompatible 
with Hive
+  client.createTable(
+tableWithDataSourceProps.copy(
+  schema = StructType(EMPTY_DATA_SCHEMA ++ 
tableDefinition.partitionSchema)

Review Comment:
   A view cannot have partition columns, so here can simply be `new 
StructType()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37364: [SPARK-39936] Store schema in properties for Spark Views

2022-08-01 Thread GitBox


cloud-fan commented on code in PR #37364:
URL: https://github.com/apache/spark/pull/37364#discussion_r93501


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -283,7 +283,22 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 // columns etc. in table properties, so that we can work around the 
Hive metastore issue
 // about not case preserving and make Hive serde table and view 
support mixed-case column
 // names.
-properties = tableDefinition.properties ++ 
tableMetaToTableProps(tableDefinition))
+properties = tableDefinition.properties ++ 
tableMetaToTableProps(tableDefinition)
+  )
+  try {
+client.createTable(tableWithDataSourceProps, ignoreIfExists)
+  } catch {
+case NonFatal(e) =>

Review Comment:
   ```suggestion
   case NonFatal(e) if ...is_view... =>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-08-01 Thread GitBox


amaliujia commented on PR #37287:
URL: https://github.com/apache/spark/pull/37287#issuecomment-1201853151

   The test is failing for example on 
   ```
   Expected:
   Database(name='default', catalog=None, description='default database', 
...
   Got:
   Database(name='default', catalog='spark_catalog', description='default 
database', 
locationUri='file:/__w/spark/spark/python/target/4ab5b07b-a1fd-4be3-b29b-6bfc1ac33d6d/e75ca4c7-9d5e-409d-b319-fba011a1ad51')
   ```
   
   I think the actual result is expected as of now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jeffreychen99 opened a new pull request, #37364: [SPARK-39936] Store schema in properties for Spark Views

2022-08-01 Thread GitBox


Jeffreychen99 opened a new pull request, #37364:
URL: https://github.com/apache/spark/pull/37364

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37356: [SPARK-39877][PYTHON][FOLLOW-UP] Add DataFrame melt to PySpark docs

2022-08-01 Thread GitBox


AmplabJenkins commented on PR #37356:
URL: https://github.com/apache/spark/pull/37356#issuecomment-1201816031

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] iemejia commented on pull request #37352: [SPARK-39927][BUILD] Upgrade to Avro 1.11.1

2022-08-01 Thread GitBox


iemejia commented on PR #37352:
URL: https://github.com/apache/spark/pull/37352#issuecomment-1201761684

   @xkrogen It is indeed. The announcement has not gone out yet but the 
binaries are already available.
   https://lists.apache.org/thread/8tk92x804owmjbvtj57xxzwhxn1qh3ho


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37346: [SPARK-37210][CORE][SQL] Allow forced use of staging directory

2022-08-01 Thread GitBox


dongjoon-hyun commented on PR #37346:
URL: https://github.com/apache/spark/pull/37346#issuecomment-1201742819

   Thank you for making a PR, @wForget . 
   
   To @viirya and @sunchao . This issue has a reproducible example in the JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37337: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow

2022-08-01 Thread GitBox


dongjoon-hyun commented on PR #37337:
URL: https://github.com/apache/spark/pull/37337#issuecomment-1201728069

   Thank you for swift recovering `master` branch, @gengliangwang .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #37337: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow

2022-08-01 Thread GitBox


gengliangwang commented on PR #37337:
URL: https://github.com/apache/spark/pull/37337#issuecomment-1201718125

   The master branch failed to compile after merging this PR: 
https://github.com/apache/spark/runs/7616787860?check_suite_focus=true
   I am reverting it to unblock the development of other contributors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37359: [SPARK-25342][CORE][SQL]Support rolling back a result stage and rerunning all result tasks when writing files

2022-08-01 Thread GitBox


AmplabJenkins commented on PR #37359:
URL: https://github.com/apache/spark/pull/37359#issuecomment-1201674509

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37360: [SPARK-39931][PYTHON][WIP] Improve applyInPandas performance for very small groups

2022-08-01 Thread GitBox


AmplabJenkins commented on PR #37360:
URL: https://github.com/apache/spark/pull/37360#issuecomment-1201674443

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37361: [SPARK-39925][SQL] Add array_sort(column, comparator) overload to DataFrame operations

2022-08-01 Thread GitBox


AmplabJenkins commented on PR #37361:
URL: https://github.com/apache/spark/pull/37361#issuecomment-1201674401

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37362: WIP: Revert "[SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE"

2022-08-01 Thread GitBox


AmplabJenkins commented on PR #37362:
URL: https://github.com/apache/spark/pull/37362#issuecomment-1201674356

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #37254: [SPARK-39841][SQL] simplify conflict binary comparison

2022-08-01 Thread GitBox


gengliangwang commented on PR #37254:
URL: https://github.com/apache/spark/pull/37254#issuecomment-1201666315

   cc @sigmod as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >