date:20220823

[GitHub] [spark] LuciferYang commented on a diff in pull request #37624: [SPARK-40186][CORE][YARN] Ensure `mergedShuffleCleaner` have been shutdown before `db` close

2022-08-23 Thread GitBox



LuciferYang commented on code in PR #37624:
URL: https://github.com/apache/spark/pull/37624#discussion_r953372176


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -795,13 +796,25 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
   }
 
   /**
-   * Close the DB during shutdown
+   * Shutdown mergedShuffleCleaner and close the DB during shutdown
*/
   @Override
   public void close() {
+if (!mergedShuffleCleaner.isShutdown()) {
+  try {
+mergedShuffleCleaner.shutdown();
+boolean terminated = mergedShuffleCleaner.awaitTermination(10L, 
TimeUnit.SECONDS);
+if (!terminated) {
+  mergedShuffleCleaner.shutdownNow();
+}
+  } catch (InterruptedException ignored) {
+// ignore InterruptedException.
+  }
+}

Review Comment:
   
[7bc9e1c](https://github.com/apache/spark/pull/37624/commits/7bc9e1c6d515b3b7646a6800fcaacc3cd088a1ac)
 fix this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #37624: [SPARK-40186][CORE][YARN] Ensure `mergedShuffleCleaner` have been shutdown before `db` close

2022-08-23 Thread GitBox



LuciferYang commented on code in PR #37624:
URL: https://github.com/apache/spark/pull/37624#discussion_r953368551


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -795,13 +796,25 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
   }
 
   /**
-   * Close the DB during shutdown
+   * Shutdown mergedShuffleCleaner and close the DB during shutdown
*/
   @Override
   public void close() {
+if (!mergedShuffleCleaner.isShutdown()) {
+  try {
+mergedShuffleCleaner.shutdown();
+boolean terminated = mergedShuffleCleaner.awaitTermination(10L, 
TimeUnit.SECONDS);
+if (!terminated) {
+  mergedShuffleCleaner.shutdownNow();
+}
+  } catch (InterruptedException ignored) {
+// ignore InterruptedException.
+  }
+}

Review Comment:
   ```
   pool.shutdown(); // Disable new tasks from being submitted
  try {
// Wait a while for existing tasks to terminate
if (!pool.awaitTermination(60, TimeUnit.SECONDS)) {
  pool.shutdownNow(); // Cancel currently executing tasks
  // Wait a while for tasks to respond to being cancelled
  if (!pool.awaitTermination(60, TimeUnit.SECONDS))
  System.err.println("Pool did not terminate");
}
  } catch (InterruptedException ie) {
// (Re-)Cancel if current thread also interrupted
pool.shutdownNow();
// Preserve interrupt status
Thread.currentThread().interrupt();
  }
   ```
   
   like above?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #36056: [SPARK-36571][SQL] Add an SQLOverwriteHadoopMapReduceCommitProtocol to support all SQL overwrite write data to staging dir

2022-08-23 Thread GitBox



AngersZh commented on PR #36056:
URL: https://github.com/apache/spark/pull/36056#issuecomment-1225216261

   > Could you rebase this PR once more? I'll review this PR, @AngersZh .
   
   Conflict resolved


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37544: [SPARK-40110][SQL][TESTS] Add JDBCWithAQESuite

2022-08-23 Thread GitBox



cloud-fan commented on code in PR #37544:
URL: https://github.com/apache/spark/pull/37544#discussion_r953362653


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:
##
@@ -44,7 +45,8 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
 import org.apache.spark.util.Utils
 
-class JDBCSuite extends QueryTest with SharedSparkSession {

Review Comment:
   `spark.sql.adaptive.forceApply` is designed for testing. Do you have use 
cases to turn it on in production? If use cases are valid, we should make it a 
public conf (instead of internal).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37619: [SPARK-40088][SQL][TESTS] Add SparkPlanWithAQESuite

2022-08-23 Thread GitBox



cloud-fan commented on PR #37619:
URL: https://github.com/apache/spark/pull/37619#issuecomment-1225207900

   why do we care about AQE force apply? It's a testing feature and should 
never be enabled in production.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37642: [SPARK-40202][PYTHON][SQL] Allow a dictionary in SparkSession.config in PySpark

2022-08-23 Thread GitBox



HyukjinKwon commented on PR #37642:
URL: https://github.com/apache/spark/pull/37642#issuecomment-1225206430

   cc @srowen, @viirya @ueshin FYI
   
   I know that the main reason of adding that signature is because of the type 
stuff that's not an issue in Python side -  i don't feel strongly if you guys 
are in doubt. I just thought that it's still a bit useful to pass a dictionary 
+ for a feature parity


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon opened a new pull request, #37642: [SPARK-40202][PYTHON][SQL] Allow a dictionary in SparkSession.config in PySpark

2022-08-23 Thread GitBox



HyukjinKwon opened a new pull request, #37642:
URL: https://github.com/apache/spark/pull/37642

   ### What changes were proposed in this pull request?
   
   This PR adds the dictionary support in `SparkSession.config`, see also 
SPARK-40163
   
   ### Why are the changes needed?
   
   For API parity w/ Scala side.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. After this change, users can pass a dictionary as below:
   
   ```python
   >>> SparkSession.builder.config(
   ... map={"spark.some.config.number": 123, "spark.some.config.float": 
0.123})
   ```
   
   ### How was this patch tested?
   
   Manually tested, and doctest was added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #37624: [SPARK-40186][CORE][YARN] Ensure `mergedShuffleCleaner` have been shutdown before `db` close

2022-08-23 Thread GitBox



zhouyejoe commented on code in PR #37624:
URL: https://github.com/apache/spark/pull/37624#discussion_r953356051


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -795,13 +796,25 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
   }
 
   /**
-   * Close the DB during shutdown
+   * Shutdown mergedShuffleCleaner and close the DB during shutdown
*/
   @Override
   public void close() {
+if (!mergedShuffleCleaner.isShutdown()) {
+  try {
+mergedShuffleCleaner.shutdown();
+boolean terminated = mergedShuffleCleaner.awaitTermination(10L, 
TimeUnit.SECONDS);
+if (!terminated) {
+  mergedShuffleCleaner.shutdownNow();
+}
+  } catch (InterruptedException ignored) {
+// ignore InterruptedException.
+  }
+}

Review Comment:
   Just wondering should we follow the example provided in Java doc with 2 
phase shutdown?
   
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html.
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mskapilks commented on pull request #37615: [SPARK-40124][SQL][TEST][3.3] Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-23 Thread GitBox



mskapilks commented on PR #37615:
URL: https://github.com/apache/spark/pull/37615#issuecomment-1225183559

   Thanks for the review. 
   Make sense on not updating benchmark file for this pr. Will raise PR for 
other branches as well now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37637: [SPARK-40152][SQL][TESTS] Move tests from SplitPart to elementAt

2022-08-23 Thread GitBox



HyukjinKwon closed pull request #37637: [SPARK-40152][SQL][TESTS] Move tests 
from SplitPart to elementAt
URL: https://github.com/apache/spark/pull/37637


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37637: [SPARK-40152][SQL][TESTS] Move tests from SplitPart to elementAt

2022-08-23 Thread GitBox



HyukjinKwon commented on PR #37637:
URL: https://github.com/apache/spark/pull/37637#issuecomment-1225178389

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kazuyukitanimura commented on pull request #37619: [SPARK-40088][SQL][TESTS] Add SparkPlanWithAQESuite

2022-08-23 Thread GitBox



kazuyukitanimura commented on PR #37619:
URL: https://github.com/apache/spark/pull/37619#issuecomment-1225168794

   Actually, there is only one more suite that fails with AQE force apply as 
far as I know. I don't think we need to test all suits with a separate Github 
Action.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37633: [SPARK-40198][CORE] Enable `spark.storage.decommission.(rdd|shuffle)Blocks.enabled` by default

2022-08-23 Thread GitBox



dongjoon-hyun commented on PR #37633:
URL: https://github.com/apache/spark/pull/37633#issuecomment-1225167337

   Merged to master for Apache Spark 3.4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #37633: [SPARK-40198][CORE] Enable `spark.storage.decommission.(rdd|shuffle)Blocks.enabled` by default

2022-08-23 Thread GitBox



dongjoon-hyun closed pull request #37633: [SPARK-40198][CORE] Enable 
`spark.storage.decommission.(rdd|shuffle)Blocks.enabled` by default
URL: https://github.com/apache/spark/pull/37633


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kazuyukitanimura commented on a diff in pull request #37544: [SPARK-40110][SQL][TESTS] Add JDBCWithAQESuite

2022-08-23 Thread GitBox



kazuyukitanimura commented on code in PR #37544:
URL: https://github.com/apache/spark/pull/37544#discussion_r953322660


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:
##
@@ -44,7 +45,8 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
 import org.apache.spark.util.Utils
 
-class JDBCSuite extends QueryTest with SharedSparkSession {

Review Comment:
   In that sense, I could have not to add `DisableAdaptiveExecutionSuite` in 
the base suite, so that it is closer to the default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37633: [SPARK-40198][CORE] Enable `spark.storage.decommission.(rdd|shuffle)Blocks.enabled` by default

2022-08-23 Thread GitBox



dongjoon-hyun commented on PR #37633:
URL: https://github.com/apache/spark/pull/37633#issuecomment-1225166713

   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37604: [DON'T MERGE] Try to replace all `json4s` with `Jackson`

2022-08-23 Thread GitBox



LuciferYang commented on PR #37604:
URL: https://github.com/apache/spark/pull/37604#issuecomment-1225165035

   
[2ea9dcf](https://github.com/apache/spark/pull/37604/commits/2ea9dcf8d41d1bf44ddba496707e696d50ce1cc9)
 add a simple benchmark, will update result later
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kazuyukitanimura commented on pull request #37619: [SPARK-40088][SQL][TESTS] Add SparkPlanWithAQESuite

2022-08-23 Thread GitBox



kazuyukitanimura commented on PR #37619:
URL: https://github.com/apache/spark/pull/37619#issuecomment-1225164465

   The intention is to add tests where that fails for AQE "force" apply on. I 
can add one more job ing Github Action with AQE force applied if it makes sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kazuyukitanimura commented on pull request #37500: [SPARK-40049][SQL][TESTS] Add ReplaceNullWithFalseInPredicateWithAQEEndToEndSuite

2022-08-23 Thread GitBox



kazuyukitanimura commented on PR #37500:
URL: https://github.com/apache/spark/pull/37500#issuecomment-1225161845

   Yes AQE is on by default, however `spark.sql.adaptive.forceApply` is off by 
default. Enabling force apply changes the execution plan. Perhaps I should have 
been clear for this point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kazuyukitanimura commented on a diff in pull request #37544: [SPARK-40110][SQL][TESTS] Add JDBCWithAQESuite

2022-08-23 Thread GitBox



kazuyukitanimura commented on code in PR #37544:
URL: https://github.com/apache/spark/pull/37544#discussion_r953317605


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:
##
@@ -44,7 +45,8 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
 import org.apache.spark.util.Utils
 
-class JDBCSuite extends QueryTest with SharedSparkSession {

Review Comment:
   It is actually opposite. This PR is adding the test for when user enables 
`spark.sql.adaptive.forceApply` that can happen in production.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37633: [SPARK-40198][CORE] Enable `spark.storage.decommission.(rdd|shuffle)Blocks.enabled` by default

2022-08-23 Thread GitBox



dongjoon-hyun commented on PR #37633:
URL: https://github.com/apache/spark/pull/37633#issuecomment-1225159349

   Hi, @viirya . Could you review this? This PR will help the executor rolling 
and worker decommission users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you opened a new pull request, #37641: [SPARK-40201][SQL][TESTS] Improve v1 write test coverage

2022-08-23 Thread GitBox



ulysses-you opened a new pull request, #37641:
URL: https://github.com/apache/spark/pull/37641

   
   
   ### What changes were proposed in this pull request?
   
   It should be a bug if `spark.sql.optimizer.plannedWrite.enabled` enabled, so 
throw exception in test mode if the ordering does not match to help improve 
test coverage.
   
   
   ### Why are the changes needed?
   
   Make v1 write test work on all SQL tests
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   no, only affect tests
   
   ### How was this patch tested?
   
   pass CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36918: [SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering

2022-08-23 Thread GitBox



cloud-fan commented on code in PR #36918:
URL: https://github.com/apache/spark/pull/36918#discussion_r953292454


##
sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala:
##
@@ -1805,3 +1805,21 @@ class DynamicPartitionPruningV2SuiteAEOff extends 
DynamicPartitionPruningV2Suite
 
 class DynamicPartitionPruningV2SuiteAEOn extends DynamicPartitionPruningV2Suite
   with EnableAdaptiveExecutionSuite
+
+abstract class DynamicPartitionPruningV2FilterSuite
+extends DynamicPartitionPruningDataSourceSuiteBase {

Review Comment:
   shall we extend `DynamicPartitionPruningV2Suite` here? then we can save the 
`override protected def runAnalyzeColumnCommands: Boolean = false`, and catalog 
configs will be overwritten.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cfmcgrady commented on a diff in pull request #37439: [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed

2022-08-23 Thread GitBox



cfmcgrady commented on code in PR #37439:
URL: https://github.com/apache/spark/pull/37439#discussion_r953283999


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala:
##
@@ -156,29 +157,36 @@ object UnwrapCastInBinaryComparison extends 
Rule[LogicalPlan] {
 case lit @ Literal(null, _) => nullList += lit
 case lit @ NonNullLiteral(_, _) =>
   unwrapCast(EqualTo(in.value, lit)) match {
-case EqualTo(_, unwrapLit: Literal) => canCastList += unwrapLit
-case e @ And(IsNull(_), Literal(null, BooleanType)) => 
cannotCastList += e
+// the function `unwrapCast` returns None means the literal can 
not cast to fromType,
+// for instance: (the boundreference is of type DECIMAL(5,2))
+// CAST(boundreference() AS DECIMAL(10,4)) = 123456.1234BD
+// Due to `cast(lit, fromExp.dataType) == null` we simply return
+// `falseIfNotNull(fromExp)`.
+case None | Some(And(IsNull(_), Literal(null, BooleanType))) =>

Review Comment:
   As the comment shown in L149 ~ L153
   
   > There are 3 kinds of literals in the in.list:
   > 1. null literals
   > 2. The literals that can cast to fromExp.dataType
   > 3. The literals that cannot cast to fromExp.dataType
   
   for 3, we have tow cases.
   
   - A. the literal cannot cast to fromExp.dataType, and there is no min/max 
for the `fromType`, then `unwrapCast` returns None
   
   for instance:
   
   ```
   cast(input[2, decimal(5,2), true] as decimal(10,4)) = 123456.1234
   ```
   
   
   - B. the literal `value` is out of `fromType` range `(min, max)`, then 
`unwrapCast` returns `falseIfNotNull(fromExp)`
   
   for instance:
   
   ```
   cast(input[0, smallint, true] as bigint) = 2147483647
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37628: [SPARK-40192][SQL][ML] Remove redundant groupby

2022-08-23 Thread GitBox



zhengruifeng commented on PR #37628:
URL: https://github.com/apache/spark/pull/37628#issuecomment-1225121461

   LGTM for the ML side
   
   But for the SQL side, all changes are in test suites, I am not sure whether 
those `groupby` were added intendedly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37620: [SPARK-40183][SQL] Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion

2022-08-23 Thread GitBox



gengliangwang commented on code in PR #37620:
URL: https://github.com/apache/spark/pull/37620#discussion_r953272400


##
sql/core/src/test/resources/sql-tests/results/cast.sql.out:
##
@@ -866,10 +866,10 @@ struct<>
 -- !query output
 org.apache.spark.SparkArithmeticException
 {
-  "errorClass" : "CANNOT_CHANGE_DECIMAL_PRECISION",
+  "errorClass" : "NUMERIC_VALUE_OUT_OF_RANGE",
   "sqlState" : "22005",
   "messageParameters" : {
-"value" : "Decimal(compact, 10, 18, 6)",
+"value" : "0.10",

Review Comment:
   I would prefer showing the original interval value in the message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37620: [SPARK-40183][SQL] Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion

2022-08-23 Thread GitBox



cloud-fan commented on code in PR #37620:
URL: https://github.com/apache/spark/pull/37620#discussion_r953269944


##
sql/core/src/test/resources/sql-tests/results/cast.sql.out:
##
@@ -866,10 +866,10 @@ struct<>
 -- !query output
 org.apache.spark.SparkArithmeticException
 {
-  "errorClass" : "CANNOT_CHANGE_DECIMAL_PRECISION",
+  "errorClass" : "NUMERIC_VALUE_OUT_OF_RANGE",
   "sqlState" : "22005",
   "messageParameters" : {
-"value" : "Decimal(compact, 10, 18, 6)",
+"value" : "0.10",

Review Comment:
   and we should use `toPrecision` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yabola commented on pull request #37545: [SPARK-40113][SQL] Reactor ParquetScanBuilder DataSourceV2 interface implementations

2022-08-23 Thread GitBox



yabola commented on PR #37545:
URL: https://github.com/apache/spark/pull/37545#issuecomment-1225106869

   @cloud-fan cc~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #37624: [SPARK-40186][CORE][YARN] Ensure `mergedShuffleCleaner` have been shutdown before `db` close

2022-08-23 Thread GitBox



LuciferYang commented on code in PR #37624:
URL: https://github.com/apache/spark/pull/37624#discussion_r953266185


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -795,13 +796,25 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
   }
 
   /**
-   * Close the DB during shutdown
+   * Shutdown mergedShuffleCleaner and close the DB during shutdown
*/
   @Override
   public void close() {
+if (!mergedShuffleCleaner.isShutdown()) {
+  try {
+mergedShuffleCleaner.shutdown();
+boolean isTermination = mergedShuffleCleaner.awaitTermination(10L, 
TimeUnit.SECONDS);

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37619: [SPARK-40088][SQL][TESTS] Add SparkPlanWithAQESuite

2022-08-23 Thread GitBox



cloud-fan commented on PR #37619:
URL: https://github.com/apache/spark/pull/37619#issuecomment-1225087639

   If we want to test AQE off for all test suites, can we add one more job in 
Github Action with AQE turned off?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37500: [SPARK-40049][SQL][TESTS] Add ReplaceNullWithFalseInPredicateWithAQEEndToEndSuite

2022-08-23 Thread GitBox



cloud-fan commented on PR #37500:
URL: https://github.com/apache/spark/pull/37500#issuecomment-1225085428

   Hey, are we going to update every test suite to run with both AQE on and 
off? I don't think it's worthwhile since AQE is already on by default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37544: [SPARK-40110][SQL][TESTS] Add JDBCWithAQESuite

2022-08-23 Thread GitBox



cloud-fan commented on code in PR #37544:
URL: https://github.com/apache/spark/pull/37544#discussion_r953252623


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:
##
@@ -44,7 +45,8 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
 import org.apache.spark.util.Utils
 
-class JDBCSuite extends QueryTest with SharedSparkSession {

Review Comment:
   does it mean this PR is testing something that will never happen in 
production?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] lvshaokang opened a new pull request, #37640: [SPARK-38752][SQL][TESTS] Test the error class: UNSUPPORTED_DATATYPE

2022-08-23 Thread GitBox



lvshaokang opened a new pull request, #37640:
URL: https://github.com/apache/spark/pull/37640

   ### What changes were proposed in this pull request?
   
   I add a test case for the error class `UNSUPPORTED_DATATYPE` in the 
QueryExecutionErrorsSuite.
   
   ### Why are the changes needed?
   
   To improve test coverage, and record expected error messages in tests.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   ./build/sbt "test:testOnly *QueryExecutionErrorsSuite"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] deshanxiao commented on pull request #37628: [SPARK-40192][SQL][ML] Remove redundant groupby

2022-08-23 Thread GitBox



deshanxiao commented on PR #37628:
URL: https://github.com/apache/spark/pull/37628#issuecomment-1225075925

   cc @zhengruifeng @LuciferYang @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun opened a new pull request, #37639: [Don't review][TEST] investigate the root cause for SPARK-40165

2022-08-23 Thread GitBox



panbingkun opened a new pull request, #37639:
URL: https://github.com/apache/spark/pull/37639

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #37584: [SPARK-39150][PS] Enable doctest which was disabled when pandas 1.4 upgrade

2022-08-23 Thread GitBox



Yikun commented on PR #37584:
URL: https://github.com/apache/spark/pull/37584#issuecomment-1225062529

   @HyukjinKwon @dongjoon-hyun Oh, Thanks!
   
   I thought UT should have to consider multiple versions but ignored doctest. 
I will pay more attention in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ljfgem commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata

2022-08-23 Thread GitBox



ljfgem commented on code in PR #35636:
URL: https://github.com/apache/spark/pull/35636#discussion_r953238606


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -447,6 +454,74 @@ class Analyzer(override val catalogManager: CatalogManager)
 }
   }
 
+  /**
+   * Substitute persisted views in parsed plans with parsed view sql text.
+   */
+  case class ViewSubstitution(sqlParser: ParserInterface) extends 
Rule[LogicalPlan] {
+
+def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
+  case u @ UnresolvedRelation(nameParts, _, _) if 
v1SessionCatalog.isTempView(nameParts) =>
+u
+  case u @ UnresolvedRelation(
+  parts @ NonSessionCatalogAndIdentifier(catalog, ident), _, _) if 
!isSQLOnFile(parts) =>

Review Comment:
   Could we avoid limiting it to the non-session catalog? There might be the 
use case that `spark_catalog` is set to be the view catalog which also extends 
`CatalogExtension`, like 
[CoralSparkViewCatalog](https://github.com/linkedin/coral/pull/297/files#diff-e0aa76aa937fc5e2d279def7c494f4e33d2ab35cdc30b5822f1e8746b2a21500)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ljfgem commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata

2022-08-23 Thread GitBox



ljfgem commented on code in PR #35636:
URL: https://github.com/apache/spark/pull/35636#discussion_r953238606


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -447,6 +454,74 @@ class Analyzer(override val catalogManager: CatalogManager)
 }
   }
 
+  /**
+   * Substitute persisted views in parsed plans with parsed view sql text.
+   */
+  case class ViewSubstitution(sqlParser: ParserInterface) extends 
Rule[LogicalPlan] {
+
+def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
+  case u @ UnresolvedRelation(nameParts, _, _) if 
v1SessionCatalog.isTempView(nameParts) =>
+u
+  case u @ UnresolvedRelation(
+  parts @ NonSessionCatalogAndIdentifier(catalog, ident), _, _) if 
!isSQLOnFile(parts) =>

Review Comment:
   Could we avoid limiting it to the non-session catalog? There might be the 
use case that `spark_catalog` is set to be the view catalog which also extends 
`CatalogExtension`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rmcyang commented on a diff in pull request #34074: [SPARK-33573][SHUFFLE][YARN] Shuffle server side metrics for Push-based shuffle

2022-08-23 Thread GitBox



rmcyang commented on code in PR #34074:
URL: https://github.com/apache/spark/pull/34074#discussion_r953226026


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -1410,4 +1442,42 @@ long getPos() {
   return pos;
 }
   }
+
+  /**
+   * A class that wraps all the push-based metrics.
+   */
+  static class PushMergeMetrics implements MetricSet {
+static final String NO_OPPORTUNITY_RESPONSES_METRIC = 
"couldNotFindOpportunityResponses";
+static final String TOO_LATE_RESPONSES_METRIC = "tooLateResponses";
+static final String PUSHED_BYTES_WRITTEN_METRIC = "pushedBytesWritten";
+static final String CACHED_BLOCKS_BYTES_METRIC = "cachedBlocksBytes";
+
+private final Map allMetrics;
+private final Meter noOpportunityResponses;
+private final Meter tooLateResponses;
+private final Meter pushedBytesWritten;
+private final Counter cachedBlockBytes;

Review Comment:
   Fixed in the PR: https://github.com/apache/spark/pull/37638



##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -904,13 +933,15 @@ public void onComplete(String streamId) throws 
IOException {
 // the error code to the client.
 AppShuffleMergePartitionsInfo info = 
appShuffleInfo.shuffles.get(partitionInfo.shuffleId);
 if (isTooLate(info, partitionInfo.reduceId)) {
-  deferredBufs = null;
+  freeAnyDeferredBlocks();
+  mergeManager.pushMergeMetrics.tooLateResponses.mark();
   throw new BlockPushNonFatalFailure(
 new BlockPushReturnCode(ReturnCode.TOO_LATE_BLOCK_PUSH.id(), 
streamId).toByteBuffer(),
 BlockPushNonFatalFailure.getErrorMsg(streamId, 
ReturnCode.TOO_LATE_BLOCK_PUSH));
 }
 if (isStale(info, partitionInfo.shuffleMergeId)) {
-  deferredBufs = null;
+  freeAnyDeferredBlocks();
+  mergeManager.pushMergeMetrics.noOpportunityResponses.mark();

Review Comment:
   Fixed in the PR: https://github.com/apache/spark/pull/37638



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rmcyang commented on a diff in pull request #34074: [SPARK-33573][SHUFFLE][YARN] Shuffle server side metrics for Push-based shuffle

2022-08-23 Thread GitBox



rmcyang commented on code in PR #34074:
URL: https://github.com/apache/spark/pull/34074#discussion_r953225817


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -1410,4 +1442,42 @@ long getPos() {
   return pos;
 }
   }
+
+  /**
+   * A class that wraps all the push-based metrics.
+   */
+  static class PushMergeMetrics implements MetricSet {
+static final String NO_OPPORTUNITY_RESPONSES_METRIC = 
"couldNotFindOpportunityResponses";
+static final String TOO_LATE_RESPONSES_METRIC = "tooLateResponses";
+static final String PUSHED_BYTES_WRITTEN_METRIC = "pushedBytesWritten";
+static final String CACHED_BLOCKS_BYTES_METRIC = "cachedBlocksBytes";
+
+private final Map allMetrics;
+private final Meter noOpportunityResponses;
+private final Meter tooLateResponses;
+private final Meter pushedBytesWritten;
+private final Counter cachedBlockBytes;
+
+private PushMergeMetrics() {
+  allMetrics = new HashMap<>();
+  noOpportunityResponses = new Meter();
+  allMetrics.put(NO_OPPORTUNITY_RESPONSES_METRIC, noOpportunityResponses);
+  tooLateResponses = new Meter();
+  allMetrics.put(TOO_LATE_RESPONSES_METRIC, tooLateResponses);
+  pushedBytesWritten = new Meter();
+  allMetrics.put(PUSHED_BYTES_WRITTEN_METRIC, pushedBytesWritten);
+  cachedBlockBytes = new Counter();
+  allMetrics.put(CACHED_BLOCKS_BYTES_METRIC, cachedBlockBytes);
+}
+
+@Override
+public Map getMetrics() {
+  return allMetrics;
+}
+
+@VisibleForTesting
+Counter getCachedBlockBytes() {
+  return cachedBlockBytes;
+}

Review Comment:
   Thanks for the suggestion, fixed in the PR: 
https://github.com/apache/spark/pull/37638



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rmcyang commented on pull request #37638: [SPARK-33573][SHUFFLE][YARN] Shuffle server side metrics for Push-based shuffle

2022-08-23 Thread GitBox



rmcyang commented on PR #37638:
URL: https://github.com/apache/spark/pull/37638#issuecomment-1225047079

   cc @zhouyejoe @otterc @mridulm @xkrogen. Please take a look, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rmcyang opened a new pull request, #37638: [SPARK-33573][SHUFFLE][YARN] Shuffle server side metrics for Push-based shuffle

2022-08-23 Thread GitBox



rmcyang opened a new pull request, #37638:
URL: https://github.com/apache/spark/pull/37638

   ### What changes were proposed in this pull request?
   This is one of the patches for SPARK-33235: Push-based Shuffle Improvement 
Tasks.
   Added a class `PushMergeMetrics`, to collect below metrics from shuffle 
server side for Push-based shuffle:
   - no opportunity responses
   - too late responses
   - pushed bytes written
   - cached block bytes
   
   ### Why are the changes needed?
   This helps to understand the push based shuffle metrics from shuffle server 
side.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Added a method `verifyMetrics` to verify those metrics in existing unit 
tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #37626: [SPARK-40152][SQL][TESTS] Add tests for SplitPart

2022-08-23 Thread GitBox



wangyum commented on code in PR #37626:
URL: https://github.com/apache/spark/pull/37626#discussion_r953222334


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala:
##
@@ -2522,4 +2522,24 @@ class CollectionExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
   Date.valueOf("2017-02-12")))
 }
   }
+
+  test("SplitPart") {
+val delimiter = Literal.create(".", StringType)
+val str = StringSplitSQL(Literal.create("11.12.13", StringType), delimiter)
+val outOfBoundValue = Some(Literal.create("", StringType))
+
+checkEvaluation(ElementAt(str, Literal(3), outOfBoundValue), 
UTF8String.fromString("13"))
+checkEvaluation(ElementAt(str, Literal(1), outOfBoundValue), 
UTF8String.fromString("11"))
+checkEvaluation(ElementAt(str, Literal(10), outOfBoundValue), 
UTF8String.fromString(""))
+checkEvaluation(ElementAt(str, Literal(-10), outOfBoundValue), 
UTF8String.fromString(""))
+
+checkEvaluation(ElementAt(StringSplitSQL(Literal.create(null, StringType), 
delimiter),
+  Literal(1), outOfBoundValue), null)
+checkEvaluation(ElementAt(StringSplitSQL(Literal.create("11.12.13", 
StringType),
+  Literal.create(null, StringType)), Literal(1), outOfBoundValue), null)
+
+intercept[Exception] {
+  checkEvaluation(ElementAt(str, Literal(0), outOfBoundValue), null)

Review Comment:
   https://github.com/apache/spark/pull/37637



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum opened a new pull request, #37637: [SPARK-40152][SQL][TESTS] Move tests from SplitPart to elementAt

2022-08-23 Thread GitBox



wangyum opened a new pull request, #37637:
URL: https://github.com/apache/spark/pull/37637

   ### What changes were proposed in this pull request?
   
   Move tests from SplitPart to elementAt in CollectionExpressionsSuite.
   
   ### Why are the changes needed?
   
   Simplify test.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   N/A.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zzzzming95 opened a new pull request, #37636: url_decode() should the return an ERROR_CLASS

2022-08-23 Thread GitBox



ming95 opened a new pull request, #37636:
URL: https://github.com/apache/spark/pull/37636

   
   
   ### What changes were proposed in this pull request?
   
   
   url_decode() return an ERROR_CLASS when Invalid parameter input. like :
   ```
   spark.sql("SELECT url_decode('http%3A%2F%2spark.apache.org')").show
   ```
   
   output:
   ```
   org.apache.spark.SparkIllegalArgumentException: [CANNOT_DECODE_URL] Cannot 
decode url : http%3A%2F%2spark.apache.org.
   URLDecoder: Illegal hex characters in escape (%) pattern - For input string: 
"2s"
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.illegalUrlError(QueryExecutionErrors.scala:351)
at 
org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:117)
at 
org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.invoke(objects.scala:148)
   ```
   
   ### Why are the changes needed?
   
   
   url_decode() should the return an ERROR_CLASS.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   no
   
   ### How was this patch tested?
   
   
   build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z 
url-functions.sql"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on a diff in pull request #37598: [SPARK-40165][BUILD] Update test plugins to latest versions

2022-08-23 Thread GitBox



panbingkun commented on code in PR #37598:
URL: https://github.com/apache/spark/pull/37598#discussion_r953215392


##
pom.xml:
##
@@ -2932,7 +2932,7 @@
 
   org.apache.maven.plugins
   maven-surefire-plugin
-  3.0.0-M5
+  3.0.0-M7

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #37635: [WIP] Support NumPy arrays in built-in functions

2022-08-23 Thread GitBox



xinrong-meng commented on code in PR #37635:
URL: https://github.com/apache/spark/pull/37635#discussion_r953211654


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -1003,6 +1003,30 @@ def test_np_scalar_input(self):
 res = df.select(array_position(df.data, 
dtype(1)).alias("c")).collect()
 self.assertEqual([Row(c=1), Row(c=0)], res)
 
+@unittest.skipIf(not have_numpy, "NumPy not installed")
+def test_ndarray_input(self):
+import numpy as np
+
+int_arrs = [np.array([1, 2]).astype(t) for t in ["int8", "int16", 
"int32", "int64"]]
+for arr in int_arrs:
+self.assertEqual(
+[Row(b=[1, 2])], 
self.spark.range(1).select(lit(arr).alias("b")).collect()
+)
+
+float_arrs = [np.array([0.1, 0.2]).astype(t) for t in ["float32", 
"float64"]]
+
+self.assertEqual(

Review Comment:
   Cannot compare Row equality
   ```
   self.assertEqual(Row(b=[0.1000149011612, 0.2000298023224]), 
self.spark.range(1).select(lit(float_arrs[0]).alias("b")).collect())
   ```
   due to 
   ```
   AssertionError: Row(b=[0.1000149011612, 0.2000298023224]) != 
[Row(b=[0.1000149011612, 0.2000298023224])]
   ```
   .
   Instead, we compare `dtypes` and the actual data in Row.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #37635: [WIP] Support NumPy arrays in built-in functions

2022-08-23 Thread GitBox



xinrong-meng commented on code in PR #37635:
URL: https://github.com/apache/spark/pull/37635#discussion_r953211654


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -1003,6 +1003,30 @@ def test_np_scalar_input(self):
 res = df.select(array_position(df.data, 
dtype(1)).alias("c")).collect()
 self.assertEqual([Row(c=1), Row(c=0)], res)
 
+@unittest.skipIf(not have_numpy, "NumPy not installed")
+def test_ndarray_input(self):
+import numpy as np
+
+int_arrs = [np.array([1, 2]).astype(t) for t in ["int8", "int16", 
"int32", "int64"]]
+for arr in int_arrs:
+self.assertEqual(
+[Row(b=[1, 2])], 
self.spark.range(1).select(lit(arr).alias("b")).collect()
+)
+
+float_arrs = [np.array([0.1, 0.2]).astype(t) for t in ["float32", 
"float64"]]
+
+self.assertEqual(

Review Comment:
   Cannot 
   ```
   self.assertEqual(Row(b=[0.1000149011612, 0.2000298023224]), 
self.spark.range(1).select(lit(float_arrs[0]).alias("b")).collect())
   ```
   due to 
   ```
   AssertionError: Row(b=[0.1000149011612, 0.2000298023224]) != 
[Row(b=[0.1000149011612, 0.2000298023224])]
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng opened a new pull request, #37635: [WIP] Support NumPy arrays in built-in functions

2022-08-23 Thread GitBox



xinrong-meng opened a new pull request, #37635:
URL: https://github.com/apache/spark/pull/37635

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-08-23 Thread GitBox



xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1225021291

   cc @shardulm94 (thanks for the push towards constructing the path into a 
single message instead of using chained exceptions, the additional code changes 
were minimal and the resulting message is much nicer)
   and a few folks who have been working on error reporting @MaxGekk @karenfeng 
@gengliangwang


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xkrogen opened a new pull request, #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-08-23 Thread GitBox



xkrogen opened a new pull request, #37634:
URL: https://github.com/apache/spark/pull/37634

   ### What changes were proposed in this pull request?
   This modifies `GenerateUnsafeProjection` to wrap projections of non-null 
fields in try-catch blocks which swallow any `NullPointerException` that is 
thrown, and instead throw a helpful error message indicating that a null value 
was encountered where the schema indicated non-null. This new error is added in 
`QueryExecutionErrors`.
   
   Small modifications are made to a few methods in `GenerateUnsafeProjection` 
to allow for passing down the path to the projection in question, to give the 
user a helpful indication of what they need to change. Getting the name of the 
top-level column seems to require substantial changes since the name is thrown 
away when the `BoundReference` is created (in favor of an ordinal), so for the 
top-level only an ordinal is given; for nested fields the name is provided. An 
example error message looks like:
   
   ```
   java.lang.RuntimeException: The value at .`middle`.`bottom` cannot be 
null, but a NULL was found. This is typically caused by the presence of a NULL 
value when the schema indicates the value should be non-null. Check that the 
input data matches the schema and/or that UDFs which can return null have a 
nullable return schema.
   ```
   
   ### Why are the changes needed?
   This is needed to help users decipher the error message; currently a 
`NullPointerException` without any message is thrown, which provides the user 
no guidance on what they've done wrong, and typically leads them to believe 
there is a bug in Spark. See the Jira for a specific example of how this 
behavior can be triggered and what the exception looks like currently.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, in the case that a user has a data-schema mismatch, they will not get a 
much more helpful error message. In other cases, no change.
   
   ### How was this patch tested?
   See tests in `DataFrameSuite` and `GeneratedProjectionSuite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37584: [SPARK-39150][PS] Enable doctest which was disabled when pandas 1.4 upgrade

2022-08-23 Thread GitBox



dongjoon-hyun commented on PR #37584:
URL: https://github.com/apache/spark/pull/37584#issuecomment-1225015233

   +1 for reverting. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #37615: [SPARK-40124][SQL][TEST][3.3] Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-23 Thread GitBox



dongjoon-hyun closed pull request #37615: [SPARK-40124][SQL][TEST][3.3] Update 
TPCDS v1.4 q32 for Plan Stability tests
URL: https://github.com/apache/spark/pull/37615


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37584: [SPARK-39150][PS] Enable doctest which was disabled when pandas 1.4 upgrade

2022-08-23 Thread GitBox



HyukjinKwon commented on PR #37584:
URL: https://github.com/apache/spark/pull/37584#issuecomment-1225006025

   Oh, actually we should revert this because the tests should pass with other 
pandas versions. Sorry it was my mistake.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37615: [SPARK-40124][SQL][TEST][3.3] Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-23 Thread GitBox



dongjoon-hyun commented on PR #37615:
URL: https://github.com/apache/spark/pull/37615#issuecomment-1225005197

   Thank you for pinging me, @kazuyukitanimura .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37521: [SPARK-40078][PYTHON][DOCS] Make pyspark.sql.column examples self-contained

2022-08-23 Thread GitBox



HyukjinKwon closed pull request #37521: [SPARK-40078][PYTHON][DOCS] Make 
pyspark.sql.column examples self-contained
URL: https://github.com/apache/spark/pull/37521


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37521: [SPARK-40078][PYTHON][DOCS] Make pyspark.sql.column examples self-contained

2022-08-23 Thread GitBox



HyukjinKwon commented on PR #37521:
URL: https://github.com/apache/spark/pull/37521#issuecomment-1225003808

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on PR #37444:
URL: https://github.com/apache/spark/pull/37444#issuecomment-1225003672

   We should make the tests passed before merging it in 
(https://github.com/Transurgeon/spark/runs/7981691501).
   
   cc @dcoliversun @khalidmammadov FYI if you guys find some time to review, 
and work on the rest of API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953198186


##
python/pyspark/sql/dataframe.py:
##
@@ -3393,12 +3549,22 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
 Parameters
 --
 cols : str
-new column names
+new column names. The length of the list needs to be the same as 
the number of columns in the initial DataFrame

Review Comment:
   
   ```suggestion
   new column names. The length of the list needs to be the same as 
the number of columns in the initial :class:`DataFrame`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953198052


##
python/pyspark/sql/dataframe.py:
##
@@ -3356,11 +3476,29 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame":  
# type: ignore[misc]
 
 Examples
 
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+| 14|  Tom|
+| 23|Alice|
+| 16|  Bob|
++---+-+
 >>> df.drop('age').collect()
-[Row(name='Alice'), Row(name='Bob')]
+[Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]
 
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+| 14|  Tom|
+| 23|Alice|
+| 16|  Bob|
++---+-+
 >>> df.drop(df.age).collect()
-[Row(name='Alice'), Row(name='Bob')]
+[Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]
 
 >>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect()

Review Comment:
   I think it's showing a common example that join and drop the join key.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953197497


##
python/pyspark/sql/dataframe.py:
##
@@ -2743,7 +2814,20 @@ def fillna(
 
 Examples
 
->>> df4.na.fill(50).show()
+Fill all null values with 50 when the data type of the column is an 
integer
+
+   >>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),
+... (None, None, "Tom"), (None, None, None)], ["age", "height", 
"name"])

Review Comment:
   intentation



##
python/pyspark/sql/dataframe.py:
##
@@ -2753,7 +2837,19 @@ def fillna(
 | 50|50| null|
 +---+--+-+
 
->>> df5.na.fill(False).show()
+   Fill all null values with False when the data type of the column is a 
boolean

Review Comment:
   ```suggestion
Fill all null values with ``False`` when the data type of the column is 
a boolean
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953197392


##
python/pyspark/sql/dataframe.py:
##
@@ -1179,6 +1231,16 @@ def distinct(self) -> "DataFrame":
 
 Examples
 
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+... (23, "Alice")], ["age", "name"]) 
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+| 14|  Tom|
+| 23|Alice|
+| 23|Alice|
++---+-+
 >>> df.distinct().count()

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953197159


##
python/pyspark/sql/dataframe.py:
##
@@ -798,8 +820,18 @@ def count(self) -> int:
 
 Examples
 
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+... (16, "Bob")], ["age", "name"])
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+| 14|  Tom|
+| 23|Alice|
+| 16|  Bob|
++---+-+
 >>> df.count()

Review Comment:
   ditto. let's add some description



##
python/pyspark/sql/dataframe.py:
##
@@ -862,8 +894,18 @@ def take(self, num: int) -> List[Row]:
 
 Examples
 
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+... (16, "Bob")], ["age", "name"])
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+| 14|  Tom|
+| 23|Alice|
+| 16|  Bob|
++---+-+
 >>> df.take(2)

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953197013


##
python/pyspark/sql/dataframe.py:
##
@@ -571,29 +580,42 @@ def show(self, n: int = 20, truncate: Union[bool, int] = 
True, vertical: bool =
 
 Examples
 
->>> df
-DataFrame[age: int, name: string]
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+... (16, "Bob")], ["age", "name"])
 >>> df.show()
 +---+-+
 |age| name|
 +---+-+
-|  2|Alice|
-|  5|  Bob|
+| 14|  Tom|
+| 23|Alice|
+| 16|  Bob|
 +---+-+
+>>> df.show(2)

Review Comment:
   Let's add some description for each example.
   e.g.)
   ```
   Show only top 2 rows.
   
   >>> df.show(2)
   +---+-+
   |age| name|
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kazuyukitanimura commented on pull request #37615: [SPARK-40124][SQL][TEST][3.3] Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-23 Thread GitBox



kazuyukitanimura commented on PR #37615:
URL: https://github.com/apache/spark/pull/37615#issuecomment-1225001567

   cc @dongjoon-hyun @viirya @huaxingao @sunchao @parthchandra @wangyum @Swinky 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953196728


##
python/pyspark/sql/dataframe.py:
##
@@ -363,9 +363,18 @@ def schema(self) -> StructType:
 
 Examples
 
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+... (16, "Bob")], ["age", "name"])
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+| 14|  Tom|
+| 23|Alice|
+| 16|  Bob|
++---+-+
 >>> df.schema

Review Comment:
   Let's describe the example like:
   
   ```
   >>> df = spark.createDataFrame(
   ... [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
   
   Retrieve the schema of the current DataFrame.
   
   >>> df.schema
   ```



##
python/pyspark/sql/dataframe.py:
##
@@ -571,29 +580,42 @@ def show(self, n: int = 20, truncate: Union[bool, int] = 
True, vertical: bool =
 
 Examples
 
->>> df
-DataFrame[age: int, name: string]
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+... (16, "Bob")], ["age", "name"])

Review Comment:
   ditto for indentation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953195936


##
python/pyspark/sql/dataframe.py:
##
@@ -363,9 +363,18 @@ def schema(self) -> StructType:
 
 Examples
 
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+... (16, "Bob")], ["age", "name"])

Review Comment:
   ```suggestion
   >>> df = spark.createDataFrame(
   ... [(14, "Tom"), (23, "Alice"),(16, "Bob")], ["age", "name"])
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953195599


##
python/pyspark/sql/dataframe.py:
##
@@ -363,9 +363,18 @@ def schema(self) -> StructType:
 
 Examples

Review Comment:
   Can we add `Returns` section?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953195766


##
python/pyspark/sql/dataframe.py:
##
@@ -363,9 +363,18 @@ def schema(self) -> StructType:
 
 Examples
 
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+... (16, "Bob")], ["age", "name"])
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+| 14|  Tom|
+| 23|Alice|
+| 16|  Bob|
++---+-+
 >>> df.schema
-StructType([StructField('age', IntegerType(), True),
-StructField('name', StringType(), True)])
+StructType([StructField('age', LongType(), True), StructField('name', 
StringType(), True)])

Review Comment:
   The change here seems unnecessarily 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



HyukjinKwon commented on PR #37444:
URL: https://github.com/apache/spark/pull/37444#issuecomment-1224999813

   I think you can reuse the same JIRA, and make a followup.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun opened a new pull request, #37633: [SPARK-40198][CORE] Enable `spark.storage.decommission.(rdd|shuffle)Blocks.enabled` by default

2022-08-23 Thread GitBox



dongjoon-hyun opened a new pull request, #37633:
URL: https://github.com/apache/spark/pull/37633

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vitaliili-db commented on pull request #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

2022-08-23 Thread GitBox



vitaliili-db commented on PR #37631:
URL: https://github.com/apache/spark/pull/37631#issuecomment-1224996770

   @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vitaliili-db commented on pull request #37632: [SPARK-40197][SQL] Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR

2022-08-23 Thread GitBox



vitaliili-db commented on PR #37632:
URL: https://github.com/apache/spark/pull/37632#issuecomment-1224996642

   @gengliangwang @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kazuyukitanimura commented on pull request #37615: [SPARK-40124][SQL][TEST][3.3] Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-23 Thread GitBox



kazuyukitanimura commented on PR #37615:
URL: https://github.com/apache/spark/pull/37615#issuecomment-1224996584

   > Seems the GitHub action to generate benchmark files was added in Spark 3.4 
only. Any idea on how to generate those files for Spark 3.3 as well?
   
   @mskapilks The TPCDS benchmarks had a bug before Spark 3.4 the results are 
incorrect. I don't think you need to benchmark results for this back port IMO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vitaliili-db opened a new pull request, #37632: [SPARK-40197][SQL] Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR

2022-08-23 Thread GitBox



vitaliili-db opened a new pull request, #37632:
URL: https://github.com/apache/spark/pull/37632

   
   
   ### What changes were proposed in this pull request?
   
   Replace query plan with context for `MULTI_VALUE_SUBQUERY_ERROR`
   
   ### Why are the changes needed?
   
   User-friendly errors.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Unit test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37627: [SPARK-40191][PYTHON][CORE][DOCS] Make pyspark.resource examples self-contained

2022-08-23 Thread GitBox



dongjoon-hyun commented on PR #37627:
URL: https://github.com/apache/spark/pull/37627#issuecomment-1224994825

   Thank you, @HyukjinKwon , @zero323 , @tgravescs .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #37627: [SPARK-40191][PYTHON][CORE][DOCS] Make pyspark.resource examples self-contained

2022-08-23 Thread GitBox



dongjoon-hyun closed pull request #37627: [SPARK-40191][PYTHON][CORE][DOCS] 
Make pyspark.resource examples self-contained
URL: https://github.com/apache/spark/pull/37627


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] romainr commented on pull request #37621: [SPARK-40185][SQL] Remove column suggestion when the candidate list is empty

2022-08-23 Thread GitBox



romainr commented on PR #37621:
URL: https://github.com/apache/spark/pull/37621#issuecomment-1224971894

    for the capping and avoiding suggesting an empty list!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #37560: [SPARK-40130][PYTHON] Support NumPy scalars in built-in functions

2022-08-23 Thread GitBox



xinrong-meng commented on code in PR #37560:
URL: https://github.com/apache/spark/pull/37560#discussion_r953173437


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -959,9 +961,29 @@ def test_lit_day_time_interval(self):
 actual = self.spark.range(1).select(lit(td)).first()[0]
 self.assertEqual(actual, td)
 
+@unittest.skipIf(not have_numpy, "NumPy not installed")
+def test_np_scalar_input(self):
+import numpy as np
+from pyspark.sql.functions import array_contains, array_position
+
+df = self.spark.createDataFrame([([1, 2, 3],), ([],)], ["data"])
+for dtype in [np.int8, np.int16, np.int32, np.int64]:
+self.assertEqual(df.select(lit(dtype(1))).dtypes, [("1", "int")])

Review Comment:
   Created https://issues.apache.org/jira/browse/SPARK-40196 as a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #37560: [SPARK-40130][PYTHON] Support NumPy scalars in built-in functions

2022-08-23 Thread GitBox



xinrong-meng commented on code in PR #37560:
URL: https://github.com/apache/spark/pull/37560#discussion_r953163031


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -959,9 +961,29 @@ def test_lit_day_time_interval(self):
 actual = self.spark.range(1).select(lit(td)).first()[0]
 self.assertEqual(actual, td)
 
+@unittest.skipIf(not have_numpy, "NumPy not installed")
+def test_np_scalar_input(self):
+import numpy as np
+from pyspark.sql.functions import array_contains, array_position
+
+df = self.spark.createDataFrame([([1, 2, 3],), ([],)], ["data"])
+for dtype in [np.int8, np.int16, np.int32, np.int64]:
+self.assertEqual(df.select(lit(dtype(1))).dtypes, [("1", "int")])
+res = df.select(array_contains(df.data, 
dtype(1)).alias("b")).collect()
+self.assertEqual([Row(b=True), Row(b=False)], res)
+res = df.select(array_position(df.data, 
dtype(1)).alias("c")).collect()
+self.assertEqual([Row(c=1), Row(c=0)], res)

Review Comment:
   Tests added for `int -> bigint`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #37560: [SPARK-40130][PYTHON] Support NumPy scalars in built-in functions

2022-08-23 Thread GitBox



xinrong-meng commented on code in PR #37560:
URL: https://github.com/apache/spark/pull/37560#discussion_r952882996


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -959,9 +961,29 @@ def test_lit_day_time_interval(self):
 actual = self.spark.range(1).select(lit(td)).first()[0]
 self.assertEqual(actual, td)
 
+@unittest.skipIf(not have_numpy, "NumPy not installed")
+def test_np_scalar_input(self):
+import numpy as np
+from pyspark.sql.functions import array_contains, array_position
+
+df = self.spark.createDataFrame([([1, 2, 3],), ([],)], ["data"])
+for dtype in [np.int8, np.int16, np.int32, np.int64]:
+self.assertEqual(df.select(lit(dtype(1))).dtypes, [("1", "int")])

Review Comment:
   The data mapping between NumPy dtype and PySpark dtypes when 
`spark.createDataFrame` looks as below:
   ```
   NumPy dtype: ["int8", "int16", "int32", "int64", "float32", "float64"]
   
   PySpark dtypes: ["tinyint", "smallint", "int", "bigint", "float", "double"]
   ```
   
   Unfortunately, we lost part of the precision in Py4J. I may file a followup 
PR to at least adjust that for `lit` function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #37560: [SPARK-40130][PYTHON] Support NumPy scalars in built-in functions

2022-08-23 Thread GitBox



xinrong-meng commented on code in PR #37560:
URL: https://github.com/apache/spark/pull/37560#discussion_r953162738


##
python/pyspark/sql/types.py:
##
@@ -2256,11 +2256,33 @@ def convert(self, obj: datetime.timedelta, 
gateway_client: GatewayClient) -> Jav
 )
 
 
+class NumpyScalarConverter:
+@property

Review Comment:
   Sounds good! Defined `has_numpy` in `pyspark.sql.utils`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vitaliili-db opened a new pull request, #37631: [SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string.

2022-08-23 Thread GitBox



vitaliili-db opened a new pull request, #37631:
URL: https://github.com/apache/spark/pull/37631

   
   
   
   ### What changes were proposed in this pull request?
   
   Special case for `split` function when `regex` parameter is empty. The 
result will split input string but avoid empty remainder/tail string. I.e. 
`split('hello', '')` will produce `['h', 'e', 'l', 'l', 'o']` instead of `['h', 
'e', 'l', 'l', 'o', '']`. Old behavior is preserved when `limit` parameter is 
explicitly set to negative value - `split('hello', '', -1)`
   
   ### Why are the changes needed?
   
   This is nice to have and matches intuitively expected behavior. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. 
   - `split` function output changes when empty `regex` parameter is provided 
and `limit` parameter is not specified or set to 0. In this case a trailing 
empty string is ignored.
   
   ### How was this patch tested?
   
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kazuyukitanimura commented on a diff in pull request #37544: [SPARK-40110][SQL][TESTS] Add JDBCWithAQESuite

2022-08-23 Thread GitBox



kazuyukitanimura commented on code in PR #37544:
URL: https://github.com/apache/spark/pull/37544#discussion_r953154117


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:
##
@@ -44,7 +45,8 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
 import org.apache.spark.util.Utils
 
-class JDBCSuite extends QueryTest with SharedSparkSession {

Review Comment:
   I see your point. I revisited this, and yes AQE is enabled. However, AQE was 
not actually applied in the execution plan as `shouldApplyAQE()` returns 
`false`. In that sense, this PR description should have been `AQE applied` 
instead of `AQE enabled`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kazuyukitanimura commented on a diff in pull request #37544: [SPARK-40110][SQL][TESTS] Add JDBCWithAQESuite

2022-08-23 Thread GitBox



kazuyukitanimura commented on code in PR #37544:
URL: https://github.com/apache/spark/pull/37544#discussion_r953154117


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:
##
@@ -44,7 +45,8 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
 import org.apache.spark.util.Utils
 
-class JDBCSuite extends QueryTest with SharedSparkSession {

Review Comment:
   I see your point. I revisited this, and yes AQE is enabled. However, AQE was 
not actually not applied in the execution plan as `shouldApplyAQE()` returns 
`false`. In that sense, this PR description should have been `AQE applied` 
instead of `AQE enabled`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vitaliili-db commented on pull request #37621: [SPARK-40185][SQL] Remove column suggestion when the candidate list is empty

2022-08-23 Thread GitBox



vitaliili-db commented on PR #37621:
URL: https://github.com/apache/spark/pull/37621#issuecomment-1224902237

   @MaxGekk @gengliangwang ptal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yangwwei commented on pull request #37622: [SPARK-40187][DOCS] Add `Apache YuniKorn` scheduler docs

2022-08-23 Thread GitBox



yangwwei commented on PR #37622:
URL: https://github.com/apache/spark/pull/37622#issuecomment-1224867456

   > It seems that I wasn't clear enough to you. We need a specific version 
number, @yangwwei . -1 for adding a doc without version version.
   > 
   > > https://yunikorn.apache.org/docs/get_started/core_features.
   
   You are absolutely clear. Sure, let me find out to get this supported with 
our documentation framework. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ljfgem commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata

2022-08-23 Thread GitBox



ljfgem commented on code in PR #35636:
URL: https://github.com/apache/spark/pull/35636#discussion_r953092252


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -447,6 +454,74 @@ class Analyzer(override val catalogManager: CatalogManager)
 }
   }
 
+  /**
+   * Substitute persisted views in parsed plans with parsed view sql text.
+   */
+  case class ViewSubstitution(sqlParser: ParserInterface) extends 
Rule[LogicalPlan] {
+
+def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
+  case u @ UnresolvedRelation(nameParts, _, _) if 
v1SessionCatalog.isTempView(nameParts) =>
+u
+  case u @ UnresolvedRelation(
+  parts @ NonSessionCatalogAndIdentifier(catalog, ident), _, _) if 
!isSQLOnFile(parts) =>
+CatalogV2Util.loadView(catalog, ident)
+.map(createViewRelation(parts.quoted, _))
+.getOrElse(u)
+}
+
+private def isSQLOnFile(parts: Seq[String]): Boolean = parts match {
+  case Seq(_, path) if path.contains("/") => true
+  case _ => false
+}
+
+private def createViewRelation(name: String, view: V2View): LogicalPlan = {
+  if (!catalogManager.isCatalogRegistered(view.currentCatalog)) {
+throw new AnalysisException(
+  s"Invalid current catalog '${view.currentCatalog}' in view '$name'")
+  }
+
+  val child = parseViewText(name, view.sql)
+  val desc = V2ViewDescription(name, view)
+  val qualifiedChild = desc.viewCatalogAndNamespace match {
+case Seq() =>
+  // Views from Spark 2.2 or prior do not store catalog or namespace,
+  // however its sql text should already be fully qualified.
+  child
+case catalogAndNamespace =>
+  // Substitute CTEs within the view before qualifying table 
identifiers
+  qualifyTableIdentifiers(CTESubstitution.apply(child), 
catalogAndNamespace)
+  }
+
+  // The relation is a view, so we wrap the relation by:
+  // 1. Add a [[View]] operator over the relation to keep track of the 
view desc;
+  // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name 
of the view.
+  SubqueryAlias(name, View(desc, false, qualifiedChild))

Review Comment:
   Hi @jzhuge, I found that with the original Hive session catalog, the 
resolved plan of the view `select * from default.t`  is like:
   ```
   'SubqueryAlias spark_catalog.default.test_view
   +- View (`default`.`test_view`, ['intCol,'structCol,'boolCol])
  +- 'Project [upcast(getviewcolumnbynameandordinal(`default`.`test_view`, 
intCol, 0, 1), IntegerType) AS intCol#6, 
upcast(getviewcolumnbynameandordinal(`default`.`test_view`, structCol, 0, 1), 
StructField(doubleCol,DoubleType,true), StructField(stringCol,StringType,true)) 
AS structCol#7, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, 
boolCol, 0, 1), BooleanType) AS boolCol#8]
 +- 'Project [*]
+- 'UnresolvedRelation [default, t], [], false
   ```
   Looks like the project node:
   ```
  +- 'Project [upcast(getviewcolumnbynameandordinal(`default`.`test_view`, 
intCol, 0, 1), IntegerType) AS intCol#6, 
upcast(getviewcolumnbynameandordinal(`default`.`test_view`, structCol, 0, 1), 
StructField(doubleCol,DoubleType,true), StructField(stringCol,StringType,true)) 
AS structCol#7, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, 
boolCol, 0, 1), BooleanType) AS boolCol#8]
   ```
   is added based on the schema stored in the view's desc (`CatalogTable`). So 
that the schema info (casing and nullability) is preserved.
   
   But in the current implementation, there is not such a project node based on 
the view's schema, do you think we need to add the similar project node if the 
view's schema is provided?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37622: [SPARK-40187][DOCS] Add `Apache YuniKorn` scheduler docs

2022-08-23 Thread GitBox



dongjoon-hyun commented on code in PR #37622:
URL: https://github.com/apache/spark/pull/37622#discussion_r953077805


##
docs/running-on-kubernetes.md:
##
@@ -1811,6 +1811,50 @@ spec:
   queue: default
 ```
 
+ Using Apache YuniKorn as Customized Scheduler for Spark on Kubernetes
+
+[Apache YuniKorn](https://yunikorn.apache.org/) is a resource scheduler for 
Kubernetes that provides advanced batch scheduling
+capabilities, such as job queuing, resource fairness, min/max queue capacity 
and flexible job ordering policies.
+For available Apache YuniKorn features, please refer to [this 
doc](https://yunikorn.apache.org/docs/next/get_started/core_features).
+
+# Prerequisites
+
+Install Apache YuniKorn:
+
+```bash
+helm repo add yunikorn https://apache.github.io/yunikorn-release
+helm repo update
+kubectl create namespace yunikorn
+helm install yunikorn yunikorn/yunikorn --namespace yunikorn
+```
+
+the above steps will install the latest version of YuniKorn on an existing 
Kubernetes cluster.
+
+# Get started
+
+Submit Spark jobs with the following extra options:
+
+```bash
+--conf spark.kubernetes.scheduler.name=yunikorn
+--conf spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={{APP_ID}}
+--conf 
spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={{APP_ID}}
+```
+
+Note, `{{APP_ID}}` is the builtin variable that will be substituted with Spark 
job ID automatically.
+With the above configuration, the job will be scheduled by YuniKorn scheduler 
instead of the default Kubernetes scheduler.
+
+# Work with YuniKorn queues

Review Comment:
   If this is all you can add, we had better remove this section, `Work with 
YuniKorn queues`,  because there is nothing for user to do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37622: [SPARK-40187][DOCS] Add `Apache YuniKorn` scheduler docs

2022-08-23 Thread GitBox



dongjoon-hyun commented on code in PR #37622:
URL: https://github.com/apache/spark/pull/37622#discussion_r953076970


##
docs/running-on-kubernetes.md:
##
@@ -1811,6 +1811,50 @@ spec:
   queue: default
 ```
 
+ Using Apache YuniKorn as Customized Scheduler for Spark on Kubernetes
+
+[Apache YuniKorn](https://yunikorn.apache.org/) is a resource scheduler for 
Kubernetes that provides advanced batch scheduling
+capabilities, such as job queuing, resource fairness, min/max queue capacity 
and flexible job ordering policies.
+For available Apache YuniKorn features, please refer to [this 
doc](https://yunikorn.apache.org/docs/next/get_started/core_features).
+
+# Prerequisites
+
+Install Apache YuniKorn:
+
+```bash
+helm repo add yunikorn https://apache.github.io/yunikorn-release
+helm repo update
+kubectl create namespace yunikorn
+helm install yunikorn yunikorn/yunikorn --namespace yunikorn

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] otterc commented on a diff in pull request #37624: [SPARK-40186][CORE][YARN] Ensure `mergedShuffleCleaner` have been shutdown before `db` close

2022-08-23 Thread GitBox



otterc commented on code in PR #37624:
URL: https://github.com/apache/spark/pull/37624#discussion_r953070286


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -795,13 +796,25 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
   }
 
   /**
-   * Close the DB during shutdown
+   * Shutdown mergedShuffleCleaner and close the DB during shutdown
*/
   @Override
   public void close() {
+if (!mergedShuffleCleaner.isShutdown()) {
+  try {
+mergedShuffleCleaner.shutdown();
+boolean isTermination = mergedShuffleCleaner.awaitTermination(10L, 
TimeUnit.SECONDS);

Review Comment:
   Nit -> `isTermination` -> `terminated`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #37560: [SPARK-40130][PYTHON] Support NumPy scalars in built-in functions

2022-08-23 Thread GitBox



xinrong-meng commented on PR #37560:
URL: https://github.com/apache/spark/pull/37560#issuecomment-1224813767

   Rebased to the latest master to resolve code conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yangwwei commented on a diff in pull request #37622: [SPARK-40187][DOCS] Add `Apache YuniKorn` scheduler docs

2022-08-23 Thread GitBox



yangwwei commented on code in PR #37622:
URL: https://github.com/apache/spark/pull/37622#discussion_r953051876


##
docs/running-on-kubernetes.md:
##
@@ -1811,6 +1811,50 @@ spec:
   queue: default
 ```
 
+ Using Apache YuniKorn as Customized Scheduler for Spark on Kubernetes
+
+[Apache YuniKorn](https://yunikorn.apache.org/) is a resource scheduler for 
Kubernetes that provides advanced batch scheduling
+capabilities, such as job queuing, resource fairness, min/max queue capacity 
and flexible job ordering policies.
+For available Apache YuniKorn features, please refer to [this 
doc](https://yunikorn.apache.org/docs/next/get_started/core_features).
+
+# Prerequisites
+
+Install Apache YuniKorn:
+
+```bash
+helm repo add yunikorn https://apache.github.io/yunikorn-release
+helm repo update
+kubectl create namespace yunikorn
+helm install yunikorn yunikorn/yunikorn --namespace yunikorn

Review Comment:
   this page https://apache.github.io/yunikorn-release has been fixed.
   
   > It seems that YuniKorn helm chart doesn't have TEST SUITE.
   
   Yes, there is no test suite today in the helm charts.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Transurgeon commented on pull request #37444: [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1)

2022-08-23 Thread GitBox



Transurgeon commented on PR #37444:
URL: https://github.com/apache/spark/pull/37444#issuecomment-1224746488

   @HyukjinKwon I made some additional changes. I think we can start by merging 
this PR, then I will make another one for the rest of the changes. 
   
   I have a list of all the functions I made changes to in this PR, should I 
add it to the JIRA ticket to avoid duplicate changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Transurgeon commented on a diff in pull request #37444: [SPARK-40012][PYTHON][DOCS][WIP] Make pyspark.sql.dataframe examples self-contained

2022-08-23 Thread GitBox



Transurgeon commented on code in PR #37444:
URL: https://github.com/apache/spark/pull/37444#discussion_r953047129


##
python/pyspark/sql/dataframe.py:
##
@@ -3356,11 +3476,29 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame":  
# type: ignore[misc]
 
 Examples
 
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+| 14|  Tom|
+| 23|Alice|
+| 16|  Bob|
++---+-+
 >>> df.drop('age').collect()
-[Row(name='Alice'), Row(name='Bob')]
+[Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]
 
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+| 14|  Tom|
+| 23|Alice|
+| 16|  Bob|
++---+-+
 >>> df.drop(df.age).collect()
-[Row(name='Alice'), Row(name='Bob')]
+[Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]
 
 >>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect()

Review Comment:
   I am not sure what these 3 inner joins do exactly. I dont see anywhere an 
instantiation of df2.. 
   
   What should I do with these 3 examples?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #33538: [SPARK-36107][SQL] Refactor first set of 20 query execution errors to use error classes

2022-08-23 Thread GitBox



MaxGekk commented on code in PR #33538:
URL: https://github.com/apache/spark/pull/33538#discussion_r953021255


##
core/src/main/resources/error/error-classes.json:
##
@@ -3,6 +3,31 @@
 "message" : [ "Field name %s is ambiguous and has %s matching fields in 
the struct." ],
 "sqlState" : "42000"
   },
+  "CANNOT_CAST_DATATYPE" : {

Review Comment:
   Any ideas how to trigger this error class? I haven't found any ways how 
trigger it from user code. WDYT? @cloud-fan @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37629: [SPARK-40160][PYTHON][DOCS] Make pyspark.broadcast examples self-contained

2022-08-23 Thread GitBox



HyukjinKwon commented on code in PR #37629:
URL: https://github.com/apache/spark/pull/37629#discussion_r952967676


##
python/pyspark/broadcast.py:
##
@@ -149,6 +147,18 @@ def __init__(
 self._path = path
 
 def dump(self, value: T, f: BinaryIO) -> None:
+"""
+Write a pickled representation of value to the open file or socket.
+The protocol pickle is HIGHEST_PROTOCOL.
+
+Parameters
+--
+value : T
+Value to write.
+
+f : :class:`BinaryIO`
+File or socket where the pickled value will be stored.
+"""

Review Comment:
   Could we maybe add an example with dump and load? E.g. with 
tempfile.TemporaryDirectory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #37610: [SPARK-38888][BUILD][CORE][YARN][DOCS] Add `RocksDB` support for shuffle state store

2022-08-23 Thread GitBox



LuciferYang commented on code in PR #37610:
URL: https://github.com/apache/spark/pull/37610#discussion_r952955164


##
docs/spark-standalone.md:
##
@@ -328,6 +328,17 @@ SPARK_WORKER_OPTS supports the following system properties:
   
   3.0.0
 
+
+  spark.shuffle.service.db.enabled
+  LEVELDB
+  
+When spark.shuffle.service.db.enabled is true, user can use 
this to specify the kind of disk-based 
+store used in shuffle state store. This supports `LEVELDB` and `ROCKSDB` 
now and `LEVELDB` as default value. 
+This only affects standalone mode (yarn always has this behavior enabled). 
+The original data store in `LevelDB/RocksDB` will not be automatically 
convert to another kind of storage now.
+  
+  3.4.0
+

Review Comment:
   Yes, you are right, both standalone and yarn are OK and 
`spark.shuffle.service.db.enabled` be used together
   
   
https://github.com/apache/spark/blob/a12d121159d0ab8293f70b819cb489cf6126224d/docs/spark-standalone.md?plain=1#L321-L328



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #37610: [SPARK-38888][BUILD][CORE][YARN][DOCS] Add `RocksDB` support for shuffle state store

2022-08-23 Thread GitBox



LuciferYang commented on code in PR #37610:
URL: https://github.com/apache/spark/pull/37610#discussion_r952955164


##
docs/spark-standalone.md:
##
@@ -328,6 +328,17 @@ SPARK_WORKER_OPTS supports the following system properties:
   
   3.0.0
 
+
+  spark.shuffle.service.db.enabled
+  LEVELDB
+  
+When spark.shuffle.service.db.enabled is true, user can use 
this to specify the kind of disk-based 
+store used in shuffle state store. This supports `LEVELDB` and `ROCKSDB` 
now and `LEVELDB` as default value. 
+This only affects standalone mode (yarn always has this behavior enabled). 
+The original data store in `LevelDB/RocksDB` will not be automatically 
convert to another kind of storage now.
+  
+  3.4.0
+

Review Comment:
   Yes, you are right, both standard and yarn are OK and 
`spark.shuffle.service.db.enabled` be used together
   
   
https://github.com/apache/spark/blob/a12d121159d0ab8293f70b819cb489cf6126224d/docs/spark-standalone.md?plain=1#L321-L328



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yangwwei commented on a diff in pull request #37622: [SPARK-40187][DOCS] Add `Apache YuniKorn` scheduler docs

2022-08-23 Thread GitBox



yangwwei commented on code in PR #37622:
URL: https://github.com/apache/spark/pull/37622#discussion_r952964187


##
docs/running-on-kubernetes.md:
##
@@ -1811,6 +1811,50 @@ spec:
   queue: default
 ```
 
+ Using Apache YuniKorn as Customized Scheduler for Spark on Kubernetes
+
+[Apache YuniKorn](https://yunikorn.apache.org/) is a resource scheduler for 
Kubernetes that provides advanced batch scheduling
+capabilities, such as job queuing, resource fairness, min/max queue capacity 
and flexible job ordering policies.
+For available Apache YuniKorn features, please refer to [this 
doc](https://yunikorn.apache.org/docs/next/get_started/core_features).
+
+# Prerequisites
+
+Install Apache YuniKorn:
+
+```bash
+helm repo add yunikorn https://apache.github.io/yunikorn-release
+helm repo update
+kubectl create namespace yunikorn
+helm install yunikorn yunikorn/yunikorn --namespace yunikorn
+```
+
+the above steps will install the latest version of YuniKorn on an existing 
Kubernetes cluster.
+
+# Get started
+
+Submit Spark jobs with the following extra options:
+
+```bash
+--conf spark.kubernetes.scheduler.name=yunikorn
+--conf spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={{APP_ID}}
+--conf 
spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={{APP_ID}}
+```
+
+Note, `{{APP_ID}}` is the builtin variable that will be substituted with Spark 
job ID automatically.
+With the above configuration, the job will be scheduled by YuniKorn scheduler 
instead of the default Kubernetes scheduler.
+
+# Work with YuniKorn queues

Review Comment:
   Could you please share what kind of info I can add to make this more 
sufficient? 
   I am trying to find a balance here, I do not want to make the doc tedious 
and hard to read, but still want to make sure enough info is provided.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 218 matches

Mail list logo