[GitHub] [spark] SparkQA removed a comment on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes

2021-05-10 Thread GitBox


SparkQA removed a comment on pull request #32031:
URL: https://github.com/apache/spark/pull/32031#issuecomment-837717297


   **[Test build #138353 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138353/testReport)**
 for PR 32031 at commit 
[`48a0be7`](https://github.com/apache/spark/commit/48a0be7b58a1a9d5c556b35359427fb0cd903b13).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes

2021-05-10 Thread GitBox


SparkQA commented on pull request #32031:
URL: https://github.com/apache/spark/pull/32031#issuecomment-837873050


   **[Test build #138353 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138353/testReport)**
 for PR 32031 at commit 
[`48a0be7`](https://github.com/apache/spark/commit/48a0be7b58a1a9d5c556b35359427fb0cd903b13).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-10 Thread GitBox


SparkQA removed a comment on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-837593393


   **[Test build #138348 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138348/testReport)**
 for PR 32303 at commit 
[`62d8a97`](https://github.com/apache/spark/commit/62d8a9756239ba92129ce6a20d037aa00b0707f1).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-10 Thread GitBox


SparkQA commented on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-837870821


   **[Test build #138348 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138348/testReport)**
 for PR 32303 at commit 
[`62d8a97`](https://github.com/apache/spark/commit/62d8a9756239ba92129ce6a20d037aa00b0707f1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-10 Thread GitBox


beliefer commented on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-837864187


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-10 Thread GitBox


SparkQA removed a comment on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-837716519


   **[Test build #138352 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138352/testReport)**
 for PR 32464 at commit 
[`4284458`](https://github.com/apache/spark/commit/42844586a5e53559187aa98797bd82b5c1f9601d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-10 Thread GitBox


SparkQA commented on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-837856847


   **[Test build #138352 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138352/testReport)**
 for PR 32464 at commit 
[`4284458`](https://github.com/apache/spark/commit/42844586a5e53559187aa98797bd82b5c1f9601d).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


SparkQA removed a comment on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837593024


   **[Test build #138347 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138347/testReport)**
 for PR 32482 at commit 
[`f0c99db`](https://github.com/apache/spark/commit/f0c99db18079fc3bbf9ff4e31074b6774a8d6416).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


SparkQA commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837853557


   **[Test build #138347 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138347/testReport)**
 for PR 32482 at commit 
[`f0c99db`](https://github.com/apache/spark/commit/f0c99db18079fc3bbf9ff4e31074b6774a8d6416).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-10 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-837836414


   **[Test build #138359 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138359/testReport)**
 for PR 32399 at commit 
[`35872b7`](https://github.com/apache/spark/commit/35872b7b0bdc435fa93439aaa957a718f3d3f8f4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32470: [WIP] Simplify ResolveAggregateFunctions

2021-05-10 Thread GitBox


SparkQA commented on pull request #32470:
URL: https://github.com/apache/spark/pull/32470#issuecomment-837836239


   **[Test build #138358 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138358/testReport)**
 for PR 32470 at commit 
[`c0bb807`](https://github.com/apache/spark/commit/c0bb8070cbb52f9a20da0c5e3e791db72ea4bf04).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32497: [SPARK-35366][SQL] Avoid using deprecated `buildForBatch` and `buildForStreaming`

2021-05-10 Thread GitBox


SparkQA commented on pull request #32497:
URL: https://github.com/apache/spark/pull/32497#issuecomment-837836024


   **[Test build #138357 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138357/testReport)**
 for PR 32497 at commit 
[`d078953`](https://github.com/apache/spark/commit/d078953d3b3b6e14b7f51ee3fd321cd892da02d5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32457: [SPARK-35329][SQL] Split generated switch code into pieces in ExpandExec

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32457:
URL: https://github.com/apache/spark/pull/32457#issuecomment-837833240


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138346/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837833239


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42877/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31986: [SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #31986:
URL: https://github.com/apache/spark/pull/31986#issuecomment-837833241


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42878/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-837833238


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138350/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32457: [SPARK-35329][SQL] Split generated switch code into pieces in ExpandExec

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32457:
URL: https://github.com/apache/spark/pull/32457#issuecomment-837833240


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138346/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-837833238


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138350/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31986: [SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #31986:
URL: https://github.com/apache/spark/pull/31986#issuecomment-837833241


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42878/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837833239


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42877/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gerashegalov commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

2021-05-10 Thread GitBox


gerashegalov commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-837830422


   @zhengruifeng can you provide a minimum code reproducing for NPEs you are 
observing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-10 Thread GitBox


venkata91 commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r629848900



##
File path: .idea/vcs.xml
##
@@ -1,36 +0,0 @@
-

Review comment:
   Sorry my bad, I think it got added as part of this `[SPARK-35223] Add 
IssueNavigationLink` I thought I added it by mistake. Fixed it. Should be good 
now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


SparkQA commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837823698






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31986: [SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements

2021-05-10 Thread GitBox


SparkQA commented on pull request #31986:
URL: https://github.com/apache/spark/pull/31986#issuecomment-837818648






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-10 Thread GitBox


venkata91 commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r629845232



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -1271,21 +1302,28 @@ private[spark] class DAGScheduler(
* locations for block push/merge by getting the historical locations of 
past executors.
*/
   private def prepareShuffleServicesForShuffleMapStage(stage: 
ShuffleMapStage): Unit = {
-// TODO(SPARK-32920) Handle stage reuse/retry cases separately as without 
finalize
-// TODO changes we cannot disable shuffle merge for the retry/reuse cases
-val mergerLocs = sc.schedulerBackend.getShufflePushMergerLocations(
-  stage.shuffleDep.partitioner.numPartitions, stage.resourceProfileId)
-
-if (mergerLocs.nonEmpty) {
-  stage.shuffleDep.setMergerLocs(mergerLocs)
-  logInfo(s"Push-based shuffle enabled for $stage (${stage.name}) with" +
-s" ${stage.shuffleDep.getMergerLocs.size} merger locations")
-
-  logDebug("List of shuffle push merger locations " +
-s"${stage.shuffleDep.getMergerLocs.map(_.host).mkString(", ")}")
-} else {
-  logInfo("No available merger locations." +
-s" Push-based shuffle disabled for $stage (${stage.name})")
+if (stage.shuffleDep.shuffleMergeEnabled && 
!stage.shuffleDep.shuffleMergeFinalized
+  && stage.shuffleDep.getMergerLocs.isEmpty) {
+  val mergerLocs = sc.schedulerBackend.getShufflePushMergerLocations(
+stage.shuffleDep.partitioner.numPartitions, stage.resourceProfileId)
+  if (mergerLocs.nonEmpty) {
+stage.shuffleDep.setMergerLocs(mergerLocs)
+logInfo(s"Push-based shuffle enabled for $stage (${stage.name}) with" +
+  s" ${stage.shuffleDep.getMergerLocs.size} merger locations")
+
+logDebug("List of shuffle push merger locations " +
+  s"${stage.shuffleDep.getMergerLocs.map(_.host).mkString(", ")}")
+  } else {
+stage.shuffleDep.setShuffleMergeEnabled(false)
+logInfo("Push-based shuffle disabled for $stage (${stage.name})")
+  }
+} else if (stage.shuffleDep.shuffleMergeFinalized) {
+  // Disable Shuffle merge for the retry/reuse of the same shuffle 
dependency if it has
+  // already been merge finalized. If the shuffle dependency was 
previously assigned merger
+  // locations but the corresponding shuffle map stage did not complete 
successfully, we
+  // would still enable push for its retry.

Review comment:
   Yes, we are disabling merge in those case since it is already finalized.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-10 Thread GitBox


venkata91 commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r629844923



##
File path: .idea/vcs.xml
##
@@ -1,36 +0,0 @@
-

Review comment:
   Somehow my idea file got added and pushed. I think I removed it. Isn't 
it? Let me check again.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue

2021-05-10 Thread GitBox


SparkQA removed a comment on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-837652969


   **[Test build #138350 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138350/testReport)**
 for PR 32399 at commit 
[`c6c677c`](https://github.com/apache/spark/commit/c6c677cfd9c6a7f9cc71b5870d2cf71c6483613f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-10 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-837803879


   **[Test build #138350 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138350/testReport)**
 for PR 32399 at commit 
[`c6c677c`](https://github.com/apache/spark/commit/c6c677cfd9c6a7f9cc71b5870d2cf71c6483613f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] linhongliu-db opened a new pull request #32497: [SPARK-35366][SQL] Avoid using deprecated `buildForBatch` and `buildForStreaming`

2021-05-10 Thread GitBox


linhongliu-db opened a new pull request #32497:
URL: https://github.com/apache/spark/pull/32497


   ### What changes were proposed in this pull request?
   Currently, in DSv2, we are still using the deprecated `buildForBatch` and 
`buildForStreaming`.
   This PR implements the `build`, `toBatch`, `toStreaming` interfaces to 
replace the deprecated ones.
   
   
   ### Why are the changes needed?
   Code refactor
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   exsting UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32457: [SPARK-35329][SQL] Split generated switch code into pieces in ExpandExec

2021-05-10 Thread GitBox


SparkQA removed a comment on pull request #32457:
URL: https://github.com/apache/spark/pull/32457#issuecomment-837543445


   **[Test build #138346 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138346/testReport)**
 for PR 32457 at commit 
[`b01bb8c`](https://github.com/apache/spark/commit/b01bb8ce8dec363541efd17c336b109c40706535).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32457: [SPARK-35329][SQL] Split generated switch code into pieces in ExpandExec

2021-05-10 Thread GitBox


SparkQA commented on pull request #32457:
URL: https://github.com/apache/spark/pull/32457#issuecomment-837793182


   **[Test build #138346 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138346/testReport)**
 for PR 32457 at commit 
[`b01bb8c`](https://github.com/apache/spark/commit/b01bb8ce8dec363541efd17c336b109c40706535).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31986: [SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements

2021-05-10 Thread GitBox


SparkQA commented on pull request #31986:
URL: https://github.com/apache/spark/pull/31986#issuecomment-837775575


   **[Test build #138356 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138356/testReport)**
 for PR 31986 at commit 
[`7764c72`](https://github.com/apache/spark/commit/7764c72a932aa058f9c864c8da8a5479c2be0c68).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


SparkQA commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837774793


   **[Test build #138355 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138355/testReport)**
 for PR 32482 at commit 
[`708bb0c`](https://github.com/apache/spark/commit/708bb0c78256256043b14ecfa7b62a6cb96fea6a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32365: [SPARK-35228][SQL] Add expression ToPrettyString for keep consistent between hive/spark format in df.show and transform

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32365:
URL: https://github.com/apache/spark/pull/32365#issuecomment-837771490


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42876/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32495: [SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32495:
URL: https://github.com/apache/spark/pull/32495#issuecomment-837771494


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138343/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-837771487


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42875/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32031:
URL: https://github.com/apache/spark/pull/32031#issuecomment-837771488


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42874/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32494: [Minor][SPARK-35362][SQL]Update null count in the column stats for UNION operator stats estimation

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-837771486


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42873/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32365: [SPARK-35228][SQL] Add expression ToPrettyString for keep consistent between hive/spark format in df.show and transform

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32365:
URL: https://github.com/apache/spark/pull/32365#issuecomment-837771490


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42876/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32495: [SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32495:
URL: https://github.com/apache/spark/pull/32495#issuecomment-837771494


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138343/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-837771487


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42875/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32031:
URL: https://github.com/apache/spark/pull/32031#issuecomment-837771488


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42874/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32494: [Minor][SPARK-35362][SQL]Update null count in the column stats for UNION operator stats estimation

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-837771486


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42873/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32365: [SPARK-35228][SQL] Add expression ToPrettyString for keep consistent between hive/spark format in df.show and transform

2021-05-10 Thread GitBox


SparkQA commented on pull request #32365:
URL: https://github.com/apache/spark/pull/32365#issuecomment-837768084






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-10 Thread GitBox


SparkQA commented on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-837765641


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42875/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #31986: [SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements

2021-05-10 Thread GitBox


HeartSaVioR commented on a change in pull request #31986:
URL: https://github.com/apache/spark/pull/31986#discussion_r629830829



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/UpdatingSessionsExec.scala
##
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.aggregate
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, 
SortOrder}
+import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, 
ClusteredDistribution, Distribution, Partitioning}
+import org.apache.spark.sql.execution.{SparkPlan, UnaryExecNode}
+
+/**
+ * This node updates the session window spec of each input rows via analyzing 
neighbor rows and
+ * determining rows belong to the same session window. The number of input 
rows remains the same.
+ * This node requires sort on input rows by group keys + the start time of 
session window.
+ *
+ * There are lots of overhead compared to [[MergingSessionsExec]]. Use 
[[MergingSessionsExec]]
+ * instead whenever possible. Use this node only when we cannot apply both 
calculations
+ * determining session windows and aggregating rows in session window 
altogether.
+ *
+ * Refer [[UpdatingSessionsIterator]] for more details.
+ */
+case class UpdatingSessionsExec(
+keyExpressions: Seq[Attribute],
+sessionExpression: Attribute,
+child: SparkPlan) extends UnaryExecNode {
+
+  private val groupingWithoutSessionExpression = keyExpressions.filterNot {
+p => p.semanticEquals(sessionExpression)
+  }
+  private val groupingWithoutSessionAttributes =
+groupingWithoutSessionExpression.map(_.toAttribute)
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val inMemoryThreshold = 
sqlContext.conf.sessionWindowBufferInMemoryThreshold
+val spillThreshold = sqlContext.conf.sessionWindowBufferSpillThreshold
+
+child.execute().mapPartitions { iter =>
+  new UpdatingSessionsIterator(iter, keyExpressions, sessionExpression,
+child.output, inMemoryThreshold, spillThreshold)
+}
+  }
+
+  override def output: Seq[Attribute] = child.output
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def requiredChildDistribution: Seq[Distribution] = {
+if (groupingWithoutSessionExpression.isEmpty) {
+  AllTuples :: Nil
+} else {
+  ClusteredDistribution(groupingWithoutSessionExpression) :: Nil
+}
+  }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+Seq((groupingWithoutSessionAttributes ++ Seq(sessionExpression))

Review comment:
   Ah I remembered the reason. We can't safely assume session expression is 
placed at the end of grouping. Let me revert back.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #31986: [SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements

2021-05-10 Thread GitBox


HeartSaVioR commented on a change in pull request #31986:
URL: https://github.com/apache/spark/pull/31986#discussion_r629830829



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/UpdatingSessionsExec.scala
##
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.aggregate
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, 
SortOrder}
+import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, 
ClusteredDistribution, Distribution, Partitioning}
+import org.apache.spark.sql.execution.{SparkPlan, UnaryExecNode}
+
+/**
+ * This node updates the session window spec of each input rows via analyzing 
neighbor rows and
+ * determining rows belong to the same session window. The number of input 
rows remains the same.
+ * This node requires sort on input rows by group keys + the start time of 
session window.
+ *
+ * There are lots of overhead compared to [[MergingSessionsExec]]. Use 
[[MergingSessionsExec]]
+ * instead whenever possible. Use this node only when we cannot apply both 
calculations
+ * determining session windows and aggregating rows in session window 
altogether.
+ *
+ * Refer [[UpdatingSessionsIterator]] for more details.
+ */
+case class UpdatingSessionsExec(
+keyExpressions: Seq[Attribute],
+sessionExpression: Attribute,
+child: SparkPlan) extends UnaryExecNode {
+
+  private val groupingWithoutSessionExpression = keyExpressions.filterNot {
+p => p.semanticEquals(sessionExpression)
+  }
+  private val groupingWithoutSessionAttributes =
+groupingWithoutSessionExpression.map(_.toAttribute)
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val inMemoryThreshold = 
sqlContext.conf.sessionWindowBufferInMemoryThreshold
+val spillThreshold = sqlContext.conf.sessionWindowBufferSpillThreshold
+
+child.execute().mapPartitions { iter =>
+  new UpdatingSessionsIterator(iter, keyExpressions, sessionExpression,
+child.output, inMemoryThreshold, spillThreshold)
+}
+  }
+
+  override def output: Seq[Attribute] = child.output
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def requiredChildDistribution: Seq[Distribution] = {
+if (groupingWithoutSessionExpression.isEmpty) {
+  AllTuples :: Nil
+} else {
+  ClusteredDistribution(groupingWithoutSessionExpression) :: Nil
+}
+  }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+Seq((groupingWithoutSessionAttributes ++ Seq(sessionExpression))

Review comment:
   Ah I remembered the reason. You can't safely assume session expression 
is placed at the end of grouping. Let me revert back.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-10 Thread GitBox


SparkQA commented on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-837759960


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42875/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32494: [Minor][SPARK-35362][SQL]Update null count in the column stats for UNION operator stats estimation

2021-05-10 Thread GitBox


SparkQA commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-837756097






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes

2021-05-10 Thread GitBox


SparkQA commented on pull request #32031:
URL: https://github.com/apache/spark/pull/32031#issuecomment-837755790


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42874/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sigmod commented on a change in pull request #32439: [SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala

2021-05-10 Thread GitBox


sigmod commented on a change in pull request #32439:
URL: https://github.com/apache/spark/pull/32439#discussion_r629828096



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -745,7 +754,8 @@ object PushProjectionThroughUnion extends Rule[LogicalPlan] 
with PredicateHelper
  */
 object ColumnPruning extends Rule[LogicalPlan] {
 
-  def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(plan 
transform {
+  def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(
+plan.transformWithPruning(AlwaysProcess.fn, ruleId) {

Review comment:
   `transform` internally calls
   `transformWithPruning(AlwaysProcess.fn, UnknownRuleId)` ...
   
   The argument comments are here:
   
https://github.com/apache/spark/blob/e08c40fa3f7054bdc713873c99d3aaf0014d9314/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L434-L441
   
So here, 
`plan.transformWithPruning(AlwaysProcess.fn, ruleId)` means there's no 
pruning based on TreePattern bits, but there's pruning based on ruleIds (if the 
rule is known to ineffective on a tree instance `T`,  it will be skipped next 
time when it is invoked on the same tree instance `T`).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maryannxue commented on a change in pull request #32439: [SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala

2021-05-10 Thread GitBox


maryannxue commented on a change in pull request #32439:
URL: https://github.com/apache/spark/pull/32439#discussion_r629826756



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -745,7 +754,8 @@ object PushProjectionThroughUnion extends Rule[LogicalPlan] 
with PredicateHelper
  */
 object ColumnPruning extends Rule[LogicalPlan] {
 
-  def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(plan 
transform {
+  def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(
+plan.transformWithPruning(AlwaysProcess.fn, ruleId) {

Review comment:
   I might have missed sth. from the previous commit, but what difference 
is there between regular `transform` and 
`transformWithPruning(AlwaysProcess.fn, ...)` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32495: [SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread GitBox


SparkQA removed a comment on pull request #32495:
URL: https://github.com/apache/spark/pull/32495#issuecomment-837480412


   **[Test build #138343 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138343/testReport)**
 for PR 32495 at commit 
[`1912436`](https://github.com/apache/spark/commit/191243698d17978e1d4ed2bb966de2564926b309).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32495: [SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread GitBox


SparkQA commented on pull request #32495:
URL: https://github.com/apache/spark/pull/32495#issuecomment-837741723


   **[Test build #138343 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138343/testReport)**
 for PR 32495 at commit 
[`1912436`](https://github.com/apache/spark/commit/191243698d17978e1d4ed2bb966de2564926b309).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `trait ShuffledJoin extends JoinCodegenSupport `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #31986: [SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements

2021-05-10 Thread GitBox


HeartSaVioR commented on a change in pull request #31986:
URL: https://github.com/apache/spark/pull/31986#discussion_r629795680



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/UpdatingSessionsExec.scala
##
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.aggregate
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, 
SortOrder}
+import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, 
ClusteredDistribution, Distribution, Partitioning}
+import org.apache.spark.sql.execution.{SparkPlan, UnaryExecNode}
+
+/**
+ * This node updates the session window spec of each input rows via analyzing 
neighbor rows and
+ * determining rows belong to the same session window. The number of input 
rows remains the same.
+ * This node requires sort on input rows by group keys + the start time of 
session window.
+ *
+ * There are lots of overhead compared to [[MergingSessionsExec]]. Use 
[[MergingSessionsExec]]
+ * instead whenever possible. Use this node only when we cannot apply both 
calculations
+ * determining session windows and aggregating rows in session window 
altogether.
+ *
+ * Refer [[UpdatingSessionsIterator]] for more details.
+ */
+case class UpdatingSessionsExec(
+keyExpressions: Seq[Attribute],
+sessionExpression: Attribute,
+child: SparkPlan) extends UnaryExecNode {
+
+  private val groupingWithoutSessionExpression = keyExpressions.filterNot {
+p => p.semanticEquals(sessionExpression)
+  }
+  private val groupingWithoutSessionAttributes =
+groupingWithoutSessionExpression.map(_.toAttribute)
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val inMemoryThreshold = 
sqlContext.conf.sessionWindowBufferInMemoryThreshold
+val spillThreshold = sqlContext.conf.sessionWindowBufferSpillThreshold
+
+child.execute().mapPartitions { iter =>
+  new UpdatingSessionsIterator(iter, keyExpressions, sessionExpression,
+child.output, inMemoryThreshold, spillThreshold)
+}
+  }
+
+  override def output: Seq[Attribute] = child.output
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def requiredChildDistribution: Seq[Distribution] = {
+if (groupingWithoutSessionExpression.isEmpty) {
+  AllTuples :: Nil
+} else {
+  ClusteredDistribution(groupingWithoutSessionExpression) :: Nil
+}
+  }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+Seq((groupingWithoutSessionAttributes ++ Seq(sessionExpression))

Review comment:
   Nice finding. Will update.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/UpdatingSessionsIterator.scala
##
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.aggregate
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray
+
+/**
+ * This class calculates and updates the session window for each element in 
the given iterator,
+ * which makes elements in the same session window having same session spec. 

[GitHub] [spark] SparkQA commented on pull request #32365: [SPARK-35228][SQL] Add expression ToPrettyString for keep consistent between hive/spark format in df.show and transform

2021-05-10 Thread GitBox


SparkQA commented on pull request #32365:
URL: https://github.com/apache/spark/pull/32365#issuecomment-837727960


   **[Test build #138354 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138354/testReport)**
 for PR 32365 at commit 
[`1994dfd`](https://github.com/apache/spark/commit/1994dfde6fc3f21709b4e968c42d85d063e882cc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes

2021-05-10 Thread GitBox


SparkQA commented on pull request #32031:
URL: https://github.com/apache/spark/pull/32031#issuecomment-837717297


   **[Test build #138353 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138353/testReport)**
 for PR 32031 at commit 
[`48a0be7`](https://github.com/apache/spark/commit/48a0be7b58a1a9d5c556b35359427fb0cd903b13).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32494: [Minor][SPARK-35362][SQL]Update null count in the column stats for UNION operator stats estimation

2021-05-10 Thread GitBox


SparkQA commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-837716583


   **[Test build #138351 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138351/testReport)**
 for PR 32494 at commit 
[`8e02f19`](https://github.com/apache/spark/commit/8e02f1937d1c725d8fa3d95a90d9b22d82d7b01d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-10 Thread GitBox


SparkQA commented on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-837716519


   **[Test build #138352 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138352/testReport)**
 for PR 32464 at commit 
[`4284458`](https://github.com/apache/spark/commit/42844586a5e53559187aa98797bd82b5c1f9601d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-10 Thread GitBox


beliefer commented on a change in pull request #32464:
URL: https://github.com/apache/spark/pull/32464#discussion_r629818005



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
##
@@ -1391,4 +1391,58 @@ private[spark] object QueryCompilationErrors {
   def functionUnsupportedInV2CatalogError(): Throwable = {
 new AnalysisException("function is only supported in v1 catalog")
   }
+
+  def operateHiveDataSourceDirectlyError(operation: String): Throwable = {

Review comment:
   How about `cannotOperateOnHiveDataSourceFilesError` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-837712777


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42872/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837670869






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-837712777


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42872/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837712776


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138349/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-10 Thread GitBox


beliefer commented on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-837711621


   > Looks good! There are two more exceptions under `streaming/ui`. How about 
adding them in the same PR?
   > 
   > 'sql/core/src/main/scala/org/apache/spark/sql/streaming/ui'
   > 
   > Filename   Count
   > StreamingQueryPage.scala   1
   > StreamingQueryStatisticsPage.scala 1
   
   I checked the two file and found all the errors is assert-like Exception. 
According to the discussion between @cloud-fan and me, we not need to treat the 
assert-like Exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


SparkQA removed a comment on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837604703


   **[Test build #138349 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138349/testReport)**
 for PR 32482 at commit 
[`30b0572`](https://github.com/apache/spark/commit/30b0572acef36f79b0f3bf6b7e094eaf0b762e33).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


SparkQA commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837709031


   **[Test build #138349 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138349/testReport)**
 for PR 32482 at commit 
[`30b0572`](https://github.com/apache/spark/commit/30b0572acef36f79b0f3bf6b7e094eaf0b762e33).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cfmcgrady commented on a change in pull request #32488: [SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate

2021-05-10 Thread GitBox


cfmcgrady commented on a change in pull request #32488:
URL: https://github.com/apache/spark/pull/32488#discussion_r629816464



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
##
@@ -89,10 +89,11 @@ import org.apache.spark.sql.types._
  */
 object UnwrapCastInBinaryComparison extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformWithPruning(
-_.containsPattern(BINARY_COMPARISON), ruleId) {
+_.containsAnyPattern(BINARY_COMPARISON, IN), ruleId) {

Review comment:
   Oh, yes, I'll update later




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

2021-05-10 Thread GitBox


zhengruifeng commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-837702523


   > This does not necessarily solve the issue that @zsxwing detailed - the 
issue here is `registerAccumulator` should not be called in `readObject` before 
subclasses have completed readObject.
   > 
   > One possible solution would be to introduce two methods.
   > 
   > a) A protected method `doHandleDriverSideAccumulator()` in `AccumulatorV2` 
- which has all the code after `defaultReadObject` in readObject.
   > b) Call `handleDriverSideAccumulator` after `defaultReadObject` in 
`AccumulatorV2`. In `AccumulatorV2`, this protected method will simply delegate 
to `doHandleDriverSideAccumulator`.
   > c) In subclasses with local state, override 
`doHandleDriverSideAccumulator` to make it do nothing - and after readObject in 
subclass is done, invoke `doHandleDriverSideAccumulator`
   > 
   > This will ensure AccumulatorV2 and subclasses will register only after 
state has been initialized.
   > (Rough sketch, please change logic/names/etc as relevant).
   > 
   > Note, there are other accumulators with local state; we should do this for 
all.
   > Thoughts ?
   
   +1
   
   I recently impl some accv2 (some complex statistics containing transient 
lazy vars and using collections like openhashmap/array/etc) in my work, there 
are lots of NPE which make task probablly fail. I has tried the method like 
this PR, but it do not help evidently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-10 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-837699020






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-10 Thread GitBox


mridulm commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r629813442



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2004,6 +2020,131 @@ private[spark] class DAGScheduler(
 }
   }
 
+  /**
+   * Schedules shuffle merge finalize.
+   */
+  private[scheduler] def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): 
Unit = {
+logInfo(("%s (%s) scheduled for finalizing" +
+  " shuffle merge in %s s").format(stage, stage.name, 
shuffleMergeFinalizeWaitSec))
+shuffleMergeFinalizeScheduler.schedule(
+  new Runnable {
+override def run(): Unit = finalizeShuffleMerge(stage)
+  },
+  shuffleMergeFinalizeWaitSec,
+  TimeUnit.SECONDS
+)
+  }
+
+  /**
+   * DAGScheduler notifies all the remote shuffle services chosen to serve 
shuffle merge request for
+   * the given shuffle map stage to finalize the shuffle merge process for 
this shuffle. This is
+   * invoked in a separate thread to reduce the impact on the DAGScheduler 
main thread, as the
+   * scheduler might need to talk to 1000s of shuffle services to finalize 
shuffle merge.
+   */
+  private[scheduler] def finalizeShuffleMerge(stage: ShuffleMapStage): Unit = {
+logInfo("%s (%s) finalizing the shuffle merge".format(stage, stage.name))

Review comment:
   What happens if the stage was cancelled during 
`shuffleMergeFinalizeWaitSec` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-10 Thread GitBox


mridulm commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r629812883



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -1271,21 +1302,28 @@ private[spark] class DAGScheduler(
* locations for block push/merge by getting the historical locations of 
past executors.
*/
   private def prepareShuffleServicesForShuffleMapStage(stage: 
ShuffleMapStage): Unit = {
-// TODO(SPARK-32920) Handle stage reuse/retry cases separately as without 
finalize
-// TODO changes we cannot disable shuffle merge for the retry/reuse cases
-val mergerLocs = sc.schedulerBackend.getShufflePushMergerLocations(
-  stage.shuffleDep.partitioner.numPartitions, stage.resourceProfileId)
-
-if (mergerLocs.nonEmpty) {
-  stage.shuffleDep.setMergerLocs(mergerLocs)
-  logInfo(s"Push-based shuffle enabled for $stage (${stage.name}) with" +
-s" ${stage.shuffleDep.getMergerLocs.size} merger locations")
-
-  logDebug("List of shuffle push merger locations " +
-s"${stage.shuffleDep.getMergerLocs.map(_.host).mkString(", ")}")
-} else {
-  logInfo("No available merger locations." +
-s" Push-based shuffle disabled for $stage (${stage.name})")
+if (stage.shuffleDep.shuffleMergeEnabled && 
!stage.shuffleDep.shuffleMergeFinalized
+  && stage.shuffleDep.getMergerLocs.isEmpty) {
+  val mergerLocs = sc.schedulerBackend.getShufflePushMergerLocations(
+stage.shuffleDep.partitioner.numPartitions, stage.resourceProfileId)
+  if (mergerLocs.nonEmpty) {
+stage.shuffleDep.setMergerLocs(mergerLocs)
+logInfo(s"Push-based shuffle enabled for $stage (${stage.name}) with" +
+  s" ${stage.shuffleDep.getMergerLocs.size} merger locations")
+
+logDebug("List of shuffle push merger locations " +
+  s"${stage.shuffleDep.getMergerLocs.map(_.host).mkString(", ")}")
+  } else {
+stage.shuffleDep.setShuffleMergeEnabled(false)
+logInfo("Push-based shuffle disabled for $stage (${stage.name})")
+  }
+} else if (stage.shuffleDep.shuffleMergeFinalized) {
+  // Disable Shuffle merge for the retry/reuse of the same shuffle 
dependency if it has
+  // already been merge finalized. If the shuffle dependency was 
previously assigned merger
+  // locations but the corresponding shuffle map stage did not complete 
successfully, we
+  // would still enable push for its retry.

Review comment:
   If the stage is getting reexecuted due to lost node (for ex), requiring 
recomputation, we are disabling merge ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-10 Thread GitBox


mridulm commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r629811689



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2004,6 +2020,131 @@ private[spark] class DAGScheduler(
 }
   }
 
+  /**
+   * Schedules shuffle merge finalize.
+   */
+  private[scheduler] def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): 
Unit = {
+logInfo(("%s (%s) scheduled for finalizing" +
+  " shuffle merge in %s s").format(stage, stage.name, 
shuffleMergeFinalizeWaitSec))
+shuffleMergeFinalizeScheduler.schedule(
+  new Runnable {
+override def run(): Unit = finalizeShuffleMerge(stage)
+  },
+  shuffleMergeFinalizeWaitSec,
+  TimeUnit.SECONDS
+)
+  }
+
+  /**
+   * DAGScheduler notifies all the remote shuffle services chosen to serve 
shuffle merge request for
+   * the given shuffle map stage to finalize the shuffle merge process for 
this shuffle. This is
+   * invoked in a separate thread to reduce the impact on the DAGScheduler 
main thread, as the
+   * scheduler might need to talk to 1000s of shuffle services to finalize 
shuffle merge.
+   */
+  private[scheduler] def finalizeShuffleMerge(stage: ShuffleMapStage): Unit = {
+logInfo("%s (%s) finalizing the shuffle merge".format(stage, stage.name))
+externalShuffleClient.foreach { shuffleClient =>
+  val shuffleId = stage.shuffleDep.shuffleId
+  val numMergers = stage.shuffleDep.getMergerLocs.length
+  val numResponses = new AtomicInteger()
+  val results = (0 until numMergers).map(_ => 
SettableFuture.create[Boolean]())
+  val timedOut = new AtomicBoolean()
+
+  def increaseAndCheckResponseCount(): Unit = {
+if (numResponses.incrementAndGet() == numMergers) {
+  logInfo("%s (%s) shuffle merge finalized".format(stage, stage.name))
+  // Since this runs in the netty client thread and is outside of 
DAGScheduler
+  // event loop, we only post ShuffleMergeFinalized event into the 
event queue.
+  // The processing of this event should be done inside the event 
loop, so it
+  // can safely modify scheduler's internal state.
+  eventProcessLoop.post(ShuffleMergeFinalized(stage))
+}
+  }
+
+  stage.shuffleDep.getMergerLocs.zipWithIndex.foreach {
+case (shuffleServiceLoc, index) =>
+  // Sends async request to shuffle service to finalize shuffle merge 
on that host
+  shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
+shuffleServiceLoc.port, shuffleId,
+new MergeFinalizerListener {
+  override def onShuffleMergeSuccess(statuses: MergeStatuses): 
Unit = {
+assert(shuffleId == statuses.shuffleId)
+// Register the merge results even if already timed out, in 
case the reducer
+// needing this merged block starts after dag scheduler 
receives this response.
+mapOutputTracker.registerMergeResults(statuses.shuffleId,
+  MergeStatus.convertMergeStatusesToMergeStatusArr(statuses, 
shuffleServiceLoc))
+if (!timedOut.get()) {
+  increaseAndCheckResponseCount
+  results(index).set(true)
+}
+  }
+
+  override def onShuffleMergeFailure(e: Throwable): Unit = {
+if (!timedOut.get()) {
+  logWarning(s"Exception encountered when trying to finalize 
shuffle " +
+s"merge on ${shuffleServiceLoc.host} for shuffle 
$shuffleId", e)
+  increaseAndCheckResponseCount
+  // Do not fail the future as this would cause dag scheduler 
to prematurely
+  // give up on waiting for merge results from the remaining 
shuffle services
+  // if one fails
+  results(index).set(false)
+}
+  }
+})
+  }
+  // DAGScheduler only waits for a limited amount of time for the merge 
results.
+  // It will attempt to submit the next stage(s) irrespective of whether 
merge results
+  // from all shuffle services are received or not.
+  // TODO: SPARK-33701: Instead of waiting for a constant amount of time 
for finalization
+  // TODO: for all the stages, adaptively tune timeout for merge 
finalization
+  try {
+Futures.allAsList(results: _*).get(shuffleMergeResultsTimeoutSec, 
TimeUnit.SECONDS)
+  } catch {
+case _: TimeoutException =>
+  logInfo(s"Timed out on waiting for merge results from all " +
+s"$numMergers mergers for shuffle $shuffleId")
+  timedOut.set(true)
+  eventProcessLoop.post(ShuffleMergeFinalized(stage))
+  }
+}
+  }
+
+  private def processShuffleMapStageCompletion(shuffleStage: ShuffleMapStage): 
Unit = {
+markStageAsFinished(shuffleStage)
+

[GitHub] [spark] LuciferYang commented on a change in pull request #32455: [SPARK-35253][SQL][BUILD] Bump up the janino version to v3.1.4

2021-05-10 Thread GitBox


LuciferYang commented on a change in pull request #32455:
URL: https://github.com/apache/spark/pull/32455#discussion_r629811420



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
##
@@ -1434,9 +1435,10 @@ object CodeGenerator extends Logging {
   private def updateAndGetCompilationStats(evaluator: ClassBodyEvaluator): 
ByteCodeStats = {
 // First retrieve the generated classes.
 val classes = {
-  val resultField = classOf[SimpleCompiler].getDeclaredField("result")
-  resultField.setAccessible(true)
-  val loader = 
resultField.get(evaluator).asInstanceOf[ByteArrayClassLoader]
+  val scField = classOf[ClassBodyEvaluator].getDeclaredField("sc")

Review comment:
   @maropu ok, I can try to do this followup :)
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-10 Thread GitBox


mridulm commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r629810826



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -110,6 +125,12 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 
   def getMergerLocs: Seq[BlockManagerId] = mergerLocs
 
+  def markShuffleMergeFinalized: Unit = {

Review comment:
   `private[spark]` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-10 Thread GitBox


mridulm commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r629810563



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -96,12 +96,27 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
   val shuffleHandle: ShuffleHandle = 
_rdd.context.env.shuffleManager.registerShuffle(
 shuffleId, this)
 
+  // By default, shuffle merge is enabled for ShuffleDependency if push based 
shuffle is enabled
+  private[this] var _shuffleMergeEnabled = 
Utils.isPushBasedShuffleEnabled(rdd.sparkContext.getConf)
+
+  def setShuffleMergeEnabled(shuffleMergeEnabled: Boolean): Unit = {

Review comment:
   `private[spark]` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-10 Thread GitBox


mridulm commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r629810296



##
File path: .idea/vcs.xml
##
@@ -1,36 +0,0 @@
-

Review comment:
   Where is this coming from ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #32381: [SPARK-35229][WEBUI] Limit the maximum number of items on the timeline view.

2021-05-10 Thread GitBox


mridulm commented on pull request #32381:
URL: https://github.com/apache/spark/pull/32381#issuecomment-837683190


   +CC @zhouyejoe


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on pull request #32494: [Minor][SPARK-35362][SQL]Update null count in the column stats for UNION operator stats estimation

2021-05-10 Thread GitBox


shahidki31 commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-837680984


   Retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837670869


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42871/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


SparkQA commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837670817


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42871/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-10 Thread GitBox


HyukjinKwon commented on a change in pull request #32204:
URL: https://github.com/apache/spark/pull/32204#discussion_r629803248



##
File path: python/pyspark/sql/streaming.py
##
@@ -504,105 +504,13 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 path : str
 string represents path to the JSON dataset,
 or RDD of Strings storing JSON objects.
-schema : :class:`pyspark.sql.types.StructType` or str, optional

Review comment:
   this doc doesn't exist anymore too. we should keep it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-10 Thread GitBox


HyukjinKwon commented on a change in pull request #32204:
URL: https://github.com/apache/spark/pull/32204#discussion_r629803050



##
File path: python/pyspark/sql/readwriter.py
##
@@ -233,114 +233,13 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 path : str, list or :class:`RDD`
 string represents path to the JSON dataset, or a list of paths,
 or RDD of Strings storing JSON objects.
-schema : :class:`pyspark.sql.types.StructType` or str, optional
-an optional :class:`pyspark.sql.types.StructType` for the input 
schema or
-a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
-primitivesAsString : str or bool, optional
-infers all primitive values as a string type. If None is set,
-it uses the default value, ``false``.
-prefersDecimal : str or bool, optional
-infers all floating-point values as a decimal type. If the values
-do not fit in decimal, then it infers them as doubles. If None is
-set, it uses the default value, ``false``.
-allowComments : str or bool, optional
-ignores Java/C++ style comment in JSON records. If None is set,
-it uses the default value, ``false``.
-allowUnquotedFieldNames : str or bool, optional
-allows unquoted JSON field names. If None is set,
-it uses the default value, ``false``.
-allowSingleQuotes : str or bool, optional
-allows single quotes in addition to double quotes. If None is
-set, it uses the default value, ``true``.
-allowNumericLeadingZero : str or bool, optional
-allows leading zeros in numbers (e.g. 00012). If None is
-set, it uses the default value, ``false``.
-allowBackslashEscapingAnyCharacter : str or bool, optional
-allows accepting quoting of all character
-using backslash quoting mechanism. If None is
-set, it uses the default value, ``false``.
-mode : str, optional
-allows a mode for dealing with corrupt records during parsing. If 
None is
- set, it uses the default value, ``PERMISSIVE``.
-
-* ``PERMISSIVE``: when it meets a corrupted record, puts the 
malformed string \
-  into a field configured by ``columnNameOfCorruptRecord``, and 
sets malformed \
-  fields to ``null``. To keep corrupt records, an user can set a 
string type \
-  field named ``columnNameOfCorruptRecord`` in an user-defined 
schema. If a \
-  schema does not have the field, it drops corrupt records during 
parsing. \
-  When inferring a schema, it implicitly adds a 
``columnNameOfCorruptRecord`` \
-  field in an output schema.
-*  ``DROPMALFORMED``: ignores the whole corrupted records.
-*  ``FAILFAST``: throws an exception when it meets corrupted 
records.
 
-columnNameOfCorruptRecord: str, optional
-allows renaming the new field having malformed string
-created by ``PERMISSIVE`` mode. This overrides
-``spark.sql.columnNameOfCorruptRecord``. If None is set,
-it uses the value specified in
-``spark.sql.columnNameOfCorruptRecord``.
-dateFormat : str, optional
-sets the string that indicates a date format. Custom date formats
-follow the formats at
-`datetime pattern 
`_.  # noqa
-This applies to date type. If None is set, it uses the
-default value, ``-MM-dd``.
-timestampFormat : str, optional
-sets the string that indicates a timestamp format.
-Custom date formats follow the formats at
-`datetime pattern 
`_.  # noqa
-This applies to timestamp type. If None is set, it uses the
-default value, ``-MM-dd'T'HH:mm:ss[.SSS][XXX]``.
-multiLine : str or bool, optional
-parse one record, which may span multiple lines, per file. If None 
is
-set, it uses the default value, ``false``.
-allowUnquotedControlChars : str or bool, optional
-allows JSON Strings to contain unquoted control
-characters (ASCII characters with value less than 32,
-including tab and line feed characters) or not.
-encoding : str or bool, optional
-allows to forcibly set one of standard basic or extended encoding 
for
-the JSON files. For example UTF-16BE, UTF-32LE. If None is set,
-the encoding of input JSON will be detected automatically
-when the multiLine option is set to ``true``.
-lineSep : str, optional
-defines 

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.

2021-05-10 Thread GitBox


HyukjinKwon commented on a change in pull request #32161:
URL: https://github.com/apache/spark/pull/32161#discussion_r629802818



##
File path: python/pyspark/sql/readwriter.py
##
@@ -416,53 +416,10 @@ def parquet(self, *paths, **options):
 
 Other Parameters
 
-mergeSchema : str or bool, optional
-sets whether we should merge schemas collected from all
-Parquet part-files. This will override
-``spark.sql.parquet.mergeSchema``. The default value is specified 
in
-``spark.sql.parquet.mergeSchema``.
-pathGlobFilter : str or bool, optional
-an optional glob pattern to only include files with paths matching
-the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
-It does not change the behavior of
-`partition discovery 
`_.
  # noqa
-recursiveFileLookup : str or bool, optional
-recursively scan a directory for files. Using this option
-disables
-`partition discovery 
`_.
  # noqa
-
-modification times occurring before the specified time. The 
provided timestamp
-must be in the following format: -MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)
-modifiedBefore (batch only) : an optional timestamp to only include 
files with

Review comment:
   this too. I think it's not parquet specific option. Can you double check 
if the options are parquet specific, and if there's some options mistakenly 
removed?  For example, the docs of this was removed and does not exist anymore.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.

2021-05-10 Thread GitBox


HyukjinKwon commented on a change in pull request #32161:
URL: https://github.com/apache/spark/pull/32161#discussion_r629802560



##
File path: python/pyspark/sql/readwriter.py
##
@@ -1257,14 +1214,13 @@ def parquet(self, path, mode=None, partitionBy=None, 
compression=None):
 * ``ignore``: Silently ignore this operation if data already 
exists.
 * ``error`` or ``errorifexists`` (default case): Throw an 
exception if data already \
 exists.
-partitionBy : str or list, optional

Review comment:
   I think `partitionBy` is not parquet specific option. We'll have to 
recover here and remove it in the docs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #32495: [SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread GitBox


c21 commented on pull request #32495:
URL: https://github.com/apache/spark/pull/32495#issuecomment-837659022


   Thank you @maropu for monitoring and review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #32495: [SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread GitBox


maropu commented on pull request #32495:
URL: https://github.com/apache/spark/pull/32495#issuecomment-837658640


   All the GA tests passed. Merged to master. Thank you, @c21 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu closed pull request #32495: [SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread GitBox


maropu closed pull request #32495:
URL: https://github.com/apache/spark/pull/32495


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


SparkQA commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837656482


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42871/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #32495: [SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread GitBox


maropu commented on pull request #32495:
URL: https://github.com/apache/spark/pull/32495#issuecomment-837656333


   okay, I've checked that it passed. I will merge this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32494: [Minor][SPARK-35362][SQL]Update null count in the column stats for UNION operator stats estimation

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-837655724


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32494: [Minor][SPARK-35362][SQL]Update null count in the column stats for UNION operator stats estimation

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-837655724


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32494: [Minor][SPARK-35362][SQL]Update null count in the column stats for UNION operator stats estimation

2021-05-10 Thread GitBox


SparkQA removed a comment on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-837428314


   **[Test build #138342 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138342/testReport)**
 for PR 32494 at commit 
[`56ea3d8`](https://github.com/apache/spark/commit/56ea3d805267047b13160b900589faf26740d389).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32494: [Minor][SPARK-35362][SQL]Update null count in the column stats for UNION operator stats estimation

2021-05-10 Thread GitBox


SparkQA commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-837654038


   **[Test build #138342 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138342/testReport)**
 for PR 32494 at commit 
[`56ea3d8`](https://github.com/apache/spark/commit/56ea3d805267047b13160b900589faf26740d389).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-10 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-837652969


   **[Test build #138350 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138350/testReport)**
 for PR 32399 at commit 
[`c6c677c`](https://github.com/apache/spark/commit/c6c677cfd9c6a7f9cc71b5870d2cf71c6483613f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-837651747


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42870/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


AmplabJenkins removed a comment on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837651748


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42869/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-837651748


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42869/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-10 Thread GitBox


AmplabJenkins commented on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-837651747


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42870/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #32492: [SPARK-35088][SQL][FOLLOWUP] Improve the error message for Sequence expression

2021-05-10 Thread GitBox


beliefer commented on pull request #32492:
URL: https://github.com/apache/spark/pull/32492#issuecomment-837647119


   @HyukjinKwon @MaxGekk Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >