[GitHub] [spark] xuanyuanking edited a comment on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking edited a comment on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-643916110 cc @maropu @gatorsmile @HeartSaVioR @dongjoon-hyun A new regression bug SPARK-31990 was found when investigating the test failure https://github.com/apache/spark/pull/28707#issuecomment-639861273. The root cause is that [this line](https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458) in SPARK-31292 made the order of groupCols in Deduplicate changed, and the order changing will break the validation logic here. That is to say, if we don't have this PR, the executor JVM could probably crash, throw a random exception, or even return a wrong answer when using the checkpoint written by the previous version. So we have 2 related work of this PR: - [ ]**[Block]** Fix and merge the compatibility issue in #28830 - [ ][Follow-up] Add new test(or modify the current Kafka test) in #28725 -- ### More detailed analysis: The expected order of `Deduplicate.groupCols` in UT KafkaMicroBatchV2SourceSuite is ``` [timestamp, partition, timestampType, key, offset, topic, value] ``` Which is also the order in the checkpoint written by the version before Spark 3.0 After the changes in SPARK-31292, the groupCols changed to ``` [key, value, topic, partition, offset, timestamp, timestampType] ``` Why this incompatibility bug didn't fail the `KafkaMicroBatchV2SourceSuite` when it merged? Because the UT `default config of includeHeader doesn't break the existing query from Spark 2.4` didn't test the scenario of duplicating and check the answer. Although the UT uses the checkpoint written by version 2.4.3 and streaming duplicate operation, it just wants to prove that the new header(added in SPARK-23539) doesn't break the original checkpoint file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors
AmplabJenkins removed a comment on pull request #28619: URL: https://github.com/apache/spark/pull/28619#issuecomment-643916951 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
xuanyuanking commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643916855 ``` How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? ``` Here's my plan to consolidate both: https://github.com/apache/spark/pull/28707#issuecomment-643916110, this will also comment in JIRA & PR description. Yes, #28707 is blocking by this fix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode
AmplabJenkins removed a comment on pull request #28829: URL: https://github.com/apache/spark/pull/28829#issuecomment-643916877 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode
SparkQA commented on pull request #28829: URL: https://github.com/apache/spark/pull/28829#issuecomment-643916564 **[Test build #124033 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124033/testReport)** for PR 28829 at commit [`16e90be`](https://github.com/apache/spark/commit/16e90bebf9314105d20c581a07120adb6d288e0b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode
AmplabJenkins commented on pull request #28829: URL: https://github.com/apache/spark/pull/28829#issuecomment-643916882 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/28652/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #17953: [SPARK-20680][SQL] Spark-sql do not support for void column datatype …
HyukjinKwon commented on pull request #17953: URL: https://github.com/apache/spark/pull/17953#issuecomment-643916503 Yeah .. I personally support this change FWIW. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors
AmplabJenkins commented on pull request #28619: URL: https://github.com/apache/spark/pull/28619#issuecomment-643916951 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors
SparkQA commented on pull request #28619: URL: https://github.com/apache/spark/pull/28619#issuecomment-643916615 **[Test build #124034 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124034/testReport)** for PR 28619 at commit [`4affa58`](https://github.com/apache/spark/commit/4affa58f95f893ef6de1c1bf1c6b731468a2519d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #27805: [SPARK-31056][SQL] Add CalendarIntervals division
HyukjinKwon commented on pull request #27805: URL: https://github.com/apache/spark/pull/27805#issuecomment-643915859 Do we have an answer to https://github.com/apache/spark/pull/27805#issuecomment-635381702? It's easier to justify with actual references and/or standard. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-643916110 A new regression bug SPARK-31990 was found when investigating the test failure https://github.com/apache/spark/pull/28707#issuecomment-639861273. The root cause is that [this line](https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458) in SPARK-31292 made the order of groupCols in Deduplicate changed, and the order changing will break the validation logic here. That is to say, if we don't have this PR, the executor JVM could probably crash, throw a random exception, or even return a wrong answer when using the checkpoint written by the previous version. So we have 2 related work of this PR: - [ ] Fix and merge the compatibility issue in #28830 - [ ] Add new test(or modify the current Kafka test) in #28725 -- ### More detailed analysis: The expected order of `Deduplicate.groupCols` in UT KafkaMicroBatchV2SourceSuite is ``` [timestamp, partition, timestampType, key, offset, topic, value] ``` After the changes in SPARK-31292, the groupCols changed to ``` [key, value, topic, partition, offset, timestamp, timestampType] ``` Why this incompatibility bug didn't fail the `KafkaMicroBatchV2SourceSuite` when it merged? Because the UT `default config of includeHeader doesn't break the existing query from Spark 2.4` didn't test the scenario of duplicating and check the answer. Although the UT uses the checkpoint written by version 2.4.3 and streaming duplicate operation, it just wants to prove that the new header(added in SPARK-23539) doesn't break the original checkpoint file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode
MaxGekk commented on pull request #28829: URL: https://github.com/apache/spark/pull/28829#issuecomment-643915417 jenkins, retest this, please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors
Ngone51 commented on pull request #28619: URL: https://github.com/apache/spark/pull/28619#issuecomment-643915676 retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
AmplabJenkins removed a comment on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-643914834 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
HyukjinKwon commented on a change in pull request #28642: URL: https://github.com/apache/spark/pull/28642#discussion_r439940687 ## File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala ## @@ -1039,7 +1039,7 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan val pythonEvals = collect(joinNode.get) { case p: BatchEvalPythonExec => p } -assert(pythonEvals.size == 2) +assert(pythonEvals.size == 4) Review comment: Yeah, I don't think it's more efficient to have `BatchEvalPythonExec` more. It will require more Python executions which aren't trivial. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
AmplabJenkins commented on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-643914834 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
SparkQA commented on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-643914470 **[Test build #124032 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124032/testReport)** for PR 28642 at commit [`65cd324`](https://github.com/apache/spark/commit/65cd324093fac15357fb0ca9bae7c524b40c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
HyukjinKwon commented on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-643913716 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #28801: [SPARK-31970][CORE] Make MDC configuration step be consistent between setLocalProperty and log4j.properties
Ngone51 commented on pull request #28801: URL: https://github.com/apache/spark/pull/28801#issuecomment-643912320 thanks all!! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
AmplabJenkins removed a comment on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643909975 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124026/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
AmplabJenkins removed a comment on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643909967 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
AmplabJenkins commented on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643909975 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124026/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
SparkQA removed a comment on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643877230 **[Test build #124026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124026/testReport)** for PR 27604 at commit [`2e11d1b`](https://github.com/apache/spark/commit/2e11d1bedf15b59c89b1f686ea716a575802f1e6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
SparkQA commented on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643909627 **[Test build #124026 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124026/testReport)** for PR 27604 at commit [`2e11d1b`](https://github.com/apache/spark/commit/2e11d1bedf15b59c89b1f686ea716a575802f1e6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"
HyukjinKwon commented on pull request #28828: URL: https://github.com/apache/spark/pull/28828#issuecomment-643906549 @xuanyuanking too FYI This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
AmplabJenkins removed a comment on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643904439 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124024/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
AmplabJenkins removed a comment on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643904434 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
SparkQA removed a comment on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643865812 **[Test build #124024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124024/testReport)** for PR 28821 at commit [`707b0cf`](https://github.com/apache/spark/commit/707b0cf949e2532429bdc62d7ef219fe98a0751e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
AmplabJenkins commented on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643904434 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
SparkQA commented on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643904220 **[Test build #124024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124024/testReport)** for PR 28821 at commit [`707b0cf`](https://github.com/apache/spark/commit/707b0cf949e2532429bdc62d7ef219fe98a0751e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
AmplabJenkins commented on pull request #28807: URL: https://github.com/apache/spark/pull/28807#issuecomment-643899506 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
AmplabJenkins removed a comment on pull request #28807: URL: https://github.com/apache/spark/pull/28807#issuecomment-643899506 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
maropu commented on a change in pull request #28807: URL: https://github.com/apache/spark/pull/28807#discussion_r439927771 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/TableIdentifierParserSuite.scala ## @@ -388,12 +396,24 @@ class TableIdentifierParserSuite extends SparkFunSuite with SQLHelper { val reservedKeywordsInAnsiMode = allCandidateKeywords -- nonReservedKeywordsInAnsiMode test("check # of reserved keywords") { -val numReservedKeywords = 78 +val numReservedKeywords = 74 Review comment: Note: `ANTI`, `SEMI`, `MINUS`, and `!` are removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
SparkQA commented on pull request #28807: URL: https://github.com/apache/spark/pull/28807#issuecomment-643899210 **[Test build #124031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124031/testReport)** for PR 28807 at commit [`eeceb30`](https://github.com/apache/spark/commit/eeceb30e050c26acdb93372eef0ce14410bd0159). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643897872 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643897872 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
SparkQA commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643897635 **[Test build #124030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124030/testReport)** for PR 28710 at commit [`2e6f35c`](https://github.com/apache/spark/commit/2e6f35c8e31fe1cde1637b922673339bfeef65fe). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
huaxingao commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643896578 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default
AmplabJenkins removed a comment on pull request #28593: URL: https://github.com/apache/spark/pull/28593#issuecomment-643892810 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default
AmplabJenkins commented on pull request #28593: URL: https://github.com/apache/spark/pull/28593#issuecomment-643892810 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default
SparkQA commented on pull request #28593: URL: https://github.com/apache/spark/pull/28593#issuecomment-643892530 **[Test build #124029 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124029/testReport)** for PR 28593 at commit [`8fe1960`](https://github.com/apache/spark/commit/8fe1960ef3a0c598a626b7024820b74cec787642). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #24922: [SPARK-28120][SS] Rocksdb state storage implementation
dongjoon-hyun commented on pull request #24922: URL: https://github.com/apache/spark/pull/24922#issuecomment-643892244 Thank you for the update, @itsvikramagr . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643891541 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124021/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643891538 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
SparkQA removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643855623 **[Test build #124021 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124021/testReport)** for PR 28710 at commit [`2e6f35c`](https://github.com/apache/spark/commit/2e6f35c8e31fe1cde1637b922673339bfeef65fe). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
SparkQA commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643891334 **[Test build #124021 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124021/testReport)** for PR 28710 at commit [`2e6f35c`](https://github.com/apache/spark/commit/2e6f35c8e31fe1cde1637b922673339bfeef65fe). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial revert as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also commit. Sure, this may need to be placed on migration guide or release note as well. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] GuoPhilipse commented on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default
GuoPhilipse commented on pull request #28593: URL: https://github.com/apache/spark/pull/28593#issuecomment-643890774 it is generated by set command,now we have removed it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-643878976 I’m sorry, but version 4 doesn’t leverage UnsafeRow. (version 2 was.) Please read the description thoughtfully. As I commented earlier there’re still lots of possible improvements in metadata, but I don’t want to go through unless we promise dedicated efforts on reviewing. This is low hanging fruit which brings massive improvement. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0. Sure, this may need to be placed on migration guide or release note as well. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also commit. Sure, this may need to be placed on migration guide or release note as well. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issue - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0. Sure, this may need to be placed on migration guide or release note as well. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issue - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simple if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issue - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
AmplabJenkins commented on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-643887374 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
AmplabJenkins removed a comment on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-643887374 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439917190 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir' are used") if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { Review comment: Got it. I added the description "This option is supported in Hive 2.0 or later." in SQLConf.scala. https://github.com/apache/spark/pull/27690/files#diff-9a6b543db706f1a90f790783d6930a13R849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439917190 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir' are used") if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { Review comment: Got it. I added the descroption "This option is supported in Hive 2.0 or later." in SQLConf.scala. https://github.com/apache/spark/pull/27690/files#diff-9a6b543db706f1a90f790783d6930a13R849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
SparkQA commented on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-643887119 **[Test build #124028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124028/testReport)** for PR 27690 at commit [`0fbeaf3`](https://github.com/apache/spark/commit/0fbeaf374bf35a7d0cde2b3340d9f3c4551cbdb2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters
AmplabJenkins removed a comment on pull request #28786: URL: https://github.com/apache/spark/pull/28786#issuecomment-643885908 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters
SparkQA removed a comment on pull request #28786: URL: https://github.com/apache/spark/pull/28786#issuecomment-643867351 **[Test build #124025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124025/testReport)** for PR 28786 at commit [`4c4d52b`](https://github.com/apache/spark/commit/4c4d52b91e1ebbd018835c3bb2cd565df79bd430). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters
AmplabJenkins commented on pull request #28786: URL: https://github.com/apache/spark/pull/28786#issuecomment-643885908 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters
SparkQA commented on pull request #28786: URL: https://github.com/apache/spark/pull/28786#issuecomment-643885633 **[Test build #124025 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124025/testReport)** for PR 28786 at commit [`4c4d52b`](https://github.com/apache/spark/commit/4c4d52b91e1ebbd018835c3bb2cd565df79bd430). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
maropu commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643885408 > Thanks for the quick fix @maropu! I think maybe we can simplify the bugfix by combining it together with #28707. WDYT? I'll also reference this PR with #28707. @xuanyuanking yea, looks fine to me. Could you takes this over? Thanks, anyway! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439913882 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -839,6 +839,17 @@ object SQLConf { .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) + val HIVE_SUPPORTED_SCHEMES_TO_USE_NONBLOBSTORE = +buildConf("spark.sql.hive.supportedSchemesToUseNonBlobstore") + .doc("Comma-separated list of supported blobstore schemes (e.g. 's3,s3a'). " + +"If any blobstore schemes are specified, this feature is enabled. " + +"When writing data out to a Hive table, " + +"Spark writes the data first into non blobstore storage, and then moves it to blobstore. " + +"By default, this option is set to empty. It means this feature is disabled.") + .version("3.1.0") + .stringConf + .createWithDefault("") Review comment: Note: I am not 100% sure whether all these blob storage systems have similar characteristics and not sure if this option is effective. At least, this option is effective for Amazon S3. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439913383 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -839,6 +839,17 @@ object SQLConf { .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) + val HIVE_SUPPORTED_SCHEMES_TO_USE_NONBLOBSTORE = +buildConf("spark.sql.hive.supportedSchemesToUseNonBlobstore") + .doc("Comma-separated list of supported blobstore schemes (e.g. 's3,s3a'). " + +"If any blobstore schemes are specified, this feature is enabled. " + +"When writing data out to a Hive table, " + +"Spark writes the data first into non blobstore storage, and then moves it to blobstore. " + +"By default, this option is set to empty. It means this feature is disabled.") + .version("3.1.0") + .stringConf + .createWithDefault("") Review comment: Users can specify any blob storage schema like following. If copy operation is expensive in the storage system, this option will be effective. - Amazon S3: `s3`, `s3a`, `s3n` - Azure Blob Storage: `wasb`, `wasbs` - Google Cloud Storage: `gs` - Databricks: `dbfs` - OpenStack: `swift` Since any schemes are possible to be used, I believe we cannot define specific supported schemes here. That's why I just listed samples in SQLConf.scala. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
AmplabJenkins commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643882434 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
AmplabJenkins removed a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643882434 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
SparkQA commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643882321 **[Test build #124027 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124027/testReport)** for PR 28830 at commit [`7546ba4`](https://github.com/apache/spark/commit/7546ba4eebeee480d9a2ff8b948e900cd6023dfc). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643880332 +1 to partial revert which should be also OK with author. (I guess it was applied simply by pattern, and it wasn’t for some intended improvement, so no problem for author as well.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
HeartSaVioR commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643880332 +1 to partial revert which should be also OK with author. (I guess it was applied simply by pattern, and it wasn’t for some outstanding improvement, so no problem for author as well.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
xuanyuanking commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643880347 Yep, I think just revert that part is good enough. I will give more context and details on #28707. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
dongjoon-hyun commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643880429 Ya. +1 for partial revert in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on a change in pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
xuanyuanking commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439910372 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ## @@ -2548,6 +2548,21 @@ class DataFrameSuite extends QueryTest assert(df.schema === new StructType().add(StructField("d", DecimalType(38, 0 } } + + test("SPARK-31990: preserves the input order of colNames in dropDuplicates") { +val df = Seq((1, 2, 3, 4, 5), (1, 2, 3, 4, 5)).toDF("c", "e", "d", "a", "b") +val inputColNames = Seq("c", "b", "c", "d", "b", "c", "b") Review comment: Thanks for adding a new UT here. Since this issue was found when investigating the test failure in https://github.com/apache/spark/pull/28707#issuecomment-639861273, how about reusing the UT `default config of includeHeader doesn't break existing query from Spark 2.4` in `KafkaMicroBatchV2SourceSuite`? I think we don't need to add a new UT for this regression after #28707. That is to say after #28707 is merged, if we don't do the fix, the mentioned UT will fail. ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2542,20 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must preserve the input order of `colNames` because of the compatibility +// issue (the Streaming's state store depends on the `groupCols` order). +val orderPreservingDistinctColNames = { + val nameSeen = mutable.Set[String]() Review comment: How about simply revert this line to https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458, use the original implementation of `toSet`. Yes, the `toSet.toSeq` might incompatible during to Scala version, but I think the current fix should just keep the original order. How to detect the order changing and have solid validation should be the work of SPARK-31894 and SPARK-27237. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-643878976 I’m sorry, but version 4 doesn’t leverage UnsafeRow. (version 2 was.) Please read the description thoughtfully. As I commented earlier there’re still lots of possible improvements in metadata, but I don’t want to go through unless we promise dedicate efforts on reviewing. This is low hanging fruit which brings massive improvement. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
gatorsmile commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643879059 Yes. I prefer to reverting the original fix in 3.0.1. and then discuss how to solve/avoid the problems in a proper way. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
maropu commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643879180 okay, I'll revert that part in this PR first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR commented on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-643878976 I’m sorry, but version 4 doesn’t leverage UnsafeRow. (version 2 was.) Please read the description thoughtfully. As I commented earlier there’re still lots of possible improvements in metadata, but I don’t want to go through unless we promise dedicate efforts on reviewing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
dongjoon-hyun edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643878606 Hi, All. This issue is marked as a hotfix for the blocker issue, but the validation of this issue looks non-trivial. Since `toSet.toSeq` is used since Apache Spark 2.2.0 (SPARK-19497) and SPARK-31292 is just an `Improvement` issue with `Trivial` priority. I'd like to propose to revert SPARK-31292 from `branch-3.0` first. We will keep SPARK-31292 in `master` branch still and proceed this @maropu 's PR to find a better way for Apache Spark 3.1.0. I know that the reverting is not a good solution for the original author as mentioned by @HeartSaVioR in the dev mailing list, but I believe that is the proper way in this case to cut Apache Spark 3.0.1. How do you think about that? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
dongjoon-hyun edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643878606 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
dongjoon-hyun commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643878606 Hi, All. This issue is marked as a hotfix for the blocker issue, but the validation of this issue looks non-trivial. Since `toSet.toSeq` is used since Apache Spark 2.2.0 (SPARK-19497) and SPARK-31292 is just an `Improvement` with `Trivial` issue. I'd like to propose to revert SPARK-31292 from `branch-3.0` first? We will keep SPARK-31292 in `master` branch still and proceed this PR to find a better way for Apache Spark 3.1.0. I know that the reverting is not a good solution for the original author as mentioned by @HeartSaVioR in the dev mailing list, but I believe that is the proper way in this case to cut Apache Spark 3.0.1. How do you think about that? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
AmplabJenkins removed a comment on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643877494 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
AmplabJenkins commented on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643877494 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] uncleGen commented on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
uncleGen commented on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-643877261 @HeartSaVioR Thanks for your efforts. The result (version 4) is very impressive. Overall, it makes sense to me. But we should resolve the concern about using `UnsafeRow`. I am not very familiar with the history of discussing about `UnsafeRow`. By the way, is there any value or plan to use this [idea](https://github.com/apache/spark/pull/24128#issuecomment-558548047)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
SparkQA commented on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643877230 **[Test build #124026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124026/testReport)** for PR 27604 at commit [`2e11d1b`](https://github.com/apache/spark/commit/2e11d1bedf15b59c89b1f686ea716a575802f1e6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] iRakson commented on pull request #26901: [SPARK-29152][CORE][2.4] Executor Plugin shutdown when dynamic allocation is enabled
iRakson commented on pull request #26901: URL: https://github.com/apache/spark/pull/26901#issuecomment-643876875 @dongjoon-hyun Its behaviour is pretty confusing. But yeah, if this is breaking branch again then we should not keep it. Yes, this patch failed twice so we must move on. Thank you for actively monitoring this patch. :) :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
maropu commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439907906 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2542,20 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must preserve the input order of `colNames` because of the compatibility +// issue (the Streaming's state store depends on the `groupCols` order). +val orderPreservingDistinctColNames = { + val nameSeen = mutable.Set[String]() Review comment: Ah, I see. Yea, I'll update the code based on `toSeq.toSeq`. ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2542,20 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must preserve the input order of `colNames` because of the compatibility +// issue (the Streaming's state store depends on the `groupCols` order). +val orderPreservingDistinctColNames = { + val nameSeen = mutable.Set[String]() Review comment: Ah, I see. Yea, I'll update the code based on `toSet.toSeq`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
dongjoon-hyun commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439907916 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ## @@ -2548,6 +2548,21 @@ class DataFrameSuite extends QueryTest assert(df.schema === new StructType().add(StructField("d", DecimalType(38, 0 } } + + test("SPARK-31990: preserves the input order of colNames in dropDuplicates") { +val df = Seq((1, 2, 3, 4, 5), (1, 2, 3, 4, 5)).toDF("c", "e", "d", "a", "b") +val inputColNames = Seq("c", "b", "c", "d", "b", "c", "b") Review comment: BTW, @HeartSaVioR . Is there a test case failure using the same Spark version checkpointing? I'm curious if this only occurs between different Spark versions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
dongjoon-hyun commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439907052 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2542,20 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must preserve the input order of `colNames` because of the compatibility +// issue (the Streaming's state store depends on the `groupCols` order). +val orderPreservingDistinctColNames = { + val nameSeen = mutable.Set[String]() Review comment: The reported issue claims that Scala `distinct` function was not enough. That's the reason why I asked that `Is there a change?` to fix Spark issue. As @HeartSaVioR 's commented (https://github.com/apache/spark/pull/28830#discussion_r439904302), we need a different code. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates
dongjoon-hyun commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439907052 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2542,20 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must preserve the input order of `colNames` because of the compatibility +// issue (the Streaming's state store depends on the `groupCols` order). +val orderPreservingDistinctColNames = { + val nameSeen = mutable.Set[String]() Review comment: The reported issue claims that Scala `distinct` function was not enough. That's the reason why I asked that `What is the difference to fix Spark issue`. As @HeartSaVioR 's commented (https://github.com/apache/spark/pull/28830#discussion_r439904302), we need a different code. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SQL][SS] Preserves the input order of colNames in dropDuplicates
HeartSaVioR commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439905543 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2542,20 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must preserve the input order of `colNames` because of the compatibility +// issue (the Streaming's state store depends on the `groupCols` order). +val orderPreservingDistinctColNames = { + val nameSeen = mutable.Set[String]() Review comment: Oh I didn't see @maropu 's comment while I'm commenting. ;) Thanks for explaining. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default
cloud-fan commented on pull request #28593: URL: https://github.com/apache/spark/pull/28593#issuecomment-643873707 Why are there empty golden files generated in `sql/hive/src/test/resources/golden`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] iRakson commented on pull request #28752: [SPARK-31983] Fix Sorting for duration column and make Status column sortable
iRakson commented on pull request #28752: URL: https://github.com/apache/spark/pull/28752#issuecomment-643873658 Thank You. @srowen @sarutak. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] iRakson commented on pull request #28823: [SPARK-31983][WEBUI][3.0] Fix sorting for duration column in structured streaming tab
iRakson commented on pull request #28823: URL: https://github.com/apache/spark/pull/28823#issuecomment-643873542 Thank You. @srowen @sarutak :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
maropu commented on a change in pull request #28807: URL: https://github.com/apache/spark/pull/28807#discussion_r439905202 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/TableIdentifierParserSuite.scala ## @@ -388,12 +391,24 @@ class TableIdentifierParserSuite extends SparkFunSuite with SQLHelper { val reservedKeywordsInAnsiMode = allCandidateKeywords -- nonReservedKeywordsInAnsiMode test("check # of reserved keywords") { -val numReservedKeywords = 78 +val numReservedKeywords = 75 assert(reservedKeywordsInAnsiMode.size == numReservedKeywords, s"The expected number of reserved keywords is $numReservedKeywords, but " + s"${reservedKeywordsInAnsiMode.size} found.") } + test("should follow reserved keywords in SQL:2016") { +withTempDir { dir => + val tmpFile = new File(dir, "tmp") + val is = Thread.currentThread().getContextClassLoader +.getResourceAsStream("ansi-sql-2016-reserved-keywords.txt") + Files.copy(is, tmpFile.toPath) + val reservedKeywordsInSql2016 = Files.readAllLines(tmpFile.toPath) +.asScala.filterNot(_.startsWith("--")).map(_.trim).toSet + assert(((reservedKeywordsInAnsiMode -- Set("!")) -- reservedKeywordsInSql2016).isEmpty) Review comment: Yea, will do. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"
AmplabJenkins commented on pull request #28828: URL: https://github.com/apache/spark/pull/28828#issuecomment-643873268 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"
AmplabJenkins removed a comment on pull request #28828: URL: https://github.com/apache/spark/pull/28828#issuecomment-643873268 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
maropu commented on a change in pull request #28807: URL: https://github.com/apache/spark/pull/28807#discussion_r439905098 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/TableIdentifierParserSuite.scala ## @@ -388,12 +391,24 @@ class TableIdentifierParserSuite extends SparkFunSuite with SQLHelper { val reservedKeywordsInAnsiMode = allCandidateKeywords -- nonReservedKeywordsInAnsiMode test("check # of reserved keywords") { -val numReservedKeywords = 78 +val numReservedKeywords = 75 assert(reservedKeywordsInAnsiMode.size == numReservedKeywords, s"The expected number of reserved keywords is $numReservedKeywords, but " + s"${reservedKeywordsInAnsiMode.size} found.") } + test("should follow reserved keywords in SQL:2016") { Review comment: Looks clearer, okay, I'll update. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates
HeartSaVioR commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439904837 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2542,20 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must preserve the input order of `colNames` because of the compatibility +// issue (the Streaming's state store depends on the `groupCols` order). +val orderPreservingDistinctColNames = { + val nameSeen = mutable.Set[String]() Review comment: So consider this as manual implementation of distinct so that we don't even get affected by Scala changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"
SparkQA removed a comment on pull request #28828: URL: https://github.com/apache/spark/pull/28828#issuecomment-643827509 **[Test build #124015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124015/testReport)** for PR 28828 at commit [`ca3b3de`](https://github.com/apache/spark/commit/ca3b3de653a92090db33ca8282eea18b75ff2420). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org