[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439965525 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2541,9 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue +// (the Streaming's state store depends on the `groupCols` order). +val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) => Review comment: Worth noting that we need to have "concrete" solution eventually - if columns are all having same type neither #28830 nor #24173 catch the change and the result becomes silently incorrect. I roughly remember the similar issue on pyspark, which was trying to fix the issue on order vs name, don't remember how it ended up. cc. @HyukjinKwon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439965525 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2541,9 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue +// (the Streaming's state store depends on the `groupCols` order). +val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) => Review comment: Worth noting that we need to have "concrete" solution eventually - if columns are all having same type neither #28830 nor #24173 catch the change. I roughly remember the similar issue on pyspark, which was trying to fix the issue on order vs name, don't remember how it ended up. cc. @HyukjinKwon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum edited a comment on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table
wangyum edited a comment on pull request #28032: URL: https://github.com/apache/spark/pull/28032#issuecomment-643936564 Yes, this strategy may introduce the data skew issue, but the case of skewed data will only affect itself. Creating a large number of files will affect the Namenode, which will affect the stability of the cluster. In fact, our cluster handles this kind of data skew, and will contribute to the community in the future if needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"
HeartSaVioR commented on a change in pull request #28828: URL: https://github.com/apache/spark/pull/28828#discussion_r439963800 ## File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala ## @@ -43,7 +43,7 @@ class StateOperatorProgress private[sql]( val numRowsTotal: Long, val numRowsUpdated: Long, val memoryUsedBytes: Long, -val numLateInputs: Long, +val numDroppedRowsByWatermark: Long, Review comment: Yeah that sounds better in consistency (numRows + blabla explaining how the number is measured) and provides same meaning. Thanks for the suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator
viirya commented on pull request #28704: URL: https://github.com/apache/spark/pull/28704#issuecomment-643937918 @srowen @huaxingao @zhengruifeng thanks for previous review. Do you have more comments for this change? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table
wangyum commented on pull request #28032: URL: https://github.com/apache/spark/pull/28032#issuecomment-643936564 Yes, this strategy may introduce the data skew issue, but the case of skewed data will only affect itself. Creating a large number of files will affect the Namenode, which will affect the stability of the cluster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
xuanyuanking commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439959045 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2541,9 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue +// (the Streaming's state store depends on the `groupCols` order). +val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) => Review comment: Yep, I also mentioned this at https://github.com/apache/spark/pull/28830#discussion_r439909489, we can relay on the validation checking and integrated tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
cloud-fan commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439956403 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2541,9 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue +// (the Streaming's state store depends on the `groupCols` order). +val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) => Review comment: I think this is good for now. In the future, this may still be broken by Scala version upgrade, and hopefully @xuanyuanking 's unsafe row validation can detect it. Then we can change it and use a deterministic order, as it will be broken anyway. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #27805: [SPARK-31056][SQL] Add CalendarIntervals division
EnricoMi commented on pull request #27805: URL: https://github.com/apache/spark/pull/27805#issuecomment-643932097 I am not aware of any SQL standard going this direction. I agree that the suggested UDFs give me the same result. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
cloud-fan commented on a change in pull request #28830: URL: https://github.com/apache/spark/pull/28830#discussion_r439956403 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ## @@ -2541,7 +2541,9 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.distinct.flatMap { (colName: String) => +// SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue +// (the Streaming's state store depends on the `groupCols` order). +val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) => Review comment: I think this is good for now. In the future, this may still be broken by Scala version upgrade, and hopefully @xuanyuanking 's unsafe row validation can detect it. Then we can change it and use a deterministic order. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
dongjoon-hyun commented on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643931292 Hi, @Fokko . The last failure means that we need to regenerate the output of `ThriftServerQueryTestSuite.sql`. To proceed more, please update it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
dongjoon-hyun edited a comment on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643931292 Hi, @Fokko . The last failure means that we need to regenerate the output file of `ThriftServerQueryTestSuite.sql`. To proceed more, please update it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28781: [SPARK-31953][SS] Add Spark Structured Streaming History Server Support
AmplabJenkins removed a comment on pull request #28781: URL: https://github.com/apache/spark/pull/28781#issuecomment-643930261 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124023/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28781: [SPARK-31953][SS] Add Spark Structured Streaming History Server Support
AmplabJenkins removed a comment on pull request #28781: URL: https://github.com/apache/spark/pull/28781#issuecomment-643930255 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table
AngersZh commented on pull request #28032: URL: https://github.com/apache/spark/pull/28032#issuecomment-643930675 > @wangyum Thanks for your response. If the incoming data is not even distributed by the repartitioning key, wouldn't this strategy create issues when there is skew in the data ? i.e we may end up with potentially very large partitions ? +1 for this concern , maybe we should add an adaptive procedure to adjust the number of partitions for the large partition key value. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HyukjinKwon edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643930058 I am okay with that; I was just wondering even the previous behaviour was deterministic or not, and SPARK-31292 looked righter to me. Given that we're going to revisit anyway, LGTM from me too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28781: [SPARK-31953][SS] Add Spark Structured Streaming History Server Support
AmplabJenkins commented on pull request #28781: URL: https://github.com/apache/spark/pull/28781#issuecomment-643930255 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28781: [SPARK-31953][SS] Add Spark Structured Streaming History Server Support
SparkQA removed a comment on pull request #28781: URL: https://github.com/apache/spark/pull/28781#issuecomment-643862941 **[Test build #124023 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124023/testReport)** for PR 28781 at commit [`c721d8c`](https://github.com/apache/spark/commit/c721d8c7299d503f1010cc8626ffb77d492c1bda). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #28829: [SPARK-31992][SQL] Benchmark the EXCEPTION rebase mode
MaxGekk commented on pull request #28829: URL: https://github.com/apache/spark/pull/28829#issuecomment-643930188 @cloud-fan Please, review the PR This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HyukjinKwon commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643930058 I am okay with that; I was just wondering even the previous behaviour was deterministic or not, and the current change looked righter to me. Given that we're going to revisit anyway, LGTM from me too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28781: [SPARK-31953][SS] Add Spark Structured Streaming History Server Support
SparkQA commented on pull request #28781: URL: https://github.com/apache/spark/pull/28781#issuecomment-643929926 **[Test build #124023 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124023/testReport)** for PR 28781 at commit [`c721d8c`](https://github.com/apache/spark/commit/c721d8c7299d503f1010cc8626ffb77d492c1bda). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait TimestampFormatterHelper extends TimeZoneAwareExpression ` * `case class WidthBucket(` * `trait PredicateHelper extends Logging ` * `case class ProcessingTimeTrigger(intervalMs: Long) extends Trigger ` * `case class ContinuousTrigger(intervalMs: Long) extends Trigger ` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
dongjoon-hyun edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643929047 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
dongjoon-hyun commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643929047 The last commit is to trying to preserve the previous behavior (whatever it was) since Apache Spark 2.2.0 although there is no guarantee which it safe or not. We will revisit the correct issue later after 3.0.1. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HyukjinKwon commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643927909 I am okay to revert it for now but I couldn't fully follow why we expect an explicit order from a set. Has it been ever guaranteed somewhere? Using `distinct`, we can expect the deterministic order but we're reverting back to using a set because of the deterministic order (?). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28784: [SPARK-31957][SQL][test-maven] Cleanup hive scratch dir for the developer api startWithContext
AmplabJenkins removed a comment on pull request #28784: URL: https://github.com/apache/spark/pull/28784#issuecomment-643926776 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28784: [SPARK-31957][SQL][test-maven] Cleanup hive scratch dir for the developer api startWithContext
AmplabJenkins commented on pull request #28784: URL: https://github.com/apache/spark/pull/28784#issuecomment-643926776 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28784: [SPARK-31957][SQL][test-maven] Cleanup hive scratch dir for the developer api startWithContext
SparkQA commented on pull request #28784: URL: https://github.com/apache/spark/pull/28784#issuecomment-643926432 **[Test build #124035 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124035/testReport)** for PR 28784 at commit [`d055d60`](https://github.com/apache/spark/commit/d055d60aab356401f0f7b00b83b76dc76df0c30c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table
dilipbiswal commented on pull request #28032: URL: https://github.com/apache/spark/pull/28032#issuecomment-643926524 @wangyum Thanks for your response. If the incoming data is not even distributed by the repartitioning key, wouldn't this strategy create issues when there is skew in the data ? i.e we may end up with potentially very large partitions ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #28784: [SPARK-31957][SQL][test-maven] Cleanup hive scratch dir for the developer api startWithContext
yaooqinn commented on pull request #28784: URL: https://github.com/apache/spark/pull/28784#issuecomment-643926043 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #28784: [SPARK-31957][SQL] Cleanup hive scratch dir for the developer api startWithContext
yaooqinn commented on pull request #28784: URL: https://github.com/apache/spark/pull/28784#issuecomment-643925671 Thanks @HyukjinKwon and @juliuszsompolski, I was waiting for https://github.com/apache/spark/pull/28797 to be merged and then ping you guys. Now it's been done. The conflicts disappear now. You can check again. Also cc @cloud-fan @maropu thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"
dongjoon-hyun commented on a change in pull request #28828: URL: https://github.com/apache/spark/pull/28828#discussion_r439949163 ## File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala ## @@ -43,7 +43,7 @@ class StateOperatorProgress private[sql]( val numRowsTotal: Long, val numRowsUpdated: Long, val memoryUsedBytes: Long, -val numLateInputs: Long, +val numDroppedRowsByWatermark: Long, Review comment: nit. Just a question. For the variable naming, `numRowsDroppedByWatermark` seems more consistent with `numRowsTotal` and `numRowsUpdates` within this `StateOperatorProgress` class, doesn't it? (cc @zsxwing since this came from his comment.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"
dongjoon-hyun commented on a change in pull request #28828: URL: https://github.com/apache/spark/pull/28828#discussion_r439949163 ## File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala ## @@ -43,7 +43,7 @@ class StateOperatorProgress private[sql]( val numRowsTotal: Long, val numRowsUpdated: Long, val memoryUsedBytes: Long, -val numLateInputs: Long, +val numDroppedRowsByWatermark: Long, Review comment: nit. Just a question. For the variable naming, `numRowsDroppedByWatermark` seems more consistent with `numRowsTotal` and `numRowsUpdates` within this `StateOperatorProgress` class, doesn't it? (cc @zsxwing since this acmes from his comment.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28797: [SPARK-31926][SQL][TEST-HIVE1.2][test-maven] Fix concurrency issue for ThriftCLIService to getPortNumber
cloud-fan commented on pull request #28797: URL: https://github.com/apache/spark/pull/28797#issuecomment-643923811 thanks, merging to master/3.0! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #28797: [SPARK-31926][SQL][TEST-HIVE1.2][test-maven] Fix concurrency issue for ThriftCLIService to getPortNumber
cloud-fan closed pull request #28797: URL: https://github.com/apache/spark/pull/28797 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #28809: [SPARK-31959][SQL][3.0] Fix Gregorian-Julian micros rebasing while switching standard time zone offset
MaxGekk commented on pull request #28809: URL: https://github.com/apache/spark/pull/28809#issuecomment-643922594 I am going to skip the test checks if JDK tzdb is outdated and Asia/Hong_Kong doesn't have timestamps overlapping in 1945 at all. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #28809: [SPARK-31959][SQL][3.0] Fix Gregorian-Julian micros rebasing while switching standard time zone offset
MaxGekk commented on pull request #28809: URL: https://github.com/apache/spark/pull/28809#issuecomment-643920058 > It might be Amplap Jenkins host issue (Java version or environment). It uses JDK w/ outdated time zone database (not clear from log which version): ``` JAVA_HOME=/usr/lib/jvm/java-8-oracle ``` other jenkins machines have: ``` JAVA_HOME=/usr/java/jdk1.8.0_191 ``` If we are not able to upgrade JDK 1.8 to the recent version, can we have at least the same JDK on all jenkins machines? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking edited a comment on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking edited a comment on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-643916110 cc @maropu @gatorsmile @HeartSaVioR @dongjoon-hyun A new regression bug SPARK-31990 was found when investigating the test failure https://github.com/apache/spark/pull/28707#issuecomment-639861273. The root cause is that [this line](https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458) in SPARK-31292 made the order of groupCols in Deduplicate changed, and the order changing will break the validation logic here. That is to say, if we don't have this PR, the executor JVM could probably crash, throw a random exception, or even return a wrong answer when using the checkpoint written by the previous version. So we have 2 related work of this PR: - [ ]**[Block]** Fix and merge the compatibility issue in #28830 - [ ][Follow-up] Add new test(or modify the current Kafka test) in #28725 -- ### More detailed analysis: The expected order of `Deduplicate.groupCols` in UT KafkaMicroBatchV2SourceSuite is ``` [timestamp, partition, timestampType, key, offset, topic, value] ``` Which is also the order in the checkpoint written by the version before Spark 3.0 After the changes in SPARK-31292, the groupCols changed to ``` [key, value, topic, partition, offset, timestamp, timestampType] ``` Why this incompatibility bug didn't fail the `KafkaMicroBatchV2SourceSuite` when it merged? Because the UT `default config of includeHeader doesn't break the existing query from Spark 2.4` didn't test the scenario of duplicating and check the answer. Although the UT uses the checkpoint written by version 2.4.3 and streaming duplicate operation, it just wants to prove that the new header(added in SPARK-23539) doesn't break the original checkpoint file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors
AmplabJenkins removed a comment on pull request #28619: URL: https://github.com/apache/spark/pull/28619#issuecomment-643916951 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
xuanyuanking commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643916855 ``` How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? ``` Here's my plan to consolidate both: https://github.com/apache/spark/pull/28707#issuecomment-643916110, this will also comment in JIRA & PR description. Yes, #28707 is blocking by this fix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode
AmplabJenkins removed a comment on pull request #28829: URL: https://github.com/apache/spark/pull/28829#issuecomment-643916877 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode
SparkQA commented on pull request #28829: URL: https://github.com/apache/spark/pull/28829#issuecomment-643916564 **[Test build #124033 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124033/testReport)** for PR 28829 at commit [`16e90be`](https://github.com/apache/spark/commit/16e90bebf9314105d20c581a07120adb6d288e0b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode
AmplabJenkins commented on pull request #28829: URL: https://github.com/apache/spark/pull/28829#issuecomment-643916882 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/28652/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #17953: [SPARK-20680][SQL] Spark-sql do not support for void column datatype …
HyukjinKwon commented on pull request #17953: URL: https://github.com/apache/spark/pull/17953#issuecomment-643916503 Yeah .. I personally support this change FWIW. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors
AmplabJenkins commented on pull request #28619: URL: https://github.com/apache/spark/pull/28619#issuecomment-643916951 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors
SparkQA commented on pull request #28619: URL: https://github.com/apache/spark/pull/28619#issuecomment-643916615 **[Test build #124034 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124034/testReport)** for PR 28619 at commit [`4affa58`](https://github.com/apache/spark/commit/4affa58f95f893ef6de1c1bf1c6b731468a2519d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #27805: [SPARK-31056][SQL] Add CalendarIntervals division
HyukjinKwon commented on pull request #27805: URL: https://github.com/apache/spark/pull/27805#issuecomment-643915859 Do we have an answer to https://github.com/apache/spark/pull/27805#issuecomment-635381702? It's easier to justify with actual references and/or standard. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-643916110 A new regression bug SPARK-31990 was found when investigating the test failure https://github.com/apache/spark/pull/28707#issuecomment-639861273. The root cause is that [this line](https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458) in SPARK-31292 made the order of groupCols in Deduplicate changed, and the order changing will break the validation logic here. That is to say, if we don't have this PR, the executor JVM could probably crash, throw a random exception, or even return a wrong answer when using the checkpoint written by the previous version. So we have 2 related work of this PR: - [ ] Fix and merge the compatibility issue in #28830 - [ ] Add new test(or modify the current Kafka test) in #28725 -- ### More detailed analysis: The expected order of `Deduplicate.groupCols` in UT KafkaMicroBatchV2SourceSuite is ``` [timestamp, partition, timestampType, key, offset, topic, value] ``` After the changes in SPARK-31292, the groupCols changed to ``` [key, value, topic, partition, offset, timestamp, timestampType] ``` Why this incompatibility bug didn't fail the `KafkaMicroBatchV2SourceSuite` when it merged? Because the UT `default config of includeHeader doesn't break the existing query from Spark 2.4` didn't test the scenario of duplicating and check the answer. Although the UT uses the checkpoint written by version 2.4.3 and streaming duplicate operation, it just wants to prove that the new header(added in SPARK-23539) doesn't break the original checkpoint file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors
Ngone51 commented on pull request #28619: URL: https://github.com/apache/spark/pull/28619#issuecomment-643915676 retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode
MaxGekk commented on pull request #28829: URL: https://github.com/apache/spark/pull/28829#issuecomment-643915417 jenkins, retest this, please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
AmplabJenkins removed a comment on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-643914834 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
HyukjinKwon commented on a change in pull request #28642: URL: https://github.com/apache/spark/pull/28642#discussion_r439940687 ## File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala ## @@ -1039,7 +1039,7 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan val pythonEvals = collect(joinNode.get) { case p: BatchEvalPythonExec => p } -assert(pythonEvals.size == 2) +assert(pythonEvals.size == 4) Review comment: Yeah, I don't think it's more efficient to have `BatchEvalPythonExec` more. It will require more Python executions which aren't trivial. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
AmplabJenkins commented on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-643914834 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
SparkQA commented on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-643914470 **[Test build #124032 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124032/testReport)** for PR 28642 at commit [`65cd324`](https://github.com/apache/spark/commit/65cd324093fac15357fb0ca9bae7c524b40c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
HyukjinKwon commented on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-643913716 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #28801: [SPARK-31970][CORE] Make MDC configuration step be consistent between setLocalProperty and log4j.properties
Ngone51 commented on pull request #28801: URL: https://github.com/apache/spark/pull/28801#issuecomment-643912320 thanks all!! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
AmplabJenkins removed a comment on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643909975 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124026/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
AmplabJenkins removed a comment on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643909967 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
AmplabJenkins commented on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643909975 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124026/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
SparkQA removed a comment on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643877230 **[Test build #124026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124026/testReport)** for PR 27604 at commit [`2e11d1b`](https://github.com/apache/spark/commit/2e11d1bedf15b59c89b1f686ea716a575802f1e6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block
SparkQA commented on pull request #27604: URL: https://github.com/apache/spark/pull/27604#issuecomment-643909627 **[Test build #124026 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124026/testReport)** for PR 27604 at commit [`2e11d1b`](https://github.com/apache/spark/commit/2e11d1bedf15b59c89b1f686ea716a575802f1e6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"
HyukjinKwon commented on pull request #28828: URL: https://github.com/apache/spark/pull/28828#issuecomment-643906549 @xuanyuanking too FYI This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
AmplabJenkins removed a comment on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643904439 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124024/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
AmplabJenkins removed a comment on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643904434 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
AmplabJenkins commented on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643904434 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
SparkQA removed a comment on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643865812 **[Test build #124024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124024/testReport)** for PR 28821 at commit [`707b0cf`](https://github.com/apache/spark/commit/707b0cf949e2532429bdc62d7ef219fe98a0751e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp
SparkQA commented on pull request #28821: URL: https://github.com/apache/spark/pull/28821#issuecomment-643904220 **[Test build #124024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124024/testReport)** for PR 28821 at commit [`707b0cf`](https://github.com/apache/spark/commit/707b0cf949e2532429bdc62d7ef219fe98a0751e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
AmplabJenkins commented on pull request #28807: URL: https://github.com/apache/spark/pull/28807#issuecomment-643899506 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
AmplabJenkins removed a comment on pull request #28807: URL: https://github.com/apache/spark/pull/28807#issuecomment-643899506 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
maropu commented on a change in pull request #28807: URL: https://github.com/apache/spark/pull/28807#discussion_r439927771 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/TableIdentifierParserSuite.scala ## @@ -388,12 +396,24 @@ class TableIdentifierParserSuite extends SparkFunSuite with SQLHelper { val reservedKeywordsInAnsiMode = allCandidateKeywords -- nonReservedKeywordsInAnsiMode test("check # of reserved keywords") { -val numReservedKeywords = 78 +val numReservedKeywords = 74 Review comment: Note: `ANTI`, `SEMI`, `MINUS`, and `!` are removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
SparkQA commented on pull request #28807: URL: https://github.com/apache/spark/pull/28807#issuecomment-643899210 **[Test build #124031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124031/testReport)** for PR 28807 at commit [`eeceb30`](https://github.com/apache/spark/commit/eeceb30e050c26acdb93372eef0ce14410bd0159). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643897872 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643897872 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
SparkQA commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643897635 **[Test build #124030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124030/testReport)** for PR 28710 at commit [`2e6f35c`](https://github.com/apache/spark/commit/2e6f35c8e31fe1cde1637b922673339bfeef65fe). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
huaxingao commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643896578 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default
AmplabJenkins removed a comment on pull request #28593: URL: https://github.com/apache/spark/pull/28593#issuecomment-643892810 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default
AmplabJenkins commented on pull request #28593: URL: https://github.com/apache/spark/pull/28593#issuecomment-643892810 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default
SparkQA commented on pull request #28593: URL: https://github.com/apache/spark/pull/28593#issuecomment-643892530 **[Test build #124029 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124029/testReport)** for PR 28593 at commit [`8fe1960`](https://github.com/apache/spark/commit/8fe1960ef3a0c598a626b7024820b74cec787642). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #24922: [SPARK-28120][SS] Rocksdb state storage implementation
dongjoon-hyun commented on pull request #24922: URL: https://github.com/apache/spark/pull/24922#issuecomment-643892244 Thank you for the update, @itsvikramagr . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643891541 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124021/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643891538 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
SparkQA removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643855623 **[Test build #124021 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124021/testReport)** for PR 28710 at commit [`2e6f35c`](https://github.com/apache/spark/commit/2e6f35c8e31fe1cde1637b922673339bfeef65fe). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
SparkQA commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-643891334 **[Test build #124021 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124021/testReport)** for PR 28710 at commit [`2e6f35c`](https://github.com/apache/spark/commit/2e6f35c8e31fe1cde1637b922673339bfeef65fe). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial revert as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also commit. Sure, this may need to be placed on migration guide or release note as well. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] GuoPhilipse commented on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default
GuoPhilipse commented on pull request #28593: URL: https://github.com/apache/spark/pull/28593#issuecomment-643890774 it is generated by set command,now we have removed it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-643878976 I’m sorry, but version 4 doesn’t leverage UnsafeRow. (version 2 was.) Please read the description thoughtfully. As I commented earlier there’re still lots of possible improvements in metadata, but I don’t want to go through unless we promise dedicated efforts on reviewing. This is low hanging fruit which brings massive improvement. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0. Sure, this may need to be placed on migration guide or release note as well. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also commit. Sure, this may need to be placed on migration guide or release note as well. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issue - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0. Sure, this may need to be placed on migration guide or release note as well. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR edited a comment on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issue - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
HeartSaVioR commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-64318 How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simple if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issue - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
AmplabJenkins commented on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-643887374 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
AmplabJenkins removed a comment on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-643887374 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439917190 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir' are used") if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { Review comment: Got it. I added the description "This option is supported in Hive 2.0 or later." in SQLConf.scala. https://github.com/apache/spark/pull/27690/files#diff-9a6b543db706f1a90f790783d6930a13R849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
moomindani commented on a change in pull request #27690: URL: https://github.com/apache/spark/pull/27690#discussion_r439917190 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") +logDebug(s"path '${path.toString}', staging dir '$stagingDir', " + + s"scratch dir '$scratchDir' are used") if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { oldVersionExternalTempPath(path, hadoopConf, scratchDir) } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { Review comment: Got it. I added the descroption "This option is supported in Hive 2.0 or later." in SQLConf.scala. https://github.com/apache/spark/pull/27690/files#diff-9a6b543db706f1a90f790783d6930a13R849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
SparkQA commented on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-643887119 **[Test build #124028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124028/testReport)** for PR 27690 at commit [`0fbeaf3`](https://github.com/apache/spark/commit/0fbeaf374bf35a7d0cde2b3340d9f3c4551cbdb2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters
AmplabJenkins removed a comment on pull request #28786: URL: https://github.com/apache/spark/pull/28786#issuecomment-643885908 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters
SparkQA removed a comment on pull request #28786: URL: https://github.com/apache/spark/pull/28786#issuecomment-643867351 **[Test build #124025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124025/testReport)** for PR 28786 at commit [`4c4d52b`](https://github.com/apache/spark/commit/4c4d52b91e1ebbd018835c3bb2cd565df79bd430). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters
AmplabJenkins commented on pull request #28786: URL: https://github.com/apache/spark/pull/28786#issuecomment-643885908 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters
SparkQA commented on pull request #28786: URL: https://github.com/apache/spark/pull/28786#issuecomment-643885633 **[Test build #124025 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124025/testReport)** for PR 28786 at commit [`4c4d52b`](https://github.com/apache/spark/commit/4c4d52b91e1ebbd018835c3bb2cd565df79bd430). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates
maropu commented on pull request #28830: URL: https://github.com/apache/spark/pull/28830#issuecomment-643885408 > Thanks for the quick fix @maropu! I think maybe we can simplify the bugfix by combining it together with #28707. WDYT? I'll also reference this PR with #28707. @xuanyuanking yea, looks fine to me. Could you takes this over? Thanks, anyway! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org