[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HeartSaVioR commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439965525



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
 val resolver = sparkSession.sessionState.analyzer.resolver
 val allColumns = queryExecution.analyzed.output
-val groupCols = colNames.distinct.flatMap { (colName: String) =>
+// SPARK-31990: We must keep `toSet.toSeq` here because of the backward 
compatibility issue
+// (the Streaming's state store depends on the `groupCols` order).
+val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
   Worth noting that we need to have "concrete" solution eventually - if 
columns are all having same type neither #28830 nor #24173 catch the change and 
the result becomes silently incorrect. I roughly remember the similar issue on 
pyspark, which was trying to fix the issue on order vs name, don't remember how 
it ended up. cc. @HyukjinKwon 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HeartSaVioR commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439965525



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
 val resolver = sparkSession.sessionState.analyzer.resolver
 val allColumns = queryExecution.analyzed.output
-val groupCols = colNames.distinct.flatMap { (colName: String) =>
+// SPARK-31990: We must keep `toSet.toSeq` here because of the backward 
compatibility issue
+// (the Streaming's state store depends on the `groupCols` order).
+val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
   Worth noting that we need to have "concrete" solution eventually - if 
columns are all having same type neither #28830 nor #24173 catch the change. I 
roughly remember the similar issue on pyspark, which was trying to fix the 
issue on order vs name, don't remember how it ended up. cc. @HyukjinKwon 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum edited a comment on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

2020-06-14 Thread GitBox


wangyum edited a comment on pull request #28032:
URL: https://github.com/apache/spark/pull/28032#issuecomment-643936564


   Yes, this strategy may introduce the data skew issue, but the case of skewed 
data will only affect itself. Creating a large number of files will affect the 
Namenode, which will affect the stability of the cluster.
   
   In fact, our cluster handles this kind of data skew, and will contribute to 
the community in the future if needed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"

2020-06-14 Thread GitBox


HeartSaVioR commented on a change in pull request #28828:
URL: https://github.com/apache/spark/pull/28828#discussion_r439963800



##
File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala
##
@@ -43,7 +43,7 @@ class StateOperatorProgress private[sql](
 val numRowsTotal: Long,
 val numRowsUpdated: Long,
 val memoryUsedBytes: Long,
-val numLateInputs: Long,
+val numDroppedRowsByWatermark: Long,

Review comment:
   Yeah that sounds better in consistency (numRows + blabla explaining how 
the number is measured) and provides same meaning. Thanks for the suggestion.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

2020-06-14 Thread GitBox


viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-643937918


   @srowen @huaxingao @zhengruifeng thanks for previous review. Do you have 
more comments for this change?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

2020-06-14 Thread GitBox


wangyum commented on pull request #28032:
URL: https://github.com/apache/spark/pull/28032#issuecomment-643936564


   Yes, this strategy may introduce the data skew issue, but the case of skewed 
data will only affect itself. Creating a large number of files will affect the 
Namenode, which will affect the stability of the cluster.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


xuanyuanking commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439959045



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
 val resolver = sparkSession.sessionState.analyzer.resolver
 val allColumns = queryExecution.analyzed.output
-val groupCols = colNames.distinct.flatMap { (colName: String) =>
+// SPARK-31990: We must keep `toSet.toSeq` here because of the backward 
compatibility issue
+// (the Streaming's state store depends on the `groupCols` order).
+val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
   Yep, I also mentioned this at 
https://github.com/apache/spark/pull/28830#discussion_r439909489, we can relay 
on the validation checking and integrated tests.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


cloud-fan commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439956403



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
 val resolver = sparkSession.sessionState.analyzer.resolver
 val allColumns = queryExecution.analyzed.output
-val groupCols = colNames.distinct.flatMap { (colName: String) =>
+// SPARK-31990: We must keep `toSet.toSeq` here because of the backward 
compatibility issue
+// (the Streaming's state store depends on the `groupCols` order).
+val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
   I think this is good for now.
   
   In the future, this may still be broken by Scala version upgrade, and 
hopefully @xuanyuanking 's unsafe row validation can detect it. Then we can 
change it and use a deterministic order, as it will be broken anyway.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on pull request #27805: [SPARK-31056][SQL] Add CalendarIntervals division

2020-06-14 Thread GitBox


EnricoMi commented on pull request #27805:
URL: https://github.com/apache/spark/pull/27805#issuecomment-643932097


   I am not aware of any SQL standard going this direction. I agree that the 
suggested UDFs give me the same result.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


cloud-fan commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439956403



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
 val resolver = sparkSession.sessionState.analyzer.resolver
 val allColumns = queryExecution.analyzed.output
-val groupCols = colNames.distinct.flatMap { (colName: String) =>
+// SPARK-31990: We must keep `toSet.toSeq` here because of the backward 
compatibility issue
+// (the Streaming's state store depends on the `groupCols` order).
+val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
   I think this is good for now.
   
   In the future, this may still be broken by Scala version upgrade, and 
hopefully @xuanyuanking 's unsafe row validation can detect it. Then we can 
change it and use a deterministic order.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp

2020-06-14 Thread GitBox


dongjoon-hyun commented on pull request #28821:
URL: https://github.com/apache/spark/pull/28821#issuecomment-643931292


   Hi, @Fokko . The last failure means that we need to regenerate the output of 
`ThriftServerQueryTestSuite.sql`. To proceed more, please update it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp

2020-06-14 Thread GitBox


dongjoon-hyun edited a comment on pull request #28821:
URL: https://github.com/apache/spark/pull/28821#issuecomment-643931292


   Hi, @Fokko . The last failure means that we need to regenerate the output 
file of `ThriftServerQueryTestSuite.sql`. To proceed more, please update it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28781: [SPARK-31953][SS] Add Spark Structured Streaming History Server Support

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28781:
URL: https://github.com/apache/spark/pull/28781#issuecomment-643930261


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124023/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28781: [SPARK-31953][SS] Add Spark Structured Streaming History Server Support

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28781:
URL: https://github.com/apache/spark/pull/28781#issuecomment-643930255


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

2020-06-14 Thread GitBox


AngersZh commented on pull request #28032:
URL: https://github.com/apache/spark/pull/28032#issuecomment-643930675


   > @wangyum Thanks for your response. If the incoming data is not even 
distributed by the repartitioning key, wouldn't this strategy create issues 
when there is skew in the data ? i.e we may end up with potentially very large 
partitions ?
   
   +1 for this concern , maybe we should add an adaptive procedure to adjust 
the number of partitions for the large partition key value. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HyukjinKwon edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643930058


   I am okay with that; I was just wondering even the previous behaviour was 
deterministic or not, and SPARK-31292 looked righter to me. Given that we're 
going to revisit anyway, LGTM from me too.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28781: [SPARK-31953][SS] Add Spark Structured Streaming History Server Support

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #28781:
URL: https://github.com/apache/spark/pull/28781#issuecomment-643930255







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28781: [SPARK-31953][SS] Add Spark Structured Streaming History Server Support

2020-06-14 Thread GitBox


SparkQA removed a comment on pull request #28781:
URL: https://github.com/apache/spark/pull/28781#issuecomment-643862941


   **[Test build #124023 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124023/testReport)**
 for PR 28781 at commit 
[`c721d8c`](https://github.com/apache/spark/commit/c721d8c7299d503f1010cc8626ffb77d492c1bda).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #28829: [SPARK-31992][SQL] Benchmark the EXCEPTION rebase mode

2020-06-14 Thread GitBox


MaxGekk commented on pull request #28829:
URL: https://github.com/apache/spark/pull/28829#issuecomment-643930188


   @cloud-fan Please, review the PR



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HyukjinKwon commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643930058


   I am okay with that; I was just wondering even the previous behaviour was 
deterministic or not, and the current change looked righter to me. Given that 
we're going to revisit anyway, LGTM from me too.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28781: [SPARK-31953][SS] Add Spark Structured Streaming History Server Support

2020-06-14 Thread GitBox


SparkQA commented on pull request #28781:
URL: https://github.com/apache/spark/pull/28781#issuecomment-643929926


   **[Test build #124023 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124023/testReport)**
 for PR 28781 at commit 
[`c721d8c`](https://github.com/apache/spark/commit/c721d8c7299d503f1010cc8626ffb77d492c1bda).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `trait TimestampFormatterHelper extends TimeZoneAwareExpression `
 * `case class WidthBucket(`
 * `trait PredicateHelper extends Logging `
 * `case class ProcessingTimeTrigger(intervalMs: Long) extends Trigger `
 * `case class ContinuousTrigger(intervalMs: Long) extends Trigger `



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


dongjoon-hyun edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643929047







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


dongjoon-hyun commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643929047


   The last commit is to trying to preserve the previous behavior (whatever it 
was) since Apache Spark 2.2.0 although there is no guarantee which it safe or 
not. We will revisit the correct issue later after 3.0.1.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HyukjinKwon commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643927909


   I am okay to revert it for now but I couldn't fully follow why we expect an 
explicit order from a set. Has it been ever guaranteed somewhere? Using 
`distinct`, we can expect the deterministic order but we're reverting back to 
using a set because of the deterministic order (?).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28784: [SPARK-31957][SQL][test-maven] Cleanup hive scratch dir for the developer api startWithContext

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28784:
URL: https://github.com/apache/spark/pull/28784#issuecomment-643926776







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28784: [SPARK-31957][SQL][test-maven] Cleanup hive scratch dir for the developer api startWithContext

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #28784:
URL: https://github.com/apache/spark/pull/28784#issuecomment-643926776







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28784: [SPARK-31957][SQL][test-maven] Cleanup hive scratch dir for the developer api startWithContext

2020-06-14 Thread GitBox


SparkQA commented on pull request #28784:
URL: https://github.com/apache/spark/pull/28784#issuecomment-643926432


   **[Test build #124035 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124035/testReport)**
 for PR 28784 at commit 
[`d055d60`](https://github.com/apache/spark/commit/d055d60aab356401f0f7b00b83b76dc76df0c30c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

2020-06-14 Thread GitBox


dilipbiswal commented on pull request #28032:
URL: https://github.com/apache/spark/pull/28032#issuecomment-643926524


   @wangyum Thanks for your response. If the incoming data is not even 
distributed by the repartitioning key, wouldn't this strategy create issues 
when there is skew in the data ? i.e we may end up with potentially very large 
partitions ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #28784: [SPARK-31957][SQL][test-maven] Cleanup hive scratch dir for the developer api startWithContext

2020-06-14 Thread GitBox


yaooqinn commented on pull request #28784:
URL: https://github.com/apache/spark/pull/28784#issuecomment-643926043


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #28784: [SPARK-31957][SQL] Cleanup hive scratch dir for the developer api startWithContext

2020-06-14 Thread GitBox


yaooqinn commented on pull request #28784:
URL: https://github.com/apache/spark/pull/28784#issuecomment-643925671


   Thanks @HyukjinKwon and @juliuszsompolski, I was waiting for 
https://github.com/apache/spark/pull/28797 to be merged and then ping you guys. 
Now it's been done.
   The conflicts disappear now. You can check again. Also cc @cloud-fan @maropu 
thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"

2020-06-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #28828:
URL: https://github.com/apache/spark/pull/28828#discussion_r439949163



##
File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala
##
@@ -43,7 +43,7 @@ class StateOperatorProgress private[sql](
 val numRowsTotal: Long,
 val numRowsUpdated: Long,
 val memoryUsedBytes: Long,
-val numLateInputs: Long,
+val numDroppedRowsByWatermark: Long,

Review comment:
   nit. Just a question. For the variable naming, 
`numRowsDroppedByWatermark` seems more consistent with `numRowsTotal` and 
`numRowsUpdates` within this `StateOperatorProgress` class, doesn't it? (cc 
@zsxwing since this came from his comment.)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"

2020-06-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #28828:
URL: https://github.com/apache/spark/pull/28828#discussion_r439949163



##
File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala
##
@@ -43,7 +43,7 @@ class StateOperatorProgress private[sql](
 val numRowsTotal: Long,
 val numRowsUpdated: Long,
 val memoryUsedBytes: Long,
-val numLateInputs: Long,
+val numDroppedRowsByWatermark: Long,

Review comment:
   nit. Just a question. For the variable naming, 
`numRowsDroppedByWatermark` seems more consistent with `numRowsTotal` and 
`numRowsUpdates` within this `StateOperatorProgress` class, doesn't it? (cc 
@zsxwing since this acmes from his comment.)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #28797: [SPARK-31926][SQL][TEST-HIVE1.2][test-maven] Fix concurrency issue for ThriftCLIService to getPortNumber

2020-06-14 Thread GitBox


cloud-fan commented on pull request #28797:
URL: https://github.com/apache/spark/pull/28797#issuecomment-643923811


   thanks, merging to master/3.0!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #28797: [SPARK-31926][SQL][TEST-HIVE1.2][test-maven] Fix concurrency issue for ThriftCLIService to getPortNumber

2020-06-14 Thread GitBox


cloud-fan closed pull request #28797:
URL: https://github.com/apache/spark/pull/28797


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #28809: [SPARK-31959][SQL][3.0] Fix Gregorian-Julian micros rebasing while switching standard time zone offset

2020-06-14 Thread GitBox


MaxGekk commented on pull request #28809:
URL: https://github.com/apache/spark/pull/28809#issuecomment-643922594


   I am going to skip the test checks if JDK tzdb is outdated and 
Asia/Hong_Kong doesn't have timestamps overlapping in 1945 at all.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #28809: [SPARK-31959][SQL][3.0] Fix Gregorian-Julian micros rebasing while switching standard time zone offset

2020-06-14 Thread GitBox


MaxGekk commented on pull request #28809:
URL: https://github.com/apache/spark/pull/28809#issuecomment-643920058


   > It might be Amplap Jenkins host issue (Java version or environment).
   
   It uses JDK w/ outdated time zone database (not clear from log which 
version):
   ```
   JAVA_HOME=/usr/lib/jvm/java-8-oracle
   ```
   other jenkins machines have:
   ```
   JAVA_HOME=/usr/java/jdk1.8.0_191
   ```
   If we are not able to upgrade JDK 1.8 to the recent version, can we have at 
least the same JDK on all jenkins machines?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking edited a comment on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-14 Thread GitBox


xuanyuanking edited a comment on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-643916110


   cc @maropu @gatorsmile @HeartSaVioR @dongjoon-hyun 
   
   A new regression bug SPARK-31990 was found when investigating the test 
failure https://github.com/apache/spark/pull/28707#issuecomment-639861273. The 
root cause is that [this 
line](https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458)
 in SPARK-31292 made the order of groupCols in Deduplicate changed, and the 
order changing will break the validation logic here. That is to say, if we 
don't have this PR, the executor JVM could probably crash, throw a random 
exception, or even return a wrong answer when using the checkpoint written by 
the previous version.
   
   So we have 2 related work of this PR:
   
   - [ ]**[Block]** Fix and merge the compatibility issue in #28830
   - [ ][Follow-up] Add new test(or modify the current Kafka test) in #28725
   
   --
   ### More detailed analysis:
   The expected order of `Deduplicate.groupCols` in UT 
KafkaMicroBatchV2SourceSuite is
   ```
   [timestamp, partition, timestampType, key, offset, topic, value]
   ```
   Which is also the order in the checkpoint written by the version before 
Spark 3.0
   After the changes in SPARK-31292, the groupCols changed to
   ```
   [key, value, topic, partition, offset, timestamp, timestampType]
   ```
   
    Why this incompatibility bug didn't fail the 
`KafkaMicroBatchV2SourceSuite` when it merged?
   
   Because the UT `default config of includeHeader doesn't break the existing 
query from Spark 2.4` didn't test the scenario of duplicating and check the 
answer.
   Although the UT uses the checkpoint written by version 2.4.3 and streaming 
duplicate operation, it just wants to prove that the new header(added in 
SPARK-23539) doesn't break the original checkpoint file. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28619:
URL: https://github.com/apache/spark/pull/28619#issuecomment-643916951







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


xuanyuanking commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643916855


   ```
   How we plan to consolidate both? How we will write JIRA title/description 
and PR title/description? Which is the type of the consolidated issue? Is the 
consolidated issue a blocker?
   ```
   Here's my plan to consolidate both: 
https://github.com/apache/spark/pull/28707#issuecomment-643916110, this will 
also comment in JIRA & PR description.
   Yes, #28707 is blocking by this fix.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28829:
URL: https://github.com/apache/spark/pull/28829#issuecomment-643916877







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode

2020-06-14 Thread GitBox


SparkQA commented on pull request #28829:
URL: https://github.com/apache/spark/pull/28829#issuecomment-643916564


   **[Test build #124033 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124033/testReport)**
 for PR 28829 at commit 
[`16e90be`](https://github.com/apache/spark/commit/16e90bebf9314105d20c581a07120adb6d288e0b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #28829:
URL: https://github.com/apache/spark/pull/28829#issuecomment-643916882


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/28652/
   Test PASSed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #17953: [SPARK-20680][SQL] Spark-sql do not support for void column datatype …

2020-06-14 Thread GitBox


HyukjinKwon commented on pull request #17953:
URL: https://github.com/apache/spark/pull/17953#issuecomment-643916503


   Yeah .. I personally support this change FWIW.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #28619:
URL: https://github.com/apache/spark/pull/28619#issuecomment-643916951







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors

2020-06-14 Thread GitBox


SparkQA commented on pull request #28619:
URL: https://github.com/apache/spark/pull/28619#issuecomment-643916615


   **[Test build #124034 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124034/testReport)**
 for PR 28619 at commit 
[`4affa58`](https://github.com/apache/spark/commit/4affa58f95f893ef6de1c1bf1c6b731468a2519d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #27805: [SPARK-31056][SQL] Add CalendarIntervals division

2020-06-14 Thread GitBox


HyukjinKwon commented on pull request #27805:
URL: https://github.com/apache/spark/pull/27805#issuecomment-643915859


   Do we have an answer to 
https://github.com/apache/spark/pull/27805#issuecomment-635381702? It's easier 
to justify with actual references and/or standard.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-14 Thread GitBox


xuanyuanking commented on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-643916110


   A new regression bug SPARK-31990 was found when investigating the test 
failure https://github.com/apache/spark/pull/28707#issuecomment-639861273. The 
root cause is that [this 
line](https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458)
 in SPARK-31292 made the order of groupCols in Deduplicate changed, and the 
order changing will break the validation logic here. That is to say, if we 
don't have this PR, the executor JVM could probably crash, throw a random 
exception, or even return a wrong answer when using the checkpoint written by 
the previous version.
   
   So we have 2 related work of this PR:
   
   - [ ] Fix and merge the compatibility issue in #28830
   - [ ] Add new test(or modify the current Kafka test) in #28725
   
   --
   ### More detailed analysis:
   The expected order of `Deduplicate.groupCols` in UT 
KafkaMicroBatchV2SourceSuite is
   ```
   [timestamp, partition, timestampType, key, offset, topic, value]
   ```
   After the changes in SPARK-31292, the groupCols changed to
   ```
   [key, value, topic, partition, offset, timestamp, timestampType]
   ```
   
    Why this incompatibility bug didn't fail the 
`KafkaMicroBatchV2SourceSuite` when it merged?
   
   Because the UT `default config of includeHeader doesn't break the existing 
query from Spark 2.4` didn't test the scenario of duplicating and check the 
answer.
   Although the UT uses the checkpoint written by version 2.4.3 and streaming 
duplicate operation, it just wants to prove that the new header(added in 
SPARK-23539) doesn't break the original checkpoint file. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors

2020-06-14 Thread GitBox


Ngone51 commented on pull request #28619:
URL: https://github.com/apache/spark/pull/28619#issuecomment-643915676


   retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #28829: [WIP][SQL] Benchmark the EXCEPTION rebase mode

2020-06-14 Thread GitBox


MaxGekk commented on pull request #28829:
URL: https://github.com/apache/spark/pull/28829#issuecomment-643915417


   jenkins, retest this, please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28642:
URL: https://github.com/apache/spark/pull/28642#issuecomment-643914834







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition

2020-06-14 Thread GitBox


HyukjinKwon commented on a change in pull request #28642:
URL: https://github.com/apache/spark/pull/28642#discussion_r439940687



##
File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
##
@@ -1039,7 +1039,7 @@ class JoinSuite extends QueryTest with SharedSparkSession 
with AdaptiveSparkPlan
 val pythonEvals = collect(joinNode.get) {
   case p: BatchEvalPythonExec => p
 }
-assert(pythonEvals.size == 2)
+assert(pythonEvals.size == 4)

Review comment:
   Yeah, I don't think it's more efficient to have `BatchEvalPythonExec` 
more. It will require more Python executions which aren't trivial.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #28642:
URL: https://github.com/apache/spark/pull/28642#issuecomment-643914834







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition

2020-06-14 Thread GitBox


SparkQA commented on pull request #28642:
URL: https://github.com/apache/spark/pull/28642#issuecomment-643914470


   **[Test build #124032 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124032/testReport)**
 for PR 28642 at commit 
[`65cd324`](https://github.com/apache/spark/commit/65cd324093fac15357fb0ca9bae7c524b40c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition

2020-06-14 Thread GitBox


HyukjinKwon commented on pull request #28642:
URL: https://github.com/apache/spark/pull/28642#issuecomment-643913716


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #28801: [SPARK-31970][CORE] Make MDC configuration step be consistent between setLocalProperty and log4j.properties

2020-06-14 Thread GitBox


Ngone51 commented on pull request #28801:
URL: https://github.com/apache/spark/pull/28801#issuecomment-643912320


   thanks all!!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #27604:
URL: https://github.com/apache/spark/pull/27604#issuecomment-643909975


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124026/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #27604:
URL: https://github.com/apache/spark/pull/27604#issuecomment-643909967


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #27604:
URL: https://github.com/apache/spark/pull/27604#issuecomment-643909975


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124026/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block

2020-06-14 Thread GitBox


SparkQA removed a comment on pull request #27604:
URL: https://github.com/apache/spark/pull/27604#issuecomment-643877230


   **[Test build #124026 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124026/testReport)**
 for PR 27604 at commit 
[`2e11d1b`](https://github.com/apache/spark/commit/2e11d1bedf15b59c89b1f686ea716a575802f1e6).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #27604: [SPARK-30849][CORE][SHUFFLE]Fix application failed due to failed to get MapStatuses broadcast block

2020-06-14 Thread GitBox


SparkQA commented on pull request #27604:
URL: https://github.com/apache/spark/pull/27604#issuecomment-643909627


   **[Test build #124026 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124026/testReport)**
 for PR 27604 at commit 
[`2e11d1b`](https://github.com/apache/spark/commit/2e11d1bedf15b59c89b1f686ea716a575802f1e6).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numDropppedRowsByWatermark"

2020-06-14 Thread GitBox


HyukjinKwon commented on pull request #28828:
URL: https://github.com/apache/spark/pull/28828#issuecomment-643906549


   @xuanyuanking too FYI



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28821:
URL: https://github.com/apache/spark/pull/28821#issuecomment-643904439


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124024/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28821:
URL: https://github.com/apache/spark/pull/28821#issuecomment-643904434


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #28821:
URL: https://github.com/apache/spark/pull/28821#issuecomment-643904434







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp

2020-06-14 Thread GitBox


SparkQA removed a comment on pull request #28821:
URL: https://github.com/apache/spark/pull/28821#issuecomment-643865812


   **[Test build #124024 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124024/testReport)**
 for PR 28821 at commit 
[`707b0cf`](https://github.com/apache/spark/commit/707b0cf949e2532429bdc62d7ef219fe98a0751e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28821: [SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp

2020-06-14 Thread GitBox


SparkQA commented on pull request #28821:
URL: https://github.com/apache/spark/pull/28821#issuecomment-643904220


   **[Test build #124024 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124024/testReport)**
 for PR 28821 at commit 
[`707b0cf`](https://github.com/apache/spark/commit/707b0cf949e2532429bdc62d7ef219fe98a0751e).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #28807:
URL: https://github.com/apache/spark/pull/28807#issuecomment-643899506







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28807:
URL: https://github.com/apache/spark/pull/28807#issuecomment-643899506







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords

2020-06-14 Thread GitBox


maropu commented on a change in pull request #28807:
URL: https://github.com/apache/spark/pull/28807#discussion_r439927771



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/TableIdentifierParserSuite.scala
##
@@ -388,12 +396,24 @@ class TableIdentifierParserSuite extends SparkFunSuite 
with SQLHelper {
   val reservedKeywordsInAnsiMode = allCandidateKeywords -- 
nonReservedKeywordsInAnsiMode
 
   test("check # of reserved keywords") {
-val numReservedKeywords = 78
+val numReservedKeywords = 74

Review comment:
   Note: `ANTI`, `SEMI`, `MINUS`, and `!` are removed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28807: [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords

2020-06-14 Thread GitBox


SparkQA commented on pull request #28807:
URL: https://github.com/apache/spark/pull/28807#issuecomment-643899210


   **[Test build #124031 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124031/testReport)**
 for PR 28807 at commit 
[`eeceb30`](https://github.com/apache/spark/commit/eeceb30e050c26acdb93372eef0ce14410bd0159).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #28710:
URL: https://github.com/apache/spark/pull/28710#issuecomment-643897872







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28710:
URL: https://github.com/apache/spark/pull/28710#issuecomment-643897872







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait

2020-06-14 Thread GitBox


SparkQA commented on pull request #28710:
URL: https://github.com/apache/spark/pull/28710#issuecomment-643897635


   **[Test build #124030 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124030/testReport)**
 for PR 28710 at commit 
[`2e6f35c`](https://github.com/apache/spark/commit/2e6f35c8e31fe1cde1637b922673339bfeef65fe).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait

2020-06-14 Thread GitBox


huaxingao commented on pull request #28710:
URL: https://github.com/apache/spark/pull/28710#issuecomment-643896578


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28593:
URL: https://github.com/apache/spark/pull/28593#issuecomment-643892810







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #28593:
URL: https://github.com/apache/spark/pull/28593#issuecomment-643892810







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default

2020-06-14 Thread GitBox


SparkQA commented on pull request #28593:
URL: https://github.com/apache/spark/pull/28593#issuecomment-643892530


   **[Test build #124029 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124029/testReport)**
 for PR 28593 at commit 
[`8fe1960`](https://github.com/apache/spark/commit/8fe1960ef3a0c598a626b7024820b74cec787642).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #24922: [SPARK-28120][SS] Rocksdb state storage implementation

2020-06-14 Thread GitBox


dongjoon-hyun commented on pull request #24922:
URL: https://github.com/apache/spark/pull/24922#issuecomment-643892244


   Thank you for the update, @itsvikramagr .



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28710:
URL: https://github.com/apache/spark/pull/28710#issuecomment-643891541


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124021/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28710:
URL: https://github.com/apache/spark/pull/28710#issuecomment-643891538


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait

2020-06-14 Thread GitBox


SparkQA removed a comment on pull request #28710:
URL: https://github.com/apache/spark/pull/28710#issuecomment-643855623


   **[Test build #124021 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124021/testReport)**
 for PR 28710 at commit 
[`2e6f35c`](https://github.com/apache/spark/commit/2e6f35c8e31fe1cde1637b922673339bfeef65fe).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait

2020-06-14 Thread GitBox


SparkQA commented on pull request #28710:
URL: https://github.com/apache/spark/pull/28710#issuecomment-643891334


   **[Test build #124021 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124021/testReport)**
 for PR 28710 at commit 
[`2e6f35c`](https://github.com/apache/spark/commit/2e6f35c8e31fe1cde1637b922673339bfeef65fe).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-64318


   How we plan to consolidate both? How we will write JIRA title/description 
and PR title/description? Which is the type of the consolidated issue? Is the 
consolidated issue a blocker?
   
   Things would be simpler if we merge the partial revert as it is, and spend 
our efforts to discuss how to guide known issues - this is one of candidates 
for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some 
of end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also 
commit. Sure, this may need to be placed on migration guide or release note as 
well.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase 
to fix the test failure.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] GuoPhilipse commented on pull request #28593: [SPARK-31710][SQL] Fail casting numeric to timestamp by default

2020-06-14 Thread GitBox


GuoPhilipse commented on pull request #28593:
URL: https://github.com/apache/spark/pull/28593#issuecomment-643890774


   it is generated by set command,now we have removed it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-14 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-643878976


   I’m sorry, but version 4 doesn’t leverage UnsafeRow. (version 2 was.) Please 
read the description thoughtfully.
   
   As I commented earlier there’re still lots of possible improvements in 
metadata, but I don’t want to go through unless we promise dedicated efforts on 
reviewing. This is low hanging fruit which brings massive improvement.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-64318


   How we plan to consolidate both? How we will write JIRA title/description 
and PR title/description? Which is the type of the consolidated issue? Is the 
consolidated issue a blocker?
   
   Things would be simpler if we merge the partial fix as it is, and spend our 
efforts to discuss how to guide known issues - this is one of candidates for 
Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of 
end users migrate to Spark 3.0.0. Sure, this may need to be placed on migration 
guide or release note as well.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase 
to fix the test failure.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-64318


   How we plan to consolidate both? How we will write JIRA title/description 
and PR title/description? Which is the type of the consolidated issue? Is the 
consolidated issue a blocker?
   
   Things would be simpler if we merge the partial fix as it is, and spend our 
efforts to discuss how to guide known issues - this is one of candidates for 
Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of 
end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also 
commit. Sure, this may need to be placed on migration guide or release note as 
well.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase 
to fix the test failure.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-64318


   How we plan to consolidate both? How we will write JIRA title/description 
and PR title/description? Which is the type of the consolidated issue? Is the 
consolidated issue a blocker?
   
   Things would be simpler if we merge the partial fix as it is, and spend our 
efforts to discuss how to guide known issue - this is one of candidates for 
Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of 
end users migrate to Spark 3.0.0. Sure, this may need to be placed on migration 
guide or release note as well.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase 
to fix the test failure.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-64318


   How we plan to consolidate both? How we will write JIRA title/description 
and PR title/description? Which is the type of the consolidated issue? Is the 
consolidated issue a blocker?
   
   Things would be simpler if we merge the partial fix as it is, and spend our 
efforts to discuss how to guide known issue - this is one of candidates for 
Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of 
end users migrate to Spark 3.0.0.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase 
to fix the test failure.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


HeartSaVioR commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-64318


   How we plan to consolidate both? How we will write JIRA title/description 
and PR title/description? Which is the type of the consolidated issue? Is the 
consolidated issue a blocker?
   
   Things would be simple if we merge the partial fix as it is, and spend our 
efforts to discuss how to guide known issue - this is one of candidates for 
Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of 
end users migrate to Spark 3.0.0.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase 
to fix the test failure.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-643887374







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-643887374







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-14 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r439917190



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir' are used")
 
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {

Review comment:
   Got it. I added the description "This option is supported in Hive 2.0 or 
later." in SQLConf.scala.
   
https://github.com/apache/spark/pull/27690/files#diff-9a6b543db706f1a90f790783d6930a13R849





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-14 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r439917190



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir' are used")
 
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {

Review comment:
   Got it. I added the descroption "This option is supported in Hive 2.0 or 
later." in SQLConf.scala.
   
https://github.com/apache/spark/pull/27690/files#diff-9a6b543db706f1a90f790783d6930a13R849





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-14 Thread GitBox


SparkQA commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-643887119


   **[Test build #124028 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124028/testReport)**
 for PR 27690 at commit 
[`0fbeaf3`](https://github.com/apache/spark/commit/0fbeaf374bf35a7d0cde2b3340d9f3c4551cbdb2).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters

2020-06-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28786:
URL: https://github.com/apache/spark/pull/28786#issuecomment-643885908







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters

2020-06-14 Thread GitBox


SparkQA removed a comment on pull request #28786:
URL: https://github.com/apache/spark/pull/28786#issuecomment-643867351


   **[Test build #124025 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124025/testReport)**
 for PR 28786 at commit 
[`4c4d52b`](https://github.com/apache/spark/commit/4c4d52b91e1ebbd018835c3bb2cd565df79bd430).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters

2020-06-14 Thread GitBox


AmplabJenkins commented on pull request #28786:
URL: https://github.com/apache/spark/pull/28786#issuecomment-643885908







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28786: [SPARK-31925][ML] Summary.totalIterations greater than maxIters

2020-06-14 Thread GitBox


SparkQA commented on pull request #28786:
URL: https://github.com/apache/spark/pull/28786#issuecomment-643885633


   **[Test build #124025 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124025/testReport)**
 for PR 28786 at commit 
[`4c4d52b`](https://github.com/apache/spark/commit/4c4d52b91e1ebbd018835c3bb2cd565df79bd430).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

2020-06-14 Thread GitBox


maropu commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643885408


   > Thanks for the quick fix @maropu! I think maybe we can simplify the bugfix 
by combining it together with #28707. WDYT? I'll also reference this PR with 
#28707.
   
   @xuanyuanking yea, looks fine to me. Could you takes this over? Thanks, 
anyway!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >